0% found this document useful (0 votes)
17 views2,819 pages

SDM Change Document

Uploaded by

John Doe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views2,819 pages

SDM Change Document

Uploaded by

John Doe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2819

Intel® 64 and IA-32 Architectures

Software Developer’s Manual

Documentation Changes

October 2024

Notice: The Intel® 64 and IA-32 architectures may contain design defects or errors known as errata
that may cause the product to deviate from published specifications. Current characterized errata are
documented in the specification updates.

Document Number: 252046-077


Notices & Disclaimers

Intel technologies may require enabled hardware, software or service activation.


No product or component can be absolutely secure.
Your costs and results may vary.
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis
concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any
patent claim thereafter drafted which includes subject matter disclosed herein.
All product plans and roadmaps are subject to change without notice.
The products described may contain design defects or errors known as errata which may cause the product to
deviate from published specifications. Current characterized errata are available on request.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of
merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from
course of performance, course of dealing, or usage in trade.
Code names are used by Intel to identify products, technologies, or services that are in development and not
publicly available. These are not “commercial” names and not intended to function as trademarks.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this
document, with the sole exception that a) you may publish an unmodified copy and b) code included in this
document is licensed subject to the Zero-Clause BSD open source license (0BSD), https://opensource.org/
licenses/0BSD. You may create software implementations based on this document and in compliance with the
foregoing that are intended to execute on the Intel product(s) referenced in this document. No rights are granted
to create modifications or derivatives of this document.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its
subsidiaries. Other names and brands may be claimed as the property of others.

2 Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes


Contents

Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Summary Tables of Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Documentation Changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 3


Revision History

Revision History

Revision Description Date

-001 • Initial release November 2002


• Added 1-10 Documentation Changes.
-002 • Removed old Documentation Changes items that already have been December 2002
incorporated in the published Software Developer’s manual

• Added 9 -17 Documentation Changes.


• Removed Documentation Change #6 - References to bits Gen and Len
-003 Deleted. February 2003
• Removed Documentation Change #4 - VIF Information Added to CLI
Discussion

• Removed Documentation changes 1-17.


-004 June 2003
• Added Documentation changes 1-24.
• Removed Documentation Changes 1-24.
-005 September 2003
• Added Documentation Changes 1-15.

-006 • Added Documentation Changes 16- 34. November 2003


• Updated Documentation changes 14, 16, 17, and 28.
-007 January 2004
• Added Documentation Changes 35-45.

• Removed Documentation Changes 1-45.


-008 March 2004
• Added Documentation Changes 1-5.

-009 • Added Documentation Changes 7-27. May 2004


• Removed Documentation Changes 1-27.
-010 August 2004
• Added Documentation Changes 1.

-011 • Added Documentation Changes 2-28. November 2004


• Removed Documentation Changes 1-28.
-012 March 2005
• Added Documentation Changes 1-16.

• Updated title.
-013 • There are no Documentation Changes for this revision of the July 2005
document.

-014 • Added Documentation Changes 1-21. September 2005


• Removed Documentation Changes 1-21.
-015 March 9, 2006
• Added Documentation Changes 1-20.

-016 • Added Documentation changes 21-23. March 27, 2006


• Removed Documentation Changes 1-23.
-017 September 2006
• Added Documentation Changes 1-36.

-018 • Added Documentation Changes 37-42. October 2006


• Removed Documentation Changes 1-42.
-019 March 2007
• Added Documentation Changes 1-19.

-020 • Added Documentation Changes 20-27. May 2007


• Removed Documentation Changes 1-27.
-021 November 2007
• Added Documentation Changes 1-6

• Removed Documentation Changes 1-6


-022 August 2008
• Added Documentation Changes 1-6

• Removed Documentation Changes 1-6


-023 March 2009
• Added Documentation Changes 1-21

4 Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes


Revision History

Revision Description Date

• Removed Documentation Changes 1-21


-024 June 2009
• Added Documentation Changes 1-16

• Removed Documentation Changes 1-16


-025 September 2009
• Added Documentation Changes 1-18

• Removed Documentation Changes 1-18


-026 December 2009
• Added Documentation Changes 1-15

• Removed Documentation Changes 1-15


-027 March 2010
• Added Documentation Changes 1-24

• Removed Documentation Changes 1-24


-028 June 2010
• Added Documentation Changes 1-29

• Removed Documentation Changes 1-29


-029 September 2010
• Added Documentation Changes 1-29

• Removed Documentation Changes 1-29


-030 January 2011
• Added Documentation Changes 1-29

• Removed Documentation Changes 1-29


-031 April 2011
• Added Documentation Changes 1-29

• Removed Documentation Changes 1-29


-032 May 2011
• Added Documentation Changes 1-14

• Removed Documentation Changes 1-14


-033 October 2011
• Added Documentation Changes 1-38

• Removed Documentation Changes 1-38


-034 December 2011
• Added Documentation Changes 1-16

• Removed Documentation Changes 1-16


-035 March 2012
• Added Documentation Changes 1-18

• Removed Documentation Changes 1-18


-036 May 2012
• Added Documentation Changes 1-17

• Removed Documentation Changes 1-17


-037 August 2012
• Added Documentation Changes 1-28

• Removed Documentation Changes 1-28


-038 January 2013
• Add Documentation Changes 1-22

• Removed Documentation Changes 1-22


-039 June 2013
• Add Documentation Changes 1-17

• Removed Documentation Changes 1-17


-040 September 2013
• Add Documentation Changes 1-24

• Removed Documentation Changes 1-24


-041 February 2014
• Add Documentation Changes 1-20

• Removed Documentation Changes 1-20


-042 February 2014
• Add Documentation Changes 1-8

• Removed Documentation Changes 1-8


-043 June 2014
• Add Documentation Changes 1-43

• Removed Documentation Changes 1-43


-044 September 2014
• Add Documentation Changes 1-12

• Removed Documentation Changes 1-12


-045 January 2015
• Add Documentation Changes 1-22

• Removed Documentation Changes 1-22


-046 April 2015
• Add Documentation Changes 1-25

• Removed Documentation Changes 1-25


-047 June 2015
• Add Documentation Changes 1-19

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 5


Revision History

Revision Description Date

• Removed Documentation Changes 1-19


-048 September 2015
• Add Documentation Changes 1-33

• Removed Documentation Changes 1-33


-049 December 2015
• Add Documentation Changes 1-33

• Removed Documentation Changes 1-33


-050 April 2016
• Add Documentation Changes 1-9

• Removed Documentation Changes 1-9


-051 June 2016
• Add Documentation Changes 1-20

• Removed Documentation Changes 1-20


-052 September 2016
• Add Documentation Changes 1-22

• Removed Documentation Changes 1-22


-053 December 2016
• Add Documentation Changes 1-26

• Removed Documentation Changes 1-26


-054 March 2017
• Add Documentation Changes 1-20

• Removed Documentation Changes 1-20


-055 July 2017
• Add Documentation Changes 1-28

• Removed Documentation Changes 1-28


-056 October 2017
• Add Documentation Changes 1-18

• Removed Documentation Changes 1-18


-057 December 2017
• Add Documentation Changes 1-29

• Removed Documentation Changes 1-29


-058 March 2018
• Add Documentation Changes 1-17

• Removed Documentation Changes 1-17


-059 May 2018
• Add Documentation Changes 1-24

• Removed Documentation Changes 1-24


-060 November 2018
• Add Documentation Changes 1-23

• Removed Documentation Changes 1-23


-061 January 2019
• Add Documentation Changes 1-21

• Removed Documentation Changes 1-21


-062 May 2019
• Add Documentation Changes 1-28

• Removed Documentation Changes 1-28


-063 October 2019
• Add Documentation Changes 1-34

• Removed Documentation Changes 1-34


-064 May 2020
• Add Documentation Changes 1-36

• Removed Documentation Changes 1-36


-065 November 2020
• Add Documentation Changes 1-31

• Removed Documentation Changes 1-31


-066 April 2021
• Add Documentation Changes 1-24

• Removed Documentation Changes 1-24


-067 June 2021
• Add Documentation Changes 1-30

• Removed Documentation Changes 1-30


-068 December 2021
• Add Documentation Changes 1-29

• Removed Documentation Changes 1-29


-069 April 2022
• Add Documentation Changes 1-18

• Removed Documentation Changes 1-18


-070 December 2022
• Add Documentation Changes 1-41

• Removed Documentation Changes 1-41


-071 March 2023
• Add Documentation Changes 1-23

6 Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes


Revision History

Revision Description Date

• Removed Documentation Changes 1-23


-072 June 2023
• Add Documentation Changes 1-19

• Removed Documentation Changes 1-19


-073 September 2023
• Add Documentation Changes 1-19

• Removed Documentation Changes 1-19


-074 December 2023
• Add Documentation Changes 1-20

• Removed Documentation Changes 1-20


-075 March 2024
• Add Documentation Changes 1-20

• Removed Documentation Changes 1-20


-076 June 2024
• Add Documentation Changes 1-8

• Removed Documentation Changes 1-8


-077 October 2024
• Add Documentation Changes 1-27

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 7


Revision History

8 Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes


Preface
This document is an update to the specifications contained in the Affected Documents table below. This
document is a compilation of device and documentation errata, specification clarifications and changes. It is
intended for hardware system manufacturers and software developers of applications, operating systems, or
tools.

Affected Documents

Document Number/
Document Title
Location

Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture 253665
®
Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A: Instruction Set
253666
Reference, A-L
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B: Instruction Set
253667
Reference, M-U
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2C: Instruction Set
326018
Reference, V
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2D: Instruction Set
334569
Reference, W-Z
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A: System
253668
Programming Guide, Part 1
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B: System
253669
Programming Guide, Part 2
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3C: System
326019
Programming Guide, Part 3
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3D: System
332831
Programming Guide, Part 4
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 4: Model Specific
335592
Registers

Nomenclature
Documentation Changes include typos, errors, or omissions from the current published specifications. These
will be incorporated in any new release of the specification.

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 7


Summary Tables of Changes
The following table indicates documentation changes which apply to the Intel® 64 and IA-32 architectures. This
table uses the following notations:

Codes Used in Summary Tables


A violet change bar to left of table row indicates this erratum is either new or modified from the previous version
of the document.

Documentation Changes(Sheet 1 of 2)
No. DOCUMENTATION CHANGES

1 Updates to Chapter 1, Volume 1


2 Updates to Chapter 2, Volume 1
3 Updates to Chapter 5, Volume 1
4 Updates to Chapter 16, Volume 1
5 Updates to Chapter 2, Volume 2A
6 Updates to Chapter 3, Volume 2A
7 Updates to Chapter 4, Volume 2B
8 Updates to Chapter 5, Volume 2C
9 Updates to Chapter 6, Volume 2D
10 Updates to Chapter 1, Volume 3A
11 Updates to Chapter 2, Volume 3A
12 Updates to Chapter 3, Volume 3A
13 New Chapter 4, Volume 3A
14 Updates to Chapter 5, Volume 3A
15 Updates to Chapter 7, Volume 3A
16 Updates to Chapter 8, Volume 3A
17 Updates to Chapter 10, Volume 3A
18 Updates to Chapter 18, Volume 3B
19 Updates to Chapter 20, Volume 3B
20 Updates to Chapter 21, Volume 3B
21 Updates to Chapter 26, Volume 3C
22 Updates to Chapter 27, Volume 3C
23 Updates to Chapter 28, Volume 3C
24 Updates to Chapter 29, Volume 3C

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 8


Documentation Changes(Sheet 2 of 2)
No. DOCUMENTATION CHANGES

25 Updates to Chapter 34, Volume 3C


26 Updates to Appendix C, Volume 3D
27 Updates to Chapter 2, Volume 4

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 9


Documentation Changes
Changes to the Intel® 64 and IA-32 Architectures Software Developer’s Manual volumes follow, and are listed
by chapter. Only chapters with changes are included in this document.

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 10


1. Updates to Chapter 1, Volume 1
Change bars and violet text show changes to Chapter 1 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 1: Basic Architecture.

------------------------------------------------------------------------------------------
Changes to this chapter:
• Added Intel® Xeon® 6 E-core, Intel® Xeon® 6 P-core, and Intel® Series 2 Core™ Ultra processor information
to Section 1.1, “Intel® 64 and IA-32 Processors Covered in this Manual.”
• Updated Section 1.2, “Overview of Volume 1: Basic Architecture,” with the newly added Chapter 16, and
renumbered the remaining chapters in the volume.

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 11


CHAPTER 1
ABOUT THIS MANUAL

The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture (order number
253665) is part of a set that describes the architecture and programming environment of Intel® 64 and IA-32
architecture processors. Other volumes in this set are:
• The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 2A, 2B, 2C & 2D: Instruction Set
Reference (order numbers 253666, 253667, 326018, and 334569).
• The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 3A, 3B, 3C & 3D: System
Programming Guide (order numbers 253668, 253669, 326019, and 332831).
• The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 4: Model-Specific Registers (order
number 335592).
The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, describes the basic architecture
and programming environment of Intel 64 and IA-32 processors. The Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volumes 2A, 2B, 2C, & 2D, describe the instruction set of the processor and the opcode struc-
ture. These volumes apply to application programmers and to programmers who write operating systems or exec-
utives. The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 3A, 3B, 3C, & 3D, describe
the operating-system support environment of Intel 64 and IA-32 processors. These volumes target operating-
system and BIOS designers. In addition, the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 3B, addresses the programming environment for classes of software that host operating systems. The
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 4, describes the model-specific registers
of Intel 64 and IA-32 processors.

1.1 INTEL® 64 AND IA-32 PROCESSORS COVERED IN THIS MANUAL


This manual set includes information pertaining primarily to the most recent Intel 64 and IA-32 processors, which
include:
• Pentium® processors
• P6 family processors
• Pentium® 4 processors
• Pentium® M processors
• Intel® Xeon® processors
• Pentium® D processors
• Pentium® processor Extreme Editions
• 64-bit Intel® Xeon® processors
• Intel® Core™ Duo processor
• Intel® Core™ Solo processor
• Dual-Core Intel® Xeon® processor LV
• Intel® Core™ 2 Duo processor
• Intel® Core™ 2 Quad processor Q6000 series
• Intel® Xeon® processor 3000, 3200 series
• Intel® Xeon® processor 5000 series
• Intel® Xeon® processor 5100, 5300 series
• Intel® Core™ 2 Extreme processor X7000 and X6800 series
• Intel® Core™ 2 Extreme processor QX6000 series
• Intel® Xeon® processor 7100 series

Vol. 1 1-1
ABOUT THIS MANUAL

• Intel® Pentium® Dual-Core processor


• Intel® Xeon® processor 7200, 7300 series
• Intel® Xeon® processor 5200, 5400, 7400 series
• Intel® Core™ 2 Extreme processor QX9000 and X9000 series
• Intel® Core™ 2 Quad processor Q9000 series
• Intel® Core™ 2 Duo processor E8000, T9000 series
• Intel Atom® processor family
• Intel Atom® processors 200, 300, D400, D500, D2000, N200, N400, N2000, E2000, Z500, Z600, Z2000,
C1000 series are built from 45 nm and 32 nm processes
• Intel® Core™ i7 processor
• Intel® Core™ i5 processor
• Intel® Xeon® processor E7-8800/4800/2800 product families
• Intel® Core™ i7-3930K processor
• 2nd generation Intel® Core™ i7-2xxx, Intel® Core™ i5-2xxx, Intel® Core™ i3-2xxx processor series
• Intel® Xeon® processor E3-1200 product family
• Intel® Xeon® processor E5-2400/1400 product family
• Intel® Xeon® processor E5-4600/2600/1600 product family
• 3rd generation Intel® Core™ processors
• Intel® Xeon® processor E3-1200 v2 product family
• Intel® Xeon® processor E5-2400/1400 v2 product families
• Intel® Xeon® processor E5-4600/2600/1600 v2 product families
• Intel® Xeon® processor E7-8800/4800/2800 v2 product families
• 4th generation Intel® Core™ processors
• The Intel® Core™ M processor family
• Intel® Core™ i7-59xx Processor Extreme Edition
• Intel® Core™ i7-49xx Processor Extreme Edition
• Intel® Xeon® processor E3-1200 v3 product family
• Intel® Xeon® processor E5-2600/1600 v3 product families
• 5th generation Intel® Core™ processors
• Intel® Xeon® processor D-1500 product family
• Intel® Xeon® processor E5 v4 family
• Intel Atom® processor X7-Z8000 and X5-Z8000 series
• Intel Atom® processor Z3400 series
• Intel Atom® processor Z3500 series
• 6th generation Intel® Core™ processors
• Intel® Xeon® processor E3-1500m v5 product family
• 7th generation Intel® Core™ processors
• Intel® Xeon Phi™ Processor 3200, 5200, 7200 Series
• Intel® Xeon® Scalable Processor Family
• 8th generation Intel® Core™ processors
• Intel® Xeon Phi™ Processor 7215, 7285, 7295 Series
• Intel® Xeon® E processors
• 9th generation Intel® Core™ processors
• 2nd generation Intel® Xeon® Scalable Processor Family

1-2 Vol. 1
ABOUT THIS MANUAL

• 10th generation Intel® Core™ processors


• 11th generation Intel® Core™ processors
• 3rd generation Intel® Xeon® Scalable Processor Family
• 12th generation Intel® Core™ processors
• 13th generation Intel® Core™ processors
• 4th generation Intel® Xeon® Scalable Processor Family
• 5th generation Intel® Xeon® Scalable Processor Family
• Intel® Core™ Ultra 7 processors
• Intel® Xeon® 6 E-Core processors
• Intel® Xeon® 6 P-Core processors
• Intel® Series 2 Core™ Ultra processors
P6 family processors are IA-32 processors based on the P6 family microarchitecture. This includes the Pentium®
Pro, Pentium® II, Pentium® III, and Pentium® III Xeon® processors.
The Pentium® 4, Pentium® D, and Pentium® processor Extreme Editions are based on the Intel NetBurst® microar-
chitecture. Most early Intel® Xeon® processors are based on the Intel NetBurst® microarchitecture. Intel Xeon
processor 5000, 7100 series are based on the Intel NetBurst® microarchitecture.
The Intel® Core™ Duo, Intel® Core™ Solo and dual-core Intel® Xeon® processor LV are based on an improved
Pentium® M processor microarchitecture.
The Intel® Xeon® processor 3000, 3200, 5100, 5300, 7200, and 7300 series, Intel® Pentium® dual-core, Intel®
Core™ 2 Duo, Intel® Core™ 2 Quad, and Intel® Core™ 2 Extreme processors are based on Intel® Core™ microar-
chitecture.
The Intel® Xeon® processor 5200, 5400, 7400 series, Intel® Core™ 2 Quad processor Q9000 series, and Intel®
Core™ 2 Extreme processors QX9000, X9000 series, Intel® Core™ 2 processor E8000 series are based on
Enhanced Intel® Core™ microarchitecture.
The Intel Atom® processors 200, 300, D400, D500, D2000, N200, N400, N2000, E2000, Z500, Z600, Z2000,
C1000 series are based on the Intel Atom® microarchitecture and supports Intel 64 architecture.
P6 family, Pentium® M, Intel® Core™ Solo, Intel® Core™ Duo processors, dual-core Intel® Xeon® processor LV,
and early generations of Pentium 4 and Intel Xeon processors support IA-32 architecture. The Intel® AtomTM
processor Z5xx series support IA-32 architecture.
The Intel® Xeon® processor 3000, 3200, 5000, 5100, 5200, 5300, 5400, 7100, 7200, 7300, 7400 series, Intel®
Core™ 2 Duo, Intel® Core™ 2 Extreme, Intel® Core™ 2 Quad processors, Pentium® D processors, Pentium® Dual-
Core processor, newer generations of Pentium 4 and Intel Xeon processor family support Intel® 64 architecture.
The Intel® Core™ i7 processor and Intel® Xeon® processor 3400, 5500, 7500 series are based on 45 nm Nehalem
microarchitecture. Westmere microarchitecture is a 32 nm version of the Nehalem microarchitecture. Intel®
Xeon® processor 5600 series, Intel Xeon processor E7 and various Intel Core i7, i5, i3 processors are based on the
Westmere microarchitecture. These processors support Intel 64 architecture.
The Intel® Xeon® processor E5 family, Intel® Xeon® processor E3-1200 family, Intel® Xeon® processor E7-
8800/4800/2800 product families, Intel® Core™ i7-3930K processor, and 2nd generation Intel® Core™ i7-2xxx,
Intel® CoreTM i5-2xxx, Intel® Core™ i3-2xxx processor series are based on the Sandy Bridge microarchitecture and
support Intel 64 architecture.
The Intel® Xeon® processor E7-8800/4800/2800 v2 product families, Intel® Xeon® processor E3-1200 v2 product
family and 3rd generation Intel® Core™ processors are based on the Ivy Bridge microarchitecture and support
Intel 64 architecture.
The Intel® Xeon® processor E5-4600/2600/1600 v2 product families, Intel® Xeon® processor E5-2400/1400 v2
product families and Intel® Core™ i7-49xx Processor Extreme Edition are based on the Ivy Bridge-E microarchitec-
ture and support Intel 64 architecture.
The Intel® Xeon® processor E3-1200 v3 product family and 4th Generation Intel® Core™ processors are based on
the Haswell microarchitecture and support Intel 64 architecture.

Vol. 1 1-3
ABOUT THIS MANUAL

The Intel® Xeon® processor E5-2600/1600 v3 product families and the Intel® Core™ i7-59xx Processor Extreme
Edition are based on the Haswell-E microarchitecture and support Intel 64 architecture.
The Intel Atom® processor Z8000 series is based on the Airmont microarchitecture.
The Intel Atom® processor Z3400 series and the Intel Atom® processor Z3500 series are based on the Silvermont
microarchitecture.
The Intel® Core™ M processor family, 5th generation Intel® Core™ processors, Intel® Xeon® processor D-1500
product family and the Intel® Xeon® processor E5 v4 family are based on the Broadwell microarchitecture and
support Intel 64 architecture.
The Intel® Xeon® Scalable Processor Family, Intel® Xeon® processor E3-1500m v5 product family and 6th gener-
ation Intel® Core™ processors are based on the Skylake microarchitecture and support Intel 64 architecture.
The 7th generation Intel® Core™ processors are based on the Kaby Lake microarchitecture and support Intel 64
architecture.
The Intel Atom® processor C series, the Intel Atom® processor X series, the Intel® Pentium® processor J series,
the Intel® Celeron® processor J series, and the Intel® Celeron® processor N series are based on the Goldmont
microarchitecture.
The Intel® Xeon Phi™ Processor 3200, 5200, 7200 Series is based on the Knights Landing microarchitecture and
supports Intel 64 architecture.
The Intel® Pentium® Silver processor series, the Intel® Celeron® processor J series, and the Intel® Celeron®
processor N series are based on the Goldmont Plus microarchitecture.
The 8th generation Intel® Core™ processors, 9th generation Intel® Core™ processors, and Intel® Xeon® E proces-
sors are based on the Coffee Lake microarchitecture and support Intel 64 architecture.
The Intel® Xeon Phi™ Processor 7215, 7285, 7295 Series is based on the Knights Mill microarchitecture and
supports Intel 64 architecture.
The 2nd generation Intel® Xeon® Scalable Processor Family is based on the Cascade Lake product and supports
Intel 64 architecture.
Some 10th generation Intel® Core™ processors are based on the Ice Lake microarchitecture, and some are based
on the Comet Lake microarchitecture; both support Intel 64 architecture.
Some 11th generation Intel® Core™ processors are based on the Tiger Lake microarchitecture, and some are
based on the Rocket Lake microarchitecture; both support Intel 64 architecture.
Some 3rd generation Intel® Xeon® Scalable Processor Family processors are based on the Cooper Lake product,
and some are based on the Ice Lake microarchitecture; both support Intel 64 architecture.
The 12th generation Intel® Core™ processors are based on the Alder Lake performance hybrid architecture and
support Intel 64 architecture.
The 13th generation Intel® Core™ processors are based on the Raptor Lake performance hybrid architecture and
support Intel 64 architecture.
The 4th generation Intel® Xeon® Scalable Processor Family is based on Sapphire Rapids microarchitecture and
supports Intel 64 architecture.
The 5th generation Intel® Xeon® Scalable Processor Family is based on Emerald Rapids microarchitecture and
supports Intel 64 architecture.
The Intel® Core™ Ultra 7 processor is based on Meteor Lake performance hybrid architecture and supports Intel 64
architecture.
The Intel® Xeon® 6 E-core processor is based on Sierra Forest microarchitecture and supports Intel 64 architec-
ture.
The Intel® Xeon® 6 P-core processor is based on Granite Rapids microarchitecture and supports Intel 64 architec-
ture.
The Intel® Series 2 Core™ Ultra processor is based on Lunar Lake performance hybrid architecture and supports
Intel 64 architecture.

1-4 Vol. 1
ABOUT THIS MANUAL

IA-32 architecture is the instruction set architecture and programming environment for Intel's 32-bit microproces-
sors. Intel® 64 architecture is the instruction set architecture and programming environment which is the superset
of Intel’s 32-bit and 64-bit architectures. It is compatible with the IA-32 architecture.

1.2 OVERVIEW OF VOLUME 1: BASIC ARCHITECTURE


A description of this manual’s content follows:
Chapter 1 — About This Manual. Gives an overview of all volumes of the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual. It also describes the notational conventions in these manuals and lists related Intel
manuals and documentation of interest to programmers and hardware designers.
Chapter 2 — Intel® 64 and IA-32 Architectures. Introduces the Intel 64 and IA-32 architectures along with
the families of Intel processors that are based on these architectures. It also gives an overview of the common
features found in these processors and brief history of the Intel 64 and IA-32 architectures.
Chapter 3 — Basic Execution Environment. Introduces the models of memory organization and describes the
register set used by applications.
Chapter 4 — Data Types. Describes the data types and addressing modes recognized by the processor; provides
an overview of real numbers and floating-point formats and of floating-point exceptions.
Chapter 5 — Instruction Set Summary. Lists all Intel 64 and IA-32 instructions, divided into technology groups.
Chapter 6 — Procedure Calls, Interrupts, and Exceptions. Describes the procedure stack and mechanisms
provided for making procedure calls and for servicing interrupts and exceptions.
Chapter 7 — Programming with General-Purpose Instructions. Describes basic load and store, program
control, arithmetic, and string instructions that operate on basic data types, general-purpose and segment regis-
ters; also describes system instructions that are executed in protected mode.
Chapter 8 — Programming with the x87 FPU. Describes the x87 floating-point unit (FPU), including floating-
point registers and data types; gives an overview of the floating-point instruction set and describes the processor's
floating-point exception conditions.
Chapter 9 — Programming with Intel® MMX™ Technology. Describes Intel MMX technology, including MMX
registers and data types; also provides an overview of the MMX instruction set.
Chapter 10 — Programming with Intel® Streaming SIMD Extensions (Intel® SSE). Describes SSE exten-
sions, including XMM registers, the MXCSR register, and packed single precision floating-point data types; provides
an overview of the SSE instruction set and gives guidelines for writing code that accesses the SSE extensions.
Chapter 11 — Programming with Intel® Streaming SIMD Extensions 2 (Intel® SSE2). Describes SSE2
extensions, including XMM registers and packed double precision floating-point data types; provides an overview
of the SSE2 instruction set and gives guidelines for writing code that accesses SSE2 extensions. This chapter also
describes SIMD floating-point exceptions that can be generated with SSE and SSE2 instructions. It also provides
general guidelines for incorporating support for SSE and SSE2 extensions into operating system and applications
code.
Chapter 12 — Programming with Intel® Streaming SIMD Extensions 3 (Intel® SSE3), Supplemental
Streaming SIMD Extensions 3 (SSSE3), Intel® Streaming SIMD Extensions 4 (Intel® SSE4) and Intel®
AES New Instructions (Intel® AES-NI). Provides an overview of the SSE3 instruction set, Supplemental SSE3,
SSE4, AESNI instructions, and guidelines for writing code that access these extensions.
Chapter 13 — Managing State Using the XSAVE Feature Set. Describes the XSAVE feature set instructions
and explains how software can enable the XSAVE feature set and XSAVE-enabled features.
Chapter 14 — Programming with Intel® AVX, FMA, and Intel® AVX2. Provides an overview of the Intel® AVX
instruction set, FMA, and Intel® AVX2 extensions and gives guidelines for writing code that access these exten-
sions.
Chapter 15 — Programming with Intel® AVX-512. Provides an overview of the Intel® AVX-512 instruction set
extensions and gives guidelines for writing code that access these extensions.
Chapter 16 — Programming with Intel® AVX10. Provides an overview of the Intel® AVX10 instruction set
extensions and gives guidelines for writing code that access these extensions.

Vol. 1 1-5
ABOUT THIS MANUAL

Chapter 17 — Programming with Intel® Transactional Synchronization Extensions. Describes the instruc-
tion extensions that support lock elision techniques to improve the performance of multi-threaded software with
contended locks.
Chapter 18 — Control-flow Enforcement Technology. Provides an overview of the Control-flow Enforcement
Technology (CET) and gives guidelines for writing code that access these extensions.
Chapter 19 — Programming with Intel® Advanced Matrix Extensions. Provides an overview of the Intel®
Advanced Matrix Extensions and gives guidelines for writing code that access these extensions.
Chapter 20 — Input/Output. Describes the processor’s I/O mechanism, including I/O port addressing, I/O
instructions, and I/O protection mechanisms.
Chapter 21 — Processor Identification and Feature Determination. Describes how to determine the CPU
type and features available in the processor.
Appendix A — EFLAGS Cross-Reference. Summarizes how the IA-32 instructions affect the flags in the EFLAGS
register.
Appendix B — EFLAGS Condition Codes. Summarizes how conditional jump, move, and ‘byte set on condition
code’ instructions use condition code flags (OF, CF, ZF, SF, and PF) in the EFLAGS register.
Appendix C — Floating-Point Exceptions Summary. Summarizes exceptions raised by the x87 FPU floating-
point and SSE/SSE2/SSE3 floating-point instructions.
Appendix D — Guidelines for Writing SIMD Floating-Point Exception Handlers. Gives guidelines for writing
exception handlers for exceptions generated by SSE/SSE2/SSE3 floating-point instructions.
Appendix E — Intel® Memory Protection Extensions. Provides an overview of the Intel® Memory Protection
Extensions, a feature that has been deprecated and will not be available on future processors.

1.3 NOTATIONAL CONVENTIONS


This manual uses specific notation for data-structure formats, for symbolic representation of instructions, and for
hexadecimal and binary numbers. This notation is described below.

1.3.1 Bit and Byte Order


In illustrations of data structures in memory, smaller addresses appear toward the bottom of the figure; addresses
increase toward the top. Bit positions are numbered from right to left. The numerical value of a set bit is equal to
two raised to the power of the bit position. Intel 64 and IA-32 processors are “little endian” machines; this means
the bytes of a word are numbered starting from the least significant byte. See Figure 1-1.

Highest Data Structure


Address 31 24 23 16 15 8 7 0 Bit offset
28
24
20
16
12
8
4
Lowest
Byte 3 Byte 2 Byte 1 Byte 0 0 Address

Byte Offset

Figure 1-1. Bit and Byte Order

1-6 Vol. 1
ABOUT THIS MANUAL

1.3.2 Reserved Bits and Software Compatibility


In many register and memory layout descriptions, certain bits are marked as reserved. When bits are marked as
reserved, it is essential for compatibility with future processors that software treat these bits as having a future,
though unknown, effect. The behavior of reserved bits should be regarded as not only undefined, but unpredict-
able.
Software should follow these guidelines in dealing with reserved bits:
• Do not depend on the states of any reserved bits when testing the values of registers that contain such bits.
Mask out the reserved bits before testing.
• Do not depend on the states of any reserved bits when storing to memory or to a register.
• Do not depend on the ability to retain information written into any reserved bits.
• When loading a register, always load the reserved bits with the values indicated in the documentation, if any,
or reload them with values previously read from the same register.

NOTE
Avoid any software dependence upon the state of reserved bits in Intel 64 and IA-32 registers.
Depending upon the values of reserved register bits will make software dependent upon the
unspecified manner in which the processor handles these bits. Programs that depend upon
reserved values risk incompatibility with future processors.

1.3.2.1 Instruction Operands


When instructions are represented symbolically, a subset of the IA-32 assembly language is used. In this subset,
an instruction has the following format:

label: mnemonic argument1, argument2, argument3


where:
• A label is an identifier which is followed by a colon.
• A mnemonic is a reserved name for a class of instruction opcodes which have the same function.
• The operands argument1, argument2, and argument3 are optional. There may be from zero to three
operands, depending on the opcode. When present, they take the form of either literals or identifiers for data
items. Operand identifiers are either reserved names of registers or are assumed to be assigned to data items
declared in another part of the program (which may not be shown in the example).
When two operands are present in an arithmetic or logical instruction, the right operand is the source and the left
operand is the destination.
For example:

LOADREG: MOV EAX, SUBTOTAL


In this example, LOADREG is a label, MOV is the mnemonic identifier of an opcode, EAX is the destination operand,
and SUBTOTAL is the source operand. Some assembly languages put the source and destination in reverse order.

1.3.3 Hexadecimal and Binary Numbers


Base 16 (hexadecimal) numbers are represented by a string of hexadecimal digits followed by the character H (for
example, 0F82EH). A hexadecimal digit is a character from the following set: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D,
E, and F.
Base 2 (binary) numbers are represented by a string of 1s and 0s, sometimes followed by the character B (for
example, 1010B). The “B” designation is only used in situations where confusion as to the type of number might
arise.

Vol. 1 1-7
ABOUT THIS MANUAL

1.3.4 Segmented Addressing


The processor uses byte addressing. This means memory is organized and accessed as a sequence of bytes.
Whether one or more bytes are being accessed, a byte address is used to locate the byte or bytes memory. The
range of memory that can be addressed is called an address space.
The processor also supports segmented addressing. This is a form of addressing where a program may have many
independent address spaces, called segments. For example, a program can keep its code (instructions) and stack
in separate segments. Code addresses would always refer to the code space, and stack addresses would always
refer to the stack space. The following notation is used to specify a byte address within a segment:

Segment-register:Byte-address
For example, the following segment address identifies the byte at address FF79H in the segment pointed by the DS
register:

DS:FF79H
The following segment address identifies an instruction address in the code segment. The CS register points to the
code segment and the EIP register contains the address of the instruction.

CS:EIP

1.3.5 A Syntax for CPUID, CR, and MSR Values


Obtain feature flags, status, and system information by using the CPUID instruction, by checking control register
bits, and by reading model-specific registers. See Figure 1-2 for details on the syntax that represents this informa-
tion.

1-8 Vol. 1
ABOUT THIS MANUAL

CPUID Input and Output

CPUID.01H:EDX.SSE[bit 25] = 1

Input value for EAX register

Output register and feature flag or field


name with bit position(s)
Value (or range) of output

Control Register Values

CR4.OSFXSR[bit 9] = 1

Example CR name

Feature flag or field name


with bit position(s)
Value (or range) of output

Model-Specific Register Values

IA32_MISC_ENABLE.ENABLEFOPCODE[bit 2] = 1

Example MSR name

Feature flag or field name with bit position(s)

Value (or range) of output

SDM29002
Figure 1-2. Syntax for CPUID, CR, and MSR Data Presentation

1.3.6 Exceptions
An exception is an event that typically occurs when an instruction causes an error. For example, an attempt to
divide by zero generates an exception. However, some exceptions, such as breakpoints, occur under other condi-
tions. Some types of exceptions may provide error codes. An error code reports additional information about the
error. An example of the notation used to show an exception and error code is shown below:

#PF(fault code)
This example refers to a page-fault exception under conditions where an error code naming a type of fault is
reported. Under some conditions, exceptions that produce error codes may not be able to report an accurate code.
In this case, the error code is zero, as shown below for a general-protection exception:

#GP(0)

Vol. 1 1-9
ABOUT THIS MANUAL

1.4 RELATED LITERATURE


Literature related to Intel 64 and IA-32 processors is listed and viewable on-line at:
https://software.intel.com/en-us/articles/intel-sdm
See also:
• The latest security information on Intel® products:
https://www.intel.com/content/www/us/en/security-center/default.html
• Software developer resources, guidance, and insights for security advisories:
https://software.intel.com/security-software-guidance/
• The data sheet for a particular Intel 64 or IA-32 processor
• The specification update for a particular Intel 64 or IA-32 processor
• Intel® C++ Compiler documentation and online help:
http://software.intel.com/en-us/articles/intel-compilers/
• Intel® Fortran Compiler documentation and online help:
http://software.intel.com/en-us/articles/intel-compilers/
• Intel® Software Development Tools:
https://software.intel.com/en-us/intel-sdp-home
• Intel® 64 and IA-32 Architectures Software Developer’s Manual (in one, four or ten volumes):
https://software.intel.com/en-us/articles/intel-sdm
• Intel® 64 and IA-32 Architectures Optimization Reference Manual:
https://software.intel.com/en-us/articles/intel-sdm#optimization
• Intel® Trusted Execution Technology Measured Launched Environment Programming Guide:
http://www.intel.com/content/www/us/en/software-developers/intel-txt-software-development-guide.html
• Intel® Software Guard Extensions (Intel® SGX) Information:
https://software.intel.com/en-us/isa-extensions/intel-sgx
• Developing Multi-threaded Applications: A Platform Consistent Approach:
https://software.intel.com/sites/default/files/article/147714/51534-developing-multithreaded-applica-
tions.pdf
• Using Spin-Loops on Intel® Pentium® 4 Processor and Intel® Xeon® Processor:
https://software.intel.com/sites/default/files/22/30/25602
• Performance Monitoring Unit Sharing Guide:
http://software.intel.com/file/30388
Literature related to select features in future Intel processors are available at:
• Intel® Architecture Instruction Set Extensions Programming Reference:
https://software.intel.com/en-us/isa-extensions
More relevant links are:
• Intel® Developer Zone:
https://software.intel.com/en-us
• Developer centers:
http://www.intel.com/content/www/us/en/hardware-developers/developer-centers.html
• Processor support general link:
http://www.intel.com/support/processors/
• Intel® Hyper-Threading Technology (Intel® HT Technology):
http://www.intel.com/technology/platform-technology/hyper-threading/index.htm

1-10 Vol. 1
2. Updates to Chapter 2, Volume 1
Change bars and violet text show changes to Chapter 2 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 1: Basic Architecture.

------------------------------------------------------------------------------------------
Changes to this chapter:
• Added Sub-Page Permissions to Table 2-4 in Section 2.4, “Planned Removal of Intel® Instruction Set Archi-
tecture and Features from Upcoming Products.”

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 13


CHAPTER 2
INTEL® 64 AND IA-32 ARCHITECTURES

2.1 BRIEF HISTORY OF INTEL® 64 AND IA-32 ARCHITECTURES


The following sections provide a summary of the major technical evolutions from IA-32 to Intel 64 architecture:
starting from the Intel 8086 processor to the latest Intel® Core® 2 Duo, Core 2 Quad and Intel Xeon processor
5300 and 7300 series. Object code created for processors released as early as 1978 still executes on the latest
processors in the Intel 64 and IA-32 architecture families.

2.1.1 16-Bit Processors and Segmentation (1978)


The IA-32 architecture family was preceded by 16-bit processors, the 8086 and 8088. The 8086 has 16-bit regis-
ters and a 16-bit external data bus, with 20-bit addressing giving a 1-MByte address space. The 8088 is similar to
the 8086 except it has an 8-bit external data bus.
The 8086/8088 introduced segmentation to the IA-32 architecture. With segmentation, a 16-bit segment register
contains a pointer to a memory segment of up to 64 KBytes. Using four segment registers at a time, 8086/8088
processors are able to address up to 256 KBytes without switching between segments. The 20-bit addresses that
can be formed using a segment register and an additional 16-bit pointer provide a total address range of 1 MByte.

2.1.2 The Intel® 286 Processor (1982)


The Intel 286 processor introduced protected mode operation into the IA-32 architecture. Protected mode uses the
segment register content as selectors or pointers into descriptor tables. Descriptors provide 24-bit base addresses
with a physical memory size of up to 16 MBytes, support for virtual memory management on a segment swapping
basis, and a number of protection mechanisms. These mechanisms include:
• Segment limit checking.
• Read-only and execute-only segment options.
• Four privilege levels.

2.1.3 The Intel386™ Processor (1985)


The Intel386 processor was the first 32-bit processor in the IA-32 architecture family. It introduced 32-bit registers
for use both to hold operands and for addressing. The lower half of each 32-bit Intel386 register retains the prop-
erties of the 16-bit registers of earlier generations, permitting backward compatibility. The processor also provides
a virtual-8086 mode that allows for even greater efficiency when executing programs created for 8086/8088
processors.
In addition, the Intel386 processor has support for:
• A 32-bit address bus that supports up to 4-GBytes of physical memory.
• A segmented-memory model and a flat memory model.
• Paging, with a fixed 4-KByte page size providing a method for virtual memory management.
• Support for parallel stages.

2.1.4 The Intel486™ Processor (1989)


The Intel486™ processor added more parallel execution capability by expanding the Intel386 processor’s instruc-
tion decode and execution units into five pipelined stages. Each stage operates in parallel with the others on up to
five instructions in different stages of execution.

Vol. 1 2-1
INTEL® 64 AND IA-32 ARCHITECTURES

In addition, the processor added:


• An 8-KByte on-chip first-level cache that increased the percent of instructions that could execute at the scalar
rate of one per clock.
• An integrated x87 FPU.
• Power saving and system management capabilities.

2.1.5 The Intel® Pentium® Processor (1993)


The introduction of the Intel Pentium processor added a second execution pipeline to achieve superscalar perfor-
mance (two pipelines, known as u and v, together can execute two instructions per clock). The on-chip first-level
cache doubled, with 8 KBytes devoted to code and another 8 KBytes devoted to data. The data cache uses the MESI
protocol to support more efficient write-back cache in addition to the write-through cache previously used by the
Intel486 processor. Branch prediction with an on-chip branch table was added to increase performance in looping
constructs.
In addition, the processor added:
• Extensions to make the virtual-8086 mode more efficient and allow for 4-MByte as well as 4-KByte pages.
• Internal data paths of 128 and 256 bits add speed to internal data transfers.
• Burstable external data bus was increased to 64 bits.
• An APIC to support systems with multiple processors.
• A dual processor mode to support glueless two processor systems.
A subsequent stepping of the Pentium family introduced Intel MMX technology (the Pentium Processor with MMX
technology). Intel MMX technology uses the single-instruction, multiple-data (SIMD) execution model to perform
parallel computations on packed integer data contained in 64-bit registers.
See Section 2.2.7, “SIMD Instructions.”

2.1.6 The P6 Family of Processors (1995—1999)


The P6 family of processors was based on a superscalar microarchitecture that set new performance standards; see
also Section 2.2.1, “P6 Family Microarchitecture.” One of the goals in the design of the P6 family microarchitecture
was to exceed the performance of the Pentium processor significantly while using the same 0.6-micrometer, four-
layer, metal BICMOS manufacturing process. Members of this family include the following:
• The Intel Pentium Pro processor is three-way superscalar. Using parallel processing techniques, the
processor is able on average to decode, dispatch, and complete execution of (retire) three instructions per
clock cycle. The Pentium Pro introduced the dynamic execution (micro-data flow analysis, out-of-order
execution, superior branch prediction, and speculative execution) in a superscalar implementation. The
processor was further enhanced by its caches. It has the same two on-chip 8-KByte 1st-Level caches as the
Pentium processor and an additional 256-KByte Level 2 cache in the same package as the processor.
• The Intel Pentium II processor added Intel MMX technology to the P6 family processors along with new
packaging and several hardware enhancements. The processor core is packaged in the single edge contact
cartridge (SECC). The Level l data and instruction caches were enlarged to 16 KBytes each, and Level 2 cache
sizes of 256 KBytes, 512 KBytes, and 1 MBytes are supported. A half-frequency backside bus connects the
Level 2 cache to the processor. Multiple low-power states such as AutoHALT, Stop-Grant, Sleep, and Deep Sleep
are supported to conserve power when idling.
• The Pentium II Xeon processor combined the premium characteristics of previous generations of Intel
processors. This includes: 4-way, 8-way (and up) scalability and a 2 MBytes 2nd-Level cache running on a full-
frequency backside bus.
• The Intel Celeron processor family focused on the value PC market segment. Its introduction offers an
integrated 128 KBytes of Level 2 cache and a plastic pin grid array (P.P.G.A.) form factor to lower system design
cost.
• The Intel Pentium III processor introduced the Streaming SIMD Extensions (SSE) to the IA-32 architecture.
SSE extensions expand the SIMD execution model introduced with the Intel MMX technology by providing a

2-2 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES

new set of 128-bit registers and the ability to perform SIMD operations on packed single precision floating-
point values. See Section 2.2.7, “SIMD Instructions.”
• The Pentium III Xeon processor extended the performance levels of the IA-32 processors with the
enhancement of a full-speed, on-die, and Advanced Transfer Cache.

2.1.7 The Intel® Pentium® 4 Processor Family (2000—2006)


The Intel Pentium 4 processor family is based on Intel NetBurst microarchitecture; see Section 2.2.2, “Intel
NetBurst® Microarchitecture.”
The Intel Pentium 4 processor introduced Streaming SIMD Extensions 2 (SSE2); see Section 2.2.7, “SIMD Instruc-
tions.” The Intel Pentium 4 processor 3.40 GHz, supporting Hyper-Threading Technology introduced Streaming
SIMD Extensions 3 (SSE3); see Section 2.2.7, “SIMD Instructions.”
Intel 64 architecture was introduced in the Intel Pentium 4 Processor Extreme Edition supporting Hyper-Threading
Technology and in the Intel Pentium 4 Processor 6xx and 5xx sequences.
Intel® Virtualization Technology (Intel® VT) was introduced in the Intel Pentium 4 processor 672 and 662.

2.1.8 The Intel® Xeon® Processor (2001—2007)


Intel Xeon processors (with exception for dual-core Intel Xeon processor LV, Intel Xeon processor 5100 series) are
based on the Intel NetBurst microarchitecture; see Section 2.2.2, “Intel NetBurst® Microarchitecture.” As a family,
this group of IA-32 processors (more recently Intel 64 processors) is designed for use in multi-processor server
systems and high-performance workstations.
The Intel Xeon processor MP introduced support for Intel® Hyper-Threading Technology; see Section 2.2.8, “Intel®
Hyper-Threading Technology.”
The 64-bit Intel Xeon processor 3.60 GHz (with an 800 MHz System Bus) was used to introduce Intel 64 architec-
ture. The Dual-Core Intel Xeon processor includes dual core technology. The Intel Xeon processor 70xx series
includes Intel Virtualization Technology.
The Intel Xeon processor 5100 series introduces power-efficient, high performance Intel Core microarchitecture.
This processor is based on Intel 64 architecture; it includes Intel Virtualization Technology and dual-core tech-
nology. The Intel Xeon processor 3000 series are also based on Intel Core microarchitecture. The Intel Xeon
processor 5300 series introduces four processor cores in a physical package, they are also based on Intel Core
microarchitecture.

2.1.9 The Intel® Pentium® M Processor (2003—2006)


The Intel Pentium M processor family is a high performance, low power mobile processor family with microarchitec-
tural enhancements over previous generations of IA-32 Intel mobile processors. This family is designed for
extending battery life and seamless integration with platform innovations that enable new usage models (such as
extended mobility, ultra thin form-factors, and integrated wireless networking).
Its enhanced microarchitecture includes:
• Support for Intel Architecture with Dynamic Execution.
• A high performance, low-power core manufactured using Intel’s advanced process technology with copper
interconnect.
• On-die, primary 32-KByte instruction cache and 32-KByte write-back data cache.
• On-die, second-level cache (up to 2 MByte) with Advanced Transfer Cache Architecture.
• Advanced Branch Prediction and Data Prefetch Logic.
• Support for MMX technology, Streaming SIMD instructions, and the SSE2 instruction set.
• A 400 or 533 MHz, Source-Synchronous Processor System Bus.
• Advanced power management using Enhanced Intel SpeedStep® technology.

Vol. 1 2-3
INTEL® 64 AND IA-32 ARCHITECTURES

2.1.10 The Intel® Pentium® Processor Extreme Edition (2005)


The Intel Pentium processor Extreme Edition introduced dual-core technology. This technology provides advanced
hardware multi-threading support. The processor is based on Intel NetBurst microarchitecture and supports Intel
SSE, SSE2, SSE3, Intel Hyper-Threading Technology, and Intel 64 architecture.
See also:
• Section 2.2.2, “Intel NetBurst® Microarchitecture.”
• Section 2.2.3, “Intel® Core™ Microarchitecture.”
• Section 2.2.7, “SIMD Instructions.”
• Section 2.2.8, “Intel® Hyper-Threading Technology.”
• Section 2.2.9, “Multi-Core Technology.”
• Section 2.2.10, “Intel® 64 Architecture.”

2.1.11 The Intel® Core™ Duo and Intel® Core™ Solo Processors (2006—2007)
The Intel Core Duo processor offers power-efficient, dual-core performance with a low-power design that extends
battery life. This family and the single-core Intel Core Solo processor offer microarchitectural enhancements over
Pentium M processor family.
Its enhanced microarchitecture includes:
• Intel® Smart Cache which allows for efficient data sharing between two processor cores.
• Improved decoding and SIMD execution.
• Intel® Dynamic Power Coordination and Enhanced Intel® Deeper Sleep to reduce power consumption.
• Intel® Advanced Thermal Manager which features digital thermal sensor interfaces.
• Support for power-optimized 667 MHz bus.
The dual-core Intel Xeon processor LV is based on the same microarchitecture as Intel Core Duo processor, and
supports IA-32 architecture.

2.1.12 The Intel® Xeon® Processor 5100, 5300 Series, and Intel® Core™ 2 Processor Family
(2006)
The Intel Xeon processor 3000, 3200, 5100, 5300, and 7300 series, Intel Pentium Dual-Core, Intel Core 2 Extreme,
Intel Core 2 Quad processors, and Intel Core 2 Duo processor family support Intel 64 architecture; they are based
on the high-performance, power-efficient Intel® Core microarchitecture built on 65 nm process technology. The
Intel Core microarchitecture includes the following innovative features:
• Intel® Wide Dynamic Execution to increase performance and execution throughput.
• Intel® Intelligent Power Capability to reduce power consumption.
• Intel® Advanced Smart Cache which allows for efficient data sharing between two processor cores.
• Intel® Smart Memory Access to increase data bandwidth and hide latency of memory accesses.
• Intel® Advanced Digital Media Boost which improves application performance using multiple generations of
Streaming SIMD extensions.
The Intel Xeon processor 5300 series, Intel Core 2 Extreme processor QX6800 series, and Intel Core 2 Quad
processors support Intel quad-core technology.

2.1.13 The Intel® Xeon® Processor 5200, 5400, 7400 Series, and Intel® Core™ 2 Processor
Family (2007)
The Intel Xeon processor 5200, 5400, and 7400 series, Intel Core 2 Quad processor Q9000 Series, Intel Core 2 Duo
processor E8000 series support Intel 64 architecture; they are based on the Enhanced Intel® Core microarchitec-

2-4 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES

ture using 45 nm process technology. The Enhanced Intel Core microarchitecture provides the following improved
features:
• A radix-16 divider, faster OS primitives further increases the performance of Intel® Wide Dynamic Execution.
• Improves Intel® Advanced Smart Cache with Up to 50% larger level-two cache and up to 50% increase in way-
set associativity.
• A 128-bit shuffler engine significantly improves the performance of Intel® Advanced Digital Media Boost and
SSE4.
The Intel Xeon processor 5400 series and the Intel Core 2 Quad processor Q9000 Series support Intel quad-core
technology. The Intel Xeon processor 7400 series offers up to six processor cores and an L3 cache up to 16 MBytes.

2.1.14 The Intel Atom® Processor Family (2008)


The first generation of Intel Atom® processors are built on 45 nm process technology. They are based on a new
microarchitecture, Intel Atom® microarchitecture, which is optimized for ultra low power devices. The Intel Atom®
microarchitecture features two in-order execution pipelines that minimize power consumption, increase battery
life, and enable ultra-small form factors. The initial Intel Atom Processor family and subsequent generations
including Intel Atom processor D2000, N2000, E2000, Z2000, C1000 series provide the following features:
• Enhanced Intel® SpeedStep® Technology.
• Intel® Hyper-Threading Technology.
• Deep Power Down Technology with Dynamic Cache Sizing.
• Support for instruction set extensions up to and including Supplemental Streaming SIMD Extensions 3
(SSSE3).
• Support for Intel® Virtualization Technology.
• Support for Intel® 64 Architecture (excluding Intel Atom processor Z5xx Series).

2.1.15 The Intel Atom® Processor Family Based on Silvermont Microarchitecture (2013)
Intel Atom Processor C2xxx, E3xxx, S1xxx series are based on the Silvermont microarchitecture. Processors based
on the Silvermont microarchitecture support instruction set extensions up to and including SSE4.2, AESNI, and
PCLMULQDQ.

2.1.16 The Intel® Core™ i7 Processor Family (2008)


The Intel Core i7 processor 900 series supports Intel 64 architecture, and is based on Nehalem microarchitecture
using 45 nm process technology. The Intel Core i7 processor and Intel Xeon processor 5500 series include the
following features:
• Intel® Turbo Boost Technology converts thermal headroom into higher performance.
• Intel® HyperThreading Technology in conjunction with Quadcore to provide four cores and eight threads.
• Dedicated power control unit to reduce active and idle power consumption.
• Integrated memory controller on the processor supporting three channels of DDR3 memory.
• 8 MB inclusive Intel® Smart Cache.
• Intel® QuickPath interconnect (QPI) providing point-to-point link to chipset.
• Support for SSE4.2 and SSE4.1 instruction sets.
• Second generation Intel Virtualization Technology.

2.1.17 The Intel® Xeon® Processor 7500 Series (2010)


The Intel Xeon processor 7500 and 6500 series are based on Nehalem microarchitecture using 45 nm process tech-
nology. These processors support the same features described in Section 2.1.16, plus the following features:

Vol. 1 2-5
INTEL® 64 AND IA-32 ARCHITECTURES

• Up to eight cores per physical processor package.


• Up to 24 MB inclusive Intel® Smart Cache.
• Provides Intel® Scalable Memory Interconnect (Intel® SMI) channels with Intel® 7500 Scalable Memory Buffer
to connect to system memory.
• Advanced RAS supporting software recoverable machine check architecture.

2.1.18 2010 Intel® Core™ Processor Family (2010)


The 2010 Intel Core processor family spans Intel Core i7, i5, and i3 processors. These processors are based on
Westmere microarchitecture using 32 nm process technology. The features can include:
• Deliver smart performance using Intel Hyper-Threading Technology plus Intel Turbo Boost Technology.
• Enhanced Intel Smart Cache and integrated memory controller.
• Intelligent power gating.
• Repartitioned platform with on-die integration of 45 nm integrated graphics.
• Range of instruction set support up to AESNI, PCLMULQDQ, SSE4.2 and SSE4.1.

2.1.19 The Intel® Xeon® Processor 5600 Series (2010)


The Intel Xeon processor 5600 series are based on Westmere microarchitecture using 32 nm process technology.
They support the same features described in Section 2.1.16, plus the following features:
• Up to six cores per physical processor package.
• Up to 12 MB enhanced Intel® Smart Cache.
• Support for AESNI, PCLMULQDQ, SSE4.2 and SSE4.1 instruction sets.
• Flexible Intel Virtualization Technologies across processor and I/O.

2.1.20 The Second Generation Intel® Core™ Processor Family (2011)


The Second Generation Intel Core processor family spans Intel Core i7, i5, and i3 processors based on the Sandy
Bridge microarchitecture. These processors are built from 32 nm process technology and have features including:
• Intel Turbo Boost Technology for Intel Core i5 and i7 processors.
• Intel Hyper-Threading Technology.
• Enhanced Intel Smart Cache and integrated memory controller.
• Processor graphics and built-in visual features like Intel® Quick Sync Video, Intel® InsiderTM, etc.
• Range of instruction set support up to AVX, AESNI, PCLMULQDQ, SSE4.2 and SSE4.1.
The Intel Xeon processor E3-1200 product family is also based on the Sandy Bridge microarchitecture.
The Intel Xeon processor E5-2400/1400 product families are based on the Sandy Bridge-EP microarchitecture.
The Intel Xeon processor E5-4600/2600/1600 product families are based on the Sandy Bridge-EP microarchitec-
ture and provide support for multiple sockets.

2.1.21 The Third Generation Intel® Core™ Processor Family (2012)


The Third Generation Intel Core processor family spans Intel Core i7, i5, and i3 processors based on the Ivy Bridge
microarchitecture. The Intel Xeon processor E7-8800/4800/2800 v2 product families and Intel Xeon processor E3-
1200 v2 product family are also based on the Ivy Bridge microarchitecture.
The Intel Xeon processor E5-2400/1400 v2 product families are based on the Ivy Bridge-EP microarchitecture.
The Intel Xeon processor E5-4600/2600/1600 v2 product families are based on the Ivy Bridge-EP microarchitec-
ture and provide support for multiple sockets.

2-6 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES

2.1.22 The Fourth Generation Intel® Core™ Processor Family (2013)


The Fourth Generation Intel Core processor family spans Intel Core i7, i5, and i3 processors based on the Haswell
microarchitecture. Intel Xeon processor E3-1200 v3 product family is also based on the Haswell microarchitecture.

2.2 MORE ON SPECIFIC ADVANCES


The following sections provide more information on major innovations.

2.2.1 P6 Family Microarchitecture


The Pentium Pro processor introduced a new microarchitecture commonly referred to as P6 processor microarchi-
tecture. The P6 processor microarchitecture was later enhanced with an on-die, Level 2 cache, called Advanced
Transfer Cache.
The microarchitecture is a three-way superscalar, pipelined architecture. Three-way superscalar means that by
using parallel processing techniques, the processor is able on average to decode, dispatch, and complete execution
of (retire) three instructions per clock cycle. To handle this level of instruction throughput, the P6 processor family
uses a decoupled, 12-stage superpipeline that supports out-of-order instruction execution.
Figure 2-1 shows a conceptual view of the P6 processor microarchitecture pipeline with the Advanced Transfer
Cache enhancement.

System Bus

Frequently used
Bus Unit Less frequently used

2nd Level Cache 1st Level Cache


On-die, 8-way 4-way, low latency

Front End

Execution
Instruction Execution
Fetch/
Cache Out-of-Order Retirement
Decode
Microcode Core
ROM

Branch History Update


BTSs/Branch Prediction

OM16520

Figure 2-1. The P6 Processor Microarchitecture with Advanced Transfer Cache Enhancement

To ensure a steady supply of instructions and data for the instruction execution pipeline, the P6 processor microar-
chitecture incorporates two cache levels. The Level 1 cache provides an 8-KByte instruction cache and an 8-KByte
data cache, both closely coupled to the pipeline. The Level 2 cache provides 256-KByte, 512-KByte, or 1-MByte
static RAM that is coupled to the core processor through a full clock-speed 64-bit cache bus.
The centerpiece of the P6 processor microarchitecture is an out-of-order execution mechanism called dynamic
execution. Dynamic execution incorporates three data-processing concepts:

Vol. 1 2-7
INTEL® 64 AND IA-32 ARCHITECTURES

• Deep branch prediction allows the processor to decode instructions beyond branches to keep the instruction
pipeline full. The P6 processor family implements highly optimized branch prediction algorithms to predict the
direction of the instruction.
• Dynamic data flow analysis requires real-time analysis of the flow of data through the processor to
determine dependencies and to detect opportunities for out-of-order instruction execution. The out-of-order
execution core can monitor many instructions and execute these instructions in the order that best optimizes
the use of the processor’s multiple execution units, while maintaining the data integrity.
• Speculative execution refers to the processor’s ability to execute instructions that lie beyond a conditional
branch that has not yet been resolved, and ultimately to commit the results in the order of the original
instruction stream. To make speculative execution possible, the P6 processor microarchitecture decouples the
dispatch and execution of instructions from the commitment of results. The processor’s out-of-order execution
core uses data-flow analysis to execute all available instructions in the instruction pool and store the results in
temporary registers. The retirement unit then linearly searches the instruction pool for completed instructions
that no longer have data dependencies with other instructions or unresolved branch predictions. When
completed instructions are found, the retirement unit commits the results of these instructions to memory
and/or the IA-32 registers (the processor’s eight general-purpose registers and eight x87 FPU data registers)
in the order they were originally issued and retires the instructions from the instruction pool.

2.2.2 Intel NetBurst® Microarchitecture


The Intel NetBurst microarchitecture provides:
• The Rapid Execution Engine.
— Arithmetic Logic Units (ALUs) run at twice the processor frequency.
— Basic integer operations can dispatch in 1/2 processor clock tick.
• Hyper-Pipelined Technology.
— Deep pipeline to enable industry-leading clock rates for desktop PCs and servers.
— Frequency headroom and scalability to continue leadership into the future.
• Advanced Dynamic Execution.
— Deep, out-of-order, speculative execution engine.
• Up to 126 instructions in flight.
• Up to 48 loads and 24 stores in pipeline1.
— Enhanced branch prediction capability.
• Reduces the misprediction penalty associated with deeper pipelines.
• Advanced branch prediction algorithm.
• 4K-entry branch target array.
• New cache subsystem.
— First level caches.
• Advanced Execution Trace Cache stores decoded instructions.
• Execution Trace Cache removes decoder latency from main execution loops.
• Execution Trace Cache integrates path of program execution flow into a single line.
• Low latency data cache.
— Second level cache.
• Full-speed, unified 8-way Level 2 on-die Advance Transfer Cache.
• Bandwidth and performance increases with processor frequency.

1. Intel 64 and IA-32 processors based on the Intel NetBurst microarchitecture at 90 nm process can handle more than 24 stores in
flight.

2-8 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES

• High-performance, quad-pumped bus interface to the Intel NetBurst microarchitecture system bus.
— Supports quad-pumped, scalable bus clock to achieve up to 4X effective speed.
— Capable of delivering up to 8.5 GBytes of bandwidth per second.
• Superscalar issue to enable parallelism.
• Expanded hardware registers with renaming to avoid register name space limitations.
• 64-byte cache line size (transfers data up to two lines per sector).
Figure 2-2 is an overview of the Intel NetBurst microarchitecture. This microarchitecture pipeline is made up of
three sections: (1) the front end pipeline, (2) the out-of-order execution core, and (3) the retirement unit.

System Bus
Frequently used paths

Less frequently used


paths
Bus Unit

3rd Level Cache


Optional

2nd Level Cache 1st Level Cache


8-Way 4-way

Front End

Execution
Trace Cache
Fetch/Decode Out-Of-Order Retirement
Microcode ROM
Core

Branch History Update


BTBs/Branch Prediction

OM16521

Figure 2-2. The Intel NetBurst® Microarchitecture

2.2.2.1 The Front End Pipeline


The front end supplies instructions in program order to the out-of-order execution core. It performs a number of
functions:
• Prefetches instructions that are likely to be executed.
• Fetches instructions that have not already been prefetched.
• Decodes instructions into micro-operations.
• Generates microcode for complex instructions and special-purpose code.
• Delivers decoded instructions from the execution trace cache.
• Predicts branches using highly advanced algorithm.
The pipeline is designed to address common problems in high-speed, pipelined microprocessors. Two of these
problems contribute to major sources of delays:
• Time to decode instructions fetched from the target.

Vol. 1 2-9
INTEL® 64 AND IA-32 ARCHITECTURES

• Wasted decode bandwidth due to branches or branch target in the middle of cache lines.
The operation of the pipeline’s trace cache addresses these issues. Instructions are constantly being fetched and
decoded by the translation engine (part of the fetch/decode logic) and built into sequences of micro-ops called
traces. At any time, multiple traces (representing prefetched branches) are being stored in the trace cache. The
trace cache is searched for the instruction that follows the active branch. If the instruction also appears as the first
instruction in a pre-fetched branch, the fetch and decode of instructions from the memory hierarchy ceases and the
pre-fetched branch becomes the new source of instructions (see Figure 2-2).
The trace cache and the translation engine have cooperating branch prediction hardware. Branch targets are
predicted based on their linear addresses using branch target buffers (BTBs) and fetched as soon as possible.

2.2.2.2 Out-Of-Order Execution Core


The out-of-order execution core’s ability to execute instructions out of order is a key factor in enabling parallelism.
This feature enables the processor to reorder instructions so that if one micro-op is delayed, other micro-ops may
proceed around it. The processor employs several buffers to smooth the flow of micro-ops.
The core is designed to facilitate parallel execution. It can dispatch up to six micro-ops per cycle (this exceeds trace
cache and retirement micro-op bandwidth). Most pipelines can start executing a new micro-op every cycle, so
several instructions can be in flight at a time for each pipeline. A number of arithmetic logical unit (ALU) instruc-
tions can start at two per cycle; many floating-point instructions can start once every two cycles.

2.2.2.3 Retirement Unit


The retirement unit receives the results of the executed micro-ops from the out-of-order execution core and
processes the results so that the architectural state updates according to the original program order.
When a micro-op completes and writes its result, it is retired. Up to three micro-ops may be retired per cycle. The
Reorder Buffer (ROB) is the unit in the processor which buffers completed micro-ops, updates the architectural
state in order, and manages the ordering of exceptions. The retirement section also keeps track of branches and
sends updated branch target information to the BTB. The BTB then purges pre-fetched traces that are no longer
needed.

2.2.3 Intel® Core™ Microarchitecture


Intel Core microarchitecture introduces the following features that enable high performance and power-efficient
performance for single-threaded as well as multi-threaded workloads:
• Intel® Wide Dynamic Execution enable each processor core to fetch, dispatch, execute in high bandwidths
to support retirement of up to four instructions per cycle.
— Fourteen-stage efficient pipeline.
— Three arithmetic logical units.
— Four decoders to decode up to five instruction per cycle.
— Macro-fusion and micro-fusion to improve front-end throughput.
— Peak issue rate of dispatching up to six micro-ops per cycle.
— Peak retirement bandwidth of up to 4 micro-ops per cycle.
— Advanced branch prediction.
— Stack pointer tracker to improve efficiency of executing function/procedure entries and exits.
• Intel® Advanced Smart Cache delivers higher bandwidth from the second level cache to the core, and
optimal performance and flexibility for single-threaded and multi-threaded applications.
— Large second level cache up to 4 MB and 16-way associativity.
— Optimized for multicore and single-threaded execution environments.
— 256-bit internal data path to improve bandwidth from L2 to first-level data cache.

2-10 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES

• Intel® Smart Memory Access prefetches data from memory in response to data access patterns and reduces
cache-miss exposure of out-of-order execution.
— Hardware prefetchers to reduce effective latency of second-level cache misses.
— Hardware prefetchers to reduce effective latency of first-level data cache misses.
— Memory disambiguation to improve efficiency of speculative execution engine.
• Intel® Advanced Digital Media Boost improves most 128-bit SIMD instructions with single-cycle
throughput and floating-point operations.
— Single-cycle throughput of most 128-bit SIMD instructions.
— Up to eight floating-point operations per cycle.
— Three issue ports available to dispatching SIMD instructions for execution.
Intel Core 2 Extreme, Intel Core 2 Duo processors and Intel Xeon processor 5100 series implement two processor
cores based on the Intel Core microarchitecture, the functionality of the subsystems in each core are depicted in
Figure 2-3.

Instruction Fetch and P reD ecode

Instruction Q ueue

M icro-
code D ecode
ROM

S hared L2 C ache
R enam e/A lloc U p to 10.7 G B /s
FS B

R etirem ent U nit


(R e-O rder B uffer)

S cheduler

A LU A LU A LU
B ranch FA dd FM ul Load S tore
M M X /S S E /FP M M X /S S E M M X/S S E
M ove

L1D C ache and D T LB

Figure 2-3. The Intel® Core™ Microarchitecture Pipeline Functionality

2.2.3.1 The Front End


The front end of Intel Core microarchitecture provides several enhancements to feed the Intel Wide Dynamic
Execution engine:
• Instruction fetch unit prefetches instructions into an instruction queue to maintain steady supply of instruction
to the decode units.
• Four-wide decode unit can decode 4 instructions per cycle or 5 instructions per cycle with Macrofusion.
• Macrofusion fuses common sequence of two instructions as one decoded instruction (micro-ops) to increase
decoding throughput.
• Microfusion fuses common sequence of two micro-ops as one micro-ops to improve retirement throughput.

Vol. 1 2-11
INTEL® 64 AND IA-32 ARCHITECTURES

• Instruction queue provides caching of short loops to improve efficiency.


• Stack pointer tracker improves efficiency of executing procedure/function entries and exits.
• Branch prediction unit employs dedicated hardware to handle different types of branches for improved branch
prediction.
• Advanced branch prediction algorithm directs instruction fetch unit to fetch instructions likely in the architec-
tural code path for decoding.

2.2.3.2 Execution Core


The execution core of the Intel Core microarchitecture is superscalar and can process instructions out of order to
increase the overall rate of instructions executed per cycle (IPC). The execution core employs the following feature
to improve execution throughput and efficiency:
• Up to six micro-ops can be dispatched to execute per cycle.
• Up to four instructions can be retired per cycle.
• Three full arithmetic logical units.
• SIMD instructions can be dispatched through three issue ports.
• Most SIMD instructions have 1-cycle throughput (including 128-bit SIMD instructions).
• Up to eight floating-point operation per cycle.
• Many long-latency computation operation are pipelined in hardware to increase overall throughput.
• Reduced exposure to data access delays using Intel Smart Memory Access.

2.2.4 Intel Atom® Microarchitecture


Intel Atom microarchitecture maximizes power-efficient performance for single-threaded and multi-threaded
workloads by providing:
• Advanced Micro-Ops Execution
— Single-micro-op instruction execution from decode to retirement, including instructions with register-only,
load, and store semantics.
— Sixteen-stage, in-order pipeline optimized for throughput and reduced power consumption.
— Dual pipelines to enable decode, issue, execution, and retirement of two instructions per cycle.
— Advanced stack pointer to improve efficiency of executing function entry/returns.
• Intel® Smart Cache
— Second level cache is 512 KB and 8-way associativity.
— Optimized for multi-threaded and single-threaded execution environments
— 256-bit internal data path between L2 and L1 data caches improves high bandwidth.
• Efficient Memory Access
— Efficient hardware prefetchers to L1 and L2, speculatively loading data likely to be requested by processor
to reduce cache miss impact.
• Intel® Digital Media Boost
— Two issue ports for dispatching SIMD instructions to execution units.
— Single-cycle throughput for most 128-bit integer SIMD instructions.
— Up to six floating-point operations per cycle.
— Up to two 128-bit SIMD integer operations per cycle.
— Safe Instruction Recognition (SIR) to allow long-latency floating-point operations to retire out of order with
respect to integer instructions.

2-12 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES

2.2.5 Nehalem Microarchitecture


Nehalem microarchitecture provides the foundation for many features of Intel Core i7 processors. It builds on the
success of 45 nm Intel Core microarchitecture and provides the following feature enhancements:
• Enhanced processor core
— Improved branch prediction and recovery from misprediction.
— Enhanced loop streaming to improve front end performance and reduce power consumption.
— Deeper buffering in out-of-order engine to extract parallelism.
— Enhanced execution units to provide acceleration in CRC, string/text processing and data shuffling.
• Smart Memory Access
— Integrated memory controller provides low-latency access to system memory and scalable memory
bandwidth.
— New cache hierarchy organization with shared, inclusive L3 to reduce snoop traffic.
— Two level TLBs and increased TLB size.
— Fast unaligned memory access.
• HyperThreading Technology
— Provides two hardware threads (logical processors) per core.
— Takes advantage of 4-wide execution engine, large L3, and massive memory bandwidth.
• Dedicated Power management Innovations
— Integrated microcontroller with optimized embedded firmware to manage power consumption.
— Embedded real-time sensors for temperature, current, and power.
— Integrated power gate to turn off/on per-core power consumption
— Versatility to reduce power consumption of memory, link subsystems.

2.2.6 Sandy Bridge Microarchitecture


Sandy Bridge microarchitecture builds on the successes of Intel® Core™ microarchitecture and Nehalem microar-
chitecture. It offers the following features:
• Intel Advanced Vector Extensions (Intel AVX).
— 256-bit floating-point instruction set extensions to the 128-bit Intel Streaming SIMD Extensions, providing
up to 2X performance benefits relative to 128-bit code.
— Non-destructive destination encoding offers more flexible coding techniques.
— Supports flexible migration and co-existence between 256-bit AVX code, 128-bit AVX code and legacy 128-
bit SSE code.
• Enhanced front-end and execution engine.
— New decoded Icache component that improves front-end bandwidth and reduces branch misprediction
penalty.
— Advanced branch prediction.
— Additional macro-fusion support.
— Larger dynamic execution window.
— Multi-precision integer arithmetic enhancements (ADC/SBB, MUL/IMUL).
— LEA bandwidth improvement.
— Reduction of general execution stalls (read ports, writeback conflicts, bypass latency, partial stalls).
— Fast floating-point exception handling.

Vol. 1 2-13
INTEL® 64 AND IA-32 ARCHITECTURES

— XSAVE/XRSTORE performance improvements and XSAVEOPT new instruction.


• Cache hierarchy improvements for wider data path.
— Doubling of bandwidth enabled by two symmetric ports for memory operation.
— Simultaneous handling of more in-flight loads and stores enabled by increased buffers.
— Internal bandwidth of two loads and one store each cycle.
— Improved prefetching.
— High bandwidth low latency LLC architecture.
— High bandwidth ring architecture of on-die interconnect.
For additional information on Intel® Advanced Vector Extensions (AVX), see Section 5.13, “Intel® Advanced Vector
Extensions (Intel® AVX)” and Chapter 14, “Programming with Intel® AVX, FMA, and Intel® AVX2” in the Intel® 64
and IA-32 Architectures Software Developer’s Manual, Volume 1.

2.2.7 SIMD Instructions


Beginning with the Pentium II and Pentium with Intel MMX technology processor families, six extensions have been
introduced into the Intel 64 and IA-32 architectures to perform single-instruction multiple-data (SIMD) operations.
These extensions include the MMX technology, SSE extensions, SSE2 extensions, SSE3 extensions, Supplemental
Streaming SIMD Extensions 3, and SSE4. Each of these extensions provides a group of instructions that perform
SIMD operations on packed integer and/or packed floating-point data elements.
SIMD integer operations can use the 64-bit MMX or the 128-bit XMM registers. SIMD floating-point operations use
128-bit XMM registers. Figure 2-4 shows a summary of the various SIMD extensions (MMX technology, Intel SSE,
Intel SSE2, Intel SSE3, SSSE3, and Intel SSE4), the data types they operate on, and how the data types are packed
into MMX and XMM registers.
The Intel MMX technology was introduced in the Pentium II and Pentium with MMX technology processor families.
MMX instructions perform SIMD operations on packed byte, word, or doubleword integers located in MMX registers.
These instructions are useful in applications that operate on integer arrays and streams of integer data that lend
themselves to SIMD processing.
Intel SSE was introduced in the Pentium III processor family. Intel SSE instructions operate on packed single preci-
sion floating-point values contained in XMM registers and on packed integers contained in MMX registers. Several
Intel SSE instructions provide state management, cache control, and memory ordering operations. Other Intel SSE
instructions are targeted at applications that operate on arrays of single precision floating-point data elements (3-
D geometry, 3-D rendering, and video encoding and decoding applications).
Intel SSE2 was introduced in the Pentium 4 and Intel Xeon processors. Intel SSE2 instructions operate on packed
double precision floating-point values contained in XMM registers and on packed integers contained in MMX and
XMM registers. Intel SSE2 integer instructions extend IA-32 SIMD operations by adding new 128-bit SIMD integer
operations and by expanding existing 64-bit SIMD integer operations to 128-bit XMM capability. Intel SSE2 instruc-
tions also provide new cache control and memory ordering operations.
Intel SSE3 was introduced with the Pentium 4 processor supporting Hyper-Threading Technology (built on 90 nm
process technology). Intel SSE3 offers 13 instructions that accelerate performance of Streaming SIMD Extensions
technology, Streaming SIMD Extensions 2 technology, and x87-FP math capabilities.
SSSE3 was introduced with the Intel Xeon processor 5100 series and Intel Core 2 processor family. SSSE3 offer 32
instructions to accelerate processing of SIMD integer data.
Intel SSE4 offers 54 instructions. 47 of them are referred to as Intel SSE4.1 instructions. Intel SSE4.1 was intro-
duced with the Intel Xeon processor 5400 series and Intel Core 2 Extreme processor QX9650. The other seven Intel
SSE4 instructions are referred to as Intel SSE4.2 instructions.
Intel AES-NI and PCLMULQDQ introduced seven new instructions. Six of them are primitives for accelerating algo-
rithms based on AES encryption/decryption standard, and are referred to as Intel AES-NI.
The PCLMULQDQ instruction accelerates general-purpose block encryption, which can perform carry-less multipli-
cation for two binary numbers up to 64-bit wide.

2-14 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES

Intel 64 architecture allows four generations of 128-bit SIMD extensions to access up to 16 XMM registers. IA-32
architecture provides eight XMM registers.
Intel® Advanced Vector Extensions offers comprehensive architectural enhancements over previous generations of
Streaming SIMD Extensions. Intel AVX introduces the following architectural enhancements:
• Support for 256-bit wide vectors and SIMD register set.
• 256-bit floating-point instruction set enhancement with up to 2X performance gain relative to 128-bit
Streaming SIMD extensions.
• Instruction syntax support for generalized three-operand syntax to improve instruction programming flexibility
and efficient encoding of new instruction extensions.
• Enhancement of legacy 128-bit SIMD instruction extensions to support three operand syntax and to simplify
compiler vectorization of high-level language expressions.
• Support flexible deployment of 256-bit AVX code, 128-bit AVX code, legacy 128-bit code and scalar code.
In addition to performance considerations, programmers should also be cognizant of the implications of VEX-
encoded AVX instructions with the expectations of system software components that manage the processor state
components enabled by XCR0. For additional information see Section 2.3.10.1, “Vector Length Transition and
Programming Considerations” in Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A.
See also:
• Section 5.4, “MMX Instructions,” and Chapter 9, “Programming with Intel® MMX™ Technology.”
• Section 5.5, “Intel® SSE Instructions,” and Chapter 10, “Programming with Intel® Streaming SIMD Extensions
(Intel® SSE).”
• Section 5.6, “Intel® SSE2 Instructions,” and Chapter 11, “Programming with Intel® Streaming SIMD
Extensions 2 (Intel® SSE2).”
• Section 5.7, “Intel® SSE3 Instructions,” Section 5.8, “Supplemental Streaming SIMD Extensions 3 (SSSE3)
Instructions,” Section 5.9, “Intel® SSE4 Instructions,” and Chapter 12, “Programming with Intel® SSE3,
SSSE3, Intel® SSE4, and Intel® AES-NI.”

Vol. 1 2-15
INTEL® 64 AND IA-32 ARCHITECTURES

SIMD Extension Register Layout Data Type

MMX Registers
MMX Technology - SSSE3 8 Packed Byte Integers
4 Packed Word Integers

2 Packed Doubleword Integers

Quadword

SSE - AVX

XMM Registers
4 Packed Single Precision
Floating-Point Values
2 Packed Double Precision
Floating-Point Values
16 Packed Byte Integers

8 Packed Word Integers


4 Packed Doubleword
Integers

2 Quadword Integers

Double Quadword

AVX
YMM Registers
8 Packed SP FP Values

4 Packed DP FP Values
2 128-bit Data

Figure 2-4. SIMD Extensions, Register Layouts, and Data Types

2.2.8 Intel® Hyper-Threading Technology


Intel Hyper-Threading Technology (Intel HT Technology) was developed to improve the performance of IA-32
processors when executing multi-threaded operating system and application code or single-threaded applications
under multi-tasking environments. The technology enables a single physical processor to execute two or more
separate code streams (threads) concurrently using shared execution resources.
Intel HT Technology is one form of hardware multi-threading capability in IA-32 processor families. It differs from
multi-processor capability using separate physically distinct packages with each physical processor package mated
with a physical socket. Intel HT Technology provides hardware multi-threading capability with a single physical
package by using shared execution resources in a processor core.
Architecturally, an IA-32 processor that supports Intel HT Technology consists of two or more logical processors,
each of which has its own IA-32 architectural state. Each logical processor consists of a full set of IA-32 data regis-
ters, segment registers, control registers, debug registers, and most of the MSRs. Each also has its own advanced
programmable interrupt controller (APIC).
Figure 2-5 shows a comparison of a processor that supports Intel HT Technology (implemented with two logical
processors) and a traditional dual processor system.

2-16 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES

IA-32 Processor Supporting


Traditional Multiple Processor (MP) System
Hyper-Threading Technology

AS AS AS AS

Processor Core Processor Core Processor Core

IA-32 processor IA-32 processor IA-32 processor

Two logical Each processor is a


processors that share separate physical
a single core package

AS = IA-32 Architectural State


OM16522

Figure 2-5. Comparison of an IA-32 Processor Supporting Intel® Hyper-Threading Technology


and a Traditional Dual Processor System
Unlike a traditional MP system configuration that uses two or more separate physical IA-32 processors, the logical
processors in an IA-32 processor supporting Intel HT Technology share the core resources of the physical
processor. This includes the execution engine and the system bus interface. After power up and initialization, each
logical processor can be independently directed to execute a specified thread, interrupted, or halted.
Intel HT Technology leverages the process and thread-level parallelism found in contemporary operating systems
and high-performance applications by providing two or more logical processors on a single chip. This configuration
allows two or more threads1 to be executed simultaneously on each a physical processor. Each logical processor
executes instructions from an application thread using the resources in the processor core. The core executes
these threads concurrently, using out-of-order instruction scheduling to maximize the use of execution units during
each clock cycle.

2.2.8.1 Some Implementation Notes


All Intel HT Technology configurations require:
• A processor that supports Intel HT Technology.
• A chipset and BIOS that utilize the technology.
• Operating system optimizations.
See http://www.intel.com/products/ht/hyperthreading_more.htm for information.
At the firmware (BIOS) level, the basic procedures to initialize the logical processors in a processor supporting Intel
HT Technology are the same as those for a traditional DP or MP platform. The mechanisms that are described in the
Multiprocessor Specification, Version 1.4, to power-up and initialize physical processors in an MP system also apply
to logical processors in a processor that supports Intel HT Technology.
An operating system designed to run on a traditional DP or MP platform may use CPUID to determine the presence
of hardware multi-threading support feature and the number of logical processors they provide.
Although existing operating system and application code should run correctly on a processor that supports Intel HT
Technology, some code modifications are recommended to get the optimum benefit. These modifications are
discussed in Chapter 7, “Multiple-Processor Management,” Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 3A.

1. In the remainder of this document, the term “thread” will be used as a general term for the terms “process” and “thread.”

Vol. 1 2-17
INTEL® 64 AND IA-32 ARCHITECTURES

2.2.9 Multi-Core Technology


Multi-core technology is another form of hardware multi-threading capability in IA-32 processor families. Multi-core
technology enhances hardware multi-threading capability by providing two or more execution cores in a physical
package.
The Intel Pentium processor Extreme Edition is the first member in the IA-32 processor family to introduce multi-
core technology. The processor provides hardware multi-threading support with both two processor cores and Intel
Hyper-Threading Technology. This means that the Intel Pentium processor Extreme Edition provides four logical
processors in a physical package (two logical processors for each processor core). The Dual-Core Intel Xeon
processor features multi-core, Intel Hyper-Threading Technology and supports multi-processor platforms.
The Intel Pentium D processor also features multi-core technology. This processor provides hardware multi-
threading support with two processor cores but does not offer Intel Hyper-Threading Technology. This means that
the Intel Pentium D processor provides two logical processors in a physical package, with each logical processor
owning the complete execution resources of a processor core.
The Intel Core 2 processor family, Intel Xeon processor 3000 series, Intel Xeon processor 5100 series, and Intel
Core Duo processor offer power-efficient multi-core technology. The processor contains two cores that share a
smart second level cache. The Level 2 cache enables efficient data sharing between two cores to reduce memory
traffic to the system bus.

Intel Core Duo Processor


Intel Core 2 Duo Processor
Intel Pentium dual-core Processor Pentium D Processor
Architectual State Architectual State Architectual State Architectual State
Execution Engine Execution Engine
Execution Engine Execution Engine
Local APIC Local APIC
Local APIC Local APIC
Second Level Cache
Bus Interface Bus Interface
Bus Interface

System Bus System Bus

Pentium Processor Extreme Edition


Architectual Architectual Architectual Architectual
State State State State

Execution Engine Execution Engine

Local APIC Local APIC Local APIC Local APIC

Bus Interface Bus Interface

OM19809
System Bus

Figure 2-6. Intel 64 and IA-32 Processors that Support Dual-Core

The Pentium® dual-core processor is based on the same technology as the Intel Core 2 Duo processor family.
The Intel Xeon processor 7300, 5300, and 3200 series, Intel Core 2 Extreme Quad-Core processor, and Intel Core
2 Quad processors support Intel quad-core technology. The Quad-core Intel Xeon processors and the Quad-Core
Intel Core 2 processor family are also in Figure 2-7.

2-18 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES

Intel Core 2 Extreme Quad-core Processor


Intel Core 2 Quad Processor
Intel Xeon Processor 3200 Series
Intel Xeon Processor 5300 Series

Architectual State Architectual State Architectual State Architectual State

Execution Engine Execution Engine Execution Engine Execution Engine

Local APIC Local APIC Local APIC Local APIC

Second Level Cache Second Level Cache

Bus Interface Bus Interface

System Bus

OM19810

Figure 2-7. Intel® 64 Processors that Support Quad-Core

Intel Core i7 processors support Intel quad-core technology, Intel HyperThreading Technology, provides Intel
QuickPath interconnect link to the chipset and have integrated memory controller supporting three channels to
DDR3 memory.

Intel Core i7 Processor

Logical Logical Logical Logical Logical Logical Logical Logical


Proces Proces Proces Proces Proces Proces Proces Proces
sor sor sor sor sor sor sor sor

L1 and L2 L1 and L2 L1 and L2 L1 and L2

Execution Engine Execution Engine Execution Engine Execution Engine

Third Level Cache

QuickPath Interconnect (QPI) Interface, Integrated Memory Controller

IMC
QPI
DDR3

Chipset
OM19810b

Figure 2-8. Intel® Core™ i7 Processor

Vol. 1 2-19
INTEL® 64 AND IA-32 ARCHITECTURES

2.2.10 Intel® 64 Architecture


Intel 64 architecture increases the linear address space for software to 64 bits and supports physical address space
up to 52 bits. The technology also introduces a new operating mode referred to as IA-32e mode.
IA-32e mode operates in one of two sub-modes: (1) compatibility mode enables a 64-bit operating system to run
most legacy 32-bit software unmodified, (2) 64-bit mode enables a 64-bit operating system to run applications
written to access 64-bit address space.
In the 64-bit mode, applications may access:
• 64-bit flat linear addressing.
• 8 additional general-purpose registers (GPRs).
• 8 additional registers for streaming SIMD extensions (Intel SSE, SSE2, and SSE3, and SSSE3).
• 64-bit-wide GPRs and instruction pointers.
• Uniform byte-register addressing.
• Fast interrupt-prioritization mechanism.
• A new instruction-pointer relative-addressing mode.
An Intel 64 architecture processor supports existing IA-32 software because it is able to run all non-64-bit legacy
modes supported by IA-32 architecture. Most existing IA-32 applications also run in compatibility mode.

2.2.11 Intel® Virtualization Technology (Intel® VT)


Intel® Virtualization Technology for Intel 64 and IA-32 architectures provide extensions that support virtualization.
The extensions are referred to as Virtual Machine Extensions (VMX). An Intel 64 or IA-32 platform with VMX can
function as multiple virtual systems (or virtual machines). Each virtual machine can run operating systems and
applications in separate partitions.
VMX also provides programming interface for a new layer of system software (called the Virtual Machine Monitor
(VMM)) used to manage the operation of virtual machines. Information on VMX and on the programming of VMMs
is in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3C.
Intel Core i7 processor provides the following enhancements to Intel Virtualization Technology:
• Virtual processor ID (VPID) to reduce the cost of VMM managing transitions.
• Extended page table (EPT) to reduce the number of transitions for VMM to manage memory virtualization.
• Reduced latency of VM transitions.

2.3 INTEL® 64 AND IA-32 PROCESSOR GENERATIONS


In the mid-1960s, Intel co-founder and Chairman Emeritus Gordon Moore had this observation: “... the number of
transistors that would be incorporated on a silicon die would double every 18 months for the next several years.”
Over the past three and half decades, this prediction known as “Moore's Law” has continued to hold true.
The computing power and the complexity (or roughly, the number of transistors per processor) of Intel architecture
processors has grown in close relation to Moore's law. By taking advantage of new process technology and new
microarchitecture designs, each new generation of IA-32 processors has demonstrated frequency-scaling head-
room and new performance levels over the previous generation processors.

2-20 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES

The key features of the Intel Pentium 4 processor, Intel Xeon processor, Intel Xeon processor MP, Pentium III
processor, and Pentium III Xeon processor with advanced transfer cache are shown in Table 2-1. Older generation
IA-32 processors, which do not employ on-die Level 2 cache, are shown in Table 2-2.
Table 2-1. Key Features of Most Recent IA-32 Processors
Intel Date Microarchitecture Top-Bin Clock Tran- Register System Max. On-Die
Processor Intro- Frequency at sistors Sizes1 Bus Band- Extern. Caches2
duced Introduction width Addr.
Space
Intel Pentium 2004 Intel Pentium M 2.00 GHz 140 M GP: 32 3.2 GB/s 4 GB L1: 64 KB
M Processor FPU: 80 L2: 2 MB
Processor MMX: 64
7553 XMM: 128
Intel Core Duo 2006 Improved Intel 2.16 GHz 152 M GP: 32 5.3 GB/s 4 GB L1: 64 KB
Processor Pentium M FPU: 80 L2: 2 MB (2
T2600 3 Processor MMX: 64 MB Total)
Microarchitecture; XMM: 128
Dual Core;
Intel Smart Cache,
Advanced Thermal
Manager
Intel Atom 2008 Intel Atom 1.86 GHz - 47 M GP: 32 Up to 4.2 4 GB L1: 56 KB4
Processor Microarchitecture; 800 MHz FPU: 80 GB/s L2: 512 KB
Z5xx series Intel Virtualization MMX: 64
Technology. XMM: 128

NOTES:
1. The register size and external data bus size are given in bits.
2. First level cache is denoted using the abbreviation L1, 2nd level cache is denoted as L2. The size of L1 includes the first-level data
cache and the instruction cache where applicable, but does not include the trace cache.
3. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family,
not across different processor families. See http://www.intel.com/products/processor_number for details.
4. In Intel Atom Processor, the size of L1 instruction cache is 32 KBytes, L1 data cache is 24 KBytes.

Table 2-2. Key Features of Most Recent Intel® 64 Processors


Intel Date Micro-architec- Highest Tran- Register System Max. On-Die
Processor Intro- ture Processor sistors Sizes Bus/QPI Extern. Caches
duced Base Fre- Link Addr.
quency at Speed Space
Intro-
duction
64-bit Intel 2004 Intel NetBurst 3.60 GHz 125 M GP: 32, 64 6.4 GB/s 64 GB 12K µop
Xeon Microarchitecture; FPU: 80 Execution
Processor Intel Hyper- MMX: 64 Trace Cache;
with 800 MHz Threading XMM: 128 16 KB L1;
System Bus Technology; Intel 1 MB L2
64 Architecture
64-bit Intel 2005 Intel NetBurst 3.33 GHz 675 M GP: 32, 64 5.3 GB/s 1 1024 GB 12K µop
Xeon Microarchitecture; FPU: 80 (1 TB) Execution
Processor MP Intel Hyper- MMX: 64 Trace Cache;
with 8MB L3 Threading XMM: 128 16 KB L1;
Technology; Intel 1 MB L2,
64 Architecture 8 MB L3

Vol. 1 2-21
INTEL® 64 AND IA-32 ARCHITECTURES

Table 2-2. Key Features of Most Recent Intel® 64 Processors (Contd.)


Intel Date Micro-architec- Highest Tran- Register System Max. On-Die
Processor Intro- ture Processor sistors Sizes Bus/QPI Extern. Caches
duced Base Fre- Link Addr.
quency at Speed Space
Intro-
duction
Intel Pentium 2005 Intel NetBurst 3.73 GHz 164 M GP: 32, 64 8.5 GB/s 64 GB 12K µop
4 Microarchitecture; FPU: 80 Execution
Processor Intel Hyper- MMX: 64 Trace Cache;
Extreme Threading XMM: 128 16 KB L1;
Edition Technology; Intel 2 MB L2
Supporting 64 Architecture
Hyper-
Threading
Technology
Intel Pentium 2005 Intel NetBurst 3.20 GHz 230 M GP: 32, 64 6.4 GB/s 64 GB 12K µop
Processor Microarchitecture; FPU: 80 Execution
Extreme Intel Hyper- MMX: 64 Trace Cache;
Edition 840 Threading XMM: 128 16 KB L1;
Technology; Intel 1 MB L2 (2
64 Architecture; MB Total)
Dual-core 2
Dual-Core Intel 2005 Intel NetBurst 3.00 GHz 321 M GP: 32, 64 6.4 GB/s 64 GB 12K µop
Xeon Microarchitecture; FPU: 80 Execution
Processor Intel Hyper- MMX: 64 Trace Cache;
7041 Threading XMM: 128 16 KB L1;
Technology; Intel 2 MB L2 (4
64 Architecture; MB Total)
Dual-core 3
Intel Pentium 2005 Intel NetBurst 3.80 GHz 164 M GP: 32, 64 6.4 GB/s 64 GB 12K µop
4 Microarchitecture; FPU: 80 Execution
Processor 672 Intel Hyper- MMX: 64 Trace Cache;
Threading XMM: 128 16 KB L1;
Technology; Intel 2 MB L2
64 Architecture;
Intel Virtualization
Technology.
Intel Pentium 2006 Intel NetBurst 3.46 GHz 376 M GP: 32, 64 8.5 GB/s 64 GB 12K µop
Processor Microarchitecture; FPU: 80 Execution
Extreme Intel 64 MMX: 64 Trace Cache;
Edition 955 Architecture; Dual XMM: 128 16 KB L1;
Core; 2 MB L2
Intel Virtualization (4 MB Total)
Technology.
Intel Core 2 2006 Intel Core 2.93 GHz 291 M GP: 32, 64 8.5 GB/s 64 GB L1: 64 KB
Extreme Microarchitecture; FPU: 80 L2: 4 MB (4
Processor Dual Core; MMX: 64 MB Total)
X6800 Intel 64 XMM: 128
Architecture;
Intel Virtualization
Technology.

2-22 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES

Table 2-2. Key Features of Most Recent Intel® 64 Processors (Contd.)


Intel Date Micro-architec- Highest Tran- Register System Max. On-Die
Processor Intro- ture Processor sistors Sizes Bus/QPI Extern. Caches
duced Base Fre- Link Addr.
quency at Speed Space
Intro-
duction
Intel Xeon 2006 Intel Core 3.00 GHz 291 M GP: 32, 64 10.6 GB/s 64 GB L1: 64 KB
Processor Microarchitecture; FPU: 80 L2: 4 MB (4
5160 Dual Core; MMX: 64 MB Total)
Intel 64 XMM: 128
Architecture;
Intel Virtualization
Technology.
Intel Xeon 2006 Intel NetBurst 3.40 GHz 1.3 B GP: 32, 64 12.8 GB/s 64 GB L1: 64 KB
Processor Microarchitecture; FPU: 80 L2: 1 MB (2
7140 Dual Core; MMX: 64 MB Total)
Intel 64 XMM: 128 L3: 16 MB
Architecture; (16 MB
Intel Virtualization Total)
Technology.
Intel Core 2 2006 Intel Core 2.66 GHz 582 M GP: 32, 64 8.5 GB/s 64 GB L1: 64 KB
Extreme Microarchitecture; FPU: 80 L2: 4 MB (4
Processor Quad Core; MMX: 64 MB Total)
QX6700 Intel 64 XMM: 128
Architecture;
Intel Virtualization
Technology.
Quad-core 2006 Intel Core 2.66 GHz 582 M GP: 32, 64 10.6 GB/s 256 GB L1: 64 KB
Intel Xeon Microarchitecture; FPU: 80 L2: 4 MB (8
Processor Quad Core; MMX: 64 MB Total)
5355 Intel 64 XMM: 128
Architecture;
Intel Virtualization
Technology.
Intel Core 2 2007 Intel Core 3.00 GHz 291 M GP: 32, 64 10.6 GB/s 64 GB L1: 64 KB
Duo Processor Microarchitecture; FPU: 80 L2: 4 MB (4
E6850 Dual Core; MMX: 64 MB Total)
Intel 64 XMM: 128
Architecture;
Intel Virtualization
Technology;
Intel Trusted
Execution
Technology
Intel Xeon 2007 Intel Core 2.93 GHz 582 M GP: 32, 64 8.5 GB/s 1024 GB L1: 64 KB
Processor Microarchitecture; FPU: 80 L2: 4 MB (8
7350 Quad Core; MMX: 64 MB Total)
Intel 64 XMM: 128
Architecture;
Intel Virtualization
Technology.

Vol. 1 2-23
INTEL® 64 AND IA-32 ARCHITECTURES

Table 2-2. Key Features of Most Recent Intel® 64 Processors (Contd.)


Intel Date Micro-architec- Highest Tran- Register System Max. On-Die
Processor Intro- ture Processor sistors Sizes Bus/QPI Extern. Caches
duced Base Fre- Link Addr.
quency at Speed Space
Intro-
duction
Intel Xeon 2007 Enhanced Intel 3.00 GHz 820 M GP: 32, 64 12.8 GB/s 256 GB L1: 64 KB
Processor Core FPU: 80 L2: 6 MB (12
5472 Microarchitecture; MMX: 64 MB Total)
Quad Core; XMM: 128
Intel 64
Architecture;
Intel Virtualization
Technology.
Intel Atom 2008 Intel Atom 2.0 - 1.60 47 M GP: 32, 64 Up to 4.2 Up to L1: 56 KB4
Processor Microarchitecture; GHz FPU: 80 GB/s 64GB L2: 512 KB
Intel 64 MMX: 64
Architecture; XMM: 128
Intel Virtualization
Technology.
Intel Xeon 2008 Enhanced Intel 2.67 GHz 1.9 B GP: 32, 64 8.5 GB/s 1024 GB L1: 64 KB
Processor Core FPU: 80 L2: 3 MB (9
7460 Microarchitecture; MMX: 64 MB Total)
Six Cores; XMM: 128 L3: 16 MB
Intel 64
Architecture;
Intel Virtualization
Technology.
Intel Atom 2008 Intel Atom 1.60 GHz 94 M GP: 32, 64 Up to 4.2 Up to L1: 56 KB5
Processor 330 Microarchitecture; FPU: 80 GB/s 64GB L2: 512 KB
Intel 64 MMX: 64 (1 MB Total)
Architecture; XMM: 128
Dual core;
Intel Virtualization
Technology.
Intel Core i7- 2008 Nehalem 3.20 GHz 731 M GP: 32, 64 QPI: 6.4 64 GB L1: 64 KB
965 microarchitecture; FPU: 80 GT/s; L2: 256 KB
Processor Quadcore; MMX: 64 Memory: L3: 8 MB
Extreme HyperThreading XMM: 128 25 GB/s
Edition Technology; Intel
QPI; Intel 64
Architecture;
Intel Virtualization
Technology.

2-24 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES

Table 2-2. Key Features of Most Recent Intel® 64 Processors (Contd.)


Intel Date Micro-architec- Highest Tran- Register System Max. On-Die
Processor Intro- ture Processor sistors Sizes Bus/QPI Extern. Caches
duced Base Fre- Link Addr.
quency at Speed Space
Intro-
duction
Intel Core i7- 2010 Intel Turbo Boost 2.66 GHz 383 M GP: 32, 64 64 GB L1: 64 KB
620M Technology, FPU: 80 L2: 256 KB
Processor Westmere MMX: 64 L3: 4 MB
microarchitecture; XMM: 128
Dual-core;
HyperThreading
Technology; Intel
64 Architecture;
Intel Virtualization
Technology.,
Integrated graphics
Intel Xeon- 2010 Intel Turbo Boost 3.33 GHz 1.1 B GP: 32, 64 QPI: 6.4 1 TB L1: 64 KB
Processor Technology, FPU: 80 GT/s; 32 L2: 256 KB
5680 Westmere MMX: 64 GB/s L3: 12 MB
microarchitecture; XMM: 128
Six core;
HyperThreading
Technology; Intel
64 Architecture;
Intel Virtualization
Technology.
Intel Xeon- 2010 Intel Turbo Boost 2.26 GHz 2.3 B GP: 32, 64 QPI: 6.4 16 TB L1: 64 KB
Processor Technology, FPU: 80 GT/s; L2: 256 KB
7560 Nehalem MMX: 64 Memory: L3: 24 MB
microarchitecture; XMM: 128 76 GB/s
Eight core;
HyperThreading
Technology; Intel
64 Architecture;
Intel Virtualization
Technology.
Intel Core i7- 2011 Intel Turbo Boost 3.40 GHz 995 M GP: 32, 64 DMI: 5 64 GB L1: 64 KB
2600K Technology, Sandy FPU: 80 GT/s; L2: 256 KB
Processor Bridge MMX: 64 Memory: L3: 8 MB
microarchitecture; XMM: 128 21 GB/s
Four core; YMM: 256
HyperThreading
Technology; Intel
64 Architecture;
Intel Virtualization
Technology.,
Processor graphics,
Quicksync Video

Vol. 1 2-25
INTEL® 64 AND IA-32 ARCHITECTURES

Table 2-2. Key Features of Most Recent Intel® 64 Processors (Contd.)


Intel Date Micro-architec- Highest Tran- Register System Max. On-Die
Processor Intro- ture Processor sistors Sizes Bus/QPI Extern. Caches
duced Base Fre- Link Addr.
quency at Speed Space
Intro-
duction
Intel Xeon- 2011 Intel Turbo Boost 3.50 GHz GP: 32, 64 DMI: 5 1 TB L1: 64 KB
Processor E3- Technology, Sandy FPU: 80 GT/s; L2: 256 KB
1280 Bridge MMX: 64 Memory: L3: 8 MB
microarchitecture; XMM: 128 21 GB/s
Four core; YMM: 256
HyperThreading
Technology; Intel
64 Architecture;
Intel Virtualization
Technology.
Intel Xeon- 2011 Intel Turbo Boost 2.40 GHz 2.2 B GP: 32, 64 QPI: 6.4 16 TB L1: 64 KB
Processor E7- Technology, FPU: 80 GT/s; L2: 256 KB
8870 Westmere MMX: 64 Memory: L3: 30 MB
microarchitecture; XMM: 128 102 GB/s
Ten core;
HyperThreading
Technology; Intel
64 Architecture;
Intel Virtualization
Technology.
NOTES:
1. The 64-bit Intel Xeon Processor MP with an 8-MByte L3 supports a multi-processor platform with a dual system bus; this creates a
platform bandwidth with 10.6 GBytes.
2. In Intel Pentium Processor Extreme Edition 840, the size of on-die cache is listed for each core. The total size of L2 in the physical
package in 2 MBytes.
3. In Dual-Core Intel Xeon Processor 7041, the size of on-die cache is listed for each core. The total size of L2 in the physical package in
4 MBytes.
4. In Intel Atom Processor, the size of L1 instruction cache is 32 KBytes, L1 data cache is 24 KBytes.
5. In Intel Atom Processor, the size of L1 instruction cache is 32 KBytes, L1 data cache is 24 KBytes.

2-26 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES

Table 2-3. Key Features of Previous Generations of IA-32 Processors


Intel Date Max. Clock Tran- Register Ext. Data Max. Caches
Processor Intro- Frequency/ sistors Sizes1 Bus Size2 Extern.
duced Technology at Addr.
Introduction Space
8086 1978 8 MHz 29 K 16 GP 16 1 MB None
Intel 286 1982 12.5 MHz 134 K 16 GP 16 16 MB Note 3
Intel386 DX 1985 20 MHz 275 K 32 GP 32 4 GB Note 3
Processor
Intel486 DX 1989 25 MHz 1.2 M 32 GP 32 4 GB L1: 8 KB
Processor 80 FPU
Pentium Processor 1993 60 MHz 3.1 M 32 GP 64 4 GB L1: 16 KB
80 FPU
Pentium Pro 1995 200 MHz 5.5 M 32 GP 64 64 GB L1: 16 KB
Processor 80 FPU L2: 256 KB or
512 KB
Pentium II Processor 1997 266 MHz 7M 32 GP 64 64 GB L1: 32 KB
80 FPU L2: 256 KB or
64 MMX 512 KB
Pentium III Processor 1999 500 MHz 8.2 M 32 GP 64 64 GB L1: 32 KB
80 FPU L2: 512 KB
64 MMX
128 XMM
Pentium III and 1999 700 MHz 28 M 32 GP 64 64 GB L1: 32 KB
Pentium III Xeon 80 FPU L2: 256 KB
Processors 64 MMX
128 XMM
Pentium 4 Processor 2000 1.50 GHz, Intel 42 M 32 GP 64 64 GB 12K µop
NetBurst 80 FPU Execution
Microarchitecture 64 MMX Trace Cache;
128 XMM L1: 8 KB
L2: 256 KB
Intel Xeon Processor 2001 1.70 GHz, Intel 42 M 32 GP 64 64 GB 12K µop
NetBurst 80 FPU Execution
Microarchitecture 64 MMX Trace Cache;
128 XMM L1: 8 KB
L2: 512 KB
Intel Xeon Processor 2002 2.20 GHz, Intel 55 M 32 GP 64 64 GB 12K µop
NetBurst 80 FPU Execution
Microarchitecture, 64 MMX Trace Cache;
HyperThreading 128 XMM L1: 8 KB
Technology L2: 512 KB
Pentium M Processor 2003 1.60 GHz, Intel 77 M 32 GP 64 4 GB L1: 64 KB
NetBurst 80 FPU L2: 1 MB
Microarchitecture 64 MMX
128 XMM

Vol. 1 2-27
INTEL® 64 AND IA-32 ARCHITECTURES

Table 2-3. Key Features of Previous Generations of IA-32 Processors (Contd.)


Intel Pentium 4 2004 3.40 GHz, Intel 125 M 32 GP 64 64 GB 12K µop
Processor NetBurst 80 FPU Execution
Supporting Hyper- Microarchitecture, 64 MMX Trace Cache;
Threading HyperThreading 128 XMM L1: 16 KB
Technology at 90 nm Technology L2: 1 MB
process

NOTES:
1. The register size and external data bus size are given in bits. Note also that each 32-bit general-purpose (GP) registers can be
addressed as an 8- or a 16-bit data registers in all of the processors.
2. Internal data paths are 2 to 4 times wider than the external data bus for each processor.

2.4 PLANNED REMOVAL OF INTEL® INSTRUCTION SET ARCHITECTURE AND


FEATURES FROM UPCOMING PRODUCTS
This section lists Intel Instruction Set Architecture (ISA) and features that Intel plans to remove from select prod-
ucts starting from a specific year.

Table 2-4. Planned Intel® ISA and Features Removal List


Intel ISA/Feature Year of Removal
Sub-page write permissions for EPT 2024 onwards
xAPIC mode 2025 onwards
Uncore PMI. IA32_DEBUGCTL MSR, bit 13 (MSR address 1D9H) 2026 onwards

2.5 INTEL® INSTRUCTION SET ARCHITECTURE AND FEATURES REMOVED


This section lists Intel ISA and features that Intel has already removed for select upcoming products. All sections
relevant to the removed features will be identified as such and may be moved to an archived section in future
Intel® 64 and IA-32 Architectures Software Developer's Manual releases.

Table 2-5. Intel® ISA and Features Removal List


Intel ISA/Feature Year of Removal
Intel® Memory Protection Extensions (Intel® MPX) 2019 onwards
MSR_TEST_CTRL, bit 31 (MSR address 33H) 2019 onwards
Hardware Lock Elision (HLE) 2019 onwards
VP2INTERSECT 2023 onwards

2-28 Vol. 1
3. Updates to Chapter 5, Volume 1
Change bars and violet text show changes to Chapter 5 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 1: Basic Architecture.

------------------------------------------------------------------------------------------
Changes to this chapter:
• Updated Table 5-2, “Instruction Set Extensions Introduction in Intel® 64 and IA-32 Processors,” with the ISA
features that have moved into the Intel® 64 and IA-32 Architectures Software Developer’s Manuals in this
release.
• Added Section 5.31, “Intel® AVX10.1 Instructions.”

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 13


CHAPTER 5
INSTRUCTION SET SUMMARY

This chapter provides an abridged overview of Intel 64 and IA-32 instructions. Instructions are divided into the
following groups:
• Section 5.1, “General-Purpose Instructions.”
• Section 5.2, “x87 FPU Instructions.”
• Section 5.3, “x87 FPU AND SIMD State Management Instructions.”
• Section 5.4, “MMX Instructions.”
• Section 5.5, “Intel® SSE Instructions.”
• Section 5.6, “Intel® SSE2 Instructions.”
• Section 5.7, “Intel® SSE3 Instructions.”
• Section 5.8, “Supplemental Streaming SIMD Extensions 3 (SSSE3) Instructions.”
• Section 5.9, “Intel® SSE4 Instructions.”
• Section 5.10, “Intel® SSE4.1 Instructions.”
• Section 5.11, “Intel® SSE4.2 Instruction Set.”
• Section 5.12, “Intel® AES-NI and PCLMULQDQ.”
• Section 5.13, “Intel® Advanced Vector Extensions (Intel® AVX).”
• Section 5.14, “16-bit Floating-Point Conversion.”
• Section 5.15, “Fused-Multiply-ADD (FMA).”
• Section 5.16, “Intel® Advanced Vector Extensions 2 (Intel® AVX2).”
• Section 5.17, “Intel® Transactional Synchronization Extensions (Intel® TSX).”
• Section 5.18, “Intel® SHA Extensions.”
• Section 5.19, “Intel® Advanced Vector Extensions 512 (Intel® AVX-512).”
• Section 5.20, “System Instructions.”
• Section 5.21, “64-Bit Mode Instructions.”
• Section 5.22, “Virtual-Machine Extensions.”
• Section 5.23, “Safer Mode Extensions.”
• Section 5.24, “Intel® Memory Protection Extensions.”
• Section 5.25, “Intel® Software Guard Extensions.”
• Section 5.26, “Shadow Stack Management Instructions.”
• Section 5.27, “Control Transfer Terminating Instructions.”
• Section 5.28, “Intel® AMX Instructions.”
• Section 5.29, “User Interrupt Instructions.”
• Section 5.30, “Enqueue Store Instructions.”
• Section 5.31, “Intel® Advanced Vector Extensions 10 Version 1 Instructions.”
Table 5-1 lists the groups and IA-32 processors that support each group. More recent instruction set extensions are
listed in Table 5-2. Within these groups, most instructions are collected into functional subgroups.

Vol. 1 5-1
INSTRUCTION SET SUMMARY

Table 5-1. Instruction Groups in Intel® 64 and IA-32 Processors


Instruction Set
Architecture Intel 64 and IA-32 Processor Support
General Purpose All Intel 64 and IA-32 processors.
X87 FPU Intel486, Pentium, Pentium with MMX Technology, Celeron, Pentium Pro, Pentium II, Pentium II Xeon,
Pentium III, Pentium III Xeon, Pentium 4, Intel Xeon processors, Pentium M, Intel Core Solo, Intel Core Duo,
Intel Core 2 Duo processors, Intel Atom processors.
X87 FPU and SIMD State Pentium II, Pentium II Xeon, Pentium III, Pentium III Xeon, Pentium 4, Intel Xeon processors, Pentium M,
Management Intel Core Solo, Intel Core Duo, Intel Core 2 Duo processors, Intel Atom processors.
MMX Technology Pentium with MMX Technology, Celeron, Pentium II, Pentium II Xeon, Pentium III, Pentium III Xeon, Pentium
4, Intel Xeon processors, Pentium M, Intel Core Solo, Intel Core Duo, Intel Core 2 Duo processors, Intel Atom
processors.
SSE Extensions Pentium III, Pentium III Xeon, Pentium 4, Intel Xeon processors, Pentium M, Intel Core Solo, Intel Core Duo,
Intel Core 2 Duo processors, Intel Atom processors.
SSE2 Extensions Pentium 4, Intel Xeon processors, Pentium M, Intel Core Solo, Intel Core Duo, Intel Core 2 Duo processors,
Intel Atom processors.
SSE3 Extensions Pentium 4 supporting HT Technology (built on 90 nm process technology), Intel Core Solo, Intel Core Duo,
Intel Core 2 Duo processors, Intel Xeon processor 3xxxx, 5xxx, 7xxx Series, Intel Atom processors.
SSSE3 Extensions Intel Xeon processor 3xxx, 5100, 5200, 5300, 5400, 5500, 5600, 7300, 7400, 7500 series, Intel Core 2
Extreme processors QX6000 series, Intel Core 2 Duo, Intel Core 2 Quad processors, Intel Pentium Dual-Core
processors, Intel Atom processors.
IA-32e mode: 64-bit Intel 64 processors.
mode instructions
System Instructions Intel 64 and IA-32 processors.
VMX Instructions Intel 64 and IA-32 processors supporting Intel Virtualization Technology.
SMX Instructions Intel Core 2 Duo processor E6x50, E8xxx; Intel Core 2 Quad processor Q9xxx.

Table 5-2. Instruction Set Extensions Introduction in Intel® 64 and IA-32 Processors
Instruction Set Architecture Processor Generation Introduction
SSE4.1 Extensions Intel® Xeon® processor 3100, 3300, 5200, 5400, 7400, 7500 series, Intel® Core™ 2 Extreme
processors QX9000 series, Intel® Core™ 2 Quad processor Q9000 series, Intel® Core™ 2 Duo processors
8000 series and T9000 series, Intel Atom® processor based on Silvermont microarchitecture.
SSE4.2 Extensions, CRC32, Intel® Core™ i7 965 processor, Intel® Xeon® processors X3400, X3500, X5500, X6500, X7500 series,
POPCNT Intel Atom processor based on Silvermont microarchitecture.
Intel® AES-NI, PCLMULQDQ Intel® Xeon® processor E7 series, Intel® Xeon® processors X3600 and X5600, Intel® Core™ i7 980X
processor, Intel Atom processor based on Silvermont microarchitecture. Use CPUID to verify presence
of Intel AES-NI and PCLMULQDQ across Intel® Core™ processor families.
Intel® AVX Intel® Xeon® processor E3 and E5 families, 2nd Generation Intel® Core™ i7, i5, i3 processor 2xxx
families.
F16C 3rd Generation Intel® Core™ processors, Intel® Xeon® processor E3-1200 v2 product family, Intel®
Xeon® processor E5 v2 and E7 v2 families.
RDRAND 3rd Generation Intel Core processors, Intel Xeon processor E3-1200 v2 product family, Intel Xeon
processor E5 v2 and E7 v2 families, Intel Atom processor based on Silvermont microarchitecture.
FS/GS base access 3rd Generation Intel Core processors, Intel Xeon processor E3-1200 v2 product family, Intel Xeon
processor E5 v2 and E7 v2 families, Intel Atom® processor based on Goldmont microarchitecture.

5-2 Vol. 1
INSTRUCTION SET SUMMARY

Table 5-2. Instruction Set Extensions Introduction in Intel® 64 and IA-32 Processors (Contd.)
Instruction Set Architecture Processor Generation Introduction
FMA, AVX2, BMI1, BMI2, Intel® Xeon® processor E3/E5/E7 v3 product families, 4th Generation Intel® Core™ processor family.
INVPCID, LZCNT, Intel® TSX
MOVBE Intel Xeon processor E3/E5/E7 v3 product families, 4th Generation Intel Core processor family, Intel
Atom processors.
PREFETCHW Intel® Core™ M processor family; 5th Generation Intel® Core™ processor family, Intel Atom processor
based on Silvermont microarchitecture.
ADX Intel Core M processor family, 5th Generation Intel Core processor family.
RDSEED, CLAC, STAC Intel Core M processor family, 5th Generation Intel Core processor family, Intel Atom processor based
on Goldmont microarchitecture.
AVX512ER, AVX512PF, Intel® Xeon Phi™ Processor 3200, 5200, 7200 Series.
PREFETCHWT1
AVX512F, AVX512CD Intel Xeon Phi Processor 3200, 5200, 7200 Series, Intel® Xeon® Scalable Processor Family, Intel® Core™
i3-8121U processor.
CLFLUSHOPT, XSAVEC, Intel Xeon Scalable Processor Family, 6th Generation Intel® Core™ processor family, Intel Atom
XSAVES, Intel® MPX processor based on Goldmont microarchitecture.
SGX1 6th Generation Intel Core processor family, Intel Atom® processor based on Goldmont Plus
microarchitecture.
AVX512DQ, AVX512BW, Intel Xeon Scalable Processor Family, Intel Core i3-8121U processor based on Cannon Lake
AVX512VL microarchitecture.
CLWB Intel Xeon Scalable Processor Family, Intel Atom® processor based on Tremont microarchitecture, 11th
Generation Intel Core processor family based on Tiger Lake microarchitecture.
PKU Intel Xeon Scalable Processor Family, 10th generation Intel® Core™ processors based on Comet Lake
microarchitecture.
AVX512_IFMA, Intel Core i3-8121U processor based on Cannon Lake microarchitecture.
AVX512_VBMI
Intel® SHA Extensions Intel Core i3-8121U processor based on Cannon Lake microarchitecture, Intel Atom processor based
on Goldmont microarchitecture, 3rd Generation Intel® Xeon® Scalable Processor Family based on Ice
Lake microarchitecture.
UMIP Intel Core i3-8121U processor based on Cannon Lake microarchitecture, Intel Atom processor based
on Goldmont Plus microarchitecture.
PTWRITE Intel Atom processor based on Goldmont Plus microarchitecture, 12th generation Intel® Core™
processor supporting Alder Lake performance hybrid architecture, 4th generation Intel® Xeon®
Scalable Processor Family based on Sapphire Rapids microarchitecture.
RDPID 10th Generation Intel® Core™ processor family based on Ice Lake microarchitecture, Intel Atom
processor based on Goldmont Plus microarchitecture.
AVX512_4FMAPS, Intel® Xeon Phi™ Processor 7215, 7285, 7295 Series.
AVX512_4VNNIW
AVX512_VNNI 2nd Generation Intel® Xeon® Scalable Processor Family, 10th Generation Intel Core processor family
based on Ice Lake microarchitecture.
AVX512_VPOPCNTDQ Intel Xeon Phi Processor 7215, 7285, 7295 Series, 10th Generation Intel Core processor family based
on Ice Lake microarchitecture.
Fast Short REP MOV 10th Generation Intel Core processor family based on Ice Lake microarchitecture.
GFNI (SSE) 10th Generation Intel Core processor family based on Ice Lake microarchitecture, Intel Atom processor
based on Tremont microarchitecture.

Vol. 1 5-3
INSTRUCTION SET SUMMARY

Table 5-2. Instruction Set Extensions Introduction in Intel® 64 and IA-32 Processors (Contd.)
Instruction Set Architecture Processor Generation Introduction
VAES, GFNI (AVX/AVX512), 10th Generation Intel Core processor family based on Ice Lake microarchitecture.
AVX512_VBMI2,
VPCLMULQDQ,
AVX512_BITALG
ENCLV Future processors.
Split Lock Detection 10th Generation Intel Core processor family based on Ice Lake microarchitecture, Intel Atom processor
based on Tremont microarchitecture.
CLDEMOTE Intel Atom processor based on Tremont microarchitecture, 4th generation Intel® Xeon® Scalable
Processor Family based on Sapphire Rapids microarchitecture.
Direct stores: MOVDIRI, Intel Atom processor based on Tremont microarchitecture, 11th Generation Intel Core processor
MOVDIR64B family based on Tiger Lake microarchitecture, 4th generation Intel® Xeon® Scalable Processor Family
based on Sapphire Rapids microarchitecture.
User wait: TPAUSE, Intel Atom processor based on Tremont microarchitecture, 12th generation Intel Core processor based
UMONITOR, UMWAIT on Alder Lake performance hybrid architecture, 4th generation Intel® Xeon® Scalable Processor Family
based on Sapphire Rapids microarchitecture.
AVX512_BF16 3rd Generation Intel® Xeon® Scalable Processor Family based on Cooper Lake product, 4th generation
Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture.
AVX512_VP2INTERSECT 11th Generation Intel Core processor family based on Tiger Lake microarchitecture. (Not currently
supported in any other processors).
Key Locker1 11th Generation Intel Core processor family based on Tiger Lake microarchitecture, 12th generation
Intel Core processor supporting Alder Lake performance hybrid architecture.
Control-flow Enforcement 11th Generation Intel Core processor family based on Tiger Lake microarchitecture, 4th generation
Technology (CET) Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture, Intel® Xeon® 6 E-
core processors based on Sierra Forest microarchitecture.
TME-MK2, PCONFIG 3rd Generation Intel® Xeon® Scalable Processor Family based on Ice Lake microarchitecture.
WBNOINVD 3rd Generation Intel® Xeon® Scalable Processor Family based on Ice Lake microarchitecture.
LBRs (architectural) 12th generation Intel Core processor supporting Alder Lake performance hybrid architecture, 4th
generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture, Intel®
Xeon® 6 E-core processors based on Sierra Forest microarchitecture.
Intel® Virtualization 12th generation Intel Core processor supporting Alder Lake performance hybrid architecture, 4th
Technology - Redirect generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture, Intel®
Protection (Intel® VT-rp) and Xeon® 6 E-core processors based on Sierra Forest microarchitecture.
HLAT
AVX-VNNI 12th generation Intel Core processor supporting Alder Lake performance hybrid architecture3, 4th
generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture, Intel®
Xeon® 6 E-core processors based on Sierra Forest microarchitecture.
SERIALIZE 12th generation Intel Core processor supporting Alder Lake performance hybrid architecture, 4th
generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture, Intel®
Xeon® 6 E-core processors based on Sierra Forest microarchitecture.
Intel® Thread Director and 12th generation Intel Core processor supporting Alder Lake performance hybrid architecture.
HRESET
Fast zero-length REP MOVSB, 12th generation Intel Core processor supporting Alder Lake performance hybrid architecture, 4th
fast short REP STOSB generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture.
Fast Short REP CMPSB, fast 4th generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture.
short REP SCASB

5-4 Vol. 1
INSTRUCTION SET SUMMARY

Table 5-2. Instruction Set Extensions Introduction in Intel® 64 and IA-32 Processors (Contd.)
Instruction Set Architecture Processor Generation Introduction
Supervisor Memory 12th generation Intel Core processor supporting Alder Lake performance hybrid architecture, 4th
Protection Keys (PKS) generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture, Intel®
Xeon® 6 E-core processors based on Sierra Forest microarchitecture.
Attestation Services for 3rd Generation Intel® Xeon® Scalable Processor Family based on Ice Lake microarchitecture.
Intel® SGX
Enqueue Stores: ENQCMD 4th generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture,
and ENQCMDS Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture.
Intel® TSX Suspend Load 4th generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture.
Address Tracking
(TSXLDTRK)
Intel® Advanced Matrix 4th generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture.
Extensions (Intel® AMX)
Includes CPUID Leaf 1EH,
“TMUL Information Main
Leaf”, and CPUID bits AMX-
BF16, AMX-TILE, and AMX-
INT8.
User Interrupts (UINTR) 4th generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture,
Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture, Intel® Core™ Ultra processor
supporting Lunar Lake performance hybrid architecture.
IPI Virtualization 4th generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture,
Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture, Intel® Core™ Ultra processor
supporting Lunar Lake performance hybrid architecture.
AVX512-FP16, for the FP16 4th generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture.
Data Type
Virtualization of guest 4th generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture,
accesses to Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture.
IA32_SPEC_CTRL
Linear Address Masking Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture, Intel® Core™ Ultra processor
(LAM) supporting Lunar Lake performance hybrid architecture.
Linear Address Space Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture, Intel® Core™ Ultra processor
Separation (LASS) supporting Lunar Lake performance hybrid architecture.
PREFETCHIT0/1 Intel® Xeon® 6 P-core processors based on Granite Rapids microarchitecture.
AMX-FP16 Intel® Xeon® 6 P-core processors based on Granite Rapids microarchitecture.
CMPCCXADD Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture, Intel® Core™ Ultra
processor supporting Lunar Lake performance hybrid architecture.
AVX-IFMA Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture, Intel® Core™ Ultra
processor supporting Lunar Lake performance hybrid architecture.
AVX-NE-CONVERT Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture, Intel® Core™ Ultra processor
supporting Lunar Lake performance hybrid architecture.
AVX-VNNI-INT8 Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture, Intel® Core™ Ultra
processor supporting Lunar Lake performance hybrid architecture.
AVX-VNNI-INT16 Intel® Core™ Ultra processor supporting Lunar Lake performance hybrid architecture.
SHA512 Intel® Core™ Ultra processor supporting Lunar Lake performance hybrid architecture.
SM3 Intel® Core™ Ultra processor supporting Lunar Lake performance hybrid architecture.
SM4 Intel® Core™ Ultra processor supporting Lunar Lake performance hybrid architecture.

Vol. 1 5-5
INSTRUCTION SET SUMMARY

Table 5-2. Instruction Set Extensions Introduction in Intel® 64 and IA-32 Processors (Contd.)
Instruction Set Architecture Processor Generation Introduction
RDMSRLIST, WRMSRLIST, Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture.
and WRMSRNS
UC Lock Disable Causes #AC Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture.
LBR Event Logging Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture, Intel® Core™ Ultra processor
supporting Lunar Lake performance hybrid architecture.
UIRET flexibly updates UIF Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture, Intel® Core™ Ultra
processor supporting Lunar Lake performance hybrid architecture.
Intel® Advanced Vector Intel® Xeon® 6 P-core processors based on Granite Rapids microarchitecture.
Extensions 10 Version 1
(Intel® AVX10.1)
NOTES:
1. Details on Key Locker can be found in the Intel Key Locker Specification here:
https://software.intel.com/content/www/us/en/develop/download/intel-key-locker-specification.html.
2. Further details on TME-MK usage can be found here:
https://software.intel.com/sites/default/files/managed/a5/16/Multi-Key-Total-Memory-Encryption-Spec.pdf.
3. Alder Lake performance hybrid architecture does not support Intel® AVX-512. ISA features such as Intel® AVX, AVX-VNNI, Intel® AVX2,
and UMONITOR/UMWAIT/TPAUSE are supported.

The following sections list instructions in each major group and subgroup. Given for each instruction is its
mnemonic and descriptive names. When two or more mnemonics are given (for example, CMOVA/CMOVNBE), they
represent different mnemonics for the same instruction opcode. Assemblers support redundant mnemonics for
some instructions to make it easier to read code listings. For instance, CMOVA (Conditional move if above) and
CMOVNBE (Conditional move if not below or equal) represent the same condition. For detailed information about
specific instructions, see the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 2A, 2B, 2C,
& 2D.

5.1 GENERAL-PURPOSE INSTRUCTIONS


The general-purpose instructions perform basic data movement, arithmetic, logic, program flow, and string opera-
tions that programmers commonly use to write application and system software to run on Intel 64 and IA-32
processors. They operate on data contained in memory, in the general-purpose registers (EAX, EBX, ECX, EDX,
EDI, ESI, EBP, and ESP) and in the EFLAGS register. They also operate on address information contained in
memory, the general-purpose registers, and the segment registers (CS, DS, SS, ES, FS, and GS).
This group of instructions includes the data transfer, binary integer arithmetic, decimal arithmetic, logic operations,
shift and rotate, bit and byte operations, program control, string, flag control, segment register operations, and
miscellaneous subgroups. The sections that follow introduce each subgroup.
For more detailed information on general purpose-instructions, see Chapter 7, “Programming With General-
Purpose Instructions.”

5.1.1 Data Transfer Instructions


The data transfer instructions move data between memory and the general-purpose and segment registers. They
also perform specific operations such as conditional moves, stack access, and data conversion.
MOV Move data between general-purpose registers; move data between memory and general-
purpose or segment registers; move immediates to general-purpose registers.
CMOVE/CMOVZ Conditional move if equal/Conditional move if zero.
CMOVNE/CMOVNZ Conditional move if not equal/Conditional move if not zero.

5-6 Vol. 1
INSTRUCTION SET SUMMARY

CMOVA/CMOVNBE Conditional move if above/Conditional move if not below or equal.


CMOVAE/CMOVNB Conditional move if above or equal/Conditional move if not below.
CMOVB/CMOVNAE Conditional move if below/Conditional move if not above or equal.
CMOVBE/CMOVNA Conditional move if below or equal/Conditional move if not above.
CMOVG/CMOVNLE Conditional move if greater/Conditional move if not less or equal.
CMOVGE/CMOVNL Conditional move if greater or equal/Conditional move if not less.
CMOVL/CMOVNGE Conditional move if less/Conditional move if not greater or equal.
CMOVLE/CMOVNG Conditional move if less or equal/Conditional move if not greater.
CMOVC Conditional move if carry.
CMOVNC Conditional move if not carry.
CMOVO Conditional move if overflow.
CMOVNO Conditional move if not overflow.
CMOVS Conditional move if sign (negative).
CMOVNS Conditional move if not sign (non-negative).
CMOVP/CMOVPE Conditional move if parity/Conditional move if parity even.
CMOVNP/CMOVPO Conditional move if not parity/Conditional move if parity odd.
XCHG Exchange.
BSWAP Byte swap.
XADD Exchange and add.
CMPXCHG Compare and exchange.
CMPXCHG8B Compare and exchange 8 bytes.
PUSH Push onto stack.
POP Pop off of stack.
PUSHA/PUSHAD Push general-purpose registers onto stack.
POPA/POPAD Pop general-purpose registers from stack.
CWD/CDQ Convert word to doubleword/Convert doubleword to quadword.
CBW/CWDE Convert byte to word/Convert word to doubleword in EAX register.
MOVSX Move and sign extend.
MOVZX Move and zero extend.

5.1.2 Binary Arithmetic Instructions


The binary arithmetic instructions perform basic binary integer computations on byte, word, and doubleword inte-
gers located in memory and/or the general purpose registers.
ADCX Unsigned integer add with carry.
ADOX Unsigned integer add with overflow.
ADD Integer add.
ADC Add with carry.
SUB Subtract.
SBB Subtract with borrow.
IMUL Signed multiply.
MUL Unsigned multiply.
IDIV Signed divide.
DIV Unsigned divide.
INC Increment.
DEC Decrement.
NEG Negate.

Vol. 1 5-7
INSTRUCTION SET SUMMARY

CMP Compare.

5.1.3 Decimal Arithmetic Instructions


The decimal arithmetic instructions perform decimal arithmetic on binary coded decimal (BCD) data.
DAA Decimal adjust after addition.
DAS Decimal adjust after subtraction.
AAA ASCII adjust after addition.
AAS ASCII adjust after subtraction.
AAM ASCII adjust after multiplication.
AAD ASCII adjust before division.

5.1.4 Logical Instructions


The logical instructions perform basic AND, OR, XOR, and NOT logical operations on byte, word, and doubleword
values.
AND Perform bitwise logical AND.
OR Perform bitwise logical OR.
XOR Perform bitwise logical exclusive OR.
NOT Perform bitwise logical NOT.

5.1.5 Shift and Rotate Instructions


The shift and rotate instructions shift and rotate the bits in word and doubleword operands.
SAR Shift arithmetic right.
SHR Shift logical right.
SAL/SHL Shift arithmetic left/Shift logical left.
SHRD Shift right double.
SHLD Shift left double.
ROR Rotate right.
ROL Rotate left.
RCR Rotate through carry right.
RCL Rotate through carry left.

5.1.6 Bit and Byte Instructions


Bit instructions test and modify individual bits in word and doubleword operands. Byte instructions set the value of
a byte operand to indicate the status of flags in the EFLAGS register.
BT Bit test.
BTS Bit test and set.
BTR Bit test and reset.
BTC Bit test and complement.
BSF Bit scan forward.
BSR Bit scan reverse.
SETE/SETZ Set byte if equal/Set byte if zero.
SETNE/SETNZ Set byte if not equal/Set byte if not zero.
SETA/SETNBE Set byte if above/Set byte if not below or equal.

5-8 Vol. 1
INSTRUCTION SET SUMMARY

SETAE/SETNB/SETNC Set byte if above or equal/Set byte if not below/Set byte if not carry.
SETB/SETNAE/SETC Set byte if below/Set byte if not above or equal/Set byte if carry.
SETBE/SETNA Set byte if below or equal/Set byte if not above.
SETG/SETNLE Set byte if greater/Set byte if not less or equal.
SETGE/SETNL Set byte if greater or equal/Set byte if not less.
SETL/SETNGE Set byte if less/Set byte if not greater or equal.
SETLE/SETNG Set byte if less or equal/Set byte if not greater.
SETS Set byte if sign (negative).
SETNS Set byte if not sign (non-negative).
SETO Set byte if overflow.
SETNO Set byte if not overflow.
SETPE/SETP Set byte if parity even/Set byte if parity.
SETPO/SETNP Set byte if parity odd/Set byte if not parity.
TEST Logical compare.
CRC321 Provides hardware acceleration to calculate cyclic redundancy checks for fast and efficient
implementation of data integrity protocols.
POPCNT2 Calculates of number of bits set to 1 in the second operand (source) and returns the count
in the first operand (a destination register).

5.1.7 Control Transfer Instructions


The control transfer instructions provide jump, conditional jump, loop, and call and return operations to control
program flow.
JMP Jump.
JE/JZ Jump if equal/Jump if zero.
JNE/JNZ Jump if not equal/Jump if not zero.
JA/JNBE Jump if above/Jump if not below or equal.
JAE/JNB Jump if above or equal/Jump if not below.
JB/JNAE Jump if below/Jump if not above or equal.
JBE/JNA Jump if below or equal/Jump if not above.
JG/JNLE Jump if greater/Jump if not less or equal.
JGE/JNL Jump if greater or equal/Jump if not less.
JL/JNGE Jump if less/Jump if not greater or equal.
JLE/JNG Jump if less or equal/Jump if not greater.
JC Jump if carry.
JNC Jump if not carry.
JO Jump if overflow.
JNO Jump if not overflow.
JS Jump if sign (negative).
JNS Jump if not sign (non-negative).
JPO/JNP Jump if parity odd/Jump if not parity.
JPE/JP Jump if parity even/Jump if parity.
JCXZ/JECXZ Jump register CX zero/Jump register ECX zero.
LOOP Loop with ECX counter.

1. Processor support of CRC32 is enumerated by CPUID.01:ECX[SSE4.2] = 1


2. Processor support of POPCNT is enumerated by CPUID.01:ECX[POPCNT] = 1

Vol. 1 5-9
INSTRUCTION SET SUMMARY

LOOPZ/LOOPE Loop with ECX and zero/Loop with ECX and equal.
LOOPNZ/LOOPNE Loop with ECX and not zero/Loop with ECX and not equal.
CALL Call procedure.
RET Return.
IRET Return from interrupt.
INT Software interrupt.
INTO Interrupt on overflow.
BOUND Detect value out of range.
ENTER High-level procedure entry.
LEAVE High-level procedure exit.

5.1.8 String Instructions


The string instructions operate on strings of bytes, allowing them to be moved to and from memory.
MOVS/MOVSB Move string/Move byte string.
MOVS/MOVSW Move string/Move word string.
MOVS/MOVSD Move string/Move doubleword string.
CMPS/CMPSB Compare string/Compare byte string.
CMPS/CMPSW Compare string/Compare word string.
CMPS/CMPSD Compare string/Compare doubleword string.
SCAS/SCASB Scan string/Scan byte string.
SCAS/SCASW Scan string/Scan word string.
SCAS/SCASD Scan string/Scan doubleword string.
LODS/LODSB Load string/Load byte string.
LODS/LODSW Load string/Load word string.
LODS/LODSD Load string/Load doubleword string.
STOS/STOSB Store string/Store byte string.
STOS/STOSW Store string/Store word string.
STOS/STOSD Store string/Store doubleword string.
REP Repeat while ECX not zero.
REPE/REPZ Repeat while equal/Repeat while zero.
REPNE/REPNZ Repeat while not equal/Repeat while not zero.

5.1.9 I/O Instructions


These instructions move data between the processor’s I/O ports and a register or memory.
IN Read from a port.
OUT Write to a port.
INS/INSB Input string from port/Input byte string from port.
INS/INSW Input string from port/Input word string from port.
INS/INSD Input string from port/Input doubleword string from port.
OUTS/OUTSB Output string to port/Output byte string to port.
OUTS/OUTSW Output string to port/Output word string to port.
OUTS/OUTSD Output string to port/Output doubleword string to port.

5-10 Vol. 1
INSTRUCTION SET SUMMARY

5.1.10 Enter and Leave Instructions


These instructions provide machine-language support for procedure calls in block-structured languages.
ENTER High-level procedure entry.
LEAVE High-level procedure exit.

5.1.11 Flag Control (EFLAG) Instructions


The flag control instructions operate on the flags in the EFLAGS register.
STC Set carry flag.
CLC Clear the carry flag.
CMC Complement the carry flag.
CLD Clear the direction flag.
STD Set direction flag.
LAHF Load flags into AH register.
SAHF Store AH register into flags.
PUSHF/PUSHFD Push EFLAGS onto stack.
POPF/POPFD Pop EFLAGS from stack.
STI Set interrupt flag.
CLI Clear the interrupt flag.

5.1.12 Segment Register Instructions


The segment register instructions allow far pointers (segment addresses) to be loaded into the segment registers.
LDS Load far pointer using DS.
LES Load far pointer using ES.
LFS Load far pointer using FS.
LGS Load far pointer using GS.
LSS Load far pointer using SS.

5.1.13 Miscellaneous Instructions


The miscellaneous instructions provide such functions as loading an effective address, executing a “no-operation,”
and retrieving processor identification information.
LEA Load effective address.
NOP No operation.
UD Undefined instruction.
XLAT/XLATB Table lookup translation.
CPUID Processor identification.
MOVBE1 Move data after swapping data bytes.
PREFETCHW Prefetch data into cache in anticipation of write.
PREFETCHWT1 Prefetch hint T1 with intent to write.
CLFLUSH Flushes and invalidates a memory operand and its associated cache line from all levels of
the processor’s cache hierarchy.
CLFLUSHOPT Flushes and invalidates a memory operand and its associated cache line from all levels of
the processor’s cache hierarchy with optimized memory system throughput.

1. Processor support of MOVBE is enumerated by CPUID.01:ECX.MOVBE[bit 22] = 1.

Vol. 1 5-11
INSTRUCTION SET SUMMARY

5.1.14 User Mode Extended State Save/Restore Instructions


XSAVE Save processor extended states to memory.
XSAVEC Save processor extended states with compaction to memory.
XSAVEOPT Save processor extended states to memory, optimized.
XRSTOR Restore processor extended states from memory.
XGETBV Reads the state of an extended control register.

5.1.15 Random Number Generator Instructions


RDRAND Retrieves a random number generated from hardware.
RDSEED Retrieves a random number generated from hardware.

5.1.16 BMI1 and BMI2 Instructions


ANDN Bitwise AND of first source with inverted second source operands.
BEXTR Contiguous bitwise extract.
BLSI Extract lowest set bit.
BLSMSK Set all lower bits below first set bit to 1.
BLSR Reset lowest set bit.
BZHI Zero high bits starting from specified bit position.
LZCNT Count the number of leading zero bits.
MULX Unsigned multiply without affecting arithmetic flags.
PDEP Parallel deposit of bits using a mask.
PEXT Parallel extraction of bits using a mask.
RORX Rotate right without affecting arithmetic flags.
SARX Shift arithmetic right.
SHLX Shift logic left.
SHRX Shift logic right.
TZCNT Count the number of trailing zero bits.

5.1.16.1 Detection of VEX-Encoded GPR Instructions, LZCNT, TZCNT, and PREFETCHW


VEX-encoded general-purpose instructions do not operate on any vector registers.
There are separate feature flags for the following subsets of instructions that operate on general purpose registers,
and the detection requirements for hardware support are:
CPUID.(EAX=07H, ECX=0H):EBX.BMI1[bit 3]: if 1 indicates the processor supports the first group of advanced bit
manipulation extensions (ANDN, BEXTR, BLSI, BLSMSK, BLSR, TZCNT);
CPUID.(EAX=07H, ECX=0H):EBX.BMI2[bit 8]: if 1 indicates the processor supports the second group of advanced
bit manipulation extensions (BZHI, MULX, PDEP, PEXT, RORX, SARX, SHLX, SHRX);
CPUID.EAX=80000001H:ECX.LZCNT[bit 5]: if 1 indicates the processor supports the LZCNT instruction.
CPUID.EAX=80000001H:ECX.PREFTEHCHW[bit 8]: if 1 indicates the processor supports the PREFTEHCHW instruc-
tion. CPUID.(EAX=07H, ECX=0H):ECX.PREFTEHCHWT1[bit 0]: if 1 indicates the processor supports the PREFT-
EHCHWT1 instruction.

5-12 Vol. 1
INSTRUCTION SET SUMMARY

5.2 X87 FPU INSTRUCTIONS


The x87 FPU instructions are executed by the processor’s x87 FPU. These instructions operate on floating-point,
integer, and binary-coded decimal (BCD) operands. For more detail on x87 FPU instructions, see Chapter 8,
“Programming with the x87 FPU.”
These instructions are divided into the following subgroups: data transfer, load constants, and FPU control instruc-
tions. The sections that follow introduce each subgroup.

5.2.1 X87 FPU Data Transfer Instructions


The data transfer instructions move floating-point, integer, and BCD values between memory and the x87 FPU
registers. They also perform conditional move operations on floating-point operands.
FLD Load floating-point value.
FST Store floating-point value.
FSTP Store floating-point value and pop.
FILD Load integer.
FIST Store integer.
FISTP1 Store integer and pop.
FBLD Load BCD.
FBSTP Store BCD and pop.
FXCH Exchange registers.
FCMOVE Floating-point conditional move if equal.
FCMOVNE Floating-point conditional move if not equal.
FCMOVB Floating-point conditional move if below.
FCMOVBE Floating-point conditional move if below or equal.
FCMOVNB Floating-point conditional move if not below.
FCMOVNBE Floating-point conditional move if not below or equal.
FCMOVU Floating-point conditional move if unordered.
FCMOVNU Floating-point conditional move if not unordered.

5.2.2 X87 FPU Basic Arithmetic Instructions


The basic arithmetic instructions perform basic arithmetic operations on floating-point and integer operands.
FADD Add floating-point.
FADDP Add floating-point and pop.
FIADD Add integer.
FSUB Subtract floating-point.
FSUBP Subtract floating-point and pop.
FISUB Subtract integer.
FSUBR Subtract floating-point reverse.
FSUBRP Subtract floating-point reverse and pop.
FISUBR Subtract integer reverse.
FMUL Multiply floating-point.
FMULP Multiply floating-point and pop.
FIMUL Multiply integer.
FDIV Divide floating-point.

1. SSE3 provides an instruction FISTTP for integer conversion.

Vol. 1 5-13
INSTRUCTION SET SUMMARY

FDIVP Divide floating-point and pop.


FIDIV Divide integer.
FDIVR Divide floating-point reverse.
FDIVRP Divide floating-point reverse and pop.
FIDIVR Divide integer reverse.
FPREM Partial remainder.
FPREM1 IEEE partial remainder.
FABS Absolute value.
FCHS Change sign.
FRNDINT Round to integer.
FSCALE Scale by power of two.
FSQRT Square root.
FXTRACT Extract exponent and significand.

5.2.3 X87 FPU Comparison Instructions


The compare instructions examine or compare floating-point or integer operands.
FCOM Compare floating-point.
FCOMP Compare floating-point and pop.
FCOMPP Compare floating-point and pop twice.
FUCOM Unordered compare floating-point.
FUCOMP Unordered compare floating-point and pop.
FUCOMPP Unordered compare floating-point and pop twice.
FICOM Compare integer.
FICOMP Compare integer and pop.
FCOMI Compare floating-point and set EFLAGS.
FUCOMI Unordered compare floating-point and set EFLAGS.
FCOMIP Compare floating-point, set EFLAGS, and pop.
FUCOMIP Unordered compare floating-point, set EFLAGS, and pop.
FTST Test floating-point (compare with 0.0).
FXAM Examine floating-point.

5.2.4 X87 FPU Transcendental Instructions


The transcendental instructions perform basic trigonometric and logarithmic operations on floating-point operands.
FSIN Sine.
FCOS Cosine.
FSINCOS Sine and cosine.
FPTAN Partial tangent.
FPATAN Partial arctangent.
F2XM1 2x − 1.
FYL2X y∗log2x.
FYL2XP1 y∗log2(x+1).

5.2.5 X87 FPU Load Constants Instructions


The load constants instructions load common constants, such as π, into the x87 floating-point registers.

5-14 Vol. 1
INSTRUCTION SET SUMMARY

FLD1 Load +1.0.


FLDZ Load +0.0.
FLDPI Load π.
FLDL2E Load log2e.
FLDLN2 Load loge2.
FLDL2T Load log210.
FLDLG2 Load log102.

5.2.6 X87 FPU Control Instructions


The x87 FPU control instructions operate on the x87 FPU register stack and save and restore the x87 FPU state.
FINCSTP Increment FPU register stack pointer.
FDECSTP Decrement FPU register stack pointer.
FFREE Free floating-point register.
FINIT Initialize FPU after checking error conditions.
FNINIT Initialize FPU without checking error conditions.
FCLEX Clear floating-point exception flags after checking for error conditions.
FNCLEX Clear floating-point exception flags without checking for error conditions.
FSTCW Store FPU control word after checking error conditions.
FNSTCW Store FPU control word without checking error conditions.
FLDCW Load FPU control word.
FSTENV Store FPU environment after checking error conditions.
FNSTENV Store FPU environment without checking error conditions.
FLDENV Load FPU environment.
FSAVE Save FPU state after checking error conditions.
FNSAVE Save FPU state without checking error conditions.
FRSTOR Restore FPU state.
FSTSW Store FPU status word after checking error conditions.
FNSTSW Store FPU status word without checking error conditions.
WAIT/FWAIT Wait for FPU.
FNOP FPU no operation.

5.3 X87 FPU AND SIMD STATE MANAGEMENT INSTRUCTIONS


Two state management instructions were introduced into the IA-32 architecture with the Pentium II processor
family:
FXSAVE Save x87 FPU and SIMD state.
FXRSTOR Restore x87 FPU and SIMD state.
Initially, these instructions operated only on the x87 FPU (and MMX) registers to perform a fast save and restore,
respectively, of the x87 FPU and MMX state. With the introduction of SSE extensions in the Pentium III processor
family, these instructions were expanded to also save and restore the state of the XMM and MXCSR registers. Intel
64 architecture also supports these instructions.
See Section 10.5, “FXSAVE and FXRSTOR Instructions,” for more detail.

5.4 MMX INSTRUCTIONS


Four extensions have been introduced into the IA-32 architecture to permit IA-32 processors to perform single-
instruction multiple-data (SIMD) operations. These extensions include the MMX technology, SSE extensions, SSE2

Vol. 1 5-15
INSTRUCTION SET SUMMARY

extensions, and SSE3 extensions. For a discussion that puts SIMD instructions in their historical context, see
Section 2.2.7, “SIMD Instructions.”
MMX instructions operate on packed byte, word, doubleword, or quadword integer operands contained in memory,
in MMX registers, and/or in general-purpose registers. For more detail on these instructions, see Chapter 9,
“Programming with Intel® MMX™ Technology.”
MMX instructions can only be executed on Intel 64 and IA-32 processors that support the MMX technology. Support
for these instructions can be detected with the CPUID instruction. See the description of the CPUID instruction in
Chapter 3, “Instruction Set Reference, A-L,” of the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 2A.
MMX instructions are divided into the following subgroups: data transfer, conversion, packed arithmetic, compar-
ison, logical, shift and rotate, and state management instructions. The sections that follow introduce each
subgroup.

5.4.1 MMX Data Transfer Instructions


The data transfer instructions move doubleword and quadword operands between MMX registers and between MMX
registers and memory.
MOVD Move doubleword.
MOVQ Move quadword.

5.4.2 MMX Conversion Instructions


The conversion instructions pack and unpack bytes, words, and doublewords
PACKSSWB Pack words into bytes with signed saturation.
PACKSSDW Pack doublewords into words with signed saturation.
PACKUSWB Pack words into bytes with unsigned saturation.
PUNPCKHBW Unpack high-order bytes.
PUNPCKHWD Unpack high-order words.
PUNPCKHDQ Unpack high-order doublewords.
PUNPCKLBW Unpack low-order bytes.
PUNPCKLWD Unpack low-order words.
PUNPCKLDQ Unpack low-order doublewords.

5.4.3 MMX Packed Arithmetic Instructions


The packed arithmetic instructions perform packed integer arithmetic on packed byte, word, and doubleword inte-
gers.
PADDB Add packed byte integers.
PADDW Add packed word integers.
PADDD Add packed doubleword integers.
PADDSB Add packed signed byte integers with signed saturation.
PADDSW Add packed signed word integers with signed saturation.
PADDUSB Add packed unsigned byte integers with unsigned saturation.
PADDUSW Add packed unsigned word integers with unsigned saturation.
PSUBB Subtract packed byte integers.
PSUBW Subtract packed word integers.
PSUBD Subtract packed doubleword integers.
PSUBSB Subtract packed signed byte integers with signed saturation.
PSUBSW Subtract packed signed word integers with signed saturation.

5-16 Vol. 1
INSTRUCTION SET SUMMARY

PSUBUSB Subtract packed unsigned byte integers with unsigned saturation.


PSUBUSW Subtract packed unsigned word integers with unsigned saturation.
PMULHW Multiply packed signed word integers and store high result.
PMULLW Multiply packed signed word integers and store low result.
PMADDWD Multiply and add packed word integers.

5.4.4 MMX Comparison Instructions


The compare instructions compare packed bytes, words, or doublewords.
PCMPEQB Compare packed bytes for equal.
PCMPEQW Compare packed words for equal.
PCMPEQD Compare packed doublewords for equal.
PCMPGTB Compare packed signed byte integers for greater than.
PCMPGTW Compare packed signed word integers for greater than.
PCMPGTD Compare packed signed doubleword integers for greater than.

5.4.5 MMX Logical Instructions


The logical instructions perform AND, AND NOT, OR, and XOR operations on quadword operands.
PAND Bitwise logical AND.
PANDN Bitwise logical AND NOT.
POR Bitwise logical OR.
PXOR Bitwise logical exclusive OR.

5.4.6 MMX Shift and Rotate Instructions


The shift and rotate instructions shift and rotate packed bytes, words, or doublewords, or quadwords in 64-bit
operands.
PSLLW Shift packed words left logical.
PSLLD Shift packed doublewords left logical.
PSLLQ Shift packed quadword left logical.
PSRLW Shift packed words right logical.
PSRLD Shift packed doublewords right logical.
PSRLQ Shift packed quadword right logical.
PSRAW Shift packed words right arithmetic.
PSRAD Shift packed doublewords right arithmetic.

5.4.7 MMX State Management Instructions


The EMMS instruction clears the MMX state from the MMX registers.
EMMS Empty MMX state.

5.5 INTEL® SSE INSTRUCTIONS


Intel SSE instructions represent an extension of the SIMD execution model introduced with the MMX technology.
For more detail on these instructions, see Chapter 10, “Programming with Intel® Streaming SIMD Extensions
(Intel® SSE).”

Vol. 1 5-17
INSTRUCTION SET SUMMARY

Intel SSE instructions can only be executed on Intel 64 and IA-32 processors that support Intel SSE extensions.
Support for these instructions can be detected with the CPUID instruction. See the description of the CPUID instruc-
tion in Chapter 3, “Instruction Set Reference, A-L,” of the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 2A.
Intel SSE instructions are divided into four subgroups (note that the first subgroup has subordinate subgroups of
its own):
• SIMD single precision floating-point instructions that operate on the XMM registers.
• MXCSR state management instructions.
• 64-bit SIMD integer instructions that operate on the MMX registers.
• Cacheability control, prefetch, and instruction ordering instructions.
The following sections provide an overview of these groups.

5.5.1 Intel® SSE SIMD Single Precision Floating-Point Instructions


These instructions operate on packed and scalar single precision floating-point values located in XMM registers
and/or memory. This subgroup is further divided into the following subordinate subgroups: data transfer, packed
arithmetic, comparison, logical, shuffle and unpack, and conversion instructions.

5.5.1.1 Intel® SSE Data Transfer Instructions


Intel SSE data transfer instructions move packed and scalar single precision floating-point operands between XMM
registers and between XMM registers and memory.
MOVAPS Move four aligned packed single precision floating-point values between XMM registers or
between an XMM register and memory.
MOVUPS Move four unaligned packed single precision floating-point values between XMM registers
or between an XMM register and memory.
MOVHPS Move two packed single precision floating-point values to and from the high quadword of
an XMM register and memory.
MOVHLPS Move two packed single precision floating-point values from the high quadword of an XMM
register to the low quadword of another XMM register.
MOVLPS Move two packed single precision floating-point values to and from the low quadword of an
XMM register and memory.
MOVLHPS Move two packed single precision floating-point values from the low quadword of an XMM
register to the high quadword of another XMM register.
MOVMSKPS Extract sign mask from four packed single precision floating-point values.
MOVSS Move scalar single precision floating-point value between XMM registers or between an
XMM register and memory.

5.5.1.2 Intel® SSE Packed Arithmetic Instructions


Intel SSE packed arithmetic instructions perform packed and scalar arithmetic operations on packed and scalar
single precision floating-point operands.
ADDPS Add packed single precision floating-point values.
ADDSS Add scalar single precision floating-point values.
SUBPS Subtract packed single precision floating-point values.
SUBSS Subtract scalar single precision floating-point values.
MULPS Multiply packed single precision floating-point values.
MULSS Multiply scalar single precision floating-point values.
DIVPS Divide packed single precision floating-point values.
DIVSS Divide scalar single precision floating-point values.

5-18 Vol. 1
INSTRUCTION SET SUMMARY

RCPPS Compute reciprocals of packed single precision floating-point values.


RCPSS Compute reciprocal of scalar single precision floating-point values.
SQRTPS Compute square roots of packed single precision floating-point values.
SQRTSS Compute square root of scalar single precision floating-point values.
RSQRTPS Compute reciprocals of square roots of packed single precision floating-point values.
RSQRTSS Compute reciprocal of square root of scalar single precision floating-point values.
MAXPS Return maximum packed single precision floating-point values.
MAXSS Return maximum scalar single precision floating-point values.
MINPS Return minimum packed single precision floating-point values.
MINSS Return minimum scalar single precision floating-point values.

5.5.1.3 Intel® SSE Comparison Instructions


Intel SSE compare instructions compare packed and scalar single precision floating-point operands.
CMPPS Compare packed single precision floating-point values.
CMPSS Compare scalar single precision floating-point values.
COMISS Perform ordered comparison of scalar single precision floating-point values and set flags in
EFLAGS register.
UCOMISS Perform unordered comparison of scalar single precision floating-point values and set flags
in EFLAGS register.

5.5.1.4 Intel® SSE Logical Instructions


Intel SSE logical instructions perform bitwise AND, AND NOT, OR, and XOR operations on packed single precision
floating-point operands.
ANDPS Perform bitwise logical AND of packed single precision floating-point values.
ANDNPS Perform bitwise logical AND NOT of packed single precision floating-point values.
ORPS Perform bitwise logical OR of packed single precision floating-point values.
XORPS Perform bitwise logical XOR of packed single precision floating-point values.

5.5.1.5 Intel® SSE Shuffle and Unpack Instructions


Intel SSE shuffle and unpack instructions shuffle or interleave single precision floating-point values in packed single
precision floating-point operands.
SHUFPS Shuffles values in packed single precision floating-point operands.
UNPCKHPS Unpacks and interleaves the two high-order values from two single precision floating-point
operands.
UNPCKLPS Unpacks and interleaves the two low-order values from two single precision floating-point
operands.

5.5.1.6 Intel® SSE Conversion Instructions


Intel SSE conversion instructions convert packed and individual doubleword integers into packed and scalar single
precision floating-point values and vice versa.
CVTPI2PS Convert packed doubleword integers to packed single precision floating-point values.
CVTSI2SS Convert doubleword integer to scalar single precision floating-point value.
CVTPS2PI Convert packed single precision floating-point values to packed doubleword integers.
CVTTPS2PI Convert with truncation packed single precision floating-point values to packed double-
word integers.
CVTSS2SI Convert a scalar single precision floating-point value to a doubleword integer.

Vol. 1 5-19
INSTRUCTION SET SUMMARY

CVTTSS2SI Convert with truncation a scalar single precision floating-point value to a scalar double-
word integer.

5.5.2 Intel® SSE MXCSR State Management Instructions


MXCSR state management instructions allow saving and restoring the state of the MXCSR control and status
register.
LDMXCSR Load MXCSR register.
STMXCSR Save MXCSR register state.

5.5.3 Intel® SSE 64-Bit SIMD Integer Instructions


These Intel SSE 64-bit SIMD integer instructions perform additional operations on packed bytes, words, or double-
words contained in MMX registers. They represent enhancements to the MMX instruction set described in Section
5.4, “MMX Instructions.”
PAVGB Compute average of packed unsigned byte integers.
PAVGW Compute average of packed unsigned word integers.
PEXTRW Extract word.
PINSRW Insert word.
PMAXUB Maximum of packed unsigned byte integers.
PMAXSW Maximum of packed signed word integers.
PMINUB Minimum of packed unsigned byte integers.
PMINSW Minimum of packed signed word integers.
PMOVMSKB Move byte mask.
PMULHUW Multiply packed unsigned integers and store high result.
PSADBW Compute sum of absolute differences.
PSHUFW Shuffle packed integer word in MMX register.

5.5.4 Intel® SSE Cacheability Control, Prefetch, and Instruction Ordering Instructions
The cacheability control instructions provide control over the caching of non-temporal data when storing data from
the MMX and XMM registers to memory. The PREFETCHh allows data to be prefetched to a selected cache level. The
SFENCE instruction controls instruction ordering on store operations.
MASKMOVQ Non-temporal store of selected bytes from an MMX register into memory.
MOVNTQ Non-temporal store of quadword from an MMX register into memory.
MOVNTPS Non-temporal store of four packed single precision floating-point values from an XMM
register into memory.
PREFETCHh Load 32 or more of bytes from memory to a selected level of the processor’s cache hier-
archy.
SFENCE Serializes store operations.

5.6 INTEL® SSE2 INSTRUCTIONS


Intel SSE2 extensions represent an extension of the SIMD execution model introduced with MMX technology and
the Intel SSE extensions. Intel SSE2 instructions operate on packed double precision floating-point operands and
on packed byte, word, doubleword, and quadword operands located in the XMM registers. For more detail on these
instructions, see Chapter 11, “Programming with Intel® Streaming SIMD Extensions 2 (Intel® SSE2).”
Intel SSE2 instructions can only be executed on Intel 64 and IA-32 processors that support the Intel SSE2 exten-
sions. Support for these instructions can be detected with the CPUID instruction. See the description of the CPUID

5-20 Vol. 1
INSTRUCTION SET SUMMARY

instruction in Chapter 3, “Instruction Set Reference, A-L,” of the Intel® 64 and IA-32 Architectures Software Devel-
oper’s Manual, Volume 2A.
These instructions are divided into four subgroups (note that the first subgroup is further divided into subordinate
subgroups):
• Packed and scalar double precision floating-point instructions.
• Packed single precision floating-point conversion instructions.
• 128-bit SIMD integer instructions.
• Cacheability-control and instruction ordering instructions.
The following sections give an overview of each subgroup.

5.6.1 Intel® SSE2 Packed and Scalar Double Precision Floating-Point Instructions
Intel SSE2 packed and scalar double precision floating-point instructions are divided into the following subordinate
subgroups: data movement, arithmetic, comparison, conversion, logical, and shuffle operations on double preci-
sion floating-point operands. These are introduced in the sections that follow.

5.6.1.1 Intel® SSE2 Data Movement Instructions


Intel SSE2 data movement instructions move double precision floating-point data between XMM registers and
between XMM registers and memory.
MOVAPD Move two aligned packed double precision floating-point values between XMM registers or
between an XMM register and memory.
MOVUPD Move two unaligned packed double precision floating-point values between XMM registers
or between an XMM register and memory.
MOVHPD Move high packed double precision floating-point value to and from the high quadword of
an XMM register and memory.
MOVLPD Move low packed single precision floating-point value to and from the low quadword of an
XMM register and memory.
MOVMSKPD Extract sign mask from two packed double precision floating-point values.
MOVSD Move scalar double precision floating-point value between XMM registers or between an
XMM register and memory.

5.6.1.2 Intel® SSE2 Packed Arithmetic Instructions


The arithmetic instructions perform addition, subtraction, multiply, divide, square root, and maximum/minimum
operations on packed and scalar double precision floating-point operands.
ADDPD Add packed double precision floating-point values.
ADDSD Add scalar double precision floating-point values.
SUBPD Subtract packed double precision floating-point values.
SUBSD Subtract scalar double precision floating-point values.
MULPD Multiply packed double precision floating-point values.
MULSD Multiply scalar double precision floating-point values.
DIVPD Divide packed double precision floating-point values.
DIVSD Divide scalar double precision floating-point values.
SQRTPD Compute packed square roots of packed double precision floating-point values.
SQRTSD Compute scalar square root of scalar double precision floating-point values.
MAXPD Return maximum packed double precision floating-point values.
MAXSD Return maximum scalar double precision floating-point values.
MINPD Return minimum packed double precision floating-point values.

Vol. 1 5-21
INSTRUCTION SET SUMMARY

MINSD Return minimum scalar double precision floating-point values.

5.6.1.3 Intel® SSE2 Logical Instructions


Intel SSE2 logical instructions perform AND, AND NOT, OR, and XOR operations on packed double precision
floating-point values.
ANDPD Perform bitwise logical AND of packed double precision floating-point values.
ANDNPD Perform bitwise logical AND NOT of packed double precision floating-point values.
ORPD Perform bitwise logical OR of packed double precision floating-point values.
XORPD Perform bitwise logical XOR of packed double precision floating-point values.

5.6.1.4 Intel® SSE2 Compare Instructions


Intel SSE2 compare instructions compare packed and scalar double precision floating-point values and return the
results of the comparison either to the destination operand or to the EFLAGS register.
CMPPD Compare packed double precision floating-point values.
CMPSD Compare scalar double precision floating-point values.
COMISD Perform ordered comparison of scalar double precision floating-point values and set flags
in EFLAGS register.
UCOMISD Perform unordered comparison of scalar double precision floating-point values and set
flags in EFLAGS register.

5.6.1.5 Intel® SSE2 Shuffle and Unpack Instructions


Intel SSE2 shuffle and unpack instructions shuffle or interleave double precision floating-point values in packed
double precision floating-point operands.
SHUFPD Shuffles values in packed double precision floating-point operands.
UNPCKHPD Unpacks and interleaves the high values from two packed double precision floating-point
operands.
UNPCKLPD Unpacks and interleaves the low values from two packed double precision floating-point
operands.

5.6.1.6 Intel® SSE2 Conversion Instructions


Intel SSE2 conversion instructions convert packed and individual doubleword integers into packed and scalar
double precision floating-point values and vice versa. They also convert between packed and scalar single precision
and double precision floating-point values.
CVTPD2PI Convert packed double precision floating-point values to packed doubleword integers.
CVTTPD2PI Convert with truncation packed double precision floating-point values to packed double-
word integers.
CVTPI2PD Convert packed doubleword integers to packed double precision floating-point values.
CVTPD2DQ Convert packed double precision floating-point values to packed doubleword integers.
CVTTPD2DQ Convert with truncation packed double precision floating-point values to packed double-
word integers.
CVTDQ2PD Convert packed doubleword integers to packed double precision floating-point values.
CVTPS2PD Convert packed single precision floating-point values to packed double precision floating-
point values.
CVTPD2PS Convert packed double precision floating-point values to packed single precision floating-
point values.
CVTSS2SD Convert scalar single precision floating-point values to scalar double precision floating-
point values.

5-22 Vol. 1
INSTRUCTION SET SUMMARY

CVTSD2SS Convert scalar double precision floating-point values to scalar single precision floating-
point values.
CVTSD2SI Convert scalar double precision floating-point values to a doubleword integer.
CVTTSD2SI Convert with truncation scalar double precision floating-point values to scalar doubleword
integers.
CVTSI2SD Convert doubleword integer to scalar double precision floating-point value.

5.6.2 Intel® SSE2 Packed Single Precision Floating-Point Instructions


Intel SSE2 packed single precision floating-point instructions perform conversion operations on single precision
floating-point and integer operands. These instructions represent enhancements to the Intel SSE single precision
floating-point instructions.
CVTDQ2PS Convert packed doubleword integers to packed single precision floating-point values.
CVTPS2DQ Convert packed single precision floating-point values to packed doubleword integers.
CVTTPS2DQ Convert with truncation packed single precision floating-point values to packed double-
word integers.

5.6.3 Intel® SSE2 128-Bit SIMD Integer Instructions


Intel SSE2 SIMD integer instructions perform additional operations on packed words, doublewords, and quadwords
contained in XMM and MMX registers.
MOVDQA Move aligned double quadword.
MOVDQU Move unaligned double quadword.
MOVQ2DQ Move quadword integer from MMX to XMM registers.
MOVDQ2Q Move quadword integer from XMM to MMX registers.
PMULUDQ Multiply packed unsigned doubleword integers.
PADDQ Add packed quadword integers.
PSUBQ Subtract packed quadword integers.
PSHUFLW Shuffle packed low words.
PSHUFHW Shuffle packed high words.
PSHUFD Shuffle packed doublewords.
PSLLDQ Shift double quadword left logical.
PSRLDQ Shift double quadword right logical.
PUNPCKHQDQ Unpack high quadwords.
PUNPCKLQDQ Unpack low quadwords.

5.6.4 Intel® SSE2 Cacheability Control and Ordering Instructions


Intel SSE2 cacheability control instructions provide additional operations for caching of non-temporal data when
storing data from XMM registers to memory. LFENCE and MFENCE provide additional control of instruction ordering
on store operations.
CLFLUSH See Section 5.1.13.
LFENCE Serializes load operations.
MFENCE Serializes load and store operations.
PAUSE Improves the performance of “spin-wait loops”.
MASKMOVDQU Non-temporal store of selected bytes from an XMM register into memory.
MOVNTPD Non-temporal store of two packed double precision floating-point values from an XMM
register into memory.
MOVNTDQ Non-temporal store of double quadword from an XMM register into memory.

Vol. 1 5-23
INSTRUCTION SET SUMMARY

MOVNTI Non-temporal store of a doubleword from a general-purpose register into memory.

5.7 INTEL® SSE3 INSTRUCTIONS


The Intel SSE3 extensions offers 13 instructions that accelerate performance of Streaming SIMD Extensions tech-
nology, Streaming SIMD Extensions 2 technology, and x87-FP math capabilities. These instructions can be grouped
into the following categories:
• One x87 FPU instruction used in integer conversion.
• One SIMD integer instruction that addresses unaligned data loads.
• Two SIMD floating-point packed ADD/SUB instructions.
• Four SIMD floating-point horizontal ADD/SUB instructions.
• Three SIMD floating-point LOAD/MOVE/DUPLICATE instructions.
• Two thread synchronization instructions.
Intel SSE3 instructions can only be executed on Intel 64 and IA-32 processors that support Intel SSE3 extensions.
Support for these instructions can be detected with the CPUID instruction. See the description of the CPUID instruc-
tion in Chapter 3, “Instruction Set Reference, A-L,” of the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 2A.
The sections that follow describe each subgroup.

5.7.1 Intel® SSE3 x87-FP Integer Conversion Instruction


FISTTP Behaves like the FISTP instruction but uses truncation, irrespective of the rounding mode
specified in the floating-point control word (FCW).

5.7.2 Intel® SSE3 Specialized 128-Bit Unaligned Data Load Instruction


LDDQU Special 128-bit unaligned load designed to avoid cache line splits.

5.7.3 Intel® SSE3 SIMD Floating-Point Packed ADD/SUB Instructions


ADDSUBPS Performs single precision addition on the second and fourth pairs of 32-bit data elements
within the operands; single precision subtraction on the first and third pairs.
ADDSUBPD Performs double precision addition on the second pair of quadwords, and double precision
subtraction on the first pair.

5.7.4 Intel® SSE3 SIMD Floating-Point Horizontal ADD/SUB Instructions


HADDPS Performs a single precision addition on contiguous data elements. The first data element of
the result is obtained by adding the first and second elements of the first operand; the
second element by adding the third and fourth elements of the first operand; the third by
adding the first and second elements of the second operand; and the fourth by adding the
third and fourth elements of the second operand.
HSUBPS Performs a single precision subtraction on contiguous data elements. The first data
element of the result is obtained by subtracting the second element of the first operand
from the first element of the first operand; the second element by subtracting the fourth
element of the first operand from the third element of the first operand; the third by
subtracting the second element of the second operand from the first element of the second
operand; and the fourth by subtracting the fourth element of the second operand from the
third element of the second operand.

5-24 Vol. 1
INSTRUCTION SET SUMMARY

HADDPD Performs a double precision addition on contiguous data elements. The first data element
of the result is obtained by adding the first and second elements of the first operand; the
second element by adding the first and second elements of the second operand.
HSUBPD Performs a double precision subtraction on contiguous data elements. The first data
element of the result is obtained by subtracting the second element of the first operand
from the first element of the first operand; the second element by subtracting the second
element of the second operand from the first element of the second operand.

5.7.5 Intel® SSE3 SIMD Floating-Point LOAD/MOVE/DUPLICATE Instructions


MOVSHDUP Loads/moves 128 bits; duplicating the second and fourth 32-bit data elements.
MOVSLDUP Loads/moves 128 bits; duplicating the first and third 32-bit data elements.
MOVDDUP Loads/moves 64 bits (bits[63:0] if the source is a register) and returns the same 64 bits in
both the lower and upper halves of the 128-bit result register; duplicates the 64 bits from
the source.

5.7.6 Intel® SSE3 Agent Synchronization Instructions


MONITOR Sets up an address range used to monitor write-back stores.
MWAIT Enables a logical processor to enter into an optimized state while waiting for a write-back
store to the address range set up by the MONITOR instruction.

5.8 SUPPLEMENTAL STREAMING SIMD EXTENSIONS 3 (SSSE3) INSTRUCTIONS


SSSE3 provide 32 instructions (represented by 14 mnemonics) to accelerate computations on packed integers.
These include:
• Twelve instructions that perform horizontal addition or subtraction operations.
• Six instructions that evaluate absolute values.
• Two instructions that perform multiply and add operations and speed up the evaluation of dot products.
• Two instructions that accelerate packed-integer multiply operations and produce integer values with scaling.
• Two instructions that perform a byte-wise, in-place shuffle according to the second shuffle control operand.
• Six instructions that negate packed integers in the destination operand if the signs of the corresponding
element in the source operand is less than zero.
• Two instructions that align data from the composite of two operands.
SSSE3 instructions can only be executed on Intel 64 and IA-32 processors that support SSSE3 extensions. Support
for these instructions can be detected with the CPUID instruction. See the description of the CPUID instruction in
Chapter 3, “Instruction Set Reference, A-L,” of the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 2A.
The sections that follow describe each subgroup.

5.8.1 Horizontal Addition/Subtraction


PHADDW Adds two adjacent, signed 16-bit integers horizontally from the source and destination
operands and packs the signed 16-bit results to the destination operand.
PHADDSW Adds two adjacent, signed 16-bit integers horizontally from the source and destination
operands and packs the signed, saturated 16-bit results to the destination operand.
PHADDD Adds two adjacent, signed 32-bit integers horizontally from the source and destination
operands and packs the signed 32-bit results to the destination operand.
PHSUBW Performs horizontal subtraction on each adjacent pair of 16-bit signed integers by
subtracting the most significant word from the least significant word of each pair in the

Vol. 1 5-25
INSTRUCTION SET SUMMARY

source and destination operands. The signed 16-bit results are packed and written to the
destination operand.
PHSUBSW Performs horizontal subtraction on each adjacent pair of 16-bit signed integers by
subtracting the most significant word from the least significant word of each pair in the
source and destination operands. The signed, saturated 16-bit results are packed and
written to the destination operand.
PHSUBD Performs horizontal subtraction on each adjacent pair of 32-bit signed integers by
subtracting the most significant doubleword from the least significant double word of each
pair in the source and destination operands. The signed 32-bit results are packed and
written to the destination operand.

5.8.2 Packed Absolute Values


PABSB Computes the absolute value of each signed byte data element.
PABSW Computes the absolute value of each signed 16-bit data element.
PABSD Computes the absolute value of each signed 32-bit data element.

5.8.3 Multiply and Add Packed Signed and Unsigned Bytes


PMADDUBSW Multiplies each unsigned byte value with the corresponding signed byte value to produce
an intermediate, 16-bit signed integer. Each adjacent pair of 16-bit signed values are
added horizontally. The signed, saturated 16-bit results are packed to the destination
operand.

5.8.4 Packed Multiply High with Round and Scale


PMULHRSW Multiplies vertically each signed 16-bit integer from the destination operand with the corre-
sponding signed 16-bit integer of the source operand, producing intermediate, signed 32-
bit integers. Each intermediate 32-bit integer is truncated to the 18 most significant bits.
Rounding is always performed by adding 1 to the least significant bit of the 18-bit interme-
diate result. The final result is obtained by selecting the 16 bits immediately to the right of
the most significant bit of each 18-bit intermediate result and packed to the destination
operand.

5.8.5 Packed Shuffle Bytes


PSHUFB Permutes each byte in place, according to a shuffle control mask. The least significant
three or four bits of each shuffle control byte of the control mask form the shuffle index.
The shuffle mask is unaffected. If the most significant bit (bit 7) of a shuffle control byte is
set, the constant zero is written in the result byte.

5.8.6 Packed Sign


PSIGNB/W/D Negates each signed integer element of the destination operand if the sign of the corre-
sponding data element in the source operand is less than zero.

5.8.7 Packed Align Right


PALIGNR Source operand is appended after the destination operand forming an intermediate value
of twice the width of an operand. The result is extracted from the intermediate value into
the destination operand by selecting the 128-bit or 64-bit value that are right-aligned to
the byte offset specified by the immediate value.

5-26 Vol. 1
INSTRUCTION SET SUMMARY

5.9 INTEL® SSE4 INSTRUCTIONS


Intel Streaming SIMD Extensions 4 (Intel SSE4) introduces 54 new instructions. 47 of the Intel SSE4 instructions
are referred to as Intel SSE4.1 in this document, and 7 new Intel SSE4 instructions are referred to as Intel SSE4.2.
Intel SSE4.1 is targeted to improve the performance of media, imaging, and 3D workloads. Intel SSE4.1 adds
instructions that improve compiler vectorization and significantly increase support for packed dword computation.
The technology also provides a hint that can improve memory throughput when reading from uncacheable WC
memory type.
The 47 Intel SSE4.1 instructions include:
• Two instructions perform packed dword multiplies.
• Two instructions perform floating-point dot products with input/output selects.
• One instruction performs a load with a streaming hint.
• Six instructions simplify packed blending.
• Eight instructions expand support for packed integer MIN/MAX.
• Four instructions support floating-point round with selectable rounding mode and precision exception override.
• Seven instructions improve data insertion and extractions from XMM registers
• Twelve instructions improve packed integer format conversions (sign and zero extensions).
• One instruction improves SAD (sum absolute difference) generation for small block sizes.
• One instruction aids horizontal searching operations.
• One instruction improves masked comparisons.
• One instruction adds qword packed equality comparisons.
• One instruction adds dword packing with unsigned saturation.
The Intel SSE4.2 instructions operating on XMM registers include:
• String and text processing that can take advantage of single-instruction multiple-data programming
techniques.
• A SIMD integer instruction that enhances the capability of the 128-bit integer SIMD capability in SSE4.1.

5.10 INTEL® SSE4.1 INSTRUCTIONS


Intel SSE4.1 instructions can use an XMM register as a source or destination. Programming Intel SSE4.1 is similar
to programming 128-bit Integer SIMD and floating-point SIMD instructions in Intel SSE/SSE2/SSE3/SSSE3. Intel
SSE4.1 does not provide any 64-bit integer SIMD instructions operating on MMX registers. The sections that follow
describe each subgroup.

5.10.1 Dword Multiply Instructions


PMULLD Returns four lower 32-bits of the 64-bit results of signed 32-bit integer multiplies.
PMULDQ Returns two 64-bit signed result of signed 32-bit integer multiplies.

5.10.2 Floating-Point Dot Product Instructions


DPPD Perform double precision dot product for up to 2 elements and broadcast.
DPPS Perform single precision dot products for up to 4 elements and broadcast.

5.10.3 Streaming Load Hint Instruction


MOVNTDQA Provides a non-temporal hint that can cause adjacent 16-byte items within an aligned 64-
byte region (a streaming line) to be fetched and held in a small set of temporary buffers

Vol. 1 5-27
INSTRUCTION SET SUMMARY

(“streaming load buffers”). Subsequent streaming loads to other aligned 16-byte items in
the same streaming line may be supplied from the streaming load buffer and can improve
throughput.

5.10.4 Packed Blending Instructions


BLENDPD Conditionally copies specified double precision floating-point data elements in the source
operand to the corresponding data elements in the destination, using an immediate byte
control.
BLENDPS Conditionally copies specified single precision floating-point data elements in the source
operand to the corresponding data elements in the destination, using an immediate byte
control.
BLENDVPD Conditionally copies specified double precision floating-point data elements in the source
operand to the corresponding data elements in the destination, using an implied mask.
BLENDVPS Conditionally copies specified single precision floating-point data elements in the source
operand to the corresponding data elements in the destination, using an implied mask.
PBLENDVB Conditionally copies specified byte elements in the source operand to the corresponding
elements in the destination, using an implied mask.
PBLENDW Conditionally copies specified word elements in the source operand to the corresponding
elements in the destination, using an immediate byte control.

5.10.5 Packed Integer MIN/MAX Instructions


PMINUW Compare packed unsigned word integers.
PMINUD Compare packed unsigned dword integers.
PMINSB Compare packed signed byte integers.
PMINSD Compare packed signed dword integers.
PMAXUW Compare packed unsigned word integers.
PMAXUD Compare packed unsigned dword integers.
PMAXSB Compare packed signed byte integers.
PMAXSD Compare packed signed dword integers.

5.10.6 Floating-Point Round Instructions with Selectable Rounding Mode


ROUNDPS Round packed single precision floating-point values into integer values and return rounded
floating-point values.
ROUNDPD Round packed double precision floating-point values into integer values and return
rounded floating-point values.
ROUNDSS Round the low packed single precision floating-point value into an integer value and return
a rounded floating-point value.
ROUNDSD Round the low packed double precision floating-point value into an integer value and return
a rounded floating-point value.

5.10.7 Insertion and Extractions from XMM Registers


EXTRACTPS Extracts a single precision floating-point value from a specified offset in an XMM register
and stores the result to memory or a general-purpose register.
INSERTPS Inserts a single precision floating-point value from either a 32-bit memory location or
selected from a specified offset in an XMM register to a specified offset in the destination
XMM register. In addition, INSERTPS allows zeroing out selected data elements in the desti-
nation, using a mask.

5-28 Vol. 1
INSTRUCTION SET SUMMARY

PINSRB Insert a byte value from a register or memory into an XMM register.
PINSRD Insert a dword value from 32-bit register or memory into an XMM register.
PINSRQ Insert a qword value from 64-bit register or memory into an XMM register.
PEXTRB Extract a byte from an XMM register and insert the value into a general-purpose register or
memory.
PEXTRW Extract a word from an XMM register and insert the value into a general-purpose register
or memory.
PEXTRD Extract a dword from an XMM register and insert the value into a general-purpose register
or memory.
PEXTRQ Extract a qword from an XMM register and insert the value into a general-purpose register
or memory.

5.10.8 Packed Integer Format Conversions


PMOVSXBW Sign extend the lower 8-bit integer of each packed word element into packed signed word
integers.
PMOVZXBW Zero extend the lower 8-bit integer of each packed word element into packed signed word
integers.
PMOVSXBD Sign extend the lower 8-bit integer of each packed dword element into packed signed
dword integers.
PMOVZXBD Zero extend the lower 8-bit integer of each packed dword element into packed signed
dword integers.
PMOVSXWD Sign extend the lower 16-bit integer of each packed dword element into packed signed
dword integers.
PMOVZXWD Zero extend the lower 16-bit integer of each packed dword element into packed signed
dword integers.
PMOVSXBQ Sign extend the lower 8-bit integer of each packed qword element into packed signed
qword integers.
PMOVZXBQ Zero extend the lower 8-bit integer of each packed qword element into packed signed
qword integers.
PMOVSXWQ Sign extend the lower 16-bit integer of each packed qword element into packed signed
qword integers.
PMOVZXWQ Zero extend the lower 16-bit integer of each packed qword element into packed signed
qword integers.
PMOVSXDQ Sign extend the lower 32-bit integer of each packed qword element into packed signed
qword integers.
PMOVZXDQ Zero extend the lower 32-bit integer of each packed qword element into packed signed
qword integers.

5.10.9 Improved Sums of Absolute Differences (SAD) for 4-Byte Blocks


MPSADBW Performs eight 4-byte wide Sum of Absolute Differences operations to produce eight word
integers.

5.10.10 Horizontal Search


PHMINPOSUW Finds the value and location of the minimum unsigned word from one of 8 horizontally
packed unsigned words. The resulting value and location (offset within the source) are
packed into the low dword of the destination XMM register.

Vol. 1 5-29
INSTRUCTION SET SUMMARY

5.10.11 Packed Test


PTEST Performs a logical AND between the destination with this mask and sets the ZF flag if the
result is zero. The CF flag (zero for TEST) is set if the inverted mask AND’d with the desti-
nation is all zeroes.

5.10.12 Packed Qword Equality Comparisons


PCMPEQQ 128-bit packed qword equality test.

5.10.13 Dword Packing With Unsigned Saturation


PACKUSDW Packs dword to word with unsigned saturation.

5.11 INTEL® SSE4.2 INSTRUCTION SET


Five of the Intel SSE4.2 instructions operate on XMM register as a source or destination. These include four
text/string processing instructions and one packed quadword compare SIMD instruction. Programming these five
Intel SSE4.2 instructions is similar to programming 128-bit Integer SIMD in Intel SSE2/SSSE3. Intel SSE4.2 does
not provide any 64-bit integer SIMD instructions.
CRC32 operates on general-purpose registers and is summarized in Section 5.1.6. The sections that follow summa-
rize each subgroup.

5.11.1 String and Text Processing Instructions


PCMPESTRI Packed compare explicit-length strings, return index in ECX/RCX.
PCMPESTRM Packed compare explicit-length strings, return mask in XMM0.
PCMPISTRI Packed compare implicit-length strings, return index in ECX/RCX.
PCMPISTRM Packed compare implicit-length strings, return mask in XMM0.

5.11.2 Packed Comparison SIMD Integer Instruction


PCMPGTQ Performs logical compare of greater-than on packed integer quadwords.

5.12 INTEL® AES-NI AND PCLMULQDQ


Six Intel® AES-NI instructions operate on XMM registers to provide accelerated primitives for block encryp-
tion/decryption using Advanced Encryption Standard (FIPS-197). The PCLMULQDQ instruction performs carry-less
multiplication for two binary numbers up to 64-bit wide.
AESDEC Perform an AES decryption round using an 128-bit state and a round key.
AESDECLAST Perform the last AES decryption round using an 128-bit state and a round key.
AESENC Perform an AES encryption round using an 128-bit state and a round key.
AESENCLAST Perform the last AES encryption round using an 128-bit state and a round key.
AESIMC Perform an inverse mix column transformation primitive.
AESKEYGENASSIST Assist the creation of round keys with a key expansion schedule.
PCLMULQDQ Perform carryless multiplication of two 64-bit numbers.

5-30 Vol. 1
INSTRUCTION SET SUMMARY

5.13 INTEL® ADVANCED VECTOR EXTENSIONS (INTEL® AVX)


Intel® Advanced Vector Extensions (AVX) promote legacy 128-bit SIMD instruction sets that operate on the XMM
register set to use a “vector extension” (VEX) prefix and operates on 256-bit vector registers (YMM). Almost all
prior generations of 128-bit SIMD instructions that operate on XMM (but not on MMX registers) are promoted to
support three-operand syntax with VEX-128 encoding.
VEX-prefix encoded Intel AVX instructions support 256-bit and 128-bit floating-point operations by extending the
legacy 128-bit SIMD floating-point instructions to support three-operand syntax.
Additional functional enhancements are also provided with VEX-encoded Intel AVX instructions.
The list of Intel AVX instructions is included in the following tables:
• Table 14-2 lists 256-bit and 128-bit floating-point arithmetic instructions promoted from legacy 128-bit SIMD
instruction sets.
• Table 14-3 lists 256-bit and 128-bit data movement and processing instructions promoted from legacy 128-bit
SIMD instruction sets.
• Table 14-4 lists functional enhancements of 256-bit Intel AVX instructions not available from legacy 128-bit
SIMD instruction sets.
• Table 14-5 lists 128-bit integer and floating-point instructions promoted from legacy 128-bit SIMD instruction
sets.
• Table 14-6 lists functional enhancements of 128-bit Intel AVX instructions not available from legacy 128-bit
SIMD instruction sets.
• Table 14-7 lists 128-bit data movement and processing instructions promoted from legacy instruction sets.

5.14 16-BIT FLOATING-POINT CONVERSION


Conversions between single precision floating-point (32-bit) and half precision floating-point (16-bit) data are
provided by the VCVTPS2PH and VCVTPH2PS instructions, introduced beginning with the third generation of Intel
Core processors based on Ivy Bridge microarchitecture:
VCVTPH2PS Convert eight/four data elements containing 16-bit floating-point data into eight/four
single precision floating-point data.
VCVTPS2PH Convert eight/four data elements containing single precision floating-point data into
eight/four 16-bit floating-point data.
Starting with the 4th generation Intel Xeon Scalable Processor Family based on Sapphire Rapids microarchitecture,
Intel® AVX-512 instruction set architecture for FP16 was added, supporting a wide range of general-purpose
numeric operations for 16-bit half precision floating-point values (binary16 in IEEE Standard 754-2019 for
Floating-Point Arithmetic, aka half precision or FP16). Section 5.19 includes a list of these instructions.

5.15 FUSED-MULTIPLY-ADD (FMA)


FMA extensions enhances Intel AVX with high-throughput, arithmetic capabilities covering fused multiply-add,
fused multiply-subtract, fused multiply add/subtract interleave, signed-reversed multiply on fused multiply-add
and multiply-subtract. FMA extensions provide 36 256-bit floating-point instructions to perform computation on
256-bit vectors and additional 128-bit and scalar FMA instructions.
• Table 14-15 lists FMA instruction sets.

5.16 INTEL® ADVANCED VECTOR EXTENSIONS 2 (INTEL® AVX2)


Intel®AVX2 extends Intel AVX by promoting most of the 128-bit SIMD integer instructions with 256-bit numeric
processing capabilities. Intel AVX2 instructions follow the same programming model as AVX instructions.

Vol. 1 5-31
INSTRUCTION SET SUMMARY

In addition, AVX2 provide enhanced functionalities for broadcast/permute operations on data elements, vector
shift instructions with variable-shift count per data element, and instructions to fetch non-contiguous data
elements from memory.
• Table 14-18 lists promoted vector integer instructions in AVX2.
• Table 14-19 lists new instructions in AVX2 that complements AVX.

5.17 INTEL® TRANSACTIONAL SYNCHRONIZATION EXTENSIONS (INTEL® TSX)


XABORT Abort an RTM transaction execution.
XACQUIRE Prefix hint to the beginning of an HLE transaction region.
XRELEASE Prefix hint to the end of an HLE transaction region.
XBEGIN Transaction begin of an RTM transaction region.
XEND Transaction end of an RTM transaction region.
XTEST Test if executing in a transactional region.
XRESLDTRK Resume tracking load addresses.
XSUSLDTRK Suspend tracking load addresses.

5.18 INTEL® SHA EXTENSIONS


Intel® SHA extensions provide a set of instructions that target the acceleration of the Secure Hash Algorithm
(SHA), specifically the SHA-1 and SHA-256 variants.
SHA1MSG1 Perform an intermediate calculation for the next four SHA1 message dwords from the
previous message dwords.
SHA1MSG2 Perform the final calculation for the next four SHA1 message dwords from the intermediate
message dwords.
SHA1NEXTE Calculate SHA1 state E after four rounds.
SHA1RNDS4 Perform four rounds of SHA1 operations.
SHA256MSG1 Perform an intermediate calculation for the next four SHA256 message dwords.
SHA256MSG2 Perform the final calculation for the next four SHA256 message dwords.
SHA256RNDS2 Perform two rounds of SHA256 operations.

5.19 INTEL® ADVANCED VECTOR EXTENSIONS 512 (INTEL® AVX-512)


The Intel® AVX-512 family comprises a collection of 512-bit SIMD instruction sets to accelerate a diverse range of
applications. Intel AVX-512 instructions provide a wide range of functionality that support programming in 512-bit,
256 and 128-bit vector register, plus support for opmask registers and instructions operating on opmask registers.
The collection of 512-bit SIMD instruction sets in Intel AVX-512 include new functionality not available in Intel AVX
and Intel AVX2, and promoted instructions similar to equivalent ones in Intel AVX/Intel AVX2 but with enhance-
ment provided by opmask registers not available to VEX-encoded Intel AVX/Intel AVX2. Some instruction
mnemonics in Intel AVX/Intel AVX2 that are promoted into Intel AVX-512 can be replaced by new instruction
mnemonics that are available only with EVEX encoding, e.g., VBROADCASTF128 into VBROADCASTF32X4. Details
of EVEX instruction encoding are discussed in Section 2.7, “Intel® AVX-512 Encoding,” of the Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 2A. Starting with the 4th generation Intel Xeon Scalable
Processor Family, an Intel AVX-512 instruction set architecture for FP16 was added, supporting a wide range of
general-purpose numeric operations for 16-bit half precision floating-point values, which complements the existing
32-bit and 64-bit floating-point instructions already available in the Intel Xeon processor-based products.
512-bit instruction mnemonics in AVX-512F instructions that are not Intel AVX or AVX2 promotions include:
VALIGND/Q Perform dword/qword alignment of two concatenated source vectors.
VBLENDMPD/PS Replace the VBLENDVPD/PS instructions (using opmask as select control).

5-32 Vol. 1
INSTRUCTION SET SUMMARY

VCOMPRESSPD/PS Compress packed DP or SP elements of a vector.


VCVT(T)PD2UDQ Convert packed DP FP elements of a vector to packed unsigned 32-bit integers.
VCVT(T)PS2UDQ Convert packed SP FP elements of a vector to packed unsigned 32-bit integers.
VCVTQQ2PD/PS Convert packed signed 64-bit integers to packed DP/SP FP elements.
VCVT(T)SD2USI Convert the low DP FP element of a vector to an unsigned integer.
VCVT(T)SS2USI Convert the low SP FP element of a vector to an unsigned integer.
VCVTUDQ2PD/PS Convert packed unsigned 32-bit integers to packed DP/SP FP elements.
VCVTUSI2USD/S Convert an unsigned integer to the low DP/SP FP element and merge to a vector.
VEXPANDPD/PS Expand packed DP or SP elements of a vector.
VEXTRACTF32X4/64X4 Extract a vector from a full-length vector with 32/64-bit granular update.
VEXTRACTI32X4/64X4 Extract a vector from a full-length vector with 32/64-bit granular update.
VFIXUPIMMPD/PS Perform fix-up to special values in DP/SP FP vectors.
VFIXUPIMMSD/SS Perform fix-up to special values of the low DP/SP FP element.
VGETEXPPD/PS Convert the exponent of DP/SP FP elements of a vector into FP values.
VGETEXPSD/SS Convert the exponent of the low DP/SP FP element in a vector into FP value.
VGETMANTPD/PS Convert the mantissa of DP/SP FP elements of a vector into FP values.
VGETMANTSD/SS Convert the mantissa of the low DP/SP FP element of a vector into FP value.
VINSERTF32X4/64X4 Insert a 128/256-bit vector into a full-length vector with 32/64-bit granular update.
VMOVDQA32/64 VMOVDQA with 32/64-bit granular conditional update.
VMOVDQU32/64 VMOVDQU with 32/64-bit granular conditional update.
VPBLENDMD/Q Blend dword/qword elements using opmask as select control.
VPBROADCASTD/Q Broadcast from general-purpose register to vector register.
VPCMPD/UD Compare packed signed/unsigned dwords using specified primitive.
VPCMPQ/UQ Compare packed signed/unsigned quadwords using specified primitive.
VPCOMPRESSQ/D Compress packed 64/32-bit elements of a vector.
VPERMI2D/Q Full permute of two tables of dword/qword elements overwriting the index vector.
VPERMI2PD/PS Full permute of two tables of DP/SP elements overwriting the index vector.
VPERMT2D/Q Full permute of two tables of dword/qword elements overwriting one source table.
VPERMT2PD/PS Full permute of two tables of DP/SP elements overwriting one source table.
VPEXPANDD/Q Expand packed dword/qword elements of a vector.
VPMAXSQ Compute maximum of packed signed 64-bit integer elements.
VPMAXUD/UQ Compute maximum of packed unsigned 32/64-bit integer elements.
VPMINSQ Compute minimum of packed signed 64-bit integer elements.
VPMINUD/UQ Compute minimum of packed unsigned 32/64-bit integer elements.
VPMOV(S|US)QB Down convert qword elements in a vector to byte elements using truncation (saturation |
unsigned saturation).
VPMOV(S|US)QW Down convert qword elements in a vector to word elements using truncation (saturation |
unsigned saturation).
VPMOV(S|US)QD Down convert qword elements in a vector to dword elements using truncation (saturation
| unsigned saturation).
VPMOV(S|US)DB Down convert dword elements in a vector to byte elements using truncation (saturation |
unsigned saturation).
VPMOV(S|US)DW Down convert dword elements in a vector to word elements using truncation (saturation |
unsigned saturation).
VPROLD/Q Rotate dword/qword element left by a constant shift count with conditional update.
VPROLVD/Q Rotate dword/qword element left by shift counts specified in a vector with conditional
update.
VPRORD/Q Rotate dword/qword element right by a constant shift count with conditional update.

Vol. 1 5-33
INSTRUCTION SET SUMMARY

VPRORRD/Q Rotate dword/qword element right by shift counts specified in a vector with conditional
update.
VPSCATTERDD/DQ Scatter dword/qword elements in a vector to memory using dword indices.
VPSCATTERQD/QQ Scatter dword/qword elements in a vector to memory using qword indices.
VPSRAQ Shift qwords right by a constant shift count and shifting in sign bits.
VPSRAVQ Shift qwords right by shift counts in a vector and shifting in sign bits.
VPTESTNMD/Q Perform bitwise NAND of dword/qword elements of two vectors and write results to
opmask.
VPTERLOGD/Q Perform bitwise ternary logic operation of three vectors with 32/64 bit granular conditional
update.
VPTESTMD/Q Perform bitwise AND of dword/qword elements of two vectors and write results to opmask.
VRCP14PD/PS Compute approximate reciprocals of packed DP/SP FP elements of a vector.
VRCP14SD/SS Compute the approximate reciprocal of the low DP/SP FP element of a vector.
VRNDSCALEPD/PS Round packed DP/SP FP elements of a vector to specified number of fraction bits.
VRNDSCALESD/SS Round the low DP/SP FP element of a vector to specified number of fraction bits.
VRSQRT14PD/PS Compute approximate reciprocals of square roots of packed DP/SP FP elements of a vector.
VRSQRT14SD/SS Compute the approximate reciprocal of square root of the low DP/SP FP element of a
vector.
VSCALEPD/PS Multiply packed DP/SP FP elements of a vector by powers of two with exponents specified
in a second vector.
VSCALESD/SS Multiply the low DP/SP FP element of a vector by powers of two with exponent specified in
the corresponding element of a second vector.
VSCATTERDD/DQ Scatter SP/DP FP elements in a vector to memory using dword indices.
VSCATTERQD/QQ Scatter SP/DP FP elements in a vector to memory using qword indices.
VSHUFF32X4/64X2 Shuffle 128-bit lanes of a vector with 32/64 bit granular conditional update.
VSHUFI32X4/64X2 Shuffle 128-bit lanes of a vector with 32/64 bit granular conditional update.

512-bit instruction mnemonics in AVX-512DQ that are not Intel AVX or AVX2 promotions include:
VCVT(T)PD2QQ Convert packed DP FP elements of a vector to packed signed 64-bit integers.
VCVT(T)PD2UQQ Convert packed DP FP elements of a vector to packed unsigned 64-bit integers.
VCVT(T)PS2QQ Convert packed SP FP elements of a vector to packed signed 64-bit integers.
VCVT(T)PS2UQQ Convert packed SP FP elements of a vector to packed unsigned 64-bit integers.
VCVTUQQ2PD/PS Convert packed unsigned 64-bit integers to packed DP/SP FP elements.
VEXTRACTF64X2 Extract a vector from a full-length vector with 64-bit granular update.
VEXTRACTI64X2 Extract a vector from a full-length vector with 64-bit granular update.
VFPCLASSPD/PS Test packed DP/SP FP elements in a vector by numeric/special-value category.
VFPCLASSSD/SS Test the low DP/SP FP element by numeric/special-value category.
VINSERTF64X2 Insert a 128-bit vector into a full-length vector with 64-bit granular update.
VINSERTI64X2 Insert a 128-bit vector into a full-length vector with 64-bit granular update.
VPMOVM2D/Q Convert opmask register to vector register in 32/64-bit granularity.
VPMOVB2D/Q2M Convert a vector register in 32/64-bit granularity to an opmask register.
VPMULLQ Multiply packed signed 64-bit integer elements of two vectors and store low 64-bit signed
result.
VRANGEPD/PS Perform RANGE operation on each pair of DP/SP FP elements of two vectors using specified
range primitive in imm8.
VRANGESD/SS Perform RANGE operation on the pair of low DP/SP FP element of two vectors using speci-
fied range primitive in imm8.

5-34 Vol. 1
INSTRUCTION SET SUMMARY

VREDUCEPD/PS Perform Reduction operation on packed DP/SP FP elements of a vector using specified
reduction primitive in imm8.
VREDUCESD/SS Perform Reduction operation on the low DP/SP FP element of a vector using specified
reduction primitive in imm8.

512-bit instruction mnemonics in AVX-512BW that are not Intel AVX or AVX2 promotions include:
VDBPSADBW Double block packed Sum-Absolute-Differences on unsigned bytes.
VMOVDQU8/16 VMOVDQU with 8/16-bit granular conditional update.
VPBLENDMB Replaces the VPBLENDVB instruction (using opmask as select control).
VPBLENDMW Blend word elements using opmask as select control.
VPBROADCASTB/W Broadcast from general-purpose register to vector register.
VPCMPB/UB Compare packed signed/unsigned bytes using specified primitive.
VPCMPW/UW Compare packed signed/unsigned words using specified primitive.
VPERMW Permute packed word elements.
VPERMI2B/W Full permute from two tables of byte/word elements overwriting the index vector.
VPMOVM2B/W Convert opmask register to vector register in 8/16-bit granularity.
VPMOVB2M/W2M Convert a vector register in 8/16-bit granularity to an opmask register.
VPMOV(S|US)WB Down convert word elements in a vector to byte elements using truncation (saturation |
unsigned saturation).
VPSLLVW Shift word elements in a vector left by shift counts in a vector.
VPSRAVW Shift words right by shift counts in a vector and shifting in sign bits.
VPSRLVW Shift word elements in a vector right by shift counts in a vector.
VPTESTNMB/W Perform bitwise NAND of byte/word elements of two vectors and write results to opmask.
VPTESTMB/W Perform bitwise AND of byte/word elements of two vectors and write results to opmask.

512-bit instruction mnemonics in AVX-512CD that are not Intel AVX or AVX2 promotions include:
VPBROADCASTM Broadcast from opmask register to vector register.
VPCONFLICTD/Q Detect conflicts within a vector of packed 32/64-bit integers.
VPLZCNTD/Q Count the number of leading zero bits of packed dword/qword elements.

Opmask instructions include:


KADDB/W/D/Q Add two 8/16/32/64-bit opmasks.
KANDB/W/D/Q Logical AND two 8/16/32/64-bit opmasks.
KANDNB/W/D/Q Logical AND NOT two 8/16/32/64-bit opmasks.
KMOVB/W/D/Q Move from or move to opmask register of 8/16/32/64-bit data.
KNOTB/W/D/Q Bitwise NOT of two 8/16/32/64-bit opmasks.
KORB/W/D/Q Logical OR two 8/16/32/64-bit opmasks.
KORTESTB/W/D/Q Update EFLAGS according to the result of bitwise OR of two 8/16/32/64-bit opmasks.
KSHIFTLB/W/D/Q Shift left 8/16/32/64-bit opmask by specified count.
KSHIFTRB/W/D/Q Shift right 8/16/32/64-bit opmask by specified count.
KTESTB/W/D/Q Update EFLAGS according to the result of bitwise TEST of two 8/16/32/64-bit opmasks.
KUNPCKBW/WD/DQ Unpack and interleave two 8/16/32-bit opmasks into 16/32/64-bit mask.
KXNORB/W/D/Q Bitwise logical XNOR of two 8/16/32/64-bit opmasks.
KXORB/W/D/Q Logical XOR of two 8/16/32/64-bit opmasks.

512-bit instruction mnemonics in AVX-512ER include:

Vol. 1 5-35
INSTRUCTION SET SUMMARY

VEXP2PD/PS Compute approximate base-2 exponential of packed DP/SP FP elements of a vector.


VEXP2SD/SS Compute approximate base-2 exponential of the low DP/SP FP element of a vector.
VRCP28PD/PS Compute approximate reciprocals to 28 bits of packed DP/SP FP elements of a vector.
VRCP28SD/SS Compute the approximate reciprocal to 28 bits of the low DP/SP FP element of a vector.
VRSQRT28PD/PS Compute approximate reciprocals of square roots to 28 bits of packed DP/SP FP elements
of a vector.
VRSQRT28SD/SS Compute the approximate reciprocal of square root to 28 bits of the low DP/SP FP element
of a vector.

512-bit instruction mnemonics in AVX-512PF include:


VGATHERPF0DPD/PS Sparse prefetch of packed DP/SP FP vector with T0 hint using dword indices.
VGATHERPF0QPD/PS Sparse prefetch of packed DP/SP FP vector with T0 hint using qword indices.
VGATHERPF1DPD/PS Sparse prefetch of packed DP/SP FP vector with T1 hint using dword indices.
VGATHERPF1QPD/PS Sparse prefetch of packed DP/SP FP vector with T1 hint using qword indices.
VSCATTERPF0DPD/PS Sparse prefetch of packed DP/SP FP vector with T0 hint to write using dword indices.
VSCATTERPF0QPD/PS Sparse prefetch of packed DP/SP FP vector with T0 hint to write using qword indices.
VSCATTERPF1DPD/PS Sparse prefetch of packed DP/SP FP vector with T1 hint to write using dword indices.
VSCATTERPF1QPD/PS Sparse prefetch of packed DP/SP FP vector with T1 hint to write using qword indices.

512-bit instruction mnemonics in AVX512-FP16 include:


VADDPH/SH Add packed/scalar FP16 values.
VCMPPH/SH Compare packed/scalar FP16 values.
VCOMISH Compare scalar ordered FP16 values and set EFLAGS.
VCVTDQ2PH Convert packed signed doubleword integers to packed FP16 values.
VCVTPD2PH Convert packed double precision FP values to packed FP16 values.
VCVTPH2DQ/QQ Convert packed FP16 values to signed doubleword/quadword integers.
VCVTPH2PD Convert packed FP16 values to FP64 values.
VCVTPH2PS[X] Convert packed FP16 values to single precision floating-point values.
VCVTPH2QQ Convert packed FP16 values to signed quadword integer values.
VCVTPH2UDQ/QQ Convert packed FP16 values to unsigned doubleword/quadword integers.
VCVTPH2UW/W Convert packed FP16 values to unsigned/signed word integers.
VCVTPS2PH[X] Convert packed single precision floating-point values to packed FP16 values.
VCVTQQ2PH Convert packed signed quadword integers to packed FP16 values.
VCVTSD2SH Convert low FP64 value to an FP16 value.
VCVTSH2SD/SS Convert low FP16 value to an FP64/FP32 value.
VCVTSH2SI/USI Convert low FP16 value to signed/unsigned integer.
VCVTSI2SH Convert a signed doubleword/quadword integer to an FP16 value.
VCVTSS2SH Convert low FP32 value to an FP16 value.
VCVTTPH2DQ/QQ Convert with truncation packed FP16 values to signed doubleword/quadword integers.
VCVTTPH2UDQ/QQ Convert with truncation packed FP16 values to unsigned doubleword/quadword integers.
VCVTTPH2UW/W Convert packed FP16 values to unsigned/signed word integers.
VCVTTSH2SI/USI Convert with truncation low FP16 value to a signed/unsigned integer.
VCVTUDQ2PH Convert packed unsigned doubleword integers to packed FP16 values.
VCVTUQQ2PH Convert packed unsigned quadword integers to packed FP16 values.
VCVTUSI2SH Convert unsigned doubleword integer to an FP16 value.
VCVTUW2PH Convert packed unsigned word integers to FP16 values.
VCVTW2PH Convert packed signed word integers to FP16 values.

5-36 Vol. 1
INSTRUCTION SET SUMMARY

VDIVPH/SH Divide packed/scalar FP16 values.


VF[C]MADDCPH Complex multiply and accumulate FP16 values.
VF[C]MADDCSH Complex multiply and accumulate scalar FP16 values.
VF[C]MULCPH Complex multiply FP16 values.
VF[C]MULCSH Complex multiply scalar FP16 values.
VF[,N]MADD[132,213,231]PH Fused multiply-add of packed FP16 values.
VF[,N]MADD[132,213,231]SH Fused multiply-add of scalar FP16 values.
VFMADDSUB[132,213,231]PH Fused multiply-alternating add/subtract of packed FP16 values.
VFMSUBADD[132,213,231]PH Fused multiply-alternating subtract/add of packed FP16 values.
VF[,N]MSUB[132,213,231]PH Fused multiply-subtract of packed FP16 values.
VF[,N]MSUB[132,213,231]SH Fused multiply-subtract of scalar FP16 values.
VFPCLASSPH/SH Test types of packed/scalar FP16 values.
VGETEXPPH/SH Convert exponents of packed/scalar FP16 values to FP16 values.
VGETMANTPH/SH Extract FP16 vector of normalized mantissas from FP16 vector/scalar.
VMAXPH/PS Return maximum of packed/scalar FP16 values.
VMINPH/PS Return minimum of packed/scalar FP16 values.
VMOVSH Move scalar FP16 value.
VMOVW Move word.
VMULPH/SH Multiply packed/scalar FP16 values.
VRCPPH/SH Compute reciprocals of packed/scalar FP16 values.
VREDUCEPH/SH Perform reduction transformation on packed/scalar FP16 values.
VRNDSCALEPH/SH Round packed/scalar FP16 values to include a given number of fraction bits.
VRSQRTPH/SH Compute reciprocals of square roots of packed/scalar FP16 values.
VSCALEPH/SH Scale packed/scalar FP16 values with FP16 values.
VSQRTPH/SH Compute square root of packed/scalar FP16 values.
VSUBPH/SH Subtract packed/scalar FP16 values.
VUCOMISH Unordered compare scalar FP16 values and set EFLAGS.

5.20 SYSTEM INSTRUCTIONS


The following system instructions are used to control those functions of the processor that are provided to support
for operating systems and executives.
CLAC Clear AC Flag in EFLAGS register.
STAC Set AC Flag in EFLAGS register.
LGDT Load global descriptor table (GDT) register.
SGDT Store global descriptor table (GDT) register.
LLDT Load local descriptor table (LDT) register.
SLDT Store local descriptor table (LDT) register.
LTR Load task register.
STR Store task register.
LIDT Load interrupt descriptor table (IDT) register.
SIDT Store interrupt descriptor table (IDT) register.
MOV Load and store control registers.
LMSW Load machine status word.
SMSW Store machine status word.
CLTS Clear the task-switched flag.

Vol. 1 5-37
INSTRUCTION SET SUMMARY

ARPL Adjust requested privilege level.


LAR Load access rights.
LSL Load segment limit.
VERR Verify segment for reading
VERW Verify segment for writing.
MOV Load and store debug registers.
INVD Invalidate cache, no writeback.
WBINVD Invalidate cache, with writeback.
INVLPG Invalidate TLB Entry.
INVPCID Invalidate Process-Context Identifier.
LOCK (prefix) Perform atomic access to memory (can be applied to a number of general purpose instruc-
tions that provide memory source/destination access).
HLT Halt processor.
RSM Return from system management mode (SMM).
RDMSR Read model-specific register.
WRMSR Write model-specific register.
RDPMC Read performance monitoring counters.
RDTSC Read time stamp counter.
RDTSCP Read time stamp counter and processor ID.
SYSENTER Fast System Call, transfers to a flat protected mode kernel at CPL = 0.
SYSEXIT Fast System Call, transfers to a flat protected mode kernel at CPL = 3.
XSAVE Save processor extended states to memory.
XSAVEC Save processor extended states with compaction to memory.
XSAVEOPT Save processor extended states to memory, optimized.
XSAVES Save processor supervisor-mode extended states to memory.
XRSTOR Restore processor extended states from memory.
XRSTORS Restore processor supervisor-mode extended states from memory.
XGETBV Reads the state of an extended control register.
XSETBV Writes the state of an extended control register.
RDFSBASE Reads from FS base address at any privilege level.
RDGSBASE Reads from GS base address at any privilege level.
WRFSBASE Writes to FS base address at any privilege level.
WRGSBASE Writes to GS base address at any privilege level.

5.21 64-BIT MODE INSTRUCTIONS


The following instructions are introduced in 64-bit mode. This mode is a sub-mode of IA-32e mode.
CDQE Convert doubleword to quadword.
CMPSQ Compare string operands.
CMPXCHG16B Compare RDX:RAX with m128.
LODSQ Load qword at address (R)SI into RAX.
MOVSQ Move qword from address (R)SI to (R)DI.
MOVZX (64-bits) Move bytes/words to doublewords/quadwords, zero-extension.
STOSQ Store RAX at address RDI.
SWAPGS Exchanges current GS base register value with value in MSR address C0000102H.
SYSCALL Fast call to privilege level 0 system procedures.

5-38 Vol. 1
INSTRUCTION SET SUMMARY

SYSRET Return from fast system call.

5.22 VIRTUAL-MACHINE EXTENSIONS


The behavior of the VMCS-maintenance instructions is summarized below:
VMPTRLD Takes a single 64-bit source operand in memory. It makes the referenced VMCS active and
current.
VMPTRST Takes a single 64-bit destination operand that is in memory. Current-VMCS pointer is
stored into the destination operand.
VMCLEAR Takes a single 64-bit operand in memory. The instruction sets the launch state of the VMCS
referenced by the operand to “clear”, renders that VMCS inactive, and ensures that data
for the VMCS have been written to the VMCS-data area in the referenced VMCS region.
VMREAD Reads a component from the VMCS (the encoding of that field is given in a register
operand) and stores it into a destination operand.
VMWRITE Writes a component to the VMCS (the encoding of that field is given in a register operand)
from a source operand.
The behavior of the VMX management instructions is summarized below:
VMLAUNCH Launches a virtual machine managed by the VMCS. A VM entry occurs, transferring control
to the VM.
VMRESUME Resumes a virtual machine managed by the VMCS. A VM entry occurs, transferring control
to the VM.
VMXOFF Causes the processor to leave VMX operation.
VMXON Takes a single 64-bit source operand in memory. It causes a logical processor to enter VMX
root operation and to use the memory referenced by the operand to support VMX opera-
tion.
The behavior of the VMX-specific TLB-management instructions is summarized below:
INVEPT Invalidate cached Extended Page Table (EPT) mappings in the processor to synchronize
address translation in virtual machines with memory-resident EPT pages.
INVVPID Invalidate cached mappings of address translation based on the Virtual Processor ID
(VPID).
None of the instructions above can be executed in compatibility mode; they generate invalid-opcode exceptions if
executed in compatibility mode.
The behavior of the guest-available instructions is summarized below:
VMCALL Allows a guest in VMX non-root operation to call the VMM for service. A VM exit occurs,
transferring control to the VMM.
VMFUNC Allows software in VMX non-root operation to invoke a VM function, which is processor
functionality enabled and configured by software in VMX root operation. No VM exit occurs.

5.23 SAFER MODE EXTENSIONS


The behavior of the GETSEC instruction leaves of the Safer Mode Extensions (SMX) are summarized below:
GETSEC[CAPABILITIES]Returns the available leaf functions of the GETSEC instruction.
GETSEC[ENTERACCS] Loads an authenticated code chipset module and enters authenticated code execution
mode.
GETSEC[EXITAC] Exits authenticated code execution mode.
GETSEC[SENTER] Establishes a Measured Launched Environment (MLE) which has its dynamic root of trust
anchored to a chipset supporting Intel Trusted Execution Technology.
GETSEC[SEXIT] Exits the MLE.
GETSEC[PARAMETERS] Returns SMX related parameter information.

Vol. 1 5-39
INSTRUCTION SET SUMMARY

GETSEC[SMCRTL] SMX mode control.


GETSEC[WAKEUP] Wakes up sleeping logical processors inside an MLE.

5.24 INTEL® MEMORY PROTECTION EXTENSIONS


Intel Memory Protection Extensions (Intel MPX) provides a set of instructions to enable software to add robust
bounds checking capability to memory references. Details of Intel MPX are described in Appendix E, “Intel®
Memory Protection Extensions.”
BNDMK Create a LowerBound and an UpperBound in a register.
BNDCL Check the address of a memory reference against a LowerBound.
BNDCU Check the address of a memory reference against an UpperBound in 1’s complement form.
BNDCN Check the address of a memory reference against an UpperBound not in 1’s complement
form.
BNDMOV Copy or load from memory of the LowerBound and UpperBound to a register.
BNDMOV Store to memory of the LowerBound and UpperBound from a register.
BNDLDX Load bounds using address translation.
BNDSTX Store bounds using address translation.

5.25 INTEL® SOFTWARE GUARD EXTENSIONS


Intel Software Guard Extensions (Intel SGX) provide two sets of instruction leaf functions to enable application
software to instantiate a protected container, referred to as an enclave. The enclave instructions are organized as
leaf functions under two instruction mnemonics: ENCLS (ring 0) and ENCLU (ring 3). Details of Intel SGX are
described in Chapter 35 through Chapter 40 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 3D.
The first implementation of Intel SGX is also referred to as SGX1, it is introduced with the 6th Generation Intel
Core Processors. The leaf functions supported in SGX1 are shown in Table 5-3.

Table 5-3. Supervisor and User Mode Enclave Instruction Leaf Functions in Long-Form of SGX1
Supervisor Instruction Description User Instruction Description
ENCLS[EADD] Add a page ENCLU[EENTER] Enter an Enclave
ENCLS[EBLOCK] Block an EPC page ENCLU[EEXIT] Exit an Enclave
ENCLS[ECREATE] Create an enclave ENCLU[EGETKEY] Create a cryptographic key
ENCLS[EDBGRD] Read data by debugger ENCLU[EREPORT] Create a cryptographic report
ENCLS[EDBGWR] Write data by debugger ENCLU[ERESUME] Re-enter an Enclave
ENCLS[EEXTEND] Extend EPC page measurement
ENCLS[EINIT] Initialize an enclave
ENCLS[ELDB] Load an EPC page as blocked
ENCLS[ELDU] Load an EPC page as unblocked
ENCLS[EPA] Add version array
ENCLS[EREMOVE] Remove a page from EPC
ENCLS[ETRACK] Activate EBLOCK checks
ENCLS[EWB] Write back/invalidate an EPC page

5-40 Vol. 1
INSTRUCTION SET SUMMARY

5.26 SHADOW STACK MANAGEMENT INSTRUCTIONS


Shadow stack management instructions allow the program and run-time to perform operations like recovering
from control protection faults, shadow stack switching, etc. The following instructions are provided.
CLRSSBSY Clear busy bit in a supervisor shadow stack token.
INCSSP Increment the shadow stack pointer (SSP).
RDSSP Read shadow stack point (SSP).
RSTORSSP Restore a shadow stack pointer (SSP).
SAVEPREVSSP Save previous shadow stack pointer (SSP).
SETSSBSY Set busy bit in a supervisor shadow stack token.
WRSS Write to a shadow stack.
WRUSS Write to a user mode shadow stack.

5.27 CONTROL TRANSFER TERMINATING INSTRUCTIONS


ENDBR32 Terminate an Indirect Branch in 32-bit and Compatibility Mode.
ENDBR64 Terminate an Indirect Branch in 64-bit Mode.

5.28 INTEL® AMX INSTRUCTIONS


LDTILECFG Load tile configuration.
STTILECFG Store tile configuration.
TDPBF16PS Dot product of BF16 tiles accumulated into packed single precision tile.
TDPBSSD Dot product of signed bytes with dword accumulation.
TDPBSUD Dot product of signed/unsigned bytes with dword accumulation.
TDPBUSD Dot product of unsigned/signed bytes with dword accumulation.
TDPBUUD Dot product of unsigned bytes with dword accumulation.
TILELOADD Load data into tile.
TILELOADDT1 Load data into tile with hint to optimize data caching.
TILERELEASE Release tile.
TILESTORED Store tile.
TILEZERO Zero tile.

5.29 USER INTERRUPT INSTRUCTIONS


CLUI Clear user interrupt flag.
SENDUIPI Send user interprocessor interrupt.
STUI Set user interrupt flag.
TESTUI Determine user interrupt flag.
UIRET User-interrupt return.

5.30 ENQUEUE STORE INSTRUCTIONS


ENQCMD Enqueue command.
ENQCMDS Enqueue command supervisor.

Vol. 1 5-41
INSTRUCTION SET SUMMARY

5.31 INTEL® ADVANCED VECTOR EXTENSIONS 10 VERSION 1 INSTRUCTIONS


Intel® Advanced Vector Extensions 10 Version 1 (Intel® AVX10.1) is based on the Intel AVX-512 ISA feature set
and includes all Intel AVX-512 instructions introduced with the Intel® Xeon® 6 P-core processor based on Granite
Rapids microarchitecture. Intel AVX10.1 supports all instruction vector lengths (128, 256, and 512), as well as
scalar and opmask instructions.
For a list of Intel AVX-512 instructions, see Section 5.19, “Intel® Advanced Vector Extensions 512 (Intel® AVX-
512).” Additionally, note that some Intel AVX and Intel AVX2 instructions were promoted to Intel AVX512 and are
also supported. See Section 5.13, “Intel® Advanced Vector Extensions (Intel® AVX),” Section 5.16, “Intel®
Advanced Vector Extensions 2 (Intel® AVX2),” and Chapter 16, “Programming with Intel® AVX10‚” for further
details.

NOTE
For instructions with a CPUID feature flag specifying AVX10, the programmer must check the
available vector options on the processor at run-time via CPUID Leaf 24H, the Intel AVX10
Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector width and as such
will determine the set of instructions available to the programmer listed in each instruction’s opcode
table.

5-42 Vol. 1
4. Updates to Chapter 16, Volume 1
Change bars and violet text show changes to Chapter 16 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 1: Basic Architecture.

------------------------------------------------------------------------------------------
Changes to this chapter:
• Added new Chapter 16, “Programming with Intel® AVX10.”

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 13


CHAPTER 16
PROGRAMMING WITH INTEL® AVX10

16.1 INTRODUCTION
Intel® Advanced Vector Extensions 10 (Intel® AVX10) represents the first major new vector ISA since the introduc-
tion of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) in 2013. This ISA establishes a common,
converged vector instruction set across all Intel architectures, incorporating the modern vectorization aspects of
Intel AVX-512. This ISA will be supported on all future processors, including Performance cores (P-cores) and Effi-
cient cores (E-cores).
The Intel AVX10 ISA represents the latest in ISA innovations, instructions, and features moving forward. Based on
the Intel AVX-512 ISA feature set and including all Intel AVX-512 instructions introduced with Intel® Xeon® 6 P-
core processors based on Granite Rapids microarchitecture, it supports all instruction vector lengths (128, 256, and
512), as well as scalar and opmask instructions. Implementations of Intel AVX10 with vector lengths of at least 256
bits will be supported across all Intel® processors.

16.2 FEATURES AND CAPABILITIES


The Intel AVX10 architecture introduces several features and capabilities beyond the Intel® AVX2 ISA:
• Version-based instruction set enumeration.
• Intel AVX10/256 − Converged implementation support on all Intel processors to include all the existing Intel
AVX-512 capabilities such as EVEX encoding, 32 vector registers, and eight mask registers at a maximum
vector length of 256 bits.
• Embedded rounding and Suppress All Exceptions (SAE) control for YMM (256-bit) versions of the instructions.
• VMX capability to create Intel AVX10/256 virtual machines that provide a hardware enforced Intel AVX10/256
execution environment on an Intel AVX10/512 capable processor.

16.3 FEATURE ENUMERATION


Intel AVX10 introduces a versioned approach for enumeration that is monotonically increasing, inclusive, and
supporting all vector lengths. This is introduced to simplify application development by ensuring that all Intel
processors support the same features and instructions at a given Intel AVX10 version number, as well as reduce the
number of CPUID feature flags required to be checked by an application to determine feature support. In this
enumeration paradigm, the application developer will only need to check three fields:
1. A CPUID feature flag indicating that the Intel AVX10 ISA is supported.
2. A version number to ensure that the supported version is greater than or equal to the desired version.
3. A vector length bit indicating the maximum supported vector length.
The “AVX10 Converged Vector ISA” feature flag indicates processor support for the ISA and the presence of an
“AVX10 Converged Vector ISA” leaf containing fields for the version number and the supported vector bit lengths.
See Table 16-1 for details.

Vol. 1 16-1
PROGRAMMING WITH INTEL® AVX10

Table 16-1. CPUID Enumeration of Intel® AVX10


CPUID Bit Description Type
CPUID.(EAX=07H, ECX=01H):EDX[bit 19] If 1, the Intel® AVX10 Converged Vector ISA is supported. Bit (0/1)
CPUID.(EAX=24H, ECX=00H):EAX[bits 31:0] Reports the maximum supported sub-leaf. Integer
CPUID.(EAX=24H, ECX=00H):EBX[bits 7:0] Reports the Intel AVX10 Converged Vector ISA version. Integer (≥ 1)
CPUID.(EAX=24H, ECX=00H):EBX[bits 15:8] Reserved. N/A
CPUID.(EAX=24H, ECX=00H):EBX[bit 16] Reserved. Always 1
CPUID.(EAX=24H, ECX=00H):EBX[bit 17] If 1, indicates that 256-bit vector support is present. Bit (0/1)
CPUID.(EAX=24H, ECX=00H):EBX[bit 18] If 1, indicates that 512-bit vector support is present. Bit (0/1)
CPUID.(EAX=24H, ECX=00H):EBX[bits 31:19] Reserved. N/A
CPUID.(EAX=24H, ECX=00H):ECX[bits 31:0] Reserved. N/A
CPUID.(EAX=24H, ECX=00H):EDX[bits 31:0] Reserved. N/A
CPUID.(EAX=24H, ECX=01H):EAX[bits 31:0] Reserved for discrete feature bits. N/A
CPUID.(EAX=24H, ECX=01H):EBX[bits 31:0] Reserved for discrete feature bits. N/A
CPUID.(EAX=24H, ECX=01H):ECX[bits 31:0] Reserved for discrete feature bits. N/A
CPUID.(EAX=24H, ECX=01H):EDX[bits 31:0] Reserved for discrete feature bits. N/A

Several other important tenets regarding Intel AVX10 enumeration are as follows:
• Versions are expected to be inclusive such that version N+1 is a superset of version N. Once an instruction is
introduced in Intel AVX10.x, it is expected to be carried forward in all subsequent Intel AVX10 versions,
allowing a developer to check only for a version greater than or equal to the desired version.
• Any processor that enumerates support for Intel AVX10 will also enumerate support for Intel AVX and Intel
AVX2.
• Developers can assume that the highest supported vector length for a processor implies that all lesser vector
lengths are also supported. Scalar Intel AVX-512 instructions will be supported independent of the maximum
vector width.
The first version of Intel AVX10 (Version 1, or Intel® AVX10.1) will support only the Intel AVX-512 instruction set
at 128, 256, and 512 bits. Applications written to Intel AVX10.1 will run on any future Intel processor that enumer-
ates Intel AVX10.1 or higher at the matching desired vector lengths. Intel AVX-512 instruction families included in
Intel AVX10.1 are shown in Table 16-2.

Table 16-2. Intel® AVX-512 CPUID Feature Flags Included in Intel® AVX10
Feature Introduction Intel® AVX-512 CPUID Feature Flags Included in Intel® AVX10
Intel® Xeon® Scalable Processor Family based on Skylake AVX512F, AVX512CD, AVX512BW, AVX512DQ
microarchitecture
Intel® Core™ processors based on Cannon Lake microarchitecture AVX512-VBMI, AVX512-IFMA
2nd generation Intel® Xeon® Scalable Processor Family based on AVX512-VNNI
Cascade Lake product
3rd generation Intel® Xeon® Scalable Processor Family based on AVX512-BF16
Cooper Lake product
3rd generation Intel® Xeon® Scalable Processor Family based on Ice AVX512-VPOPCNTDQ, AVX512-VBMI2, VAES, GFNI,
Lake microarchitecture VPCLMULQDQ, AVX512-BITALG
4th generation Intel® Xeon® Scalable Processor Family based on AVX512-FP16
Sapphire Rapids microarchitecture

NOTE
VAES, VPCLMULQDQ, and GFNI EVEX instructions will be supported on Intel AVX10.1 machines but
will continue to be enumerated by their existing discrete CPUID feature flags. This requires the
developer to check for both the feature and Intel AVX10, e.g., {AVX10.1 AND VAES}.

16-2 Vol. 1
PROGRAMMING WITH INTEL® AVX10

New vector ISA features will only be added to the Intel AVX10 ISA moving forward. While Intel AVX10/512 includes
all Intel AVX-512 instructions, it is important to note that applications compiled to Intel AVX-512 with vector length
limited to 256 bits are not guaranteed to be compatible with an Intel AVX10/256 processor.

Table 16-3. Feature Differences Between Intel® AVX-512 and Intel® AVX10
Feature Intel® AVX-512 Intel® AVX10.1/256 Intel® AVX10.1/512
128-bit vector (XMM) register support Yes Yes Yes
256-bit vector (YMM) register support Yes Yes Yes
512-bit vector (ZMM) register support Yes No Yes
YMM embedded rounding No No No
ZMM embedded rounding Yes No Yes

Vol. 1 16-3
PROGRAMMING WITH INTEL® AVX10

16-4 Vol. 1
5. Updates to Chapter 2, Volume 2A
Change bars and violet text show changes to Chapter 2 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 2A: Instruction Set Reference, A-L.

------------------------------------------------------------------------------------------
Changes to this chapter:
• Added new exception Type 14 to existing tables.

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 13


CHAPTER 2
INSTRUCTION FORMAT

This chapter describes the instruction format for all Intel 64 and IA-32 processors. The instruction format for
protected mode, real-address mode and virtual-8086 mode is described in Section 2.1. Increments provided for IA-
32e mode and its sub-modes are described in Section 2.2.

2.1 INSTRUCTION FORMAT FOR PROTECTED MODE, REAL-ADDRESS MODE,


AND VIRTUAL-8086 MODE
The Intel 64 and IA-32 architectures instruction encodings are subsets of the format shown in Figure 2-1. Instruc-
tions consist of optional instruction prefixes (in any order), primary opcode bytes (up to three bytes), an
addressing-form specifier (if required) consisting of the ModR/M byte and sometimes the SIB (Scale-Index-Base)
byte, a displacement (if required), and an immediate data field (if required).

Instruction Opcode ModR/M SIB Displacement Immediate


Prefixes

Prefixes of 1-, 2-, or 3-byte 1 byte 1 byte Address Immediate


1 byte each opcode (if required) (if required) displacement data of
(optional)1, 2 of 1, 2, or 4 1, 2, or 4
bytes or none3 bytes or none3

7 6 5 3 2 0 7 6 5 3 2 0
Reg/
Mod Opcode R/M Scale Index Base

1. The REX prefix is optional, but if used must be immediately before the opcode; see Section
2.2.1, “REX Prefixes” for additional information.
2. For VEX encoding information, see Section 2.3, “Intel® Advanced Vector Extensions (Intel®
AVX)”.
3. Some rare instructions can take an 8B immediate or 8B displacement.

Figure 2-1. Intel 64 and IA-32 Architectures Instruction Format

2.1.1 Instruction Prefixes


Instruction prefixes are divided into four groups, each with a set of allowable prefix codes. For each instruction, it
is only useful to include up to one prefix code from each of the four groups (Groups 1, 2, 3, 4). Groups 1 through 4
may be placed in any order relative to each other.
• Group 1
— Lock and repeat prefixes:
• LOCK prefix is encoded using F0H.
• REPNE/REPNZ prefix is encoded using F2H. Repeat-Not-Zero prefix applies only to string and
input/output instructions. (F2H is also used as a mandatory prefix for some instructions.)
• REP or REPE/REPZ is encoded using F3H. The repeat prefix applies only to string and input/output
instructions. (F3H is also used as a mandatory prefix for some instructions.)

Vol. 2A 2-1
INSTRUCTION FORMAT

— BND prefix is encoded using F2H if the following conditions are true:
• CPUID.(EAX=07H, ECX=0):EBX.MPX[bit 14] is set.
• BNDCFGU.EN and/or IA32_BNDCFGS.EN is set.
• When the F2 prefix precedes a near CALL, a near RET, a near JMP, a short Jcc, or a near Jcc instruction
(see Appendix E, “Intel® Memory Protection Extensions,” of the Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 1).
• Group 2
— Segment override prefixes:
• 2EH—CS segment override (use with any branch instruction is reserved).
• 36H—SS segment override prefix (use with any branch instruction is reserved).
• 3EH—DS segment override prefix (use with any branch instruction is reserved).
• 26H—ES segment override prefix (use with any branch instruction is reserved).
• 64H—FS segment override prefix (use with any branch instruction is reserved).
• 65H—GS segment override prefix (use with any branch instruction is reserved).
— Branch hints1:
• 2EH—Branch not taken (used only with Jcc instructions).
• 3EH—Branch taken (used only with Jcc instructions).
• Group 3
• Operand-size override prefix is encoded using 66H (66H is also used as a mandatory prefix for some
instructions).
• Group 4
• 67H—Address-size override prefix.
The LOCK prefix (F0H) forces an operation that ensures exclusive use of shared memory in a multiprocessor envi-
ronment. See “LOCK—Assert LOCK# Signal Prefix” in Chapter 3, “Instruction Set Reference, A-L,” for a description
of this prefix.
Repeat prefixes (F2H, F3H) cause an instruction to be repeated for each element of a string. Use these prefixes
only with string and I/O instructions (MOVS, CMPS, SCAS, LODS, STOS, INS, and OUTS). Use of repeat prefixes
and/or undefined opcodes with other Intel 64 or IA-32 instructions is reserved; such use may cause unpredictable
behavior.
Some instructions may use F2H or F3H as a mandatory prefix to express distinct functionality.
Branch hint prefixes (2EH, 3EH) allow a program to give a hint to the processor about the most likely code path for
a branch when used on conditional branch instructions (Jcc).
The operand-size override prefix allows a program to switch between 16- and 32-bit operand sizes. Either size can
be the default; use of the prefix selects the non-default size.
Some SSE2/SSE3/SSSE3/SSE4 instructions and instructions using a three-byte sequence of primary opcode bytes
may use 66H as a mandatory prefix to express distinct functionality.
Other use of the 66H prefix is reserved; such use may cause unpredictable behavior.
The address-size override prefix (67H) allows programs to switch between 16- and 32-bit addressing. Either size
can be the default; the prefix selects the non-default size. Using this prefix and/or other undefined opcodes when
operands for the instruction do not reside in memory is reserved; such use may cause unpredictable behavior.

1. Microarchitectural behavior varies; refer to the Intel® 64 and IA-32 Architectures Optimization Reference Manual.

2-2 Vol. 2A
INSTRUCTION FORMAT

2.1.2 Opcodes
A primary opcode can be 1, 2, or 3 bytes in length. An additional 3-bit opcode field is sometimes encoded in the
ModR/M byte. Smaller fields can be defined within the primary opcode. Such fields define the direction of opera-
tion, size of displacements, register encoding, condition codes, or sign extension. Encoding fields used by an
opcode vary depending on the class of operation.
Two-byte opcode formats for general-purpose and SIMD instructions consist of one of the following:
• An escape opcode byte 0FH as the primary opcode and a second opcode byte.
• A mandatory prefix (66H, F2H, or F3H), an escape opcode byte, and a second opcode byte (same as previous
bullet).
For example, CVTDQ2PD consists of the following sequence: F3 0F E6. The first byte is a mandatory prefix (it is not
considered as a repeat prefix).
Three-byte opcode formats for general-purpose and SIMD instructions consist of one of the following:
• An escape opcode byte 0FH as the primary opcode, plus two additional opcode bytes.
• A mandatory prefix (66H, F2H, or F3H), an escape opcode byte, plus two additional opcode bytes (same as
previous bullet).
For example, PHADDW for XMM registers consists of the following sequence: 66 0F 38 01. The first byte is the
mandatory prefix.
Valid opcode expressions are defined in Appendix A and Appendix B.

2.1.3 ModR/M and SIB Bytes


Many instructions that refer to an operand in memory have an addressing-form specifier byte (called the ModR/M
byte) following the primary opcode. The ModR/M byte contains three fields of information:
• The mod field combines with the r/m field to form 32 possible values: eight registers and 24 addressing modes.
• The reg/opcode field specifies either a register number or three more bits of opcode information. The purpose
of the reg/opcode field is specified in the primary opcode.
• The r/m field can specify a register as an operand or it can be combined with the mod field to encode an
addressing mode. Sometimes, certain combinations of the mod field and the r/m field are used to express
opcode information for some instructions.
Certain encodings of the ModR/M byte require a second addressing byte (the SIB byte). The base-plus-index and
scale-plus-index forms of 32-bit addressing require the SIB byte. The SIB byte includes the following fields:
• The scale field specifies the scale factor.
• The index field specifies the register number of the index register.
• The base field specifies the register number of the base register.
See Section 2.1.5 for the encodings of the ModR/M and SIB bytes.

2.1.4 Displacement and Immediate Bytes


Some addressing forms include a displacement immediately following the ModR/M byte (or the SIB byte if one is
present). If a displacement is required, it can be 1, 2, or 4 bytes.
If an instruction specifies an immediate operand, the operand always follows any displacement bytes. An imme-
diate operand can be 1, 2 or 4 bytes.

Vol. 2A 2-3
INSTRUCTION FORMAT

2.1.5 Addressing-Mode Encoding of ModR/M and SIB Bytes


The values and corresponding addressing forms of the ModR/M and SIB bytes are shown in Table 2-1 through Table
2-3: 16-bit addressing forms specified by the ModR/M byte are in Table 2-1 and 32-bit addressing forms are in
Table 2-2. Table 2-3 shows 32-bit addressing forms specified by the SIB byte. In cases where the reg/opcode field
in the ModR/M byte represents an extended opcode, valid encodings are shown in Appendix B.
In Table 2-1 and Table 2-2, the Effective Address column lists 32 effective addresses that can be assigned to the
first operand of an instruction by using the Mod and R/M fields of the ModR/M byte. The first 24 options provide
ways of specifying a memory location; the last eight (Mod = 11B) provide ways of specifying general-purpose, MMX
technology and XMM registers.
The Mod and R/M columns in Table 2-1 and Table 2-2 give the binary encodings of the Mod and R/M fields required
to obtain the effective address listed in the first column. For example: see the row indicated by Mod = 11B, R/M =
000B. The row identifies the general-purpose registers EAX, AX or AL; MMX technology register MM0; or XMM
register XMM0. The register used is determined by the opcode byte and the operand-size attribute.
Now look at the seventh row in either table (labeled “REG =”). This row specifies the use of the 3-bit Reg/Opcode
field when the field is used to give the location of a second operand. The second operand must be a general-
purpose, MMX technology, or XMM register. Rows one through five list the registers that may correspond to the
value in the table. Again, the register used is determined by the opcode byte along with the operand-size attribute.
If the instruction does not require a second operand, then the Reg/Opcode field may be used as an opcode exten-
sion. This use is represented by the sixth row in the tables (labeled “/digit (Opcode)”). Note that values in row six
are represented in decimal form.
The body of Table 2-1 and Table 2-2 (under the label “Value of ModR/M Byte (in Hexadecimal)”) contains a 32 by
8 array that presents all of 256 values of the ModR/M byte (in hexadecimal). Bits 3, 4, and 5 are specified by the
column of the table in which a byte resides. The row specifies bits 0, 1, and 2; and bits 6 and 7. The figure below
demonstrates interpretation of one table value.

Mod 11
RM 000
/digit (Opcode); REG = 001
C8H 11001000

Figure 2-2. Table Interpretation of ModR/M Byte (C8H)

2-4 Vol. 2A
INSTRUCTION FORMAT

Table 2-1. 16-Bit Addressing Forms with the ModR/M Byte


r8(/r) AL CL DL BL AH CH DH BH
r16(/r) AX CX DX BX SP BP1 SI DI
r32(/r) EAX ECX EDX EBX ESP EBP ESI EDI
mm(/r) MM0 MM1 MM2 MM3 MM4 MM5 MM6 MM7
xmm(/r) XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7
(In decimal) /digit (Opcode) 0 1 2 3 4 5 6 7
(In binary) REG = 000 001 010 011 100 101 110 111
Effective Address Mod R/M Value of ModR/M Byte (in Hexadecimal)
[BX+SI] 00 000 00 08 10 18 20 28 30 38
[BX+DI] 001 01 09 11 19 21 29 31 39
[BP+SI] 010 02 0A 12 1A 22 2A 32 3A
[BP+DI] 011 03 0B 13 1B 23 2B 33 3B
[SI] 100 04 0C 14 1C 24 2C 34 3C
[DI] 101 05 0D 15 1D 25 2D 35 3D
disp162 110 06 0E 16 1E 26 2E 36 3E
[BX] 111 07 0F 17 1F 27 2F 37 3F
[BX+SI]+disp83 01 000 40 48 50 58 60 68 70 78
[BX+DI]+disp8 001 41 49 51 59 61 69 71 79
[BP+SI]+disp8 010 42 4A 52 5A 62 6A 72 7A
[BP+DI]+disp8 011 43 4B 53 5B 63 6B 73 7B
[SI]+disp8 100 44 4C 54 5C 64 6C 74 7C
[DI]+disp8 101 45 4D 55 5D 65 6D 75 7D
[BP]+disp8 110 46 4E 56 5E 66 6E 76 7E
[BX]+disp8 111 47 4F 57 5F 67 6F 77 7F
[BX+SI]+disp16 10 000 80 88 90 98 A0 A8 B0 B8
[BX+DI]+disp16 001 81 89 91 99 A1 A9 B1 B9
[BP+SI]+disp16 010 82 8A 92 9A A2 AA B2 BA
[BP+DI]+disp16 011 83 8B 93 9B A3 AB B3 BB
[SI]+disp16 100 84 8C 94 9C A4 AC B4 BC
[DI]+disp16 101 85 8D 95 9D A5 AD B5 BD
[BP]+disp16 110 86 8E 96 9E A6 AE B6 BE
[BX]+disp16 111 87 8F 97 9F A7 AF B7 BF
EAX/AX/AL/MM0/XMM0 11 000 C0 C8 D0 D8 E0 E8 F0 F8
ECX/CX/CL/MM1/XMM1 001 C1 C9 D1 D9 E1 E9 F1 F9
EDX/DX/DL/MM2/XMM2 010 C2 CA D2 DA E2 EA F2 FA
EBX/BX/BL/MM3/XMM3 011 C3 CB D3 DB E3 EB F3 FB
ESP/SP/AHMM4/XMM4 100 C4 CC D4 DC E4 EC F4 FC
EBP/BP/CH/MM5/XMM5 101 C5 CD D5 DD E5 ED F5 FD
ESI/SI/DH/MM6/XMM6 110 C6 CE D6 DE E6 EE F6 FE
EDI/DI/BH/MM7/XMM7 111 C7 CF D7 DF E7 EF F7 FF

NOTES:
1. The default segment register is SS for the effective addresses containing a BP index, DS for other effective addresses.
2. The disp16 nomenclature denotes a 16-bit displacement that follows the ModR/M byte and that is added to the index.
3. The disp8 nomenclature denotes an 8-bit displacement that follows the ModR/M byte and that is sign-extended and added to the
index.

Vol. 2A 2-5
INSTRUCTION FORMAT

Table 2-2. 32-Bit Addressing Forms with the ModR/M Byte


r8(/r) AL CL DL BL AH CH DH BH
r16(/r) AX CX DX BX SP BP SI DI
r32(/r) EAX ECX EDX EBX ESP EBP ESI EDI
mm(/r) MM0 MM1 MM2 MM3 MM4 MM5 MM6 MM7
xmm(/r) XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7
(In decimal) /digit (Opcode) 0 1 2 3 4 5 6 7
(In binary) REG = 000 001 010 011 100 101 110 111
Effective Address Mod R/M Value of ModR/M Byte (in Hexadecimal)
[EAX] 00 000 00 08 10 18 20 28 30 38
[ECX] 001 01 09 11 19 21 29 31 39
[EDX] 010 02 0A 12 1A 22 2A 32 3A
[EBX] 011 03 0B 13 1B 23 2B 33 3B
[--][--]1 100 04 0C 14 1C 24 2C 34 3C
disp322 101 05 0D 15 1D 25 2D 35 3D
[ESI] 110 06 0E 16 1E 26 2E 36 3E
[EDI] 111 07 0F 17 1F 27 2F 37 3F
[EAX]+disp83 01 000 40 48 50 58 60 68 70 78
[ECX]+disp8 001 41 49 51 59 61 69 71 79
[EDX]+disp8 010 42 4A 52 5A 62 6A 72 7A
[EBX]+disp8 011 43 4B 53 5B 63 6B 73 7B
[--][--]+disp8 100 44 4C 54 5C 64 6C 74 7C
[EBP]+disp8 101 45 4D 55 5D 65 6D 75 7D
[ESI]+disp8 110 46 4E 56 5E 66 6E 76 7E
[EDI]+disp8 111 47 4F 57 5F 67 6F 77 7F
[EAX]+disp32 10 000 80 88 90 98 A0 A8 B0 B8
[ECX]+disp32 001 81 89 91 99 A1 A9 B1 B9
[EDX]+disp32 010 82 8A 92 9A A2 AA B2 BA
[EBX]+disp32 011 83 8B 93 9B A3 AB B3 BB
[--][--]+disp32 100 84 8C 94 9C A4 AC B4 BC
[EBP]+disp32 101 85 8D 95 9D A5 AD B5 BD
[ESI]+disp32 110 86 8E 96 9E A6 AE B6 BE
[EDI]+disp32 111 87 8F 97 9F A7 AF B7 BF
EAX/AX/AL/MM0/XMM0 11 000 C0 C8 D0 D8 E0 E8 F0 F8
ECX/CX/CL/MM/XMM1 001 C1 C9 D1 D9 E1 E9 F1 F9
EDX/DX/DL/MM2/XMM2 010 C2 CA D2 DA E2 EA F2 FA
EBX/BX/BL/MM3/XMM3 011 C3 CB D3 DB E3 EB F3 FB
ESP/SP/AH/MM4/XMM4 100 C4 CC D4 DC E4 EC F4 FC
EBP/BP/CH/MM5/XMM5 101 C5 CD D5 DD E5 ED F5 FD
ESI/SI/DH/MM6/XMM6 110 C6 CE D6 DE E6 EE F6 FE
EDI/DI/BH/MM7/XMM7 111 C7 CF D7 DF E7 EF F7 FF

NOTES:
1. The [--][--] nomenclature means a SIB follows the ModR/M byte.
2. The disp32 nomenclature denotes a 32-bit displacement that follows the ModR/M byte (or the SIB byte if one is present) and that is
added to the index.
3. The disp8 nomenclature denotes an 8-bit displacement that follows the ModR/M byte (or the SIB byte if one is present) and that is
sign-extended and added to the index.

Table 2-3 is organized to give 256 possible values of the SIB byte (in hexadecimal). General purpose registers used
as a base are indicated across the top of the table, along with corresponding values for the SIB byte’s base field.
Table rows in the body of the table indicate the register used as the index (SIB byte bits 3, 4, and 5) and the scaling
factor (determined by SIB byte bits 6 and 7).

2-6 Vol. 2A
INSTRUCTION FORMAT

Table 2-3. 32-Bit Addressing Forms with the SIB Byte


r32 EAX ECX EDX EBX ESP [*] ESI EDI
(In decimal) Base = 0 1 2 3 4 5 6 7
(In binary) Base = 000 001 010 011 100 101 110 111
Scaled Index SS Index Value of SIB Byte (in Hexadecimal)
[EAX] 00 000 00 01 02 03 04 05 06 07
[ECX] 001 08 09 0A 0B 0C 0D 0E 0F
[EDX] 010 10 11 12 13 14 15 16 17
[EBX] 011 18 19 1A 1B 1C 1D 1E 1F
none 100 20 21 22 23 24 25 26 27
[EBP] 101 28 29 2A 2B 2C 2D 2E 2F
[ESI] 110 30 31 32 33 34 35 36 37
[EDI] 111 38 39 3A 3B 3C 3D 3E 3F
[EAX*2] 01 000 40 41 42 43 44 45 46 47
[ECX*2] 001 48 49 4A 4B 4C 4D 4E 4F
[EDX*2] 010 50 51 52 53 54 55 56 57
[EBX*2] 011 58 59 5A 5B 5C 5D 5E 5F
none 100 60 61 62 63 64 65 66 67
[EBP*2] 101 68 69 6A 6B 6C 6D 6E 6F
[ESI*2] 110 70 71 72 73 74 75 76 77
[EDI*2] 111 78 79 7A 7B 7C 7D 7E 7F
[EAX*4] 10 000 80 81 82 83 84 85 86 87
[ECX*4] 001 88 89 8A 8B 8C 8D 8E 8F
[EDX*4] 010 90 91 92 93 94 95 96 97
[EBX*4] 011 98 99 9A 9B 9C 9D 9E 9F
none 100 A0 A1 A2 A3 A4 A5 A6 A7
[EBP*4] 101 A8 A9 AA AB AC AD AE AF
[ESI*4] 110 B0 B1 B2 B3 B4 B5 B6 B7
[EDI*4] 111 B8 B9 BA BB BC BD BE BF
[EAX*8] 11 000 C0 C1 C2 C3 C4 C5 C6 C7
[ECX*8] 001 C8 C9 CA CB CC CD CE CF
[EDX*8] 010 D0 D1 D2 D3 D4 D5 D6 D7
[EBX*8] 011 D8 D9 DA DB DC DD DE DF
none 100 E0 E1 E2 E3 E4 E5 E6 E7
[EBP*8] 101 E8 E9 EA EB EC ED EE EF
[ESI*8] 110 F0 F1 F2 F3 F4 F5 F6 F7
[EDI*8] 111 F8 F9 FA FB FC FD FE FF

NOTES:
1. The [*] nomenclature means a disp32 with no base if the MOD is 00B. Otherwise, [*] means disp8 or disp32 + [EBP]. This provides the
following address modes:
MOD bits Effective Address
00 [scaled index] + disp32
01 [scaled index] + disp8 + [EBP]
10 [scaled index] + disp32 + [EBP]

2.2 IA-32E MODE


IA-32e mode has two sub-modes. These are:
• Compatibility Mode. Enables a 64-bit operating system to run most legacy protected mode software
unmodified.
• 64-Bit Mode. Enables a 64-bit operating system to run applications written to access 64-bit address space.

2.2.1 REX Prefixes


REX prefixes are instruction-prefix bytes used in 64-bit mode. They do the following:
• Specify GPRs and SSE registers.

Vol. 2A 2-7
INSTRUCTION FORMAT

• Specify 64-bit operand size.


• Specify extended control registers.
Not all instructions require a REX prefix in 64-bit mode. A REX prefix is necessary only if an instruction references
one of the extended registers or one of the byte registers SPL, BPL, SIL, DIL; or uses a 64-bit operand. A REX prefix
is ignored, as are its individual bits, when it is not needed for an instruction or when it does not immediately
precede the opcode byte or the escape opcode byte (0FH) of an instruction for which it is needed. This has the
implication that only one REX prefix, properly located, can affect an instruction.
When a REX prefix is used in conjunction with an instruction containing a mandatory prefix, the mandatory prefix
must come before the REX so the REX prefix can immediately precede the opcode or the escape byte. For example,
CVTDQ2PD with a REX prefix should have REX placed between F3 and 0F E6. Other placements are ignored. The
instruction-size limit of 15 bytes still applies to instructions with a REX prefix. See Figure 2-3.

Legacy REX SIB Displacement Immediate


Prefix Opcode ModR/M
Prefixes

Grp 1, Grp (optional) 1-, 2-, or 1 byte 1 byte Address Immediate data
2, Grp 3, 3-byte (if required) (if required) displacement of of 1, 2, or 4
Grp 4 opcode 1, 2, or 4 bytes bytes or none
(optional)

Figure 2-3. Prefix Ordering in 64-bit Mode

2.2.1.1 Encoding
Intel 64 and IA-32 instruction formats specify up to three registers by using 3-bit fields in the encoding, depending
on the format:
• ModR/M: the reg and r/m fields of the ModR/M byte.
• ModR/M with SIB: the reg field of the ModR/M byte, the base and index fields of the SIB (scale, index, base)
byte.
• Instructions without ModR/M: the reg field of the opcode.
In 64-bit mode, these formats do not change. Bits needed to define fields in the 64-bit context are provided by the
addition of REX prefixes.

2.2.1.2 More on REX Prefix Fields


REX prefixes are a set of 16 opcodes that span one row of the opcode map and occupy entries 40H to 4FH. These
opcodes represent valid instructions (INC or DEC) in IA-32 operating modes and in compatibility mode. In 64-bit
mode, the same opcodes represent the instruction prefix REX and are not treated as individual instructions.
The single-byte-opcode forms of the INC/DEC instructions are not available in 64-bit mode. INC/DEC functionality
is still available using ModR/M forms of the same instructions (opcodes FF/0 and FF/1).
See Table 2-4 for a summary of the REX prefix format. Figure 2-4 though Figure 2-7 show examples of REX prefix
fields in use. Some combinations of REX prefix fields are invalid. In such cases, the prefix is ignored. Some addi-
tional information follows:
• Setting REX.W can be used to determine the operand size but does not solely determine operand width. Like
the 66H size prefix, 64-bit operand size override has no effect on byte-specific operations.
• For non-byte operations: if a 66H prefix is used with prefix (REX.W = 1), 66H is ignored.
• If a 66H override is used with REX and REX.W = 0, the operand size is 16 bits.
• REX.R modifies the ModR/M reg field when that field encodes a GPR, SSE, control or debug register. REX.R is
ignored when ModR/M specifies other registers or defines an extended opcode.
• REX.X bit modifies the SIB index field.

2-8 Vol. 2A
INSTRUCTION FORMAT

• REX.B either modifies the base in the ModR/M r/m field or SIB base field; or it modifies the opcode reg field
used for accessing GPRs.

Table 2-4. REX Prefix Fields [BITS: 0100WRXB]


Field Name Bit Position Definition
- 7:4 0100
W 3 0 = Operand size determined by CS.D
1 = 64 Bit Operand Size
R 2 Extension of the ModR/M reg field
X 1 Extension of the SIB index field
B 0 Extension of the ModR/M r/m field, SIB base field, or Opcode reg field

ModRM Byte

REX PREFIX Opcode mod reg r/m


0100WR0B ≠11 rrr bbb

Rrrr Bbbb

OM17Xfig1-3

Figure 2-4. Memory Addressing Without an SIB Byte; REX.X Not Used

ModRM Byte

REX PREFIX Opcode mod reg r/m


0100WR0B 11 rrr bbb

Rrrr Bbbb
OM17Xfig1-4

Figure 2-5. Register-Register Addressing (No Memory Operand); REX.X Not Used

Vol. 2A 2-9
INSTRUCTION FORMAT

ModRM Byte SIB Byte

REX PREFIX Opcode mod reg r/m scale index base


0100WRXB ≠ 11 rrr 100 ss xxx bbb

Rrrr Xxxx Bbbb


OM17Xfig1-5

Figure 2-6. Memory Addressing With a SIB Byte

REX PREFIX Opcode reg


0100W00B bbb

Bbbb
OM17Xfig1-6

Figure 2-7. Register Operand Coded in Opcode Byte; REX.X & REX.R Not Used

In the IA-32 architecture, byte registers (AH, AL, BH, BL, CH, CL, DH, and DL) are encoded in the ModR/M byte’s
reg field, the r/m field or the opcode reg field as registers 0 through 7. REX prefixes provide an additional
addressing capability for byte-registers that makes the least-significant byte of GPRs available for byte operations.
Certain combinations of the fields of the ModR/M byte and the SIB byte have special meaning for register encod-
ings. For some combinations, fields expanded by the REX prefix are not decoded. Table 2-5 describes how each
case behaves.

Table 2-5. Special Cases of REX Encodings


ModR/M or Sub-field Compatibility Mode Compatibility Mode
Additional Implications
SIB Encodings Operation Implications
ModR/M Byte mod ? 11 SIB byte present. SIB byte required for REX prefix adds a fourth bit (b) which is not decoded
ESP-based (don't care).
r/m =
addressing. SIB byte also required for R12-based addressing.
b*100(ESP)
ModR/M Byte mod = 0 Base register not EBP without a REX prefix adds a fourth bit (b) which is not decoded
used. displacement must be (don't care).
r/m =
done using Using RBP or R13 without displacement must be
b*101(EBP)
mod = 01 with done using mod = 01 with a displacement of 0.
displacement of 0.
SIB Byte index = Index register not ESP cannot be used REX prefix adds a fourth bit (b) which is decoded.
0100(ESP) used. as an index register. There are no additional implications. The expanded
index field allows distinguishing RSP from R12,
therefore R12 can be used as an index.

2-10 Vol. 2A
INSTRUCTION FORMAT

Table 2-5. Special Cases of REX Encodings (Contd.)


ModR/M or Sub-field Compatibility Mode Compatibility Mode
Additional Implications
SIB Encodings Operation Implications
SIB Byte base = Base register is Base register REX prefix adds a fourth bit (b) which is not decoded.
0101(EBP) unused if mod = 0. depends on mod This requires explicit displacement to be used with
encoding. EBP/RBP or R13.
NOTES:
* Don’t care about value of REX.B

2.2.1.3 Displacement
Addressing in 64-bit mode uses existing 32-bit ModR/M and SIB encodings. The ModR/M and SIB displacement
sizes do not change. They remain 8 bits or 32 bits and are sign-extended to 64 bits.

2.2.1.4 Direct Memory-Offset MOVs


In 64-bit mode, direct memory-offset forms of the MOV instruction are extended to specify a 64-bit immediate
absolute address. This address is called a moffset. No prefix is needed to specify this 64-bit memory offset. For
these MOV instructions, the size of the memory offset follows the address-size default (64 bits in 64-bit mode). See
Table 2-6.

Table 2-6. Direct Memory Offset Form of MOV


Opcode Instruction
A0 MOV AL, moffset
A1 MOV EAX, moffset
A2 MOV moffset, AL
A3 MOV moffset, EAX

2.2.1.5 Immediates
In 64-bit mode, the typical size of immediate operands remains 32 bits. When the operand size is 64 bits, the
processor sign-extends all immediates to 64 bits prior to their use.
Support for 64-bit immediate operands is accomplished by expanding the semantics of the existing move (MOV
reg, imm16/32) instructions. These instructions (opcodes B8H – BFH) move 16-bits or 32-bits of immediate data
(depending on the effective operand size) into a GPR. When the effective operand size is 64 bits, these instructions
can be used to load an immediate into a GPR. A REX prefix is needed to override the 32-bit default operand size to
a 64-bit operand size.
For example:

48 B8 8877665544332211 MOV RAX,1122334455667788H

2.2.1.6 RIP-Relative Addressing


A new addressing form, RIP-relative (relative instruction-pointer) addressing, is implemented in 64-bit mode. An
effective address is formed by adding displacement to the 64-bit RIP of the next instruction.
In IA-32 architecture and compatibility mode, addressing relative to the instruction pointer is available only with
control-transfer instructions. In 64-bit mode, instructions that use ModR/M addressing can use RIP-relative
addressing. Without RIP-relative addressing, all ModR/M modes address memory relative to zero.
RIP-relative addressing allows specific ModR/M modes to address memory relative to the 64-bit RIP using a signed
32-bit displacement. This provides an offset range of ±2GB from the RIP. Table 2-7 shows the ModR/M and SIB
encodings for RIP-relative addressing. Redundant forms of 32-bit displacement-addressing exist in the current
ModR/M and SIB encodings. There is one ModR/M encoding and there are several SIB encodings. RIP-relative
addressing is encoded using a redundant form.

Vol. 2A 2-11
INSTRUCTION FORMAT

In 64-bit mode, the ModR/M Disp32 (32-bit displacement) encoding is re-defined to be RIP+Disp32 rather than
displacement-only. See Table 2-7.

Table 2-7. RIP-Relative Addressing


Compatibility Mode 64-bit Mode
ModR/M and SIB Sub-field Encodings Additional Implications in 64-bit mode
Operation Operation
ModR/M Byte mod = 00 Disp32 RIP + Disp32 In 64-bit mode, if one wants to use a Disp32
without specifying a base register, one can use a
r/m = 101 (none)
SIB byte encoding (indicated by ModR/M.r/m=100)
as described in the next row.
SIB Byte base = 101 (none) If mod = 00, Disp32 Same as legacy None
index = 100 (none)
scale = 0, 1, 2, 4

The ModR/M encoding for RIP-relative addressing does not depend on using a prefix. Specifically, the r/m bit field
encoding of 101B (used to select RIP-relative addressing) is not affected by the REX prefix. For example, selecting
R13 (REX.B = 1, r/m = 101B) with mod = 00B still results in RIP-relative addressing. The 4-bit r/m field of REX.B
combined with ModR/M is not fully decoded. In order to address R13 with no displacement, software must encode
R13 + 0 using a 1-byte displacement of zero.
RIP-relative addressing is enabled by 64-bit mode, not by a 64-bit address-size. The use of the address-size prefix
does not disable RIP-relative addressing. The effect of the address-size prefix is to truncate and zero-extend the
computed effective address to 32 bits.

2.2.1.7 Default 64-Bit Operand Size


In 64-bit mode, two groups of instructions have a default operand size of 64 bits (do not need a REX prefix for this
operand size). These are:
• Near branches.
• All instructions, except far branches, that implicitly reference the RSP.

2.2.2 Additional Encodings for Control and Debug Registers


In 64-bit mode, more encodings for control and debug registers are available. The REX.R bit is used to modify the
ModR/M reg field when that field encodes a control or debug register (see Table 2-4). These encodings enable the
processor to address CR8-CR15 and DR8- DR15. An additional control register (CR8) is defined in 64-bit mode. CR8
becomes the Task Priority Register (TPR).
In the first implementation of IA-32e mode, CR9-CR15 and DR8-DR15 are not implemented. Any attempt to access
unimplemented registers results in an invalid-opcode exception (#UD).

2-12 Vol. 2A
INSTRUCTION FORMAT

2.3 INTEL® ADVANCED VECTOR EXTENSIONS (INTEL® AVX)


Intel AVX instructions are encoded using an encoding scheme that combines prefix bytes, opcode extension field,
operand encoding fields, and vector length encoding capability into a new prefix, referred to as VEX. In the VEX
encoding scheme, the VEX prefix may be two or three bytes long, depending on the instruction semantics. Despite
the two-byte or three-byte length of the VEX prefix, the VEX encoding format provides a more compact represen-
tation/packing of the components of encoding an instruction in Intel 64 architecture. The VEX encoding scheme
also allows more headroom for future growth of Intel 64 architecture.

2.3.1 Instruction Format


Instruction encoding using VEX prefix provides several advantages:
• Instruction syntax support for three operands and up-to four operands when necessary. For example, the third
source register used by VBLENDVPD is encoded using bits 7:4 of the immediate byte.
• Encoding support for vector length of 128 bits (using XMM registers) and 256 bits (using YMM registers).
• Encoding support for instruction syntax of non-destructive source operands.
• Elimination of escape opcode byte (0FH), SIMD prefix byte (66H, F2H, F3H) via a compact bit field represen-
tation within the VEX prefix.
• Elimination of the need to use REX prefix to encode the extended half of general-purpose register sets (R8-
R15) for direct register access, memory addressing, or accessing XMM8-XMM15 (including YMM8-YMM15).
• Flexible and more compact bit fields are provided in the VEX prefix to retain the full functionality provided by
REX prefix. REX.W, REX.X, REX.B functionalities are provided in the three-byte VEX prefix only because only a
subset of SIMD instructions need them.
• Extensibility for future instruction extensions without significant instruction length increase.
Figure 2-8 shows the Intel 64 instruction encoding format with VEX prefix support. Legacy instruction without a
VEX prefix is fully supported and unchanged. The use of VEX prefix in an Intel 64 instruction is optional, but a VEX
prefix is required for Intel 64 instructions that operate on YMM registers or support three and four operand syntax.
VEX prefix is not a constant-valued, “single-purpose” byte like 0FH, 66H, F2H, F3H in legacy SSE instructions. VEX
prefix provides substantially richer capability than the REX prefix.

# Bytes 2,3 1 1 0,1 0,1,2,4 0,1

[Prefixes] [VEX] OPCODE ModR/M [SIB] [DISP] [IMM]

Figure 2-8. Instruction Encoding Format with VEX Prefix

2.3.2 VEX and the LOCK prefix


Any VEX-encoded instruction with a LOCK prefix preceding VEX will #UD.

2.3.3 VEX and the 66H, F2H, and F3H prefixes


Any VEX-encoded instruction with a 66H, F2H, or F3H prefix preceding VEX will #UD.

2.3.4 VEX and the REX prefix


Any VEX-encoded instruction with a REX prefix proceeding VEX will #UD.

Vol. 2A 2-13
INSTRUCTION FORMAT

2.3.5 The VEX Prefix


The VEX prefix is encoded in either the two-byte form (the first byte must be C5H) or in the three-byte form (the
first byte must be C4H). The two-byte VEX is used mainly for 128-bit, scalar, and the most common 256-bit AVX
instructions; while the three-byte VEX provides a compact replacement of REX and 3-byte opcode instructions
(including AVX and FMA instructions). Beyond the first byte of the VEX prefix, it consists of a number of bit fields
providing specific capability, they are shown in Figure 2-9.
The bit fields of the VEX prefix can be summarized by its functional purposes:
• Non-destructive source register encoding (applicable to three and four operand syntax): This is the first source
operand in the instruction syntax. It is represented by the notation, VEX.vvvv. This field is encoded using 1’s
complement form (inverted form), i.e., XMM0/YMM0/R0 is encoded as 1111B, XMM15/YMM15/R15 is encoded
as 0000B.
• Vector length encoding: This 1-bit field represented by the notation VEX.L. L= 0 means vector length is 128 bits
wide, L=1 means 256 bit vector. The value of this field is written as VEX.128 or VEX.256 in this document to
distinguish encoded values of other VEX bit fields.
• REX prefix functionality: Full REX prefix functionality is provided in the three-byte form of VEX prefix. However
the VEX bit fields providing REX functionality are encoded using 1’s complement form, i.e., XMM0/YMM0/R0 is
encoded as 1111B, XMM15/YMM15/R15 is encoded as 0000B.
— Two-byte form of the VEX prefix only provides the equivalent functionality of REX.R, using 1’s complement
encoding. This is represented as VEX.R.
— Three-byte form of the VEX prefix provides REX.R, REX.X, REX.B functionality using 1’s complement
encoding and three dedicated bit fields represented as VEX.R, VEX.X, VEX.B.
— Three-byte form of the VEX prefix provides the functionality of REX.W only to specific instructions that need
to override default 32-bit operand size for a general purpose register to 64-bit size in 64-bit mode. For
those applicable instructions, VEX.W field provides the same functionality as REX.W. VEX.W field can
provide completely different functionality for other instructions.
Consequently, the use of REX prefix with VEX encoded instructions is not allowed. However, the intent of the
REX prefix for expanding register set is reserved for future instruction set extensions using VEX prefix
encoding format.
• Compaction of SIMD prefix: Legacy SSE instructions effectively use SIMD prefixes (66H, F2H, F3H) as an
opcode extension field. VEX prefix encoding allows the functional capability of such legacy SSE instructions
(operating on XMM registers, bits 255:128 of corresponding YMM unmodified) to be encoded using the VEX.pp
field without the presence of any SIMD prefix. The VEX-encoded 128-bit instruction will zero-out bits 255:128
of the destination register. VEX-encoded instruction may have 128 bit vector length or 256 bits length.
• Compaction of two-byte and three-byte opcode: More recently introduced legacy SSE instructions employ two
and three-byte opcode. The one or two leading bytes are: 0FH, and 0FH 3AH/0FH 38H. The one-byte escape
(0FH) and two-byte escape (0FH 3AH, 0FH 38H) can also be interpreted as an opcode extension field. The
VEX.mmmmm field provides compaction to allow many legacy instruction to be encoded without the constant
byte sequence, 0FH, 0FH 3AH, 0FH 38H. These VEX-encoded instruction may have 128 bit vector length or 256
bits length.
The VEX prefix is required to be the last prefix and immediately precedes the opcode bytes. It must follow any other
prefixes. If VEX prefix is present a REX prefix is not supported.
The 3-byte VEX leaves room for future expansion with 3 reserved bits. REX and the 66h/F2h/F3h prefixes are
reclaimed for future use.
VEX prefix has a two-byte form and a three byte form. If an instruction syntax can be encoded using the two-byte
form, it can also be encoded using the three byte form of VEX. The latter increases the length of the instruction by
one byte. This may be helpful in some situations for code alignment.
The VEX prefix supports 256-bit versions of floating-point SSE, SSE2, SSE3, and SSE4 instructions. Note, certain
new instruction functionality can only be encoded with the VEX prefix.
The VEX prefix will #UD on any instruction containing MMX register sources or destinations.

2-14 Vol. 2A
INSTRUCTION FORMAT

Byte 0 Byte 1 Byte 2


(Bit Position) 7 0 7 6 5 4 0 7 6 3 2 1 0

3-byte VEX 11000100 RXB m-mmmm W vvvv


1 L pp

7 0 7 6 3 2 1 0

2-byte VEX 11000101 R vvvv


1 L pp

R: REX.R in 1’s complement (inverted) form


1: Same as REX.R=0 (must be 1 in 32-bit mode)
0: Same as REX.R=1 (64-bit mode only)
X: REX.X in 1’s complement (inverted) form
1: Same as REX.X=0 (must be 1 in 32-bit mode)
0: Same as REX.X=1 (64-bit mode only)
B: REX.B in 1’s complement (inverted) form
1: Same as REX.B=0 (Ignored in 32-bit mode).
0: Same as REX.B=1 (64-bit mode only)
W: opcode specific (use like REX.W, or used for opcode
extension, or ignored, depending on the opcode byte)
m-mmmm:
00000: Reserved for future use (will #UD)
00001: implied 0F leading opcode byte
00010: implied 0F 38 leading opcode bytes
00011: implied 0F 3A leading opcode bytes
00100-11111: Reserved for future use (will #UD)

vvvv: a register specifier (in 1’s complement form) or 1111 if unused.

L: Vector Length
0: scalar or 128-bit vector
1: 256-bit vector

pp: opcode extension providing equivalent functionality of a SIMD prefix


00: None
01: 66
10: F3
11: F2

Figure 2-9. VEX bit fields

The following subsections describe the various fields in two or three-byte VEX prefix.

2.3.5.1 VEX Byte 0, bits[7:0]


VEX Byte 0, bits [7:0] must contain the value 11000101b (C5h) or 11000100b (C4h). The 3-byte VEX uses the C4h
first byte, while the 2-byte VEX uses the C5h first byte.

2.3.5.2 VEX Byte 1, bit [7] - ‘R’


VEX Byte 1, bit [7] contains a bit analogous to a bit inverted REX.R. In protected and compatibility modes the bit
must be set to ‘1’ otherwise the instruction is LES or LDS.

Vol. 2A 2-15
INSTRUCTION FORMAT

This bit is present in both 2- and 3-byte VEX prefixes.


The usage of WRXB bits for legacy instructions is explained in detail section 2.2.1.2 of Intel 64 and IA-32 Architec-
tures Software developer’s manual, Volume 2A.
This bit is stored in bit inverted format.

2.3.5.3 3-byte VEX byte 1, bit[6] - ‘X’


Bit[6] of the 3-byte VEX byte 1 encodes a bit analogous to a bit inverted REX.X. It is an extension of the SIB Index
field in 64-bit modes. In 32-bit modes, this bit must be set to ‘1’ otherwise the instruction is LES or LDS.
This bit is available only in the 3-byte VEX prefix.
This bit is stored in bit inverted format.

2.3.5.4 3-byte VEX byte 1, bit[5] - ‘B’


Bit[5] of the 3-byte VEX byte 1 encodes a bit analogous to a bit inverted REX.B. In 64-bit modes, it is an extension
of the ModR/M r/m field, or the SIB base field. In 32-bit modes, this bit is ignored.
This bit is available only in the 3-byte VEX prefix.
This bit is stored in bit inverted format.

2.3.5.5 3-byte VEX byte 2, bit[7] - ‘W’


Bit[7] of the 3-byte VEX byte 2 is represented by the notation VEX.W. It can provide following functions, depending
on the specific opcode.
• For AVX instructions that have equivalent legacy SSE instructions (typically these SSE instructions have a
general-purpose register operand with its operand size attribute promotable by REX.W), if REX.W promotes
the operand size attribute of the general-purpose register operand in legacy SSE instruction, VEX.W has same
meaning in the corresponding AVX equivalent form. In 32-bit modes for these instructions, VEX.W is silently
ignored.
• For AVX instructions that have equivalent legacy SSE instructions (typically these SSE instructions have oper-
ands with their operand size attribute fixed and not promotable by REX.W), if REX.W is don’t care in legacy
SSE instruction, VEX.W is ignored in the corresponding AVX equivalent form irrespective of mode.
• For new AVX instructions where VEX.W has no defined function (typically these meant the combination of the
opcode byte and VEX.mmmmm did not have any equivalent SSE functions), VEX.W is reserved as zero and
setting to other than zero will cause instruction to #UD.

2.3.5.6 2-byte VEX Byte 1, bits[6:3] and 3-byte VEX Byte 2, bits [6:3]- ‘vvvv’ the Source or Dest
Register Specifier
In 32-bit mode the VEX first byte C4 and C5 alias onto the LES and LDS instructions. To maintain compatibility with
existing programs the VEX 2nd byte, bits [7:6] must be 11b. To achieve this, the VEX payload bits are selected to
place only inverted, 64-bit valid fields (extended register selectors) in these upper bits.
The 2-byte VEX Byte 1, bits [6:3] and the 3-byte VEX, Byte 2, bits [6:3] encode a field (shorthand VEX.vvvv) that
for instructions with 2 or more source registers and an XMM or YMM or memory destination encodes the first source
register specifier stored in inverted (1’s complement) form.
VEX.vvvv is not used by the instructions with one source (except certain shifts, see below) or on instructions with
no XMM or YMM or memory destination. If an instruction does not use VEX.vvvv then it should be set to 1111b
otherwise instruction will #UD.
In 64-bit mode all 4 bits may be used. See Table for the encoding of the XMM or YMM registers. In 32-bit and 16-
bit modes bit 6 must be 1 (if bit 6 is not 1, the 2-byte VEX version will generate LDS instruction and the 3-byte VEX
version will ignore this bit).

2-16 Vol. 2A
INSTRUCTION FORMAT

Table 2-8. VEX.vvvv to Register Name Mapping


General-Purpose Register (If Valid in Legacy/Compatibility 32-bit
VEX.vvvv Dest Register
Applicable)1 modes?2
1111B XMM0/YMM0 RAX/EAX Valid
1110B XMM1/YMM1 RCX/ECX Valid
1101B XMM2/YMM2 RDX/EDX Valid
1100B XMM3/YMM3 RBX/EBX Valid
1011B XMM4/YMM4 RSP/ESP Valid
1010B XMM5/YMM5 RBP/EBP Valid
1001B XMM6/YMM6 RSI/ESI Valid
1000B XMM7/YMM7 RDI/EDI Valid
0111B XMM8/YMM8 R8/R8D Invalid
0110B XMM9/YMM9 R9/R9D Invalid
0101B XMM10/YMM10 R10/R10D Invalid
0100B XMM11/YMM11 R11/R11D Invalid
0011B XMM12/YMM12 R12/R12D Invalid
0010B XMM13/YMM13 R13/R13D Invalid
0001B XMM14/YMM14 R14/R14D Invalid
0000B XMM15/YMM15 R15/R15D Invalid
NOTES:
1. See Section 2.6, “VEX Encoding Support for GPR Instructions” for additional details.
2. Only the first eight General-Purpose Registers are accessible/encodable in 16/32b modes.

The VEX.vvvv field is encoded in bit inverted format for accessing a register operand.

2.3.6 Instruction Operand Encoding and VEX.vvvv, ModR/M


VEX-encoded instructions support three-operand and four-operand instruction syntax. Some VEX-encoded
instructions have syntax with less than three operands, e.g., VEX-encoded pack shift instructions support one
source operand and one destination operand).
The roles of VEX.vvvv, reg field of ModR/M byte (ModR/M.reg), r/m field of ModR/M byte (ModR/M.r/m) with
respect to encoding destination and source operands vary with different type of instruction syntax.
The role of VEX.vvvv can be summarized to three situations:
• VEX.vvvv encodes the first source register operand, specified in inverted (1’s complement) form and is valid for
instructions with 2 or more source operands.
• VEX.vvvv encodes the destination register operand, specified in 1’s complement form for certain vector shifts.
The instructions where VEX.vvvv is used as a destination are listed in Table 2-9. The notation in the “Opcode”
column in Table 2-9 is described in detail in section 3.1.1.
• VEX.vvvv does not encode any operand, the field is reserved and should contain 1111b.

Table 2-9. Instructions with a VEX.vvvv Destination


Opcode Instruction mnemonic
VEX.128.66.0F 73 /7 ib VPSLLDQ xmm1, xmm2, imm8
VEX.128.66.0F 73 /3 ib VPSRLDQ xmm1, xmm2, imm8

Vol. 2A 2-17
INSTRUCTION FORMAT

Table 2-9. Instructions with a VEX.vvvv Destination (Contd.)


Opcode Instruction mnemonic
VEX.128.66.0F 71 /2 ib VPSRLW xmm1, xmm2, imm8
VEX.128.66.0F 72 /2 ib VPSRLD xmm1, xmm2, imm8
VEX.128.66.0F 73 /2 ib VPSRLQ xmm1, xmm2, imm8
VEX.128.66.0F 71 /4 ib VPSRAW xmm1, xmm2, imm8
VEX.128.66.0F 72 /4 ib VPSRAD xmm1, xmm2, imm8
VEX.128.66.0F 71 /6 ib VPSLLW xmm1, xmm2, imm8
VEX.128.66.0F 72 /6 ib VPSLLD xmm1, xmm2, imm8
VEX.128.66.0F 73 /6 ib VPSLLQ xmm1, xmm2, imm8

The role of ModR/M.r/m field can be summarized to two situations:


• ModR/M.r/m encodes the instruction operand that references a memory address.
• For some instructions that do not support memory addressing semantics, ModR/M.r/m encodes either the
destination register operand or a source register operand.
The role of ModR/M.reg field can be summarized to two situations:
• ModR/M.reg encodes either the destination register operand or a source register operand.
• For some instructions, ModR/M.reg is treated as an opcode extension and not used to encode any instruction
operand.
For instruction syntax that support four operands, VEX.vvvv, ModR/M.r/m, ModR/M.reg encodes three of the four
operands. The role of bits 7:4 of the immediate byte serves the following situation:
• Imm8[7:4] encodes the third source register operand.

2.3.6.1 3-byte VEX byte 1, bits[4:0] - “m-mmmm”


Bits[4:0] of the 3-byte VEX byte 1 encode an implied leading opcode byte (0F, 0F 38, or 0F 3A). Several bits are
reserved for future use and will #UD unless 0.

Table 2-10. VEX.m-mmmm Interpretation


VEX.m-mmmm Implied Leading Opcode Bytes
00000B Reserved
00001B 0F
00010B 0F 38
00011B 0F 3A
00100-11111B Reserved
(2-byte VEX) 0F

VEX.m-mmmm is only available on the 3-byte VEX. The 2-byte VEX implies a leading 0Fh opcode byte.

2.3.6.2 2-byte VEX byte 1, bit[2], and 3-byte VEX byte 2, bit [2]- “L”
The vector length field, VEX.L, is encoded in bit[2] of either the second byte of 2-byte VEX, or the third byte of 3-
byte VEX. If “VEX.L = 1”, it indicates 256-bit vector operation. “VEX.L = 0” indicates scalar and 128-bit vector
operations.
The instruction VZEROUPPER is a special case that is encoded with VEX.L = 0, although its operation zero’s bits
255:128 of all YMM registers accessible in the current operating mode. See Table 2-11.

2-18 Vol. 2A
INSTRUCTION FORMAT

Table 2-11. VEX.L Interpretation


VEX.L Vector Length
0 128-bit (or 32/64-bit scalar)
1 256-bit

2.3.6.3 2-byte VEX byte 1, bits[1:0], and 3-byte VEX byte 2, bits [1:0]- “pp”
Up to one implied prefix is encoded by bits[1:0] of either the 2-byte VEX byte 1 or the 3-byte VEX byte 2. The prefix
behaves as if it was encoded prior to VEX, but after all other encoded prefixes. See Table 2-12.

Table 2-12. VEX.pp Interpretation


pp Implies this prefix after other prefixes but before VEX
00B None
01B 66
10B F3
11B F2

2.3.7 The Opcode Byte


One (and only one) opcode byte follows the 2 or 3 byte VEX. Legal opcodes are specified in Appendix B, in color.
Any instruction that uses illegal opcode will #UD.

2.3.8 The ModR/M, SIB, and Displacement Bytes


The encodings are unchanged but the interpretation of reg_field or rm_field differs (see above).

2.3.9 The Third Source Operand (Immediate Byte)


VEX-encoded instructions can support instruction with a four operand syntax. VBLENDVPD, VBLENDVPS, and
PBLENDVB use imm8[7:4] to encode one of the source registers.

2.3.10 Intel® AVX Instructions and the Upper 128-bits of YMM registers
If an instruction with a destination XMM register is encoded with a VEX prefix, the processor zeroes the upper bits
(above bit 128) of the equivalent YMM register. Legacy SSE instructions without VEX preserve the upper bits.

2.3.10.1 Vector Length Transition and Programming Considerations


An instruction encoded with a VEX.128 prefix that loads a YMM register operand operates as follows:
• Data is loaded into bits 127:0 of the register
• Bits above bit 127 in the register are cleared.
Thus, such an instruction clears bits 255:128 of a destination YMM register on processors with a maximum vector-
register width of 256 bits. In the event that future processors extend the vector registers to greater widths, an
instruction encoded with a VEX.128 or VEX.256 prefix will also clear any bits beyond bit 255. (This is in contrast
with legacy SSE instructions, which have no VEX prefix; these modify only bits 127:0 of any destination register
operand.)
Programmers should bear in mind that instructions encoded with VEX.128 and VEX.256 prefixes will clear any
future extensions to the vector registers. A calling function that uses such extensions should save their state before
calling legacy functions. This is not possible for involuntary calls (e.g., into an interrupt-service routine). It is

Vol. 2A 2-19
INSTRUCTION FORMAT

recommended that software handling involuntary calls accommodate this by not executing instructions encoded
with VEX.128 and VEX.256 prefixes. In the event that it is not possible or desirable to restrict these instructions,
then software must take special care to avoid actions that would, on future processors, zero the upper bits of vector
registers.
Processors that support further vector-register extensions (defining bits beyond bit 255) will also extend the
XSAVE and XRSTOR instructions to save and restore these extensions. To ensure forward compatibility, software
that handles involuntary calls and that uses instructions encoded with VEX.128 and VEX.256 prefixes should first
save and then restore the vector registers (with any extensions) using the XSAVE and XRSTOR instructions with
save/restore masks that set bits that correspond to all vector-register extensions. Ideally, software should rely on
a mechanism that is cognizant of which bits to set. (E.g., an OS mechanism that sets the save/restore mask bits
for all vector-register extensions that are enabled in XCR0.) Saving and restoring state with instructions other than
XSAVE and XRSTOR will, on future processors with wider vector registers, corrupt the extended state of the vector
registers - even if doing so functions correctly on processors supporting 256-bit vector registers. (The same is true
if XSAVE and XRSTOR are used with a save/restore mask that does not set bits corresponding to all supported
extensions to the vector registers.)

2.3.11 Intel® AVX Instruction Length


The Intel AVX instructions described in this document (including VEX and ignoring other prefixes) do not exceed 11
bytes in length, but may increase in the future. The maximum length of an Intel 64 and IA-32 instruction remains
15 bytes.

2.3.12 Vector SIB (VSIB) Memory Addressing


In Intel® Advanced Vector Extensions 2 (Intel® AVX2), an SIB byte that follows the ModR/M byte can support VSIB
memory addressing to an array of linear addresses. VSIB addressing is only supported in a subset of Intel AVX2
instructions. VSIB memory addressing requires 32-bit or 64-bit effective address. In 32-bit mode, VSIB addressing
is not supported when address size attribute is overridden to 16 bits. In 16-bit protected mode, VSIB memory
addressing is permitted if address size attribute is overridden to 32 bits. Additionally, VSIB memory addressing is
supported only with VEX prefix.
In VSIB memory addressing, the SIB byte consists of:
• The scale field (bit 7:6) specifies the scale factor.
• The index field (bits 5:3) specifies the register number of the vector index register, each element in the vector
register specifies an index.
• The base field (bits 2:0) specifies the register number of the base register.
Table 2-13 shows the 32-bit VSIB addressing form. It is organized to give 256 possible values of the SIB byte (in
hexadecimal). General purpose registers used as a base are indicated across the top of the table, along with corre-
sponding values for the SIB byte’s base field. The register names also include R8D-R15D applicable only in 64-bit
mode (when address size override prefix is used, but the value of VEX.B is not shown in Table 2-13). In 32-bit
mode, R8D-R15D does not apply.
Table rows in the body of the table indicate the vector index register used as the index field and each supported
scaling factor shown separately. Vector registers used in the index field can be XMM or YMM registers. The left-
most column includes vector registers VR8-VR15 (i.e., XMM8/YMM8-XMM15/YMM15), which are only available in
64-bit mode and does not apply if encoding in 32-bit mode.

2-20 Vol. 2A
INSTRUCTION FORMAT

Table 2-13. 32-Bit VSIB Addressing Forms of the SIB Byte


r32 EAX/ ECX/ EDX/ EBX/ ESP/ EBP/ ESI/ EDI/
R8D R9D R10D R11D R12D R13D1 R14D R15D
(In decimal) Base = 0 1 2 3 4 5 6 7
(In binary) Base = 000 001 010 011 100 101 110 111
Scaled Index SS Index Value of SIB Byte (in Hexadecimal)
VR0/VR8 *1 00 000 00 01 02 03 04 05 06 07
VR1/VR9 001 08 09 0A 0B 0C 0D 0E 0F
VR2/VR10 010 10 11 12 13 14 15 16 17
VR3/VR11 011 18 19 1A 1B 1C 1D 1E 1F
VR4/VR12 100 20 21 22 23 24 25 26 27
VR5/VR13 101 28 29 2A 2B 2C 2D 2E 2F
VR6/VR14 110 30 31 32 33 34 35 36 37
VR7/VR15 111 38 39 3A 3B 3C 3D 3E 3F
VR0/VR8 *2 01 000 40 41 42 43 44 45 46 47
VR1/VR9 001 48 49 4A 4B 4C 4D 4E 4F
VR2/VR10 010 50 51 52 53 54 55 56 57
VR3/VR11 011 58 59 5A 5B 5C 5D 5E 5F
VR4/VR12 100 60 61 62 63 64 65 66 67
VR5/VR13 101 68 69 6A 6B 6C 6D 6E 6F
VR6/VR14 110 70 71 72 73 74 75 76 77
VR7/VR15 111 78 79 7A 7B 7C 7D 7E 7F
VR0/VR8 *4 10 000 80 81 82 83 84 85 86 87
VR1/VR9 001 88 89 8A 8B 8C 8D 8E 8F
VR2/VR10 010 90 91 92 93 94 95 96 97
VR3/VR11 011 98 89 9A 9B 9C 9D 9E 9F
VR4/VR12 100 A0 A1 A2 A3 A4 A5 A6 A7
VR5/VR13 101 A8 A9 AA AB AC AD AE AF
VR6/VR14 110 B0 B1 B2 B3 B4 B5 B6 B7
VR7/VR15 111 B8 B9 BA BB BC BD BE BF
VR0/VR8 *8 11 000 C0 C1 C2 C3 C4 C5 C6 C7
VR1/VR9 001 C8 C9 CA CB CC CD CE CF
VR2/VR10 010 D0 D1 D2 D3 D4 D5 D6 D7
VR3/VR11 011 D8 D9 DA DB DC DD DE DF
VR4/VR12 100 E0 E1 E2 E3 E4 E5 E6 E7
VR5/VR13 101 E8 E9 EA EB EC ED EE EF
VR6/VR14 110 F0 F1 F2 F3 F4 F5 F6 F7
VR7/VR15 111 F8 F9 FA FB FC FD FE FF
NOTES:
1. If ModR/M.mod = 00b, the base address is zero, then effective address is computed as [scaled vector index] + disp32. Otherwise the
base address is computed as [EBP/R13]+ disp, the displacement is either 8 bit or 32 bit depending on the value of ModR/M.mod:
MOD Effective Address
00b [Scaled Vector Register] + Disp32
01b [Scaled Vector Register] + Disp8 + [EBP/R13]
10b [Scaled Vector Register] + Disp32 + [EBP/R13]

2.3.12.1 64-bit Mode VSIB Memory Addressing


In 64-bit mode VSIB memory addressing uses the VEX.B field and the base field of the SIB byte to encode one of
the 16 general-purpose register as the base register. The VEX.X field and the index field of the SIB byte encode one
of the 16 vector registers as the vector index register.
In 64-bit mode the top row of Table 2-13 base register should be interpreted as the full 64-bit of each register.

2.4 INTEL® ADVANCED MATRIX EXTENSIONS (INTEL® AMX)


Intel® AMX instructions follow the general documentation convention established in previous sections. Additionally,
Intel® Advanced Matrix Extensions use notation conventions as described below.
In the instruction encoding boxes, sibmem is used to denote an encoding where a ModR/M byte and SIB byte are
used to indicate a memory operation where the base and displacement are used to point to memory, and the index

Vol. 2A 2-21
INSTRUCTION FORMAT

register (if present) is used to denote a stride between memory rows. The index register is scaled by the sib.scale
field as usual. The base register is added to the displacement, if present.
In the instruction encoding, the ModR/M byte is represented several ways depending on the role it plays. The
ModR/M byte has 3 fields: 2-bit ModR/M.mod field, a 3-bit ModR/M.reg field and a 3-bit ModR/M.r/m field. When all
bits of the ModR/M byte have fixed values for an instruction, the 2-hex nibble value of that byte is presented after
the opcode in the encoding boxes on the instruction description pages. When only some fields of the ModR/M byte
must contain fixed values, those values are specified as follows:
• If only the ModR/M.mod must be 0b11, and ModR/M.reg and ModR/M.r/m fields are unrestricted, this is
denoted as 11:rrr:bbb. The rrr correspond to the 3-bits of the ModR/M.reg field and the bbb correspond to the
3-bits of the ModR/M.r/m field.
• If the ModR/M.mod field is constrained to be a value other than 0b11, i.e., it must be one of 0b00, 0b01, or
0b10, then the notation !(11) is used.
• If the ModR/M.reg field had a specific required value, e.g., 0b101, that would be denoted as mm:101:bbb.

NOTE
Historically this document only specified the ModR/M.reg field restrictions with the notation /0 ... /7
and did not specify restrictions on the ModR/M.mod and ModR/M.r/m fields in the encoding boxes.

2.5 INTEL® AVX AND INTEL® SSE INSTRUCTION EXCEPTION CLASSIFICATION


To look up the exceptions of legacy 128-bit SIMD instruction, 128-bit VEX-encoded instructions, and 256-bit VEX-
encoded instruction, Table summarizes the exception behavior into separate classes, with detailed exception
conditions defined in sub-sections 2.5.1 through 2.6.1. For example, ADDPS contains the entry:
“See Exceptions Type 2.”
In this entry, “Type 2” can be looked up in Table 2-19.
The instruction’s corresponding CPUID feature flag can be identified in the fourth column of the Instruction
summary table.
Note: #UD on CPUID feature flags=0 is not guaranteed in a virtualized environment if the hardware supports the
feature flag.

NOTE
Instructions that operate only with MMX, X87, or general-purpose registers are not covered by the
exception classes defined in this section. For instructions that operate on MMX registers, see
Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers”
in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.

Table 2-14. Exception Class Description


Floating-Point Exceptions
Exception Class Instruction Set Mem Arg
(#XM)
AVX,
Type 1 16/32 byte explicitly aligned No
Legacy SSE
AVX, 16/32 byte not explicitly
Type 2 Yes
Legacy SSE aligned
AVX,
Type 3 < 16 byte Yes
Legacy SSE
AVX, 16/32 byte not explicitly
Type 4 No
Legacy SSE aligned
AVX,
Type 5 < 16 byte No
Legacy SSE
Type 6 AVX (no Legacy SSE) Varies (At present, none do)

2-22 Vol. 2A
INSTRUCTION FORMAT

Table 2-14. Exception Class Description (Contd.)


Floating-Point Exceptions
Exception Class Instruction Set Mem Arg
(#XM)
AVX,
Type 7 None No
Legacy SSE
Type 8 AVX None No
F16C 8 or 16 byte, Not explicitly Yes
Type 11
aligned, no AC#
Type 12 AVX2 Gathers Not explicitly aligned, no AC# No

See Table 2-15 for lists of instructions in each exception class.

Table 2-15. Instructions in Each Exception Class


Exception Class Instruction
Type 1 (V)MOVAPD, (V)MOVAPS, (V)MOVDQA, (V)MOVNTDQ, (V)MOVNTDQA, (V)MOVNTPD, (V)MOVNTPS
(V)ADDPD, (V)ADDPS, (V)ADDSUBPD, (V)ADDSUBPS, (V)CMPPD, (V)CMPPS, (V)CVTDQ2PS, (V)CVTPD2DQ,
(V)CVTPD2PS, (V)CVTPS2DQ, (V)CVTTPD2DQ, (V)CVTTPS2DQ, (V)DIVPD, (V)DIVPS, (V)DPPD*, (V)DPPS*,
VFMADD132PD, VFMADD213PD, VFMADD231PD, VFMADD132PS, VFMADD213PS, VFMADD231PS,
VFMADDSUB132PD, VFMADDSUB213PD, VFMADDSUB231PD, VFMADDSUB132PS, VFMADDSUB213PS,
VFMADDSUB231PS, VFMSUBADD132PD, VFMSUBADD213PD, VFMSUBADD231PD, VFMSUBADD132PS,
Type 2 VFMSUBADD213PS, VFMSUBADD231PS, VFMSUB132PD, VFMSUB213PD, VFMSUB231PD, VFMSUB132PS,
VFMSUB213PS, VFMSUB231PS, VFNMADD132PD, VFNMADD213PD, VFNMADD231PD, VFNMADD132PS,
VFNMADD213PS, VFNMADD231PS, VFNMSUB132PD, VFNMSUB213PD, VFNMSUB231PD, VFNMSUB132PS,
VFNMSUB213PS, VFNMSUB231PS, (V)HADDPD, (V)HADDPS, (V)HSUBPD, (V)HSUBPS, (V)MAXPD, (V)MAXPS,
(V)MINPD, (V)MINPS, (V)MULPD, (V)MULPS, (V)ROUNDPD, (V)ROUNDPS, (V)SQRTPD, (V)SQRTPS, (V)SUBPD,
(V)SUBPS
(V)ADDSD, (V)ADDSS, (V)CMPSD, (V)CMPSS, (V)COMISD, (V)COMISS, (V)CVTPS2PD, (V)CVTSD2SI, (V)CVTSD2SS,
(V)CVTSI2SD, (V)CVTSI2SS, (V)CVTSS2SD, (V)CVTSS2SI, (V)CVTTSD2SI, (V)CVTTSS2SI, (V)DIVSD, (V)DIVSS,
VFMADD132SD, VFMADD213SD, VFMADD231SD, VFMADD132SS, VFMADD213SS, VFMADD231SS,
VFMSUB132SD, VFMSUB213SD, VFMSUB231SD, VFMSUB132SS, VFMSUB213SS, VFMSUB231SS,
Type 3
VFNMADD132SD, VFNMADD213SD, VFNMADD231SD, VFNMADD132SS, VFNMADD213SS, VFNMADD231SS,
VFNMSUB132SD, VFNMSUB213SD, VFNMSUB231SD, VFNMSUB132SS, VFNMSUB213SS, VFNMSUB231SS,
(V)MAXSD, (V)MAXSS, (V)MINSD, (V)MINSS, (V)MULSD, (V)MULSS, (V)ROUNDSD, (V)ROUNDSS, (V)SQRTSD,
(V)SQRTSS, (V)SUBSD, (V)SUBSS, (V)UCOMISD, (V)UCOMISS
(V)AESDEC, (V)AESDECLAST, (V)AESENC, (V)AESENCLAST, (V)AESIMC, (V)AESKEYGENASSIST, (V)ANDPD,
(V)ANDPS, (V)ANDNPD, (V)ANDNPS, (V)BLENDPD, (V)BLENDPS, VBLENDVPD, VBLENDVPS, (V)LDDQU***,
(V)MASKMOVDQU, (V)PTEST, VTESTPS, VTESTPD, (V)MOVDQU*, (V)MOVSHDUP, (V)MOVSLDUP, (V)MOVUPD*,
(V)MOVUPS*, (V)MPSADBW, (V)ORPD, (V)ORPS, (V)PABSB, (V)PABSW, (V)PABSD, (V)PACKSSWB, (V)PACKSSDW,
(V)PACKUSWB, (V)PACKUSDW, (V)PADDB, (V)PADDW, (V)PADDD, (V)PADDQ, (V)PADDSB, (V)PADDSW,
(V)PADDUSB, (V)PADDUSW, (V)PALIGNR, (V)PAND, (V)PANDN, (V)PAVGB, (V)PAVGW, (V)PBLENDVB, (V)PBLENDW,
(V)PCMP(E/I)STRI/M***, (V)PCMPEQB, (V)PCMPEQW, (V)PCMPEQD, (V)PCMPEQQ, (V)PCMPGTB, (V)PCMPGTW,
(V)PCMPGTD, (V)PCMPGTQ, (V)PCLMULQDQ, (V)PHADDW, (V)PHADDD, (V)PHADDSW, (V)PHMINPOSUW,
(V)PHSUBD, (V)PHSUBW, (V)PHSUBSW, (V)PMADDWD, (V)PMADDUBSW, (V)PMAXSB, (V)PMAXSW, (V)PMAXSD,
Type 4
(V)PMAXUB, (V)PMAXUW, (V)PMAXUD, (V)PMINSB, (V)PMINSW, (V)PMINSD, (V)PMINUB, (V)PMINUW, (V)PMINUD,
(V)PMULHUW, (V)PMULHRSW, (V)PMULHW, (V)PMULLW, (V)PMULLD, (V)PMULUDQ, (V)PMULDQ, (V)POR,
(V)PSADBW, (V)PSHUFB, (V)PSHUFD, (V)PSHUFHW, (V)PSHUFLW, (V)PSIGNB, (V)PSIGNW, (V)PSIGND, (V)PSLLW,
(V)PSLLD, (V)PSLLQ, (V)PSRAW, (V)PSRAD, (V)PSRLW, (V)PSRLD, (V)PSRLQ, (V)PSUBB, (V)PSUBW, (V)PSUBD,
(V)PSUBQ, (V)PSUBSB, (V)PSUBSW, (V)PSUBUSB, (V)PSUBUSW, (V)PUNPCKHBW, (V)PUNPCKHWD,
(V)PUNPCKHDQ, (V)PUNPCKHQDQ, (V)PUNPCKLBW, (V)PUNPCKLWD, (V)PUNPCKLDQ, (V)PUNPCKLQDQ, (V)PXOR,
(V)RCPPS, (V)RSQRTPS, (V)SHUFPD, (V)SHUFPS, (V)UNPCKHPD, (V)UNPCKHPS, (V)UNPCKLPD, (V)UNPCKLPS,
(V)XORPD, (V)XORPS, VPBLENDD, VPERMD, VPERMPS, VPERMPD, VPERMQ, VPSLLVD, VPSLLVQ, VPSRAVD,
VPSRLVD, VPSRLVQ, VPERMILPD, VPERMILPS, VPERM2F128

Vol. 2A 2-23
INSTRUCTION FORMAT

Table 2-15. Instructions in Each Exception Class (Contd.)


Exception Class Instruction
(V)CVTDQ2PD, (V)EXTRACTPS, (V)INSERTPS, (V)MOVD, (V)MOVQ, (V)MOVDDUP, (V)MOVLPD, (V)MOVLPS,
(V)MOVHPD, (V)MOVHPS, (V)MOVSD, (V)MOVSS, (V)PEXTRB, (V)PEXTRD, (V)PEXTRW, (V)PEXTRQ, (V)PINSRB,
Type 5
(V)PINSRD, (V)PINSRW, (V)PINSRQ, PMOVSXBW, (V)RCPSS, (V)RSQRTSS, (V)PMOVSX/ZX, VLDMXCSR*,
VSTMXCSR
VEXTRACTF128/VEXTRACTFxxxx, VBROADCASTSS, VBROADCASTSD, VBROADCASTF128, VINSERTF128,
Type 6 VMASKMOVPS**, VMASKMOVPD**, VPMASKMOVD, VPMASKMOVQ, VBROADCASTI128, VPBROADCASTB,
VPBROADCASTD, VPBROADCASTW, VPBROADCASTQ, VEXTRACTI128, VINSERTI128, VPERM2I128
(V)MOVLHPS, (V)MOVHLPS, (V)MOVMSKPD, (V)MOVMSKPS, (V)PMOVMSKB, (V)PSLLDQ, (V)PSRLDQ, (V)PSLLW,
Type 7
(V)PSLLD, (V)PSLLQ, (V)PSRAW, (V)PSRAD, (V)PSRLW, (V)PSRLD, (V)PSRLQ
Type 8 VZEROALL, VZEROUPPER
Type 11 VCVTPH2PS, VCVTPS2PH
VGATHERDPS, VGATHERDPD, VGATHERQPS, VGATHERQPD, VPGATHERDD, VPGATHERDQ, VPGATHERQD,
Type 12
VPGATHERQQ

(*) - Additional exception restrictions are present - see the Instruction description for details
(**) - Instruction behavior on alignment check reporting with mask bits of less than all 1s are the same as with mask bits of all 1s, i.e., no
alignment checks are performed.
(***) - PCMPESTRI, PCMPESTRM, PCMPISTRI, PCMPISTRM, and LDDQU instructions do not cause #GP if the memory operand is not
aligned to 16-Byte boundary.

Table 2-15 classifies exception behaviors for Intel AVX instructions. Within each class of exception conditions that
are listed in Table 2-18 through Table 2-27, certain subsets of Intel AVX instructions may be subject to #UD excep-
tion depending on the encoded value of the VEX.L field. Table 2-16 and Table 2-17 provide supplemental informa-
tion of Intel AVX instructions that may be subject to #UD exception if encoded with incorrect values in the VEX.W
or VEX.L field.

Table 2-16. #UD Exception and VEX.W=1 Encoding


#UD If VEX.W = 1 in
Exception Class #UD If VEX.W = 1 in All Modes
Non-64-Bit Modes
Type 1
Type 2
Type 3
VBLENDVPD, VBLENDVPS, VPBLENDVB, VTESTPD, VTESTPS, VPBLENDD, VPERMD,
Type 4
VPERMPS, VPERM2I128, VPSRAVD, VPERMILPD, VPERMILPS, VPERM2F128
Type 5
VEXTRACTF128, VBROADCASTSS, VBROADCASTSD, VBROADCASTF128,
Type 6 VINSERTF128, VMASKMOVPS, VMASKMOVPD, VBROADCASTI128,
VPBROADCASTB/W/D, VEXTRACTI128, VINSERTI128
Type 7
Type 8
Type 11 VCVTPH2PS, VCVTPS2PH
Type 12

2-24 Vol. 2A
INSTRUCTION FORMAT

Table 2-17. #UD Exception and VEX.L Field Encoding


Exception #UD If (VEX.L = 1 && AVX2
#UD If VEX.L = 0 #UD If (VEX.L = 1 && AVX2 not present && AVX present)
Class present)
Type 1 VMOVNTDQA
VDPPD VDPPD
Type 2

Type 3
VMASKMOVDQU, VMPSADBW, VPABSB/W/D, VPCMP(E/I)STRI/M,
VPACKSSWB/DW, VPACKUSWB/DW, VPADDB/W/D, VPADDQ, PHMINPOSUW
VPADDSB/W, VPADDUSB/W, VPALIGNR, VPAND, VPANDN,
VPAVGB/W, VPBLENDVB, VPBLENDW, VPCMP(E/I)STRI/M,
VPCMPEQB/W/D/Q, VPCMPGTB/W/D/Q, VPHADDW/D,
VPHADDSW, VPHMINPOSUW, VPHSUBD/W, VPHSUBSW,
VPMADDWD, VPMADDUBSW, VPMAXSB/W/D,
Type 4
VPMAXUB/W/D, VPMINSB/W/D, VPMINUB/W/D, VPMULHUW,
VPMULHRSW, VPMULHW/LW, VPMULLD, VPMULUDQ,
VPMULDQ, VPOR, VPSADBW, VPSHUFB/D, VPSHUFHW/LW,
VPSIGNB/W/D, VPSLLW/D/Q, VPSRAW/D, VPSRLW/D/Q,
VPSUBB/W/D/Q, VPSUBSB/W, VPUNPCKHBW/WD/DQ,
VPUNPCKHQDQ, VPUNPCKLBW/WD/DQ, VPUNPCKLQDQ,
VPXOR
VEXTRACTPS, VINSERTPS, VMOVD, VMOVQ, VMOVLPD, Same as column 3
VMOVLPS, VMOVHPD, VMOVHPS, VPEXTRB, VPEXTRD,
Type 5
VPEXTRW, VPEXTRQ, VPINSRB, VPINSRD, VPINSRW,
VPINSRQ, VPMOVSX/ZX, VLDMXCSR, VSTMXCSR
VEXTRACTF128,
VPERM2F128,
Type 6 VBROADCASTSD,
VBROADCASTF128,
VINSERTF128,
VMOVLHPS, VMOVHLPS, VPMOVMSKB, VPSLLDQ, VPSRLDQ, VMOVLHPS, VMOVHLPS
Type 7 VPSLLW, VPSLLD, VPSLLQ, VPSRAW, VPSRAD, VPSRLW,
VPSRLD, VPSRLQ
Type 8
Type 11
Type 12

Vol. 2A 2-25
INSTRUCTION FORMAT

2.5.1 Exceptions Type 1 (Aligned Memory Reference)

Table 2-18. Type 1 Class Exception Conditions

Protected and
Compatibility
Virtual-8086

64-bit
Real
Exception Cause of Exception

X X VEX prefix.
VEX prefix:
X X If XCR0[2:1] ? ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
Invalid Opcode, Legacy SSE instruction:
#UD X X X X If CR0.EM[bit 2] = 1.
If CR4.OSFXSR[bit 9] = 0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Avail-
X X X X If CR0.TS[bit 3]=1.
able, #NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
VEX.256: Memory operand is not 32-byte aligned.
X X
VEX.128: Memory operand is not 16-byte aligned.
X X X X Legacy SSE: Memory operand is not 16-byte aligned.
General Protec-
For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
tion, #GP(0) X
ments.
X If the memory address is in a non-canonical form.
X X If any part of the operand lies outside the effective address space from 0 to FFFFH.
Page Fault
X X X For a page fault.
#PF(fault-code)

2-26 Vol. 2A
INSTRUCTION FORMAT

2.5.2 Exceptions Type 2 (>=16 Byte Memory Reference, Unaligned)

Table 2-19. Type 2 Class Exception Conditions

Protected and
Compatibility
Virtual 8086

64-bit
Real
Exception Cause of Exception

X X VEX prefix.
X X X X If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 0.
VEX prefix:
X X If XCR0[2:1] ? ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
Invalid Opcode,
Legacy SSE instruction:
#UD
X X X X If CR0.EM[bit 2] = 1.
If CR4.OSFXSR[bit 9] = 0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Avail-
X X X X If CR0.TS[bit 3]=1.
able, #NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
X X X X Legacy SSE: Memory operand is not 16-byte aligned.
General Protec- X For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.
tion, #GP(0) X If the memory address is in a non-canonical form.
X X If any part of the operand lies outside the effective address space from 0 to FFFFH.
Page Fault
X X X For a page fault.
#PF(fault-code)
SIMD Floating-
point Exception, X X X X If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 1.
#XM

Vol. 2A 2-27
INSTRUCTION FORMAT

2.5.3 Exceptions Type 3 (<16 Byte Memory Argument)

Table 2-20. Type 3 Class Exception Conditions

Protected and
Compatibility
Virtual-8086

64-bit
Exception Real Cause of Exception

X X VEX prefix.
X X X X If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 0.
VEX prefix:
X X If XCR0[2:1] ? ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
Invalid Opcode, #UD Legacy SSE instruction:
X X X X If CR0.EM[bit 2] = 1.
If CR4.OSFXSR[bit 9] = 0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
X
ments.
General Protection,
X If the memory address is in a non-canonical form.
#GP(0)
If any part of the operand lies outside the effective address space from 0 to
X X
FFFFH.
Page Fault
X X X For a page fault.
#PF(fault-code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X X
#AC(0) unaligned memory access is made while the current privilege level is 3.
SIMD Floating-point
X X X X If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 1.
Exception, #XM

2-28 Vol. 2A
INSTRUCTION FORMAT

2.5.4 Exceptions Type 4 (>=16 Byte Mem Arg, No Alignment, No Floating-point Exceptions)

Table 2-21. Type 4 Class Exception Conditions

Protected and
Compatibility
Virtual-8086

64-bit
Exception Real Cause of Exception

X X VEX prefix.
VEX prefix:
X X If XCR0[2:1] ? ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
Legacy SSE instruction:
Invalid Opcode, #UD
X X X X If CR0.EM[bit 2] = 1.
If CR4.OSFXSR[bit 9] = 0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
X X X X Legacy SSE: Memory operand is not 16-byte aligned.1
For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
X
General Protection, ments.
#GP(0) X If the memory address is in a non-canonical form.
If any part of the operand lies outside the effective address space from 0 to
X X
FFFFH.
Page Fault
X X X For a page fault.
#PF(fault-code)
NOTES:
1. LDDQU, MOVUPD, MOVUPS, PCMPESTRI, PCMPESTRM, PCMPISTRI, and PCMPISTRM instructions do not cause #GP if the memory
operand is not aligned to 16-Byte boundary.

Vol. 2A 2-29
INSTRUCTION FORMAT

2.5.5 Exceptions Type 5 (<16 Byte Mem Arg and No FP Exceptions)

Table 2-22. Type 5 Class Exception Conditions

Protected and
Compatibility
Virtual-8086

64-bit
Exception Real Cause of Exception

X X VEX prefix.
VEX prefix:
X X If XCR0[2:1] ? ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
Legacy SSE instruction:
Invalid Opcode, #UD
X X X X If CR0.EM[bit 2] = 1.
If CR4.OSFXSR[bit 9] = 0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
X
ments.
General Protection,
X If the memory address is in a non-canonical form.
#GP(0)
If any part of the operand lies outside the effective address space from 0 to
X X
FFFFH.
Page Fault
X X X For a page fault.
#PF(fault-code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X X
#AC(0) unaligned memory access is made while the current privilege level is 3.

2-30 Vol. 2A
INSTRUCTION FORMAT

2.5.6 Exceptions Type 6 (VEX-Encoded Instructions without Legacy SSE Analogues)


Note: At present, the AVX instructions in this category do not generate floating-point exceptions.

Table 2-23. Type 6 Class Exception Conditions

Protected and
Compatibility
Virtual-8086

64-bit
Real
Exception Cause of Exception

X X VEX prefix.
If XCR0[2:1] ? ‘11b’.
X X
If CR4.OSXSAVE[bit 18]=0.
Invalid Opcode, #UD
X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X If CR0.TS[bit 3]=1.
#NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
General Protection, X
ments.
#GP(0)
X If the memory address is in a non-canonical form.
Page Fault
X X For a page fault.
#PF(fault-code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X
#AC(0) unaligned memory access is made while the current privilege level is 3.

Vol. 2A 2-31
INSTRUCTION FORMAT

2.5.7 Exceptions Type 7 (No FP Exceptions, No Memory Arg)

Table 2-24. Type 7 Class Exception Conditions

Protected and
Compatibility
Virtual-8086

64-bit
Exception Real Cause of Exception

X X VEX prefix.
VEX prefix:
X X If XCR0[2:1] ? ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
Legacy SSE instruction:
Invalid Opcode, #UD
X X X X If CR0.EM[bit 2] = 1.
If CR4.OSFXSR[bit 9] = 0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X If CR0.TS[bit 3]=1.
#NM

2.5.8 Exceptions Type 8 (AVX and No Memory Argument)

Table 2-25. Type 8 Class Exception Conditions


Protected and
Compatibility
Virtual-8086

64-bit
Real

Exception Cause of Exception

Invalid Opcode, #UD X X Always in Real or Virtual-8086 mode.


X X If XCR0[2:1] ? ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
If CPUID.01H.ECX.AVX[bit 28]=0.
If VEX.vvvv ? 1111B.
X X X X If proceeded by a LOCK prefix (F0H).
Device Not Available, X X If CR0.TS[bit 3]=1.
#NM

2-32 Vol. 2A
INSTRUCTION FORMAT

2.5.9 Exceptions Type 11 (VEX-only, Mem Arg, No AC, Floating-point Exceptions)

Table 2-26. Type 11 Class Exception Conditions

Protected and
Compatibility
Virtual-8086

64-bit
Exception Real Cause of Exception

Invalid Opcode, #UD X X VEX prefix.


X X VEX prefix:
If XCR0[2:1] ? ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Avail- X X X X If CR0.TS[bit 3]=1.
able, #NM
Stack, #SS(0) X For an illegal address in the SS segment.
X If a memory address referencing the SS segment is in a non-canonical form.
General Protection, X For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
#GP(0) ments.
X If the memory address is in a non-canonical form.
X X If any part of the operand lies outside the effective address space from 0 to
FFFFH.
Page Fault #PF X X X For a page fault.
(fault-code)
SIMD Floating-Point X X X X If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 1.
Exception, #XM

Vol. 2A 2-33
INSTRUCTION FORMAT

2.5.10 Exceptions Type 12 (VEX-only, VSIB Mem Arg, No AC, No Floating-point Exceptions)

Table 2-27. Type 12 Class Exception Conditions

Protected and
Compatibility
Virtual-8086

64-bit
Real
Exception Cause of Exception

Invalid Opcode, #UD X X VEX prefix.


X X VEX prefix:
If XCR0[2:1] ? ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X NA If address size attribute is 16 bit.
X X X X If ModR/M.mod = ‘11b’.
X X X X If ModR/M.rm ? ‘100b’.
X X X X If any corresponding CPUID feature flag is ‘0’.
X X X X If any vector register is used more than once between the destination register,
mask register and the index register in VSIB addressing.
Device Not Available, X X X X If CR0.TS[bit 3]=1.
#NM
Stack, #SS(0) X For an illegal address in the SS segment.
X If a memory address referencing the SS segment is in a non-canonical form.
General Protection, X For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
#GP(0) ments.
X If the memory address is in a non-canonical form.
X X If any part of the operand lies outside the effective address space from 0 to
FFFFH.
Page Fault #PF (fault- X X X For a page fault.
code)

2.6 VEX ENCODING SUPPORT FOR GPR INSTRUCTIONS


The VEX prefix may be used to encode instructions that operate on neither YMM nor XMM registers. VEX-encoded
general-purpose-register instructions have the following properties:
• Instruction syntax support for three encodable operands.
• Encoding support for instruction syntax of non-destructive source operand, destination operand encoded via
VEX.vvvv, and destructive three-operand syntax.
• Elimination of escape opcode byte (0FH), two-byte escape via a compact bit field representation within the VEX
prefix.
• Elimination of the need to use REX prefix to encode the extended half of general-purpose register sets (R8-R15)
for direct register access or memory addressing.
• Flexible and more compact bit fields are provided in the VEX prefix to retain the full functionality provided by
REX prefix. REX.W, REX.X, REX.B functionalities are provided in the three-byte VEX prefix only.
• VEX-encoded GPR instructions are encoded with VEX.L=0.

2-34 Vol. 2A
INSTRUCTION FORMAT

Any VEX-encoded GPR instruction with a 66H, F2H, or F3H prefix preceding VEX will #UD.
Any VEX-encoded GPR instruction with a REX prefix proceeding VEX will #UD.
VEX-encoded GPR instructions are not supported in real and virtual 8086 modes.

2.6.1 Exceptions Type 13 (VEX-Encoded GPR Instructions)


The exception conditions applicable to VEX-encoded GPR instructions differ from those of legacy GPR instructions.
Table 2-28 lists VEX-encoded GPR instructions. The exception conditions for VEX-encoded GPR instructions are
found in Table 2-29 for those instructions which have a default operand size of 32 bits and 16-bit operand size is
not encodable.

Table 2-28. VEX-Encoded GPR Instructions


Exception Class Instruction
Type 13 ANDN, BEXTR, BLSI, BLSMSK, BLSR, BZHI, MULX, PDEP, PEXT, RORX, SARX, SHLX, SHRX

(*) - Additional exception restrictions are present - see the Instruction description for details.

Table 2-29. Type 13 Class Exception Conditions


Protected and
Compatibility
Virtual-8086

64-bit
Real

Exception Cause of Exception

Invalid Opcode, #UD X X X X If BMI1/BMI2 CPUID feature flag is ‘0’.


X X If a VEX prefix is present.
X X X X If VEX.L = 1.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
Stack, #SS(0) X X X For an illegal address in the SS segment.
X If a memory address referencing the SS segment is in a non-canonical form.
General Protection, X For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
#GP(0) ments.
If the DS, ES, FS, or GS register is used to access memory and it contains a null
segment selector.
X If the memory address is in a non-canonical form.
X X If any part of the operand lies outside the effective address space from 0 to
FFFFH.
Page Fault #PF(fault- X X X For a page fault.
code)
Alignment Check X X X For 2, 4, or 8 byte memory access if alignment checking is enabled and an
#AC(0) unaligned memory access is made while the current privilege level is 3.

2.6.2 Exceptions Type 14 (CMPCCXADD)


The exception conditions applicable to the CMPCCXADD instruction differ from those of other VEX-encoded GPR
instructions. The exception conditions for the CMPCCXADD instruction are found in Table 2-31.

Vol. 2A 2-35
INSTRUCTION FORMAT

Table 2-30. Exceptions Type 14 Instructions


Exception Class Instruction
Type 14 CMPCCXADD

Table 2-31. Type 14 Class Exception Conditions

Protected and
Compatibility
Virtual-8086

64-bit
Real

Exception Cause of Exception

X X X Only supported in 64-bit mode.


Invalid Opcode, #UD X If any LOCK, REX, F2, F3, or 66 prefixes precede a VEX prefix.
X If any corresponding CPUID feature flag is ‘0’.
Stack, #SS(0) X If a memory address referencing the SS segment is in a non-canonical form.
General Protection, If the memory address is in a non-canonical form.
X
#GP(0)
Page Fault, #PF(fault- If a page fault occurs.
X
code)
Alignment Check X X X If alignment checking is enabled and an unaligned memory reference is made
#AC(0) while the current privilege level is 3.

2.7 INTEL® AVX-512 ENCODING


The majority of the Intel AVX-512 family of instructions (operating on 512/256/128-bit vector register operands)
are encoded using a new prefix (called EVEX). Opmask instructions (operating on opmask register operands) are
encoded using the VEX prefix. The EVEX prefix has some parts resembling the instruction encoding scheme using
the VEX prefix, and many other capabilities not available with the VEX prefix.
The significant feature differences between EVEX and VEX are summarized below.
• EVEX is a 4-Byte prefix (the first byte must be 62H); VEX is either a 2-Byte (C5H is the first byte) or 3-Byte
(C4H is the first byte) prefix.
• EVEX prefix can encode 32 vector registers (XMM/YMM/ZMM) in 64-bit mode.
• EVEX prefix can encode an opmask register for conditional processing or selection control in EVEX-encoded
vector instructions. Opmask instructions, whose source/destination operands are opmask registers and treat
the content of an opmask register as a single value, are encoded using the VEX prefix.
• EVEX memory addressing with disp8 form uses a compressed disp8 encoding scheme to improve the encoding
density of the instruction byte stream.
• EVEX prefix can encode functionality that are specific to instruction classes (e.g., packed instruction with
“load+op” semantic can support embedded broadcast functionality, floating-point instruction with rounding
semantic can support static rounding functionality, floating-point instruction with non-rounding arithmetic
semantic can support “suppress all exceptions” functionality).

2.7.1 Instruction Format and EVEX


The placement of the EVEX prefix in an IA instruction is represented in Figure 2-10. Note that the values contained
within brackets are optional.

2-36 Vol. 2A
INSTRUCTION FORMAT

# of bytes: 4 1 1 1 2, 4 1
[Prefixes] EVEX Opcode ModR/M [SIB] [Disp16,32] [Immediate]
1
[Disp8*N]

Figure 2-10. Intel® AVX-512 Instruction Format and the EVEX Prefix

The EVEX prefix is a 4-byte prefix, with the first two bytes derived from unused encoding form of the 32-bit-mode-
only BOUND instruction. The layout of the EVEX prefix is shown in Figure 2-11. The first byte must be 62H, followed
by three payload bytes, denoted as P0, P1, and P2 individually or collectively as P[23:0] (see Figure 2-11).

EVEX 62H P0 P1 P2

7 6 5 4 3 2 1 0
P0 R X B R’ 0 m m m P[7:0]

7 6 5 4 3 2 1 0
P1 W v v v v 1 p p P[15:8]

7 6 5 4 3 2 1 0
P2 z L’ L b V’ a a a P[23:16]

Figure 2-11. Bit Field Layout of the EVEX Prefix1


NOTES:
1. See Table 2-32 for additional details on bit fields.

Vol. 2A 2-37
INSTRUCTION FORMAT

Table 2-32. EVEX Prefix Bit Field Functional Grouping


Notation Bit field Group Position Comment
EVEX.mmm Access to up to eight decoding maps P[2:0] Currently, only the following decoding maps are supported: 1,
2, 3, 5, and 6.
-- Reserved P[3] Must be 0.
EVEX.R’ High-16 register specifier modifier P[4] Combine with EVEX.R and ModR/M.reg. This bit is stored in
inverted format.
EVEX.RXB Next-8 register specifier modifier P[7:5] Combine with ModR/M.reg, ModR/M.rm (base, index/vidx). This
field is encoded in bit inverted format.
EVEX.X High-16 register specifier modifier P[6] Combine with EVEX.B and ModR/M.rm, when SIB/VSIB absent.
EVEX.pp Compressed legacy prefix P[9:8] Identical to VEX.pp.
-- Fixed Value P[10] Must be 1.
EVEX.vvvv VVVV register specifier P[14:11] Same as VEX.vvvv. This field is encoded in bit inverted format.
EVEX.W Operand size promotion/Opcode P[15]
extension
EVEX.aaa Embedded opmask register specifier P[18:16]
EVEX.V’ High-16 VVVV/VIDX register specifier P[19] Combine with EVEX.vvvv or when VSIB present. This bit is
stored in inverted format.
EVEX.b Broadcast/RC/SAE Context P[20]
EVEX.L’L Vector length/RC P[22:21]
EVEX.z Zeroing/Merging P[23]

The bit fields in P[23:0] are divided into the following functional groups (Table 2-32 provides a tabular summary):
• Reserved bits: P[3] must be 0, otherwise #UD.
• Fixed-value bit: P[10] must be 1, otherwise #UD.
• Compressed legacy prefix/escape bytes: P[1:0] is identical to the lowest 2 bits of VEX.mmmmm; P[9:8] is
identical to VEX.pp.
• EVEX.mmm: P[2:0] provides access to up to eight decoding maps. Currently, only the following decoding maps
are supported: 1, 2, 3, 5, and 6. Map ids 1, 2, and 3 are denoted by 0F, 0F38, and 0F3A, respectively, in the
instruction encoding descriptions.
• Operand specifier modifier bits for vector register, general purpose register, memory addressing: P[7:5] allows
access to the next set of 8 registers beyond the low 8 registers when combined with ModR/M register specifiers.
• Operand specifier modifier bit for vector register: P[4] (or EVEX.R’) allows access to the high 16 vector register
set when combined with P[7] and ModR/M.reg specifier; P[6] can also provide access to a high 16 vector
register when SIB or VSIB addressing are not needed.
• Non-destructive source /vector index operand specifier: P[19] and P[14:11] encode the second source vector
register operand in a non-destructive source syntax, vector index register operand can access an upper 16
vector register using P[19].
• Op-mask register specifiers: P[18:16] encodes op-mask register set k0-k7 in instructions operating on vector
registers.
• EVEX.W: P[15] is similar to VEX.W which serves either as opcode extension bit or operand size promotion to
64-bit in 64-bit mode.
• Vector destination merging/zeroing: P[23] encodes the destination result behavior which either zeroes the
masked elements or leave masked element unchanged.
• Broadcast/Static-rounding/SAE context bit: P[20] encodes multiple functionality, which differs across different
classes of instructions and can affect the meaning of the remaining field (EVEX.L’L). The functionality for the
following instruction classes are:

2-38 Vol. 2A
INSTRUCTION FORMAT

— Broadcasting a single element across the destination vector register: this applies to the instruction class
with Load+Op semantic where one of the source operand is from memory.
— Redirect L’L field (P[22:21]) as static rounding control for floating-point instructions with rounding
semantic. Static rounding control overrides MXCSR.RC field and implies “Suppress all exceptions” (SAE).
— Enable SAE for floating -point instructions with arithmetic semantic that is not rounding.
— For instruction classes outside of the afore-mentioned three classes, setting EVEX.b will cause #UD.
• Vector length/rounding control specifier: P[22:21] can serve one of three options.
— Vector length information for packed vector instructions.
— Ignored for instructions operating on vector register content as a single data element.
— Rounding control for floating-point instructions that have a rounding semantic and whose source and
destination operands are all vector registers.

2.7.2 Register Specifier Encoding and EVEX


EVEX-encoded instruction can access 8 opmask registers, 16 general-purpose registers and 32 vector registers in
64-bit mode (8 general-purpose registers and 8 vector registers in non-64-bit modes). EVEX-encoding can support
instruction syntax that access up to 4 instruction operands. Normal memory addressing modes and VSIB memory
addressing are supported with EVEX prefix encoding. The mapping of register operands used by various instruction
syntax and memory addressing in 64-bit mode are shown in Table 2-33. Opmask register encoding is described in
Section 2.7.3.

Table 2-33. 32-Register Support in 64-bit Mode Using EVEX with Embedded REX Bits
41 3 [2:0] Reg. Type Common Usages
REG EVEX.R’ REX.R modrm.reg GPR, Vector Destination or Source
VVVV EVEX.V’ EVEX.vvvv GPR, Vector 2ndSource or Destination
RM EVEX.X EVEX.B modrm.r/m GPR, Vector 1st Source or Destination
BASE 0 EVEX.B modrm.r/m GPR memory addressing
INDEX 0 EVEX.X sib.index GPR memory addressing
VIDX EVEX.V’ EVEX.X sib.index Vector VSIB memory addressing
NOTES:
1. Not applicable for accessing general purpose registers.

The mapping of register operands used by various instruction syntax and memory addressing in 32-bit modes are
shown in Table 2-34.

Table 2-34. EVEX Encoding Register Specifiers in 32-bit Mode


[2:0] Reg. Type Common Usages
REG modrm.reg GPR, Vector Destination or Source
VVVV EVEX.vvv GPR, Vector 2nd Source or Destination
RM modrm.r/m GPR, Vector 1st Source or Destination
BASE modrm.r/m GPR Memory Addressing
INDEX sib.index GPR Memory Addressing
VIDX sib.index Vector VSIB Memory Addressing

Vol. 2A 2-39
INSTRUCTION FORMAT

2.7.3 Opmask Register Encoding


There are eight opmask registers, k0-k7. Opmask register encoding falls into two categories:
• Opmask registers that are the source or destination operands of an instruction treating the content of opmask
register as a scalar value, are encoded using the VEX prefix scheme. It can support up to three operands using
standard modR/M byte’s reg field and rm field and VEX.vvvv. Such a scalar opmask instruction does not support
conditional update of the destination operand.
• An opmask register providing conditional processing and/or conditional update of the destination register of a
vector instruction is encoded using EVEX.aaa field (see Section 2.7.4).
• An opmask register serving as the destination or source operand of a vector instruction is encoded using
standard modR/M byte’s reg field and rm fields.

Table 2-35. Opmask Register Specifier Encoding


[2:0] Register Access Common Usages
REG modrm.reg k0-k7 Source
VVVV VEX.vvvv k0-k7 2nd Source
RM modrm.r/m k0-7 1st Source

{k1} EVEX.aaa k01-k7 Opmask

NOTES:
1. Instructions that overwrite the conditional mask in opmask do not permit using k0 as the embedded mask.

2.7.4 Masking Support in EVEX


EVEX can encode an opmask register to conditionally control per-element computational operation and updating of
result of an instruction to the destination operand. The predicate operand is known as the opmask register. The
EVEX.aaa field, P[18:16] of the EVEX prefix, is used to encode one out of a set of eight 64-bit architectural regis-
ters. Note that from this set of 8 architectural registers, only k1 through k7 can be addressed as predicate oper-
ands. k0 can be used as a regular source or destination but cannot be encoded as a predicate operand.
AVX-512 instructions support two types of masking with EVEX.z bit (P[23]) controlling the type of masking:
• Merging-masking, which is the default type of masking for EVEX-encoded vector instructions, preserves the old
value of each element of the destination where the corresponding mask bit has a 0. It corresponds to the case
of EVEX.z = 0.
• Zeroing-masking, is enabled by having the EVEX.z bit set to 1. In this case, an element of the destination is set
to 0 when the corresponding mask bit has a 0 value.
AVX-512 Foundation instructions can be divided into the following groups:
• Instructions which support “zeroing-masking”.
— Also allow merging-masking.
• Instructions which require aaa = 000.
— Do not allow any form of masking.
• Instructions which allow merging-masking but do not allow zeroing-masking.
— Require EVEX.z to be set to 0.
— This group is mostly composed of instructions that write to memory.
• Instructions which require aaa <> 000 do not allow EVEX.z to be set to 1.
— Allow merging-masking and do not allow zeroing-masking, e.g., gather instructions.

2-40 Vol. 2A
INSTRUCTION FORMAT

2.7.5 Compressed Displacement (disp8*N) Support in EVEX


For memory addressing using disp8 form, EVEX-encoded instructions always use a compressed displacement
scheme by multiplying disp8 in conjunction with a scaling factor N that is determined based on the vector length,
the value of EVEX.b bit (embedded broadcast) and the input element size of the instruction. In general, the factor
N corresponds to the number of bytes characterizing the internal memory operation of the input operand (e.g., 64
when the accessing a full 512-bit memory vector). The scale factor N is listed in Table 2-36 and Table 2-37 below,
where EVEX encoded instructions are classified using the tupletype attribute. The scale factor N of each tupletype
is listed based on the vector length (VL) and other factors affecting it.
Table 2-36 covers EVEX-encoded instructions which has a load semantic in conjunction with additional computa-
tional or data element movement operation, operating either on the full vector or half vector (due to conversion of
numerical precision from a wider format to narrower format). EVEX.b is supported for such instructions for data
element sizes which are either dword or qword (see Section 2.7.11).
EVEX-encoded instruction that are pure load/store, and “Load+op” instruction semantic that operate on data
element size less then dword do not support broadcasting using EVEX.b. These are listed in Table 2-37. Table 2-37
also includes many broadcast instructions which perform broadcast using a subset of data elements without using
EVEX.b. These instructions and a few data element size conversion instruction are covered in Table 2-37. Instruc-
tion classified in Table 2-37 do not use EVEX.b and EVEX.b must be 0, otherwise #UD will occur.
The tupletype will be referenced in the instruction operand encoding table in the reference page of each instruction,
providing the cross reference for the scaling factor N to encoding memory addressing operand.
Note that the disp8*N rules still apply when using 16b addressing.

Table 2-36. Compressed Displacement (DISP8*N) Affected by Embedded Broadcast


TupleType EVEX.b InputSize EVEX.W Broadcast N (VL=128) N (VL=256) N (VL= 512) Comment
0 32bit 0 none 16 32 64
1 32bit 0 {1tox} 4 4 4 Load+Op (Full Vector
Full
0 64bit 1 none 16 32 64 Dword/Qword)

1 64bit 1 {1tox} 8 8 8
0 32bit 0 none 8 16 32
Half Load+Op (Half Vector)
1 32bit 0 {1tox} 4 4 4

Table 2-37. EVEX DISP8*N for Instructions Not Affected by Embedded Broadcast
TupleType InputSize EVEX.W N (VL= 128) N (VL= 256) N (VL= 512) Comment
Full Mem N/A N/A 16 32 64 Load/store or subDword full vector
8bit N/A 1 1 1
16bit N/A 2 2 2
Tuple1 Scalar 1Tuple
32bit 0 4 4 4
64bit 1 8 8 8
32bit N/A 4 4 4 1 Tuple, memsize not affected by
Tuple1 Fixed
64bit N/A 8 8 8 EVEX.W

32bit 0 8 8 8
Tuple2 Broadcast (2 elements)
64bit 1 NA 16 16
32bit 0 NA 16 16
Tuple4 Broadcast (4 elements)
64bit 1 NA NA 32
Tuple8 32bit 0 NA NA 32 Broadcast (8 elements)
Half Mem N/A N/A 8 16 32 SubQword Conversion
Quarter Mem N/A N/A 4 8 16 SubDword Conversion

Vol. 2A 2-41
INSTRUCTION FORMAT

Table 2-37. EVEX DISP8*N for Instructions Not Affected by Embedded Broadcast (Contd.)
TupleType InputSize EVEX.W N (VL= 128) N (VL= 256) N (VL= 512) Comment
Eighth Mem N/A N/A 2 4 8 SubWord Conversion
Mem128 N/A N/A 16 16 16 Shift count from memory
MOVDDUP N/A N/A 8 32 64 VMOVDDUP

2.7.6 EVEX Encoding of Broadcast/Rounding/SAE Support


EVEX.b can provide three types of encoding context, depending on the instruction classes:
• Embedded broadcasting of one data element from a source memory operand to the destination for vector
instructions with “load+op” semantic.
• Static rounding control overriding MXCSR.RC for floating-point instructions with rounding semantic.
• “Suppress All exceptions” (SAE) overriding MXCSR mask control for floating-point arithmetic instructions that
do not have rounding semantic.

2.7.7 Embedded Broadcast Support in EVEX


EVEX encodes an embedded broadcast functionality that is supported on many vector instructions with 32-bit
(double word or single precision floating-point) and 64-bit data elements, and when the source operand is from
memory. EVEX.b (P[20]) bit is used to enable broadcast on load-op instructions. When enabled, only one element
is loaded from memory and broadcasted to all other elements instead of loading the full memory size.
The following instruction classes do not support embedded broadcasting:
• Instructions with only one scalar result is written to the vector destination.
• Instructions with explicit broadcast functionality provided by its opcode.
• Instruction semantic is a pure load or a pure store operation.

2.7.8 Static Rounding Support in EVEX


Static rounding control embedded in the EVEX encoding system applies only to register-to-register flavor of
floating-point instructions with rounding semantic at two distinct vector lengths: (i) scalar, (ii) 512-bit. In both
cases, the field EVEX.L’L expresses rounding mode control overriding MXCSR.RC if EVEX.b is set. When EVEX.b is
set, “suppress all exceptions” is implied. The processor behaves as if all MXCSR masking controls are set.

2.7.9 SAE Support in EVEX


The EVEX encoding system allows arithmetic floating-point instructions without rounding semantic to be encoded
with the SAE attribute. This capability applies to scalar and 512-bit vector lengths, register-to-register only, by
setting EVEX.b. When EVEX.b is set, “suppress all exceptions” is implied. The processor behaves as if all MXCSR
masking controls are set.

2.7.10 Vector Length Orthogonality


The architecture of EVEX encoding scheme can support SIMD instructions operating at multiple vector lengths.
Many AVX-512 Foundation instructions operate at 512-bit vector length. The vector length of EVEX encoded vector
instructions are generally determined using the L’L field in EVEX prefix, except for 512-bit floating-point, reg-reg
instructions with rounding semantic. The table below shows the vector length corresponding to various values of
the L’L bits. When EVEX is used to encode scalar instructions, L’L is generally ignored.
When EVEX.b bit is set for a register-register instructions with floating-point rounding semantic, the same two bits
P2[6:5] specifies rounding mode for the instruction, with implied SAE behavior. The mapping of different instruc-
tion classes relative to the embedded broadcast/rounding/SAE control and the EVEX.L’L fields are summarized in
Table 2-38.

2-42 Vol. 2A
INSTRUCTION FORMAT

Table 2-38. EVEX Embedded Broadcast/Rounding/SAE and Vector Length on Vector Instructions
Position P2[4] P2[6:5] P2[6:5]
Broadcast/Rounding/SAE Context EVEX.b EVEX.L’L EVEX.RC
Reg-reg, FP Instructions w/ rounding semantic or SAE Enable static rounding Vector length Implied 00b: SAE + RNE
control (SAE implied) (512 bit or scalar) 01b: SAE + RD
10b: SAE + RU
11b: SAE + RZ
Load+op Instructions w/ memory source Broadcast Control 00b: 128-bit NA
01b: 256-bit
Other Instructions ( Must be 0 (otherwise NA
10b: 512-bit
Explicit Load/Store/Broadcast/Gather/Scatter) #UD)
11b: Reserved (#UD)

2.7.11 #UD Equations for EVEX


Instructions encoded using EVEX can face three types of UD conditions: state dependent, opcode independent and
opcode dependent.

2.7.11.1 State Dependent #UD


In general, attempts of execute an instruction, which required OS support for incremental extended state compo-
nent, will #UD if required state components were not enabled by OS. Table 2-39 lists instruction categories with
respect to required processor state components. Attempts to execute a given category of instructions while
enabled states were less than the required bit vector in XCR0 shown in Table 2-39 will cause #UD.

Table 2-39. OS XSAVE Enabling Requirements of Instruction Categories


Instruction Categories Vector Register State Access Required XCR0 Bit Vector [7:0]
Legacy SIMD prefix encoded Instructions (e.g SSE) XMM xxxxxx11b
VEX-encoded instructions operating on YMM YMM xxxxx111b
EVEX-encoded 128-bit instructions ZMM 111xx111b
EVEX-encoded 256-bit instructions ZMM 111xx111b
EVEX-encoded 512-bit instructions ZMM 111xx111b
VEX-encoded instructions operating on opmask k-reg 111xxx11b

2.7.11.2 Opcode Independent #UD


A number of bit fields in EVEX encoded instruction must obey mode-specific but opcode-independent patterns
listed in Table 2-40.

Table 2-40. Opcode Independent, State Dependent EVEX Bit Fields


Position Notation 64-bit #UD Non-64-bit #UD
P[3] -- if > 0 if > 0
P[10] -- if 0 if 0
P[2:0] EVEX.mmm if 000b, 100b, or 111b if 000b, 100b, or 111b
P[7 : 6] EVEX.RX None (valid) None (BOUND if EVEX.RX != 11b)

Vol. 2A 2-43
INSTRUCTION FORMAT

2.7.11.3 Opcode Dependent #UD


This section describes legal values for the rest of the EVEX bit fields. Table 2-41 lists the #UD conditions of EVEX
prefix bit fields which encodes or modifies register operands.

Table 2-41. #UD Conditions of Operand-Encoding EVEX Prefix Bit Fields


Notation Position Operand Encoding 64-bit #UD Non-64-bit #UD
EVEX.R P[7] ModRM.reg encodes k-reg If EVEX.R = 0 None (BOUND if
EVEX.RX != 11b)
ModRM.reg is opcode extension None (ignored)
ModRM.reg encodes all other registers None (valid)
EVEX.X P[6] ModRM.r/m encodes ZMM/YMM/XMM None (valid)
ModRM.r/m encodes k-reg or GPR None (ignored)
ModRM.r/m without SIB/VSIB None (ignored)
ModRM.r/m with SIB/VSIB None (valid)
EVEX.B P[5] ModRM.r/m encodes k-reg None (ignored) None (ignored)
ModRM.r/m encodes other registers None (valid)
ModRM.r/m base present None (valid)
ModRM.r/m base not present None (ignored)
EVEX.R’ P[4] ModRM.reg encodes k-reg or GPR If 0 None (ignored)
ModRM.reg is opcode extension None (ignored)
ModRM.reg encodes ZMM/YMM/XMM None (valid)
EVEX.vvvv P[14:11] vvvv encodes ZMM/YMM/XMM None (valid) None (valid)
P[14] ignored
Otherwise If != 1111b If != 1111b
EVEX.V’ P[19] Encodes ZMM/YMM/XMM None (valid) If 0
Otherwise If 0 If 0

Table 2-42 lists the #UD conditions of instruction encoding of opmask register using EVEX.aaa and EVEX.z

Table 2-42. #UD Conditions of Opmask Related Encoding Field


Notation Position Operand Encoding 64-bit #UD Non-64-bit #UD
EVEX.aaa P[18:16] Instructions do not use opmask for conditional processing1. If aaa != 000b If aaa != 000b
Opmask used as conditional processing mask and updated If aaa = 000b If aaa = 000b;
at completion2.
Opmask used as conditional processing. None (valid3) None (valid1)
EVEX.z P[23] Vector instruction using opmask as source or destination4. If EVEX.z != 0 If EVEX.z != 0
Store instructions or gather/scatter instructions. If EVEX.z != 0 If EVEX.z != 0
Instructions with EVEX.aaa = 000b. If EVEX.z != 0 If EVEX.z != 0
VEX.vvvv Varies K-regs are instruction operands not mask control. If vvvv = 0xxxb None
NOTES:
1. E.g., VPBROADCASTMxxx, VPMOVM2x, VPMOVx2M.
2. E.g., Gather/Scatter family.
3. aaa can take any value. A value of 000 indicates that there is no masking on the instruction; in this case, all elements will be pro-
cessed as if there was a mask of ‘all ones’ regardless of the actual value in K0.
4. E.g., VFPCLASSPD/PS, VCMPB/D/Q/W family, VPMOVM2x, VPMOVx2M.

2-44 Vol. 2A
INSTRUCTION FORMAT

Table 2-43 lists the #UD conditions of EVEX bit fields that depends on the context of EVEX.b.

Table 2-43. #UD Conditions Dependent on EVEX.b Context


Notation Position Operand Encoding 64-bit #UD Non-64-bit #UD
EVEX.L’Lb P[22 : 20] Reg-reg, FP instructions with rounding semantic. None (valid1 ) None (valid1)
Other reg-reg, FP instructions that can cause #XM. None (valid2) None (valid2)
Other reg-mem instructions in Table 2-36. None (valid3) None (valid3)
Other instruction classes4 in Table 2-37. If EVEX.b = 1 If EVEX.b = 1
NOTES:
1. L’L specifies rounding control, see Table 2-38, supports {er} syntax.
2. L’L is ignored.
3. L’L specifies vector length, see Table 2-38, supports embedded broadcast syntax
4. L’L specifies either vector length or ignored.

2.7.12 Device Not Available


EVEX-encoded instructions follow the same rules when it comes to generating #NM (Device Not Available) excep-
tion. In particular, it is generated when CR0.TS[bit 3]= 1.

2.7.13 Scalar Instructions


EVEX-encoded scalar SIMD instructions can access up to 32 registers in 64-bit mode. Scalar instructions support
masking (using the least significant bit of the opmask register), but broadcasting is not supported.

2.8 EXCEPTION CLASSIFICATIONS OF EVEX-ENCODED INSTRUCTIONS


The exception behavior of EVEX-encoded instructions can be classified into the classes shown in the rest of this
section. The classification of EVEX-encoded instructions follow a similar framework as those of AVX and AVX2
instructions using the VEX prefix. Exception types for EVEX-encoded instructions are named in the style of
“E##” or with a suffix “E##XX”. The “##” designation generally follows that of AVX/AVX2 instructions. The
majority of EVEX encoded instruction with “Load+op” semantic supports memory fault suppression, which is repre-
sented by E##. The instructions with “Load+op” semantic but do not support fault suppression are named
“E##NF”. A summary table of exception classes by class names are shown below.

Table 2-44. EVEX-Encoded Instruction Exception Class Summary


Exception Class Instruction set Mem arg (#XM)
Type E1 Vector Moves/Load/Stores Explicitly aligned, w/ fault suppression None
Type E1NF Vector Non-temporal Stores Explicitly aligned, no fault suppression None
Type E2 FP Vector Load+op Support fault suppression Yes
Type E2NF FP Vector Load+op No fault suppression Yes
Type E3 FP Scalar/Partial Vector, Load+Op Support fault suppression Yes
Type E3NF FP Scalar/Partial Vector, Load+Op No fault suppression Yes
Type E4 Integer Vector Load+op Support fault suppression No
Type E4NF Integer Vector Load+op No fault suppression No
Type E5 Legacy-like Promotion Varies, Support fault suppression No
Type E5NF Legacy-like Promotion Varies, No fault suppression No

Vol. 2A 2-45
INSTRUCTION FORMAT

Table 2-44. EVEX-Encoded Instruction Exception Class Summary (Contd.)


Exception Class Instruction set Mem arg (#XM)
Type E6 Post AVX Promotion Varies, w/ fault suppression No
Type E6NF Post AVX Promotion Varies, no fault suppression No
Type E7NM Register-to-register op None None
Type E9NF Miscellaneous 128-bit Vector-length Specific, no fault suppression None
Type E10 Non-XF Scalar Vector Length ignored, w/ fault suppression None
Type E10NF Non-XF Scalar Vector Length ignored, no fault suppression None
Type E11 VCVTPH2PS, VCVTPS2PH Half Vector Length, w/ fault suppression Yes
Type E12 Gather and Scatter Family VSIB addressing, w/ fault suppression None
Type E12NP Gather and Scatter Prefetch Family VSIB addressing, w/o page fault None

Table 2-45 lists EVEX-encoded instruction mnemonic by exception classes.

Table 2-45. EVEX Instructions in Each Exception Class


Exception Class Instruction
Type E1 VMOVAPD, VMOVAPS, VMOVDQA32, VMOVDQA64
Type E1NF VMOVNTDQ, VMOVNTDQA, VMOVNTPD, VMOVNTPS
VADDPD, VADDPH, VADDPS, VCMPPD, VCMPPH, VCMPPS, VCVTDQ2PH, VCVTDQ2PS, VCVTPD2DQ, VCVTPD2PH,
VCVTPD2PS, VCVTPD2QQ, VCVTPD2UQQ, VCVTPD2UDQ, VCVTPH2DQ, VCVTPH2PD, VCVTPH2QQ, VCVTPH2UDQ,
VCVTPH2UQQ, VCVTPH2UW, VCVTPH2W, VCVTPS2DQ, VCVTPS2UDQS, VCVTQQ2PD, VCVTQQ2PH, VCVTQQ2PS,
VCVTTPD2DQ, VCVTTPD2QQ, VCVTTPD2UDQ, VCVTTPD2UQQ, VCVTTPH2DQ, VCVTTPH2QQ, VCVTTPH2UDQ,
VCVTTPH2UQQ, VCVTTPH2UW, VCVTTPH2W, VCVTTPS2DQ, VCVTTPS2UDQ, VCVTUDQ2PH, VCVTUDQ2PS,
VCVTUQQ2PD, VCVTUQQ2PH, VCVTUQQ2PS, VCVTUW2PH, VCVTW2PH, VDIVPD, VDIVPH, VDIVPS, VEXP2PD,
VEXP2PS, VFIXUPIMMPD, VFIXUPIMMPS, VFMADDxxxPD, VFMADDxxxPH, VFMADDxxxPS, VFMADDSUBxxxPD,
Type E2
VFMADDSUBxxxPH, VFMADDSUBxxxPS, VFMSUBADDxxxPD, VFMSUBADDxxxPH, VFMSUBADDxxxPS,
VFMSUBxxxPD, VFMSUBxxxPH, VFMSUBxxxPS, VFNMADDxxxPD, VFNMADDxxxPH, VFNMADDxxxPS,
VFNMSUBxxxPD, VFNMSUBxxxPH, VFNMSUBxxxPS, VGETEXPPD, VGETEXPPH, VGETEXPPS, VGETMANTPD,
VGETMANTPH, VGETMANTPS, VGETMANTSH, VMAXPD, VMAXPH, VMAXPS, VMINPD, VMINPH, VMINPS, VMULPD,
VMULPH, VMULPS, VRANGEPD, VRANGEPS, VREDUCEPD, VREDUCEPH, VREDUCEPS, VRNDSCALEPD,
VRNDSCALEPH, VRNDSCALEPS, VRCP28PD, VRCP28PS, VRSQRT28PD, VRSQRT28PS, VSCALEFPD, VSCALEFPS,
VSQRTPD, VSQRTPH, VSQRTPS, VSUBPD, VSUBPH, VSUBPS
VADDSD, VADDSH, VADDSS, VCMPSD, VCMPSH, VCMPSS, VCVTPS2QQ, VCVTPS2UQQ, VCVTPS2PD, VCVTSD2SH,
VCVTSD2SS, VCVTSH2SD, VCVTSH2SS, VCVTSS2SD, VCVTSS2SH, VCVTTPS2QQ, VCVTTPS2UQQ, VDIVSD, VDIVSH,
VDIVSS, VFMADDxxxSD, VFMADDxxxSH, VFMADDxxxSS, VFMSUBxxxSD, VFMSUBxxxSH, VFMSUBxxxSS,
VFNMADDxxxSD, VFNMADDxxxSH, VFNMADDxxxSS, VFNMSUBxxxSD, VFNMSUBxxxSH, VFNMSUBxxxSS,
Type E3 VFIXUPIMMSD, VFIXUPIMMSS, VGETEXPSD, VGETEXPSH, VGETEXPSS, VGETMANTSD, VGETMANTSH,
VGETMANTSS, VMAXSD, VMAXSH, VMAXSS, VMINSD, VMINSH, VMINSS, VMULSD, VMULSH, VMULSS, VRANGESD,
VRANGESS, VREDUCESD, VREDUCESH, VREDUCESS, VRNDSCALESD, VRNDSCALESH, VRNDSCALESS, VSCALEFSD,
VSCALEFSH, VSCALEFSS, VRCP28SD, VRCP28SS, VRSQRT28SD, VRSQRT28SS, VSQRTSD, VSQRTSH, VSQRTSS,
VSUBSD, VSUBSH, VSUBSS
VCOMISD, VCOMISH, VCOMISS, VCVTSD2SI, VCVTSD2USI, VCVTSH2SI, VCVTSH2USI, VCVTSI2SD, VCVTSI2SH,
Type E3NF VCVTSI2SS, VCVTSS2SI, VCVTSS2USI, VCVTTSD2SI, VCVTTSD2USI, VCVTTSH2SI, VCVTTSH2USI, VCVTTSS2SI,
VCVTTSS2USI, VCVTUSI2SD, VCVTUSI2SH, VCVTUSI2SS, VUCOMISD, VUCOMISH, VUCOMISS

2-46 Vol. 2A
INSTRUCTION FORMAT

Table 2-45. EVEX Instructions in Each Exception Class (Contd.)


Exception Class Instruction
VANDPD, VANDPS, VANDNPD, VANDNPS, VBLENDMPD, VBLENDMPS, VFCMADDCPH, VFCMULCPH, VFMADDCPH,
VFMULCPH, VFPCLASSPD, VFPCLASSPH, VFPCLASSPS, VORPD, VORPS, VPABSD, VPABSQ, VPADDD, VPADDQ,
VPANDD, VPANDQ, VPANDND, VPANDNQ, VPBLENDMB, VPBLENDMD, VPBLENDMQ, VPBLENDMW, VPCMPD,
VPCMPEQD, VPCMPEQQ, VPCMPGTD, VPCMPGTQ, VPCMPQ, VPCMPUD, VPCMPUQ, VPLZCNTD, VPLZCNTQ,
VPMADD52LUQ, VPMADD52HUQ, VPMAXSD, VPMAXSQ, VPMAXUD, VPMAXUQ, VPMINSD, VPMINSQ, VPMINUD,
Type E4
VPMINUQ, VPMULLD, VPMULLQ, VPMULUDQ, VPMULDQ, VPORD, VPORQ, VPROLD, VPROLQ, VPROLVD, VPROLVQ,
VPRORD, VPRORQ, VPRORVD, VPRORVQ, (VPSLLD, VPSLLQ, VPSRAD, VPSRAQ, VPSRAVW, VPSRAVD, VPSRAVW,
VPSRAVQ, VPSRLD, VPSRLQ)1, VPSUBD, VPSUBQ, VPSUBUSB, VPSUBUSW, VPTERNLOGD, VPTERNLOGQ,
VPTESTMD, VPTESTMQ, VPTESTNMD, VPTESTNMQ, VPXORD, VPXORQ, VPSLLVD, VPSLLVQ, VRCP14PD,
VRCP14PS, VRCPPH, VRSQRT14PD, VRSQRT14PS, VRSQRTPH, VXORPD, VXORPS
VCOMPRESSPD, VCOMPRESSPS, VEXPANDPD, VEXPANDPS, VMOVDQU8, VMOVDQU16, VMOVDQU32,
VMOVDQU64, VMOVUPD, VMOVUPS, VPABSB, VPABSW, VPADDB, VPADDW, VPADDSB, VPADDSW, VPADDUSB,
VPADDUSW, VPAVGB, VPAVGW, VPCMPB, VPCMPEQB, VPCMPEQW, VPCMPGTB, VPCMPGTW, VPCMPW, VPCMPUB,
E4.nb2 VPCMPUW, VPCOMPRESSD, VPCOMPRESSQ, VPEXPANDD, VPEXPANDQ, VPMAXSB, VPMAXSW, VPMAXUB,
VPMAXUW, VPMINSB, VPMINSW, VPMINUB, VPMINUW, VPMULHRSW, VPMULHUW, VPMULHW, VPMULLW,
VPSLLVW, VPSLLW, VPSRAW, VPSRLVW, VPSRLW, VPSUBB, VPSUBW, VPSUBSB, VPSUBSW, VPTESTMB,
VPTESTMW, VPTESTNMB, VPTESTNMW
VALIGND, VALIGNQ, VPACKSSDW, VPACKUSDW, VPCONFLICTD, VPCONFLICTQ, VPERMD, VPERMI2D, VPERMI2PS,
VPERMI2PD, VPERMI2Q, VPERMPD, VPERMPS, VPERMQ, VPERMT2D, VPERMT2PS, VPERMT2Q, VPERMT2PD,
Type E4NF VPERMILPD, VPERMILPS, VPMULTISHIFTQB, VPSHUFD, VPUNPCKHDQ, VPUNPCKHQDQ, VPUNPCKLDQ,
VPUNPCKLQDQ, VSHUFF32X4, VSHUFF64X2, VSHUFI32X4, VSHUFI64X2, VSHUFPD, VSHUFPS, VUNPCKHPD,
VUNPCKHPS, VUNPCKLPD, VUNPCKLPS
VDBPSADBW, VPACKSSWB, VPACKUSWB, VPALIGNR, VPMADDWD, VPMADDUBSW, VMOVSHDUP, VMOVSLDUP,
2 VPSADBW, VPSHUFB, VPSHUFHW, VPSHUFLW, VPSLLDQ, VPSRLDQ, VPSLLW, VPSRAW, VPSRLW, (VPSLLD,
E4NF.nb VPSLLQ, VPSRAD, VPSRAQ, VPSRLD, VPSRLQ)3, VPUNPCKHBW, VPUNPCKHWD, VPUNPCKLBW, VPUNPCKLWD,
VPERMW, VPERMI2W, VPERMT2W
PMOVSXBW, PMOVSXBW, PMOVSXBD, PMOVSXBQ, PMOVSXWD, PMOVSXWQ, PMOVSXDQ, PMOVZXBW,
Type E5 PMOVZXBD, PMOVZXBQ, PMOVZXWD, PMOVZXWQ, PMOVZXDQ, VCVTDQ2PD, VCVTUDQ2PD, VMOVSH,
VPMOVSXxx, VPMOVZXxx,
Type E5NF VMOVDDUP
VBROADCASTF32X2, VBROADCASTF32X4, VBROADCASTF64X2, VBROADCASTF32X8, VBROADCASTF64X4,
VBROADCASTI32X2, VBROADCASTI32X4, VBROADCASTI64X2, VBROADCASTI32X8, VBROADCASTI64X4,
VBROADCASTSD, VBROADCASTSS, VFPCLASSSD, VFPCLASSSS, VPBROADCASTB, VPBROADCASTD,
Type E6
VPBROADCASTW, VPBROADCASTQ, VPMOVQB, VPMOVSQB, VPMOVUSQB, VPMOVQW, VPMOVSQW, VPMOVUSQW,
VPMOVQD, VPMOVSQD, VPMOVUSQD, VPMOVDB, VPMOVSDB, VPMOVUSDB, VPMOVDW, VPMOVSDW,
VPMOVUSDW, VPMOVWB, VPMOVSWB, VPMOVUSWB
VEXTRACTF32X4, VEXTRACTF32X8, VEXTRACTF64X2, VEXTRACTF64X4, VEXTRACTI32X4, VEXTRACTI32X8,
Type E6NF VEXTRACTI64X2, VEXTRACTI64X4, VINSERTF32X4, VINSERTF32X8, VINSERTF64X2, VINSERTF64X4,
VINSERTI32X4, VINSERTI32X8, VINSERTI64X2, VINSERTI64X4, VPBROADCASTMB2Q, VPBROADCASTMW2D
Type VMOVHLPS, VMOVLHPS
E7NM.1284
(VPBROADCASTD, VPBROADCASTQ, VPBROADCASTB, VPBROADCASTW)5, VPMOVB2M, VPMOVD2M, VPMOVM2B,
Type E7NM.
VPMOVM2D, VPMOVM2Q, VPMOVM2W, VPMOVQ2M, VPMOVW2M
VEXTRACTPS, VINSERTPS, VMOVHPD, VMOVHPS, VMOVLPD, VMOVLPS, VMOVD, VMOVQ, VMOVW, VPEXTRB,
Type E9NF
VPEXTRD, VPEXTRW, VPEXTRQ, VPINSRB, VPINSRD, VPINSRW, VPINSRQ
VFCMADDCSH, VFMADDCSH, VFCMULCSH, VFMULCSH, VFPCLASSSH, VMOVSD, VMOVSS, VRCP14SD, VRCP14SS,
Type E10
VRCPSH, VRSQRT14SD, VRSQRT14SS, VRSQRTSH
Type E10NF (VCVTSI2SD, VCVTUSI2SD)6
Type E11 VCVTPH2PS, VCVTPS2PH

Vol. 2A 2-47
INSTRUCTION FORMAT

Table 2-45. EVEX Instructions in Each Exception Class (Contd.)


Exception Class Instruction
VGATHERDPS, VGATHERDPD, VGATHERQPS, VGATHERQPD, VPGATHERDD, VPGATHERDQ, VPGATHERQD,
Type E12 VPGATHERQQ, VPSCATTERDD, VPSCATTERDQ, VPSCATTERQD, VPSCATTERQQ, VSCATTERDPD, VSCATTERDPS,
VSCATTERQPD, VSCATTERQPS
VGATHERPF0DPD, VGATHERPF0DPS, VGATHERPF0QPD, VGATHERPF0QPS, VGATHERPF1DPD, VGATHERPF1DPS,
Type E12NP VGATHERPF1QPD, VGATHERPF1QPS, VSCATTERPF0DPD, VSCATTERPF0DPS, VSCATTERPF0QPD,
VSCATTERPF0QPS, VSCATTERPF1DPD, VSCATTERPF1DPS, VSCATTERPF1QPD, VSCATTERPF1QPS
NOTES:
1. Operand encoding Full tupletype with immediate.
2. Embedded broadcast is not supported with the “.nb” suffix.
3. Operand encoding Mem128 tupletype.
4. #UD raised if EVEX.L’L !=00b (VL=128).
5. The source operand is a general purpose register.
6. W0 encoding only.

2-48 Vol. 2A
INSTRUCTION FORMAT

2.8.1 Exceptions Type E1 and E1NF of EVEX-Encoded Instructions


EVEX-encoded instructions with memory alignment restrictions, and supporting memory fault suppression follow
exception class E1.

Table 2-46. Type E1 Class Exception Conditions

Protected and
Virtual 80x86

Compatibility

64-bit
Real

Exception Cause of Exception

X X If EVEX prefix present.


If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
X X • Opcode independent #UD condition in Table 2-40.
• Operand encoding #UD conditions in Table 2-41.
Invalid Opcode,
• Opmask encoding #UD condition of Table 2-42.
#UD • EVEX.b encoding #UD condition of Table 2-43.
• Instruction specific EVEX.L'L restriction not met.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a EVEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Avail-
X X X X If CR0.TS[bit 3]=1.
able, #NM
X If fault suppression not set, and an illegal address in the SS segment.
Stack, #SS(0) If fault suppression not set, and a memory address referencing the SS segment is in
X
a non-canonical form.
EVEX.512: Memory operand is not 64-byte aligned.
X X EVEX.256: Memory operand is not 32-byte aligned.
EVEX.128: Memory operand is not 16-byte aligned.

General Protection, If fault suppression not set, and an illegal memory operand effective address in the
X
#GP(0) CS, DS, ES, FS or GS segments.
X If fault suppression not set, and the memory address is in a non-canonical form.
If fault suppression not set, and any part of the operand lies outside the effective
X X
address space from 0 to FFFFH.
Page Fault
X X X If fault suppression not set, and a page fault.
#PF(fault-code)

Vol. 2A 2-49
INSTRUCTION FORMAT

EVEX-encoded instructions with memory alignment restrictions, but do not support memory fault suppression
follow exception class E1NF.

Table 2-47. Type E1NF Class Exception Conditions

Virtual 80x86

Protected and
Compatibility

64-bit
Real
Exception Cause of Exception

X X If EVEX prefix present.


If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
X X • Opcode independent #UD condition in Table 2-40.
Invalid Opcode, • Operand encoding #UD conditions in Table 2-41.
• Opmask encoding #UD condition of Table 2-42.
#UD
• EVEX.b encoding #UD condition of Table 2-43.
• Instruction specific EVEX.L'L restriction not met.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a EVEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Avail-
X X X X If CR0.TS[bit 3]=1.
able, #NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
EVEX.512: Memory operand is not 64-byte aligned.
X X EVEX.256: Memory operand is not 32-byte aligned.
EVEX.128: Memory operand is not 16-byte aligned.
General Protection, For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
#GP(0) X
ments.
X If the memory address is in a non-canonical form.
X X If any part of the operand lies outside the effective address space from 0 to FFFFH.
Page Fault
X X X For a page fault.
#PF(fault-code)

2-50 Vol. 2A
INSTRUCTION FORMAT

2.8.2 Exceptions Type E2 of EVEX-Encoded Instructions


EVEX-encoded vector instructions with arithmetic semantic follow exception class E2.

Table 2-48. Type E2 Class Exception Conditions

Protected and
Compatibility
Virtual 8086

64-bit
Real

Exception Cause of Exception

X X If EVEX prefix present.


X X X X If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 0.
If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
Invalid Opcode, X X • Opcode independent #UD condition in Table 2-40.
#UD • Operand encoding #UD conditions in Table 2-41.
• Opmask encoding #UD condition of Table 2-42.
• Instruction specific EVEX.L'L restriction not met.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a EVEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Avail-
X X X X If CR0.TS[bit 3]=1.
able, #NM
X If fault suppression not set, and an illegal address in the SS segment.
Stack, #SS(0) If fault suppression not set, and a memory address referencing the SS segment is in a
X
non-canonical form.
If fault suppression not set, and an illegal memory operand effective address in the CS,
X
DS, ES, FS or GS segments.
General Protec-
X If fault suppression not set, and the memory address is in a non-canonical form.
tion, #GP(0)
If fault suppression not set, and any part of the operand lies outside the effective
X X
address space from 0 to FFFFH.
Page Fault
X X X If fault suppression not set, and a page fault.
#PF(fault-code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an unaligned
X X X
#AC(0) memory access is made while the current privilege level is 3.
SIMD Floating-
If an unmasked SIMD floating-point exception, {sae} or {er} not set, and CR4.OSXMMEX-
point Exception, X X X X
CPT[bit 10] = 1.
#XM

Vol. 2A 2-51
INSTRUCTION FORMAT

2.8.3 Exceptions Type E3 and E3NF of EVEX-Encoded Instructions


EVEX-encoded scalar instructions with arithmetic semantic that support memory fault suppression follow exception
class E3.

Table 2-49. Type E3 Class Exception Conditions

Protected and
Virtual 80x86

Compatibility

64-bit
Real
Exception Cause of Exception

X X If EVEX prefix present.


X X X X If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 0.
If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
X X • Opcode independent #UD condition in Table 2-40.
Invalid Opcode, #UD • Operand encoding #UD conditions in Table 2-41.
• Opmask encoding #UD condition of Table 2-42.
• EVEX.b encoding #UD condition of Table 2-43.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a EVEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X If fault suppression not set, and an illegal address in the SS segment.
Stack, #SS(0) If fault suppression not set, and a memory address referencing the SS segment is
X
in a non-canonical form.
If fault suppression not set, and an illegal memory operand effective address in
X
the CS, DS, ES, FS or GS segments.
General Protection,
X If fault suppression not set, and the memory address is in a non-canonical form.
#GP(0)
If fault suppression not set, and any part of the operand lies outside the effective
X X
address space from 0 to FFFFH.
Page Fault #PF(fault-
X X X If fault suppression not set, and a page fault.
code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X X
#AC(0) unaligned memory access is made while the current privilege level is 3.
SIMD Floating-point If an unmasked SIMD floating-point exception, {sae} or {er} not set, and CR4.OSX-
X X X X
Exception, #XM MMEXCPT[bit 10] = 1.

2-52 Vol. 2A
INSTRUCTION FORMAT

EVEX-encoded scalar instructions with arithmetic semantic that do not support memory fault suppression follow
exception class E3NF.

Table 2-50. Type E3NF Class Exception Conditions

Virtual 80x86

Protected and
Compatibility

64-bit
Exception Real Cause of Exception

X X EVEX prefix.
X X X X If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 0.
If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
X X • Opcode independent #UD condition in Table 2-40.
Invalid Opcode, #UD • Operand encoding #UD conditions in Table 2-41.
• Opmask encoding #UD condition of Table 2-42.
• EVEX.b encoding #UD condition of Table 2-43.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a EVEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
X
ments.
General Protection,
X If the memory address is in a non-canonical form.
#GP(0)
If any part of the operand lies outside the effective address space from 0 to
X X
FFFFH.
Page Fault #PF(fault-
X X X For a page fault.
code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X X
#AC(0) unaligned memory access is made while the current privilege level is 3.
SIMD Floating-point If an unmasked SIMD floating-point exception, {sae} or {er} not set, and CR4.OSX-
X X X X
Exception, #XM MMEXCPT[bit 10] = 1.

Vol. 2A 2-53
INSTRUCTION FORMAT

2.8.4 Exceptions Type E4 and E4NF of EVEX-Encoded Instructions


EVEX-encoded vector instructions that cause no SIMD FP exception and support memory fault suppression follow
exception class E4.

Table 2-51. Type E4 Class Exception Conditions

Protected and
Virtual 80x86

Compatibility

64-bit
Real
Exception Cause of Exception

X X If EVEX prefix present.


If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
• Opcode independent #UD condition in Table 2-40.
X X • Operand encoding #UD conditions in Table 2-41.
• Opmask encoding #UD condition of Table 2-42.
Invalid Opcode, #UD
• EVEX.b encoding #UD condition of Table 2-43 and in E4.nb subclass (see E4.nb
entries in Table 2-45).
• Instruction specific EVEX.L'L restriction not met.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a EVEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X If fault suppression not set, and an illegal address in the SS segment.
Stack, #SS(0) If fault suppression not set, and a memory address referencing the SS segment is
X
in a non-canonical form.
If fault suppression not set, and an illegal memory operand effective address in
X
the CS, DS, ES, FS or GS segments.
General Protection,
#GP(0) X If fault suppression not set, and the memory address is in a non-canonical form.
If fault suppression not set, and any part of the operand lies outside the effective
X X
address space from 0 to FFFFH.
Page Fault #PF(fault-
X X X If fault suppression not set, and a page fault.
code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X X
#AC(0) unaligned memory access is made while the current privilege level is 3.

2-54 Vol. 2A
INSTRUCTION FORMAT

EVEX-encoded vector instructions that do not cause SIMD FP exception nor support memory fault suppression
follow exception class E4NF.

Table 2-52. Type E4NF Class Exception Conditions

Virtual 80x86

Protected and
Compatibility

64-bit
Exception Real Cause of Exception

X X If EVEX prefix present.


If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
• Opcode independent #UD condition in Table 2-40.
X X • Operand encoding #UD conditions in Table 2-41.
• Opmask encoding #UD condition of Table 2-42.
Invalid Opcode, #UD
• EVEX.b encoding #UD condition of Table 2-43 and in E4NF.nb subclass (see
E4NF.nb entries in Table 2-45).
• Instruction specific EVEX.L'L restriction not met.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a EVEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
X
ments.
General Protection,
#GP(0) X If the memory address is in a non-canonical form.
If any part of the operand lies outside the effective address space from 0 to
X X
FFFFH.
Page Fault #PF(fault-
X X X For a page fault.
code)

Vol. 2A 2-55
INSTRUCTION FORMAT

2.8.5 Exceptions Type E5 and E5NF


EVEX-encoded scalar/partial-vector instructions that cause no SIMD FP exception and support memory fault
suppression follow exception class E5.

Table 2-53. Type E5 Class Exception Conditions

Protected and
Virtual 80x86

Compatibility

64-bit
Real
Exception Cause of Exception

X X If EVEX prefix present.


If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
X X • Opcode independent #UD condition in Table 2-40.
• Operand encoding #UD conditions in Table 2-41.
Invalid Opcode, #UD • Opmask encoding #UD condition of Table 2-42.
• EVEX.b encoding #UD condition of Table 2-43.
• Instruction specific EVEX.L'L restriction not met.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a EVEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X If fault suppression not set, and an illegal address in the SS segment.
Stack, #SS(0) If fault suppression not set, and a memory address referencing the SS segment is
X
in a non-canonical form.
If fault suppression not set, and an illegal memory operand effective address in the
X
CS, DS, ES, FS or GS segments.
General Protection,
X If fault suppression not set, and the memory address is in a non-canonical form.
#GP(0)
If fault suppression not set, and any part of the operand lies outside the effective
X X
address space from 0 to FFFFH.
Page Fault #PF(fault-
X X X If fault suppression not set, and a page fault.
code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X X
#AC(0) unaligned memory access is made while the current privilege level is 3.

EVEX-encoded scalar/partial vector instructions that do not cause SIMD FP exception nor support memory fault
suppression follow exception class E5NF.

2-56 Vol. 2A
INSTRUCTION FORMAT

Table 2-54. Type E5NF Class Exception Conditions

Protected and
Virtual 80x86

Compatibility

64-bit
Real
Exception Cause of Exception

X X If EVEX prefix present.


If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
X X • Opcode independent #UD condition in Table 2-40.
• Operand encoding #UD conditions in Table 2-41.
Invalid Opcode, #UD • Opmask encoding #UD condition of Table 2-42.
• EVEX.b encoding #UD condition of Table 2-43.
• Instruction specific EVEX.L'L restriction not met.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a EVEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X If an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
X If an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.
General Protection, X If the memory address is in a non-canonical form.
#GP(0)
If any part of the operand lies outside the effective address space from 0 to
X X
FFFFH.
Page Fault #PF(fault-
X X X For a page fault.
code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X X
#AC(0) unaligned memory access is made while the current privilege level is 3.

Vol. 2A 2-57
INSTRUCTION FORMAT

2.8.6 Exceptions Type E6 and E6NF

Table 2-55. Type E6 Class Exception Conditions

Virtual 80x86

Protected and
Compatibility

64-bit
Real
Exception Cause of Exception

X X If EVEX prefix present.


If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
X X • Opcode independent #UD condition in Table 2-40.
• Operand encoding #UD conditions in Table 2-41.
Invalid Opcode, #UD • Opmask encoding #UD condition of Table 2-42.
• EVEX.b encoding #UD condition of Table 2-43.
• Instruction specific EVEX.L'L restriction not met.
X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a EVEX prefix.
X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X If CR0.TS[bit 3]=1.
#NM
X If fault suppression not set, and an illegal address in the SS segment.
Stack, #SS(0) If fault suppression not set, and a memory address referencing the SS segment is
X
in a non-canonical form.
If fault suppression not set, and an illegal memory operand effective address in the
General Protection, X
CS, DS, ES, FS or GS segments.
#GP(0)
X If fault suppression not set, and the memory address is in a non-canonical form.
Page Fault #PF(fault-
X X If fault suppression not set, and a page fault.
code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X
#AC(0) unaligned memory access is made while the current privilege level is 3.

2-58 Vol. 2A
INSTRUCTION FORMAT

EVEX-encoded instructions that do not cause SIMD FP exception nor support memory fault suppression follow
exception class E6NF.

Table 2-56. Type E6NF Class Exception Conditions

Virtual 80x86

Protected and
Compatibility

64-bit
Exception Real Cause of Exception

Invalid Opcode, #UD X X If EVEX prefix present.


If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
X X • Opcode independent #UD condition in Table 2-40.
• Operand encoding #UD conditions in Table 2-41.
• Opmask encoding #UD condition of Table 2-42.
• EVEX.b encoding #UD condition of Table 2-43.
• Instruction specific EVEX.L'L restriction not met.
X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a EVEX prefix.
X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X If CR0.TS[bit 3]=1.
#NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
General Protection, X
ments.
#GP(0)
X If the memory address is in a non-canonical form.
Page Fault #PF(fault-
X X For a page fault.
code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X
#AC(0) unaligned memory access is made while the current privilege level is 3.

Vol. 2A 2-59
INSTRUCTION FORMAT

2.8.7 Exceptions Type E7NM


EVEX-encoded instructions that cause no SIMD FP exception and do not reference memory follow exception class
E7NM.

Table 2-57. Type E7NM Class Exception Conditions

Protected and
Virtual 80x86

Compatibility

64-bit
Real
Exception Cause of Exception

X X If EVEX prefix present.


If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
X X • Opcode independent #UD condition in Table 2-40.
• Operand encoding #UD conditions in Table 2-41.
Invalid Opcode, #UD • Opmask encoding #UD condition of Table 2-42.
• EVEX.b encoding #UD condition of Table 2-43.
• Instruction specific EVEX.L’L restriction not met.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a EVEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X If CR0.TS[bit 3]=1.
#NM

2-60 Vol. 2A
INSTRUCTION FORMAT

2.8.8 Exceptions Type E9 and E9NF


EVEX-encoded vector or partial-vector instructions that do not cause no SIMD FP exception and support memory
fault suppression follow exception class E9.

Table 2-58. Type E9 Class Exception Conditions

Protected and
Virtual 80x86

Compatibility

64-bit
Real
Exception Cause of Exception

X X If EVEX prefix present.


If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
X X • Opcode independent #UD condition in Table 2-40.
• Operand encoding #UD conditions in Table 2-41.
Invalid Opcode, #UD • Opmask encoding #UD condition of Table 2-42.
• EVEX.b encoding #UD condition of Table 2-43.
• Instruction specific EVEX.L'L restriction not met.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a EVEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X If fault suppression not set, and an illegal address in the SS segment.
Stack, #SS(0) If fault suppression not set, and a memory address referencing the SS segment is
X
in a non-canonical form.
If fault suppression not set, and an illegal memory operand effective address in the
X
CS, DS, ES, FS or GS segments.
General Protection,
X If fault suppression not set, and the memory address is in a non-canonical form.
#GP(0)
If fault suppression not set, and any part of the operand lies outside the effective
X X
address space from 0 to FFFFH.
Page Fault #PF(fault-
X X X If fault suppression not set, and a page fault.
code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X X
#AC(0) unaligned memory access is made while the current privilege level is 3.

Vol. 2A 2-61
INSTRUCTION FORMAT

EVEX-encoded vector or partial-vector instructions that must be encoded with VEX.L’L = 0, do not cause SIMD FP
exception nor support memory fault suppression follow exception class E9NF.

Table 2-59. Type E9NF Class Exception Conditions

Virtual 80x86

Protected and
Compatibility

64-bit
Exception Real Cause of Exception

X X If EVEX prefix present.


If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
X X • Opcode independent #UD condition in Table 2-40.
• Operand encoding #UD conditions in Table 2-41.
Invalid Opcode, #UD • Opmask encoding #UD condition of Table 2-42.
• EVEX.b encoding #UD condition of Table 2-43.
• Instruction specific EVEX.L'L restriction not met.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a EVEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X If an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
X If an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.
General Protection, X If the memory address is in a non-canonical form.
#GP(0)
If any part of the operand lies outside the effective address space from 0 to
X X
FFFFH.
Page Fault #PF(fault-
X X X For a page fault.
code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X X
#AC(0) unaligned memory access is made while the current privilege level is 3.

2-62 Vol. 2A
INSTRUCTION FORMAT

2.8.9 Exceptions Type E10 and E10NF


EVEX-encoded scalar instructions that ignore EVEX.L’L vector length encoding, do not cause a SIMD FP exception,
and support memory fault suppression follow exception class E10.

Table 2-60. Type E10 Class Exception Conditions

Protected and
Virtual 80x86

Compatibility

64-bit
Real
Exception Cause of Exception

X X If EVEX prefix present.


If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
X X • Opcode independent #UD condition in Table 2-40.
• Operand encoding #UD conditions in Table 2-41.
Invalid Opcode, #UD
• Opmask encoding #UD condition of Table 2-42.
• EVEX.b encoding #UD condition of Table 2-43.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a EVEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X If fault suppression not set, and an illegal address in the SS segment.
Stack, #SS(0) If fault suppression not set, and a memory address referencing the SS segment is
X
in a non-canonical form.
If fault suppression not set, and an illegal memory operand effective address in the
X
CS, DS, ES, FS or GS segments.
General Protection,
X If fault suppression not set, and the memory address is in a non-canonical form.
#GP(0)
If fault suppression not set, and any part of the operand lies outside the effective
X X
address space from 0 to FFFFH.
Page Fault #PF(fault-
X X X If fault suppression not set, and a page fault.
code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X X
#AC(0) unaligned memory access is made while the current privilege level is 3.

Vol. 2A 2-63
INSTRUCTION FORMAT

EVEX-encoded scalar instructions that ignore EVEX.L’L vector length encoding, do not cause a SIMD FP exception,
and do not support memory fault suppression follow exception class E10NF.

Table 2-61. Type E10NF Class Exception Conditions

Virtual 80x86

Protected and
Compatibility

64-bit
Exception Real Cause of Exception

X X If EVEX prefix present.


If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
X X • Opcode independent #UD condition in Table 2-40.
• Operand encoding #UD conditions in Table 2-41.
Invalid Opcode, #UD
• Opmask encoding #UD condition of Table 2-42.
• EVEX.b encoding #UD condition of Table 2-43.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a EVEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X If fault suppression not set, and an illegal address in the SS segment.
Stack, #SS(0) If fault suppression not set, and a memory address referencing the SS segment is
X
in a non-canonical form.
If fault suppression not set, and an illegal memory operand effective address in the
X
CS, DS, ES, FS or GS segments.
General Protection,
X If fault suppression not set, and the memory address is in a non-canonical form.
#GP(0)
If fault suppression not set, and any part of the operand lies outside the effective
X X
address space from 0 to FFFFH.
Page Fault #PF(fault-
X X X If fault suppression not set, and a page fault.
code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X X
#AC(0) unaligned memory access is made while the current privilege level is 3.

2-64 Vol. 2A
INSTRUCTION FORMAT

2.8.10 Exceptions Type E11 (EVEX-only, Mem Arg, No AC, Floating-point Exceptions)
EVEX-encoded instructions that can cause SIMD FP exception, memory operand support fault suppression but do
not cause #AC follow exception class E11.

Table 2-62. Type E11 Class Exception Conditions

Protected and
Virtual 80x86

Compatibility

64-bit
Real
Exception Cause of Exception

Invalid Opcode, #UD X X If EVEX prefix present.


X X If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
• Opcode independent #UD condition in Table 2-40.
• Operand encoding #UD conditions in Table 2-41.
• Opmask encoding #UD condition of Table 2-42.
• EVEX.b encoding #UD condition of Table 2-43.
• Instruction specific EVEX.L'L restriction not met.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a EVEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available, X X X X If CR0.TS[bit 3]=1.
#NM
Stack, #SS(0) X If fault suppression not set, and an illegal address in the SS segment.
X If fault suppression not set, and a memory address referencing the SS segment is
in a non-canonical form.
General Protection, X If fault suppression not set, and an illegal memory operand effective address in the
#GP(0) CS, DS, ES, FS or GS segments.
X If fault suppression not set, and the memory address is in a non-canonical form.
X X If fault suppression not set, and any part of the operand lies outside the effective
address space from 0 to FFFFH.
Page Fault #PF (fault- X X X If fault suppression not set, and a page fault.
code)
SIMD Floating-Point X X X X If an unmasked SIMD floating-point exception, {sae} not set, and CR4.OSXMMEX-
Exception, #XM CPT[bit 10] = 1.

Vol. 2A 2-65
INSTRUCTION FORMAT

2.8.11 Exceptions Type E12 and E12NP (VSIB Mem Arg, No AC, No Floating-point Exceptions)

Table 2-63. Type E12 Class Exception Conditions

Virtual 80x86

Protected and
Compatibility

64-bit
Real
Exception Cause of Exception

Invalid Opcode, #UD X X If EVEX prefix present.


X X If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
• Opcode independent #UD condition in Table 2-40.
• Operand encoding #UD conditions in Table 2-41.
• Opmask encoding #UD condition of Table 2-42.
• EVEX.b encoding #UD condition of Table 2-43.
• Instruction specific EVEX.L'L restriction not met.
• If vvvv != 1111b.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X NA If address size attribute is 16 bit.
X X X X If ModR/M.mod = ‘11b’.
X X X X If ModR/M.rm != ‘100b’.
X X X X If any corresponding CPUID feature flag is ‘0’.
X X X X If k0 is used (gather or scatter operation).
X X X X If index = destination register (gather operation).
Device Not Available, X X X X If CR0.TS[bit 3]=1.
#NM
Stack, #SS(0) X For an illegal address in the SS segment.
X If a memory address referencing the SS segment is in a non-canonical form.
General Protection, X For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
#GP(0) ments.
X If the memory address is in a non-canonical form.
X X If any part of the operand lies outside the effective address space from 0 to
FFFFH.
Page Fault #PF (fault- X X X For a page fault.
code)

2-66 Vol. 2A
INSTRUCTION FORMAT

EVEX-encoded prefetch instructions that do not cause #PF follow exception class E12NP.

Table 2-64. Type E12NP Class Exception Conditions

Virtual 80x86

Protected and
Compatibility

64-bit
Real
Exception Cause of Exception

Invalid Opcode, #UD X X If EVEX prefix present.


X X If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
• Opcode independent #UD condition in Table 2-40.
• Operand encoding #UD conditions in Table 2-41.
• Opmask encoding #UD condition of Table 2-42.
• EVEX.b encoding #UD condition of Table 2-43.
• Instruction specific EVEX.L'L restriction not met.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X NA If address size attribute is 16 bit.
X X X X If ModR/M.mod = ‘11b’.
X X X X If ModR/M.rm != ‘100b’.
X X X X If any corresponding CPUID feature flag is ‘0’.
X X X X If k0 is used (gather or scatter operation).
Device Not Available, X X X X If CR0.TS[bit 3]=1.
#NM

Vol. 2A 2-67
INSTRUCTION FORMAT

2.9 EXCEPTION CLASSIFICATIONS OF OPMASK INSTRUCTIONS, TYPE K20 AND


TYPE K21
The exception behavior of VEX-encoded opmask instructions are listed below.

2.9.1 Exceptions Type K20


Exception conditions of Opmask instructions that do not address memory are listed as Type K20.

Table 2-65. TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg)
Virtual 80x86

Protected and
Compatibility

64-bit
Real

Exception Cause of Exception

Invalid Opcode, #UD X X X X If relevant CPUID feature flag is ‘0’.


X X If a VEX prefix is present.
X X If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
• Opcode independent #UD condition in Table 2-40.
• Operand encoding #UD conditions in Table 2-41.
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X If ModRM:[7:6] != 11b.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM

2-68 Vol. 2A
INSTRUCTION FORMAT

2.9.2 Exceptions Type K21


Exception conditions of Opmask instructions that address memory are listed as Type K21.

Table 2-66. TYPE K21 Exception Definition (VEX-Encoded OpMask Instructions Addressing Memory)

Virtual 80x86

Protected and
Compatibility

64-bit
Exception Real Cause of Exception

Invalid Opcode, #UD X X X X If relevant CPUID feature flag is ‘0’.


X X If a VEX prefix is present.
X X If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
• Opcode independent #UD condition in Table 2-40.
• Operand encoding #UD conditions in Table 2-41.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
Stack, #SS(0) X X X For an illegal address in the SS segment.
X If a memory address referencing the SS segment is in a non-canonical form.
General Protection, X For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
#GP(0) ments.
If the DS, ES, FS, or GS register is used to access memory and it contains a null
segment selector.
X If the memory address is in a non-canonical form.
X X If any part of the operand lies outside the effective address space from 0 to
FFFFH.
Page Fault #PF(fault- X X X For a page fault.
code)
Alignment Check X X X For 2, 4, or 8 byte memory access if alignment checking is enabled and an
#AC(0) unaligned memory access is made while the current privilege level is 3.

Vol. 2A 2-69
INSTRUCTION FORMAT

2.10 INTEL® AMX INSTRUCTION EXCEPTION CLASSES


Alignment exceptions: The Intel AMX instructions that access memory will never generate #AC exceptions.

Table 2-67. Intel® AMX Exception Classes


Class Description
• #UD if preceded by LOCK, 66H, F2H, F3H or REX prefixes.
• #UD if CR4.OSXSAVE ≠ 1.
• #UD if XCR0[18:17] ≠ 0b11.
• #UD if IA32_EFER.LMA ≠ 1 OR CS.L ≠ 1.
• #UD if VVVV ≠ 0b1111.
AMX-E1
• #GP based on palette and configuration checks (see pseudocode).
• #GP if the memory address is in a non-canonical form.
• #SS(0) if the memory address referencing the SS segment is in a non-canonical form.
• #PF if a page fault occurs.
• #UD if preceded by LOCK, 66H, F2H, F3H or REX prefixes.
• #UD if CR4.OSXSAVE ≠ 1.
• #UD if XCR0[18:17] ≠ 0b11.
• #UD if IA32_EFER.LMA ≠ 1 OR CS.L ≠ 1.
AMX-E2 • #UD if VVVV ≠ 0b1111.
• #GP if the memory address is in a non-canonical form.
• #SS(0) if the memory address referencing the SS segment is in a non-canonical form.
• #PF if a page fault occurs.
• #UD if preceded by LOCK, 66H, F2H, F3H or REX prefixes.
• #UD if CR4.OSXSAVE ≠ 1.
• #UD if XCR0[18:17] ≠ 0b11.
• #UD if IA32_EFER.LMA ≠ 1 OR CS.L ≠ 1.
• #UD if VVVV ≠ 0b1111.
• #UD if not using SIB addressing.
• #UD if TILES_CONFIGURED == 0.
• #UD if tsrc or tdest are not valid tiles.
AMX-E3 • #UD if tsrc/tdest are ≥ palette_table[tilecfg.palette_id].max_names.
• #UD if tsrc.colbytes mod 4 ≠ 0 OR tdest.colbytes mod 4 ≠ 0.
• #UD if tilecfg.start_row ≥ tsrc.rows OR tilecfg.start_row ≥ tdest.rows.
• #GP if the memory address is in a non-canonical form.
• #SS(0) if the memory address referencing the SS segment is in a non-canonical form.
• #PF if any memory operand causes a page fault.
• #NM if XFD[18] == 1.

2-70 Vol. 2A
INSTRUCTION FORMAT

Table 2-67. Intel® AMX Exception Classes (Contd.)


Class Description
• #UD if preceded by LOCK, 66H, F2H, F3H or REX prefixes.
• #UD if CR4.OSXSAVE ≠ 1.
• #UD if XCR0[18:17] ≠ 0b11.
• #UD if IA32_EFER.LMA ≠ 1 OR CS.L ≠ 1.
• #UD if srcdest == src1 OR src1 == src2 OR srcdest == src2.
• #UD if TILES_CONFIGURED == 0.
• #UD if srcdest.colbytes mod 4 ≠ 0.
• #UD if src1.colbytes mod 4 ≠ 0.
• #UD if src2.colbytes mod 4 ≠ 0.
AMX-E4 • #UD if srcdest/src1/src2 are not valid tiles.
• #UD if srcdest/src1/src2 are ≥ palette_table[tilecfg.palette_id].max_names.
• #UD if srcdest.colbytes ≠ src2.colbytes.
• #UD if srcdest.rows ≠ src1.rows.
• #UD if src1.colbytes / 4 ≠ src2.rows.
• #UD if srcdest.colbytes > tmul_maxn.
• #UD if src2.colbytes > tmul_maxn.
• #UD if src1.colbytes/4 > tmul_maxk.
• #UD if src2.rows > tmul_maxk.
• #NM if XFD[18] == 1.
• #UD if preceded by LOCK, 66H, F2H, F3H or REX prefixes.
• #UD if CR4.OSXSAVE ≠ 1.
• #UD if XCR0[18:17] ≠ 0b11.
• #UD if IA32_EFER.LMA ≠ 1 OR CS.L ≠ 1.
AMX-E5 • #UD if VVVV ≠ 0b1111.
• #UD if TILES_CONFIGURED == 0.
• #UD if tdest is not a valid tile.
• #UD if tdest is ≥ palette_table[tilecfg.palette_id].max_names.
• #NM if XFD[18] == 1.
• #UD if preceded by LOCK, 66H, F2H, F3H or REX prefixes.
• #UD if CR4.OSXSAVE ≠ 1.
AMX-E6 • #UD if XCR0[18:17] ≠ 0b11.
• #UD if IA32_EFER.LMA ≠ 1 OR CS.L ≠ 1.
• #UD if VVVV ≠ 0b1111.

Vol. 2A 2-71
INSTRUCTION FORMAT

2-72 Vol. 2A
6. Updates to Chapter 3, Volume 2A
Change bars and violet text show changes to Chapter 3 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 2A: Instruction Set Reference, A-L.

------------------------------------------------------------------------------------------
Changes to this chapter:
• Updated the BSF and BSR instructions to indicate that BSF and BSR leave the destination operand unmodified
if the source operand is zero. Added a footnote to confirm that using a 32-bit operand size on some older
processors may clear the upper 32 bits of a 64-bit destination while leaving the lower 32 bits unmodified.
• Added the CMPccXADD instruction.
• Updated the CPUID instruction to add enumeration for the AVX10, AVX-IFMA, AVX-NE-CONVERT, AVX-VNNI-
INT8, AVX-VNNI-INT16, CMPCCXADD, LAM, LASS, MSRLIST, PREFETCHI, WRMSRNS, SHA512, SM3, SM4, UC-
lock disable, and UIRET_UIF features. Added CPUID Leaf 24H. Corrected typos as needed.
• Added Intel® AVX10.1 information to the following instructions:
— ADDPD
— ADDPS
— ADDSD
— ADDSS
— AESDEC
— AESDECLAST
— AESENC
— AESENCLAST
— ANDNPD
— ANDNPS
— ANDPD
— ANDPS
— CMPPD
— CMPPS
— CMPSD
— CMPSS
— COMISD
— COMISS
— CVTDQ2PD
— CVTDQ2PS
— CVTPD2DQ
— CVTPD2PS
— CVTPS2DQ
— CVTPS2PD
— CVTSD2SI
— CVTSD2SS
— CVTSI2SD
— CVTSI2SS
— CVTSS2SD
— CVTSS2SI
— CVTTPD2DQ
— CVTTPS2DQ
— CVTTSD2SI

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 13


— CVTTSS2SI
— DIVPD
— DIVPS
— DIVSD
— DIVSS
— EXTRACTPS
— GF2P8AFFINEINVQB
— GF2P8AFFINEQB
— GF2P8MULB
— INSERTPS
— KADDW/KADDB/KADDQ/KADDD
— KANDW/KANDB/KANDQ/KANDD
— KANDNW/KANDNB/KANDNQ/KANDND
— KMOVW/KMOVB/KMOVQ/KMOVD
— KNOTW/KNOTB/KNOTQ/KNOTD
— KORW/KORB/KORQ/KORD
— KORTESTW/KORTESTB/KORTESTQ/KORTESTD
— KSHIFTLW/KSHIFTLB/KSHIFTLQ/KSHIFTLD
— KSHIFTRW/KSHIFTRB/KSHIFTRQ/KSHIFTRD
— KTESTW/KTESTB/KTESTQ/KTESTD
— KUNPCKBW/KUNPCKWD/KUNPCKDQ
— KXNORW/KXNORB/KXNORQ/KXNORD
— KXORW/KXORB/KXORQ/KXORD

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 14


ADDPD—Add Packed Double Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 58 /r A V/V SSE2 Add packed double precision floating-point values from
ADDPD xmm1, xmm2/m128 xmm2/mem to xmm1 and store result in xmm1.
VEX.128.66.0F.WIG 58 /r B V/V AVX Add packed double precision floating-point values from
VADDPD xmm1,xmm2, xmm3/mem to xmm2 and store result in xmm1.
xmm3/m128
VEX.256.66.0F.WIG 58 /r B V/V AVX Add packed double precision floating-point values from
VADDPD ymm1, ymm2, ymm3/mem to ymm2 and store result in ymm1.
ymm3/m256
EVEX.128.66.0F.W1 58 /r C V/V (AVX512VL AND Add packed double precision floating-point values from
VADDPD xmm1 {k1}{z}, xmm2, AVX512F) OR xmm3/m128/m64bcst to xmm2 and store result in xmm1
xmm3/m128/m64bcst AVX10.11 with writemask k1.
EVEX.256.66.0F.W1 58 /r C V/V (AVX512VL AND Add packed double precision floating-point values from
VADDPD ymm1 {k1}{z}, ymm2, AVX512F) OR ymm3/m256/m64bcst to ymm2 and store result in ymm1
ymm3/m256/m64bcst AVX10.11 with writemask k1.
EVEX.512.66.0F.W1 58 /r C V/V AVX512F OR Add packed double precision floating-point values from
VADDPD zmm1 {k1}{z}, zmm2, AVX10.11 zmm3/m512/m64bcst to zmm2 and store result in zmm1
zmm3/m512/m64bcst {er} with writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Adds two, four or eight packed double precision floating-point values from the first source operand to the second
source operand, and stores the packed double precision floating-point result in the destination operand.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a
64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with write-
mask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of
the corresponding ZMM register destination are zeroed.
VEX.128 encoded version: the first source operand is a XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper Bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified.

ADDPD—Add Packed Double Precision Floating-Point Values Vol. 2A 3-34


Operation
VADDPD (EVEX Encoded Versions) When SRC2 Operand is a Vector Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC1[i+63:i] + SRC2[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VADDPD (EVEX Encoded Versions) When SRC2 Operand is a Memory Source


(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] := SRC1[i+63:i] + SRC2[63:0]
ELSE
DEST[i+63:i] := SRC1[i+63:i] + SRC2[i+63:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VADDPD (VEX.256 Encoded Version)


DEST[63:0] := SRC1[63:0] + SRC2[63:0]
DEST[127:64] := SRC1[127:64] + SRC2[127:64]
DEST[191:128] := SRC1[191:128] + SRC2[191:128]
DEST[255:192] := SRC1[255:192] + SRC2[255:192]
DEST[MAXVL-1:256] := 0
.

ADDPD—Add Packed Double Precision Floating-Point Values Vol. 2A 3-35


VADDPD (VEX.128 Encoded Version)
DEST[63:0] := SRC1[63:0] + SRC2[63:0]
DEST[127:64] := SRC1[127:64] + SRC2[127:64]
DEST[MAXVL-1:128] := 0

ADDPD (128-bit Legacy SSE Version)


DEST[63:0] := DEST[63:0] + SRC[63:0]
DEST[127:64] := DEST[127:64] + SRC[127:64]
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VADDPD __m512d _mm512_add_pd (__m512d a, __m512d b);
VADDPD __m512d _mm512_mask_add_pd (__m512d s, __mmask8 k, __m512d a, __m512d b);
VADDPD __m512d _mm512_maskz_add_pd (__mmask8 k, __m512d a, __m512d b);
VADDPD __m256d _mm256_mask_add_pd (__m256d s, __mmask8 k, __m256d a, __m256d b);
VADDPD __m256d _mm256_maskz_add_pd (__mmask8 k, __m256d a, __m256d b);
VADDPD __m128d _mm_mask_add_pd (__m128d s, __mmask8 k, __m128d a, __m128d b);
VADDPD __m128d _mm_maskz_add_pd (__mmask8 k, __m128d a, __m128d b);
VADDPD __m512d _mm512_add_round_pd (__m512d a, __m512d b, int);
VADDPD __m512d _mm512_mask_add_round_pd (__m512d s, __mmask8 k, __m512d a, __m512d b, int);
VADDPD __m512d _mm512_maskz_add_round_pd (__mmask8 k, __m512d a, __m512d b, int);
ADDPD __m256d _mm256_add_pd (__m256d a, __m256d b);
ADDPD __m128d _mm_add_pd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”

ADDPD—Add Packed Double Precision Floating-Point Values Vol. 2A 3-36


ADDPS—Add Packed Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 58 /r A V/V SSE Add packed single precision floating-point values
ADDPS xmm1, xmm2/m128 from xmm2/m128 to xmm1 and store result in
xmm1.
VEX.128.0F.WIG 58 /r B V/V AVX Add packed single precision floating-point values
VADDPS xmm1,xmm2, xmm3/m128 from xmm3/m128 to xmm2 and store result in
xmm1.
VEX.256.0F.WIG 58 /r B V/V AVX Add packed single precision floating-point values
VADDPS ymm1, ymm2, ymm3/m256 from ymm3/m256 to ymm2 and store result in
ymm1.
EVEX.128.0F.W0 58 /r C V/V (AVX512VL AND Add packed single precision floating-point values
VADDPS xmm1 {k1}{z}, xmm2, AVX512F) OR from xmm3/m128/m32bcst to xmm2 and store
xmm3/m128/m32bcst AVX10.11 result in xmm1 with writemask k1.
EVEX.256.0F.W0 58 /r C V/V (AVX512VL AND Add packed single precision floating-point values
VADDPS ymm1 {k1}{z}, ymm2, AVX512F) OR from ymm3/m256/m32bcst to ymm2 and store
ymm3/m256/m32bcst AVX10.11 result in ymm1 with writemask k1.
EVEX.512.0F.W0 58 /r C V/V AVX512F OR Add packed single precision floating-point values
VADDPS zmm1 {k1}{z}, zmm2, AVX10.11 from zmm3/m512/m32bcst to zmm2 and store
zmm3/m512/m32bcst {er} result in zmm1 with writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Adds four, eight or sixteen packed single precision floating-point values from the first source operand with the
second source operand, and stores the packed single precision floating-point result in the destination operand.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a
32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with write-
mask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of
the corresponding ZMM register destination are zeroed.
VEX.128 encoded version: the first source operand is a XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper Bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified.

ADDPS—Add Packed Single Precision Floating-Point Values Vol. 2A 3-37


Operation
VADDPS (EVEX Encoded Versions) When SRC2 Operand is a Register
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC1[i+31:i] + SRC2[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

VADDPS (EVEX Encoded Versions) When SRC2 Operand is a Memory Source


(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] := SRC1[i+31:i] + SRC2[31:0]
ELSE
DEST[i+31:i] := SRC1[i+31:i] + SRC2[i+31:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

ADDPS—Add Packed Single Precision Floating-Point Values Vol. 2A 3-38


VADDPS (VEX.256 Encoded Version)
DEST[31:0] := SRC1[31:0] + SRC2[31:0]
DEST[63:32] := SRC1[63:32] + SRC2[63:32]
DEST[95:64] := SRC1[95:64] + SRC2[95:64]
DEST[127:96] := SRC1[127:96] + SRC2[127:96]
DEST[159:128] := SRC1[159:128] + SRC2[159:128]
DEST[191:160]:= SRC1[191:160] + SRC2[191:160]
DEST[223:192] := SRC1[223:192] + SRC2[223:192]
DEST[255:224] := SRC1[255:224] + SRC2[255:224].
DEST[MAXVL-1:256] := 0

VADDPS (VEX.128 Encoded Version)


DEST[31:0] := SRC1[31:0] + SRC2[31:0]
DEST[63:32] := SRC1[63:32] + SRC2[63:32]
DEST[95:64] := SRC1[95:64] + SRC2[95:64]
DEST[127:96] := SRC1[127:96] + SRC2[127:96]
DEST[MAXVL-1:128] := 0

ADDPS (128-bit Legacy SSE Version)


DEST[31:0] := SRC1[31:0] + SRC2[31:0]
DEST[63:32] := SRC1[63:32] + SRC2[63:32]
DEST[95:64] := SRC1[95:64] + SRC2[95:64]
DEST[127:96] := SRC1[127:96] + SRC2[127:96]
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VADDPS __m512 _mm512_add_ps (__m512 a, __m512 b);
VADDPS __m512 _mm512_mask_add_ps (__m512 s, __mmask16 k, __m512 a, __m512 b);
VADDPS __m512 _mm512_maskz_add_ps (__mmask16 k, __m512 a, __m512 b);
VADDPS __m256 _mm256_mask_add_ps (__m256 s, __mmask8 k, __m256 a, __m256 b);
VADDPS __m256 _mm256_maskz_add_ps (__mmask8 k, __m256 a, __m256 b);
VADDPS __m128 _mm_mask_add_ps (__m128d s, __mmask8 k, __m128 a, __m128 b);
VADDPS __m128 _mm_maskz_add_ps (__mmask8 k, __m128 a, __m128 b);
VADDPS __m512 _mm512_add_round_ps (__m512 a, __m512 b, int);
VADDPS __m512 _mm512_mask_add_round_ps (__m512 s, __mmask16 k, __m512 a, __m512 b, int);
VADDPS __m512 _mm512_maskz_add_round_ps (__mmask16 k, __m512 a, __m512 b, int);
ADDPS __m256 _mm256_add_ps (__m256 a, __m256 b);
ADDPS __m128 _mm_add_ps (__m128 a, __m128 b);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”

ADDPS—Add Packed Single Precision Floating-Point Values Vol. 2A 3-39


ADDSD—Add Scalar Double Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F2 0F 58 /r A V/V SSE2 Add the low double precision floating-point value from
ADDSD xmm1, xmm2/m64 xmm2/mem to xmm1 and store the result in xmm1.
VEX.LIG.F2.0F.WIG 58 /r B V/V AVX Add the low double precision floating-point value from
VADDSD xmm1, xmm2, xmm3/mem to xmm2 and store the result in xmm1.
xmm3/m64
EVEX.LLIG.F2.0F.W1 58 /r C V/V AVX512F Add the low double precision floating-point value from
VADDSD xmm1 {k1}{z}, xmm2, OR AVX10.11 xmm3/m64 to xmm2 and store the result in xmm1 with
xmm3/m64{er} writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Adds the low double precision floating-point values from the second source operand and the first source operand
and stores the double precision floating-point result in the destination operand.
The second source operand can be an XMM register or a 64-bit memory location. The first source and destination
operands are XMM registers.
128-bit Legacy SSE version: The first source and destination operands are the same. Bits (MAXVL-1:64) of the
corresponding destination register remain unchanged.
EVEX and VEX.128 encoded version: The first source operand is encoded by EVEX.vvvv/VEX.vvvv. Bits (127:64) of
the XMM register destination are copied from corresponding bits in the first source operand. Bits (MAXVL-1:128) of
the destination register are zeroed.
EVEX version: The low quadword element of the destination is updated according to the writemask.
Software should ensure VADDSD is encoded with VEX.L=0. Encoding VADDSD with VEX.L=1 may encounter
unpredictable behavior across different processor generations.

ADDSD—Add Scalar Double Precision Floating-Point Values Vol. 2A 3-40


Operation
VADDSD (EVEX Encoded Version)
IF (EVEX.b = 1) AND SRC2 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[63:0] := SRC1[63:0] + SRC2[63:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

VADDSD (VEX.128 Encoded Version)


DEST[63:0] := SRC1[63:0] + SRC2[63:0]
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

ADDSD (128-bit Legacy SSE Version)


DEST[63:0] := DEST[63:0] + SRC[63:0]
DEST[MAXVL-1:64] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VADDSD __m128d _mm_mask_add_sd (__m128d s, __mmask8 k, __m128d a, __m128d b);
VADDSD __m128d _mm_maskz_add_sd (__mmask8 k, __m128d a, __m128d b);
VADDSD __m128d _mm_add_round_sd (__m128d a, __m128d b, int);
VADDSD __m128d _mm_mask_add_round_sd (__m128d s, __mmask8 k, __m128d a, __m128d b, int);
VADDSD __m128d _mm_maskz_add_round_sd (__mmask8 k, __m128d a, __m128d b, int);
ADDSD __m128d _mm_add_sd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instruction, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-49, “Type E3 Class Exception Conditions.”

ADDSD—Add Scalar Double Precision Floating-Point Values Vol. 2A 3-41


ADDSS—Add Scalar Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F3 0F 58 /r A V/V SSE Add the low single precision floating-point value from
ADDSS xmm1, xmm2/m32 xmm2/mem to xmm1 and store the result in xmm1.
VEX.LIG.F3.0F.WIG 58 /r B V/V AVX Add the low single precision floating-point value from
VADDSS xmm1,xmm2, xmm3/mem to xmm2 and store the result in xmm1.
xmm3/m32
EVEX.LLIG.F3.0F.W0 58 /r C V/V AVX512F Add the low single precision floating-point value from
VADDSS xmm1{k1}{z}, xmm2, OR AVX10.11 xmm3/m32 to xmm2 and store the result in xmm1with
xmm3/m32{er} writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Adds the low single precision floating-point values from the second source operand and the first source operand,
and stores the double precision floating-point result in the destination operand.
The second source operand can be an XMM register or a 64-bit memory location. The first source and destination
operands are XMM registers.
128-bit Legacy SSE version: The first source and destination operands are the same. Bits (MAXVL-1:32) of the
corresponding the destination register remain unchanged.
EVEX and VEX.128 encoded version: The first source operand is encoded by EVEX.vvvv/VEX.vvvv. Bits (127:32) of
the XMM register destination are copied from corresponding bits in the first source operand. Bits (MAXVL-1:128) of
the destination register are zeroed.
EVEX version: The low doubleword element of the destination is updated according to the writemask.
Software should ensure VADDSS is encoded with VEX.L=0. Encoding VADDSS with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.

ADDSS—Add Scalar Single Precision Floating-Point Values Vol. 2A 3-42


Operation
VADDSS (EVEX Encoded Versions)
IF (EVEX.b = 1) AND SRC2 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := SRC1[31:0] + SRC2[31:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

VADDSS DEST, SRC1, SRC2 (VEX.128 Encoded Version)


DEST[31:0] := SRC1[31:0] + SRC2[31:0]
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

ADDSS DEST, SRC (128-bit Legacy SSE Version)


DEST[31:0] := DEST[31:0] + SRC[31:0]
DEST[MAXVL-1:32] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VADDSS __m128 _mm_mask_add_ss (__m128 s, __mmask8 k, __m128 a, __m128 b);
VADDSS __m128 _mm_maskz_add_ss (__mmask8 k, __m128 a, __m128 b);
VADDSS __m128 _mm_add_round_ss (__m128 a, __m128 b, int);
VADDSS __m128 _mm_mask_add_round_ss (__m128 s, __mmask8 k, __m128 a, __m128 b, int);
VADDSS __m128 _mm_maskz_add_round_ss (__mmask8 k, __m128 a, __m128 b, int);
ADDSS __m128 _mm_add_ss (__m128 a, __m128 b);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instruction, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-49, “Type E3 Class Exception Conditions.”

ADDSS—Add Scalar Single Precision Floating-Point Values Vol. 2A 3-43


AESDEC—Perform One Round of an AES Decryption Flow
Opcode/ Op/ 64/32-bit CPUID Description
Instruction En Mode Feature Flag
66 0F 38 DE /r A V/V AES Perform one round of an AES decryption flow, using
AESDEC xmm1, xmm2/m128 the Equivalent Inverse Cipher, using one 128-bit data
(state) from xmm1 with one 128-bit round key from
xmm2/m128.
VEX.128.66.0F38.WIG DE /r B V/V AES Perform one round of an AES decryption flow, using
VAESDEC xmm1, xmm2, xmm3/m128 AVX the Equivalent Inverse Cipher, using one 128-bit data
(state) from xmm2 with one 128-bit round key from
xmm3/m128; store the result in xmm1.
VEX.256.66.0F38.WIG DE /r B V/V VAES Perform one round of an AES decryption flow, using
VAESDEC ymm1, ymm2, ymm3/m256 the Equivalent Inverse Cipher, using two 128-bit data
(state) from ymm2 with two 128-bit round keys from
ymm3/m256; store the result in ymm1.
EVEX.128.66.0F38.WIG DE /r C V/V VAES Perform one round of an AES decryption flow, using
VAESDEC xmm1, xmm2, xmm3/m128 (AVX512VL the Equivalent Inverse Cipher, using one 128-bit data
OR AVX10.11) (state) from xmm2 with one 128-bit round key from
xmm3/m128; store the result in xmm1.
EVEX.256.66.0F38.WIG DE /r C V/V VAES Perform one round of an AES decryption flow, using
VAESDEC ymm1, ymm2, ymm3/m256 (AVX512VL the Equivalent Inverse Cipher, using two 128-bit data
OR AVX10.11) (state) from ymm2 with two 128-bit round keys from
ymm3/m256; store the result in ymm1.
EVEX.512.66.0F38.WIG DE /r C V/V VAES Perform one round of an AES decryption flow, using
VAESDEC zmm1, zmm2, zmm3/m512 (AVX512F OR the Equivalent Inverse Cipher, using four 128-bit data
AVX10.11) (state) from zmm2 with four 128-bit round keys from
zmm3/m512; store the result in zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs a single round of the AES decryption flow using the Equivalent Inverse Cipher, using
one/two/four (depending on vector length) 128-bit data (state) from the first source operand with one/two/four
(depending on vector length) round key(s) from the second source operand, and stores the result in the destina-
tion operand.
Use the AESDEC instruction for all but the last decryption round. For the last decryption round, use the AESDE-
CLAST instruction.
VEX and EVEX encoded versions of the instruction allow 3-operand (non-destructive) operation. The legacy
encoded versions of the instruction require that the first source operand and the destination operand are the same
and must be an XMM register.
The EVEX encoded form of this instruction does not support memory fault suppression.

AESDEC—Perform One Round of an AES Decryption Flow Vol. 2A 3-57


Operation
AESDEC
STATE := SRC1;
RoundKey := SRC2;
STATE := InvShiftRows( STATE );
STATE := InvSubBytes( STATE );
STATE := InvMixColumns( STATE );
DEST[127:0] := STATE XOR RoundKey;
DEST[MAXVL-1:128] (Unmodified)

VAESDEC (128b and 256b VEX Encoded Versions)


(KL,VL) = (1,128), (2,256)
FOR i = 0 to KL-1:
STATE := SRC1.xmm[i]
RoundKey := SRC2.xmm[i]
STATE := InvShiftRows( STATE )
STATE := InvSubBytes( STATE )
STATE := InvMixColumns( STATE )
DEST.xmm[i] := STATE XOR RoundKey
DEST[MAXVL-1:VL] := 0

VAESDEC (EVEX Encoded Version)


(KL,VL) = (1,128), (2,256), (4,512)
FOR i = 0 to KL-1:
STATE := SRC1.xmm[i]
RoundKey := SRC2.xmm[i]
STATE := InvShiftRows( STATE )
STATE := InvSubBytes( STATE )
STATE := InvMixColumns( STATE )
DEST.xmm[i] := STATE XOR RoundKey
DEST[MAXVL-1:VL] :=0

Intel C/C++ Compiler Intrinsic Equivalent


(V)AESDEC __m128i _mm_aesdec (__m128i, __m128i)
VAESDEC __m256i _mm256_aesdec_epi128(__m256i, __m256i);
VAESDEC __m512i _mm512_aesdec_epi128(__m512i, __m512i);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded: See Table 2-52, “Type E4NF Class Exception Conditions.”

AESDEC—Perform One Round of an AES Decryption Flow Vol. 2A 3-58


AESDECLAST—Perform Last Round of an AES Decryption Flow
Opcode/ Op/ 64/32-bit CPUID Description
Instruction En Mode Feature Flag
66 0F 38 DF /r A V/V AES Perform the last round of an AES decryption flow,
AESDECLAST xmm1, xmm2/m128 using the Equivalent Inverse Cipher, using one 128-
bit data (state) from xmm1 with one 128-bit round
key from xmm2/m128.
VEX.128.66.0F38.WIG DF /r B V/V AES Perform the last round of an AES decryption flow,
VAESDECLAST xmm1, xmm2, xmm3/m128 AVX using the Equivalent Inverse Cipher, using one 128-
bit data (state) from xmm2 with one 128-bit round
key from xmm3/m128; store the result in xmm1.
VEX.256.66.0F38.WIG DF /r B V/V VAES Perform the last round of an AES decryption flow,
VAESDECLAST ymm1, ymm2, ymm3/m256 using the Equivalent Inverse Cipher, using two 128-
bit data (state) from ymm2 with two 128-bit round
keys from ymm3/m256; store the result in ymm1.
EVEX.128.66.0F38.WIG DF /r C V/V VAES Perform the last round of an AES decryption flow,
VAESDECLAST xmm1, xmm2, xmm3/m128 (AVX512VL using the Equivalent Inverse Cipher, using one 128-
OR AVX10.11) bit data (state) from xmm2 with one 128-bit round
key from xmm3/m128; store the result in xmm1.
EVEX.256.66.0F38.WIG DF /r C V/V VAES Perform the last round of an AES decryption flow,
VAESDECLAST ymm1, ymm2, ymm3/m256 (AVX512VL using the Equivalent Inverse Cipher, using two 128-
OR AVX10.11) bit data (state) from ymm2 with two 128-bit round
keys from ymm3/m256; store the result in ymm1.
EVEX.512.66.0F38.WIG DF /r C V/V VAES Perform the last round of an AES decryption flow,
VAESDECLAST zmm1, zmm2, zmm3/m512 (AVX512F OR using the Equivalent Inverse Cipher, using four128-
AVX10.11) bit data (state) from zmm2 with four 128-bit round
keys from zmm3/m512; store the result in zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs the last round of the AES decryption flow using the Equivalent Inverse Cipher, using
one/two/four (depending on vector length) 128-bit data (state) from the first source operand with one/two/four
(depending on vector length) round key(s) from the second source operand, and stores the result in the destina-
tion operand.
VEX and EVEX encoded versions of the instruction allow 3-operand (non-destructive) operation. The legacy
encoded versions of the instruction require that the first source operand and the destination operand are the same
and must be an XMM register.
The EVEX encoded form of this instruction does not support memory fault suppression.

AESDECLAST—Perform Last Round of an AES Decryption Flow Vol. 2A 3-55


Operation
AESDECLAST
STATE := SRC1;
RoundKey := SRC2;
STATE := InvShiftRows( STATE );
STATE := InvSubBytes( STATE );
DEST[127:0] := STATE XOR RoundKey;
DEST[MAXVL-1:128] (Unmodified)

VAESDECLAST (128b and 256b VEX Encoded Versions)


(KL,VL) = (1,128), (2,256)
FOR i = 0 to KL-1:
STATE := SRC1.xmm[i]
RoundKey := SRC2.xmm[i]
STATE := InvShiftRows( STATE )
STATE := InvSubBytes( STATE )
DEST.xmm[i] := STATE XOR RoundKey
DEST[MAXVL-1:VL] := 0

VAESDECLAST (EVEX Encoded Version)


(KL,VL) = (1,128), (2,256), (4,512)
FOR i = 0 to KL-1:
STATE := SRC1.xmm[i]
RoundKey := SRC2.xmm[i]
STATE := InvShiftRows( STATE )
STATE := InvSubBytes( STATE )
DEST.xmm[i] := STATE XOR RoundKey
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


(V)AESDECLAST __m128i _mm_aesdeclast (__m128i, __m128i)
VAESDECLAST __m256i _mm256_aesdeclast_epi128(__m256i, __m256i);
VAESDECLAST __m512i _mm512_aesdeclast_epi128(__m512i, __m512i);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded: See Table 2-52, “Type E4NF Class Exception Conditions.”

AESDECLAST—Perform Last Round of an AES Decryption Flow Vol. 2A 3-56


AESENC—Perform One Round of an AES Encryption Flow
Opcode/ Op/ 64/32-bit CPUID Feature Description
Instruction En Mode Flag
66 0F 38 DC /r A V/V AES Perform one round of an AES encryption flow, using one
AESENC xmm1, xmm2/m128 128-bit data (state) from xmm1 with one 128-bit round
key from xmm2/m128.
VEX.128.66.0F38.WIG DC /r B V/V AES Perform one round of an AES encryption flow, using one
VAESENC xmm1, xmm2, xmm3/m128 AVX 128-bit data (state) from xmm2 with one 128-bit round
key from the xmm3/m128; store the result in xmm1.
VEX.256.66.0F38.WIG DC /r B V/V VAES Perform one round of an AES encryption flow, using two
VAESENC ymm1, ymm2, ymm3/m256 128-bit data (state) from ymm2 with two 128-bit round
keys from the ymm3/m256; store the result in ymm1.
EVEX.128.66.0F38.WIG DC /r C V/V VAES Perform one round of an AES encryption flow, using one
VAESENC xmm1, xmm2, xmm3/m128 (AVX512VL OR 128-bit data (state) from xmm2 with one 128-bit round
AVX10.11) key from the xmm3/m128; store the result in xmm1.
EVEX.256.66.0F38.WIG DC /r C V/V VAES Perform one round of an AES encryption flow, using two
VAESENC ymm1, ymm2, ymm3/m256 (AVX512VL OR 128-bit data (state) from ymm2 with two 128-bit round
AVX10.11) keys from the ymm3/m256; store the result in ymm1.
EVEX.512.66.0F38.WIG DC /r C V/V VAES Perform one round of an AES encryption flow, using
VAESENC zmm1, zmm2, zmm3/m512 (AVX512F OR four 128-bit data (state) from zmm2 with four 128-bit
AVX10.11) round keys from the zmm3/m512; store the result in
zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs a single round of an AES encryption flow using one/two/four (depending on vector
length) 128-bit data (state) from the first source operand with one/two/four (depending on vector length) round
key(s) from the second source operand, and stores the result in the destination operand.
Use the AESENC instruction for all but the last encryption rounds. For the last encryption round, use the AESENC-
CLAST instruction.
VEX and EVEX encoded versions of the instruction allow 3-operand (non-destructive) operation. The legacy
encoded versions of the instruction require that the first source operand and the destination operand are the same
and must be an XMM register.
The EVEX encoded form of this instruction does not support memory fault suppression.

AESENC—Perform One Round of an AES Encryption Flow Vol. 2A 3-69


Operation
AESENC
STATE := SRC1;
RoundKey := SRC2;
STATE := ShiftRows( STATE );
STATE := SubBytes( STATE );
STATE := MixColumns( STATE );
DEST[127:0] := STATE XOR RoundKey;
DEST[MAXVL-1:128] (Unmodified)

VAESENC (128b and 256b VEX Encoded Versions)


(KL,VL) = (1,128), (2,256)
FOR I := 0 to KL-1:
STATE := SRC1.xmm[i]
RoundKey := SRC2.xmm[i]
STATE := ShiftRows( STATE )
STATE := SubBytes( STATE )
STATE := MixColumns( STATE )
DEST.xmm[i] := STATE XOR RoundKey
DEST[MAXVL-1:VL] := 0

VAESENC (EVEX Encoded Version)


(KL,VL) = (1,128), (2,256), (4,512)
FOR i := 0 to KL-1:
STATE := SRC1.xmm[i] // xmm[i] is the i’th xmm word in the SIMD register
RoundKey := SRC2.xmm[i]
STATE := ShiftRows( STATE )
STATE := SubBytes( STATE )
STATE := MixColumns( STATE )
DEST.xmm[i] := STATE XOR RoundKey
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


(V)AESENC __m128i _mm_aesenc (__m128i, __m128i)
VAESENC __m256i _mm256_aesenc_epi128(__m256i, __m256i);
VAESENC __m512i _mm512_aesenc_epi128(__m512i, __m512i);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded: See Table 2-52, “Type E4NF Class Exception Conditions.”

AESENC—Perform One Round of an AES Encryption Flow Vol. 2A 3-70


AESENCLAST—Perform Last Round of an AES Encryption Flow
Opcode/ Op/ 64/32-bit CPUID Description
Instruction En Mode Feature Flag
66 0F 38 DD /r A V/V AES Perform the last round of an AES encryption flow,
AESENCLAST xmm1, xmm2/m128 using one 128-bit data (state) from xmm1 with one
128-bit round key from xmm2/m128.
VEX.128.66.0F38.WIG DD /r B V/V AES Perform the last round of an AES encryption flow,
VAESENCLAST xmm1, xmm2, xmm3/m128 AVX using one 128-bit data (state) from xmm2 with one
128-bit round key from xmm3/m128; store the
result in xmm1.
VEX.256.66.0F38.WIG DD /r B V/V VAES Perform the last round of an AES encryption flow,
VAESENCLAST ymm1, ymm2, ymm3/m256 using two 128-bit data (state) from ymm2 with two
128-bit round keys from ymm3/m256; store the
result in ymm1.
EVEX.128.66.0F38.WIG DD /r C V/V VAES Perform the last round of an AES encryption flow,
VAESENCLAST xmm1, xmm2, xmm3/m128 (AVX512VL using one 128-bit data (state) from xmm2 with one
OR AVX10.11) 128-bit round key from xmm3/m128; store the
result in xmm1.
EVEX.256.66.0F38.WIG DD /r C V/V VAES Perform the last round of an AES encryption flow,
VAESENCLAST ymm1, ymm2, ymm3/m256 (AVX512VL using two 128-bit data (state) from ymm2 with two
OR AVX10.11) 128-bit round keys from ymm3/m256; store the
result in ymm1.
EVEX.512.66.0F38.WIG DD /r C V/V VAES Perform the last round of an AES encryption flow,
VAESENCLAST zmm1, zmm2, zmm3/m512 (AVX512F OR using four 128-bit data (state) from zmm2 with four
AVX10.11) 128-bit round keys from zmm3/m512; store the
result in zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs the last round of an AES encryption flow using one/two/four (depending on vector length)
128-bit data (state) from the first source operand with one/two/four (depending on vector length) round key(s)
from the second source operand, and stores the result in the destination operand.
VEX and EVEX encoded versions of the instruction allows 3-operand (non-destructive) operation. The legacy
encoded versions of the instruction require that the first source operand and the destination operand are the same
and must be an XMM register.
The EVEX encoded form of this instruction does not support memory fault suppression.

AESENCLAST—Perform Last Round of an AES Encryption Flow Vol. 2A 3-67


Operation
AESENCLAST
STATE := SRC1;
RoundKey := SRC2;
STATE := ShiftRows( STATE );
STATE := SubBytes( STATE );
DEST[127:0] := STATE XOR RoundKey;
DEST[MAXVL-1:128] (Unmodified)

VAESENCLAST (128b and 256b VEX Encoded Versions)


(KL, VL) = (1,128), (2,256)
FOR I=0 to KL-1:
STATE := SRC1.xmm[i]
RoundKey := SRC2.xmm[i]
STATE := ShiftRows( STATE )
STATE := SubBytes( STATE )
DEST.xmm[i] := STATE XOR RoundKey
DEST[MAXVL-1:VL] := 0

VAESENCLAST (EVEX Encoded Version)


(KL,VL) = (1,128), (2,256), (4,512)
FOR i = 0 to KL-1:
STATE := SRC1.xmm[i]
RoundKey := SRC2.xmm[i]
STATE := ShiftRows( STATE )
STATE := SubBytes( STATE )
DEST.xmm[i] := STATE XOR RoundKey
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


(V)AESENCLAST __m128i _mm_aesenclast (__m128i, __m128i)
VAESENCLAST __m256i _mm256_aesenclast_epi128(__m256i, __m256i);
VAESENCLAST __m512i _mm512_aesenclast_epi128(__m512i, __m512i);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded: See Table 2-52, “Type E4NF Class Exception Conditions.”

AESENCLAST—Perform Last Round of an AES Encryption Flow Vol. 2A 3-68


ANDNPD—Bitwise Logical AND NOT of Packed Double Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 55 /r A V/V SSE2 Return the bitwise logical AND NOT of packed double
ANDNPD xmm1, xmm2/m128 precision floating-point values in xmm1 and
xmm2/mem.
VEX.128.66.0F 55 /r B V/V AVX Return the bitwise logical AND NOT of packed double
VANDNPD xmm1, xmm2, precision floating-point values in xmm2 and
xmm3/m128 xmm3/mem.
VEX.256.66.0F 55/r B V/V AVX Return the bitwise logical AND NOT of packed double
VANDNPD ymm1, ymm2, precision floating-point values in ymm2 and
ymm3/m256 ymm3/mem.
EVEX.128.66.0F.W1 55 /r C V/V (AVX512VL AND Return the bitwise logical AND NOT of packed double
VANDNPD xmm1 {k1}{z}, xmm2, AVX512DQ) OR precision floating-point values in xmm2 and
xmm3/m128/m64bcst AVX10.11 xmm3/m128/m64bcst subject to writemask k1.
EVEX.256.66.0F.W1 55 /r C V/V (AVX512VL AND Return the bitwise logical AND NOT of packed double
VANDNPD ymm1 {k1}{z}, ymm2, AVX512DQ) OR precision floating-point values in ymm2 and
ymm3/m256/m64bcst AVX10.11 ymm3/m256/m64bcst subject to writemask k1.
EVEX.512.66.0F.W1 55 /r C V/V AVX512DQ Return the bitwise logical AND NOT of packed double
VANDNPD zmm1 {k1}{z}, zmm2, OR AVX10.11 precision floating-point values in zmm2 and
zmm3/m512/m64bcst zmm3/m512/m64bcst subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a bitwise logical AND NOT of the two, four or eight packed double precision floating-point values from the
first source operand and the second source operand, and stores the result in the destination operand.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a
64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with write-
mask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the
corresponding ZMM register destination are zeroed.
VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
register destination are unmodified.

ANDNPD—Bitwise Logical AND NOT of Packed Double Precision Floating-Point Values Vol. 2A 3-81
Operation
VANDNPD (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
IF (EVEX.b == 1) AND (SRC2 *is memory*)
THEN
DEST[i+63:i] := (NOT(SRC1[i+63:i])) BITWISE AND SRC2[63:0]
ELSE
DEST[i+63:i] := (NOT(SRC1[i+63:i])) BITWISE AND SRC2[i+63:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] = 0
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VANDNPD (VEX.256 Encoded Version)


DEST[63:0] := (NOT(SRC1[63:0])) BITWISE AND SRC2[63:0]
DEST[127:64] := (NOT(SRC1[127:64])) BITWISE AND SRC2[127:64]
DEST[191:128] := (NOT(SRC1[191:128])) BITWISE AND SRC2[191:128]
DEST[255:192] := (NOT(SRC1[255:192])) BITWISE AND SRC2[255:192]
DEST[MAXVL-1:256] := 0

VANDNPD (VEX.128 Encoded Version)


DEST[63:0] := (NOT(SRC1[63:0])) BITWISE AND SRC2[63:0]
DEST[127:64] := (NOT(SRC1[127:64])) BITWISE AND SRC2[127:64]
DEST[MAXVL-1:128] := 0

ANDNPD (128-bit Legacy SSE Version)


DEST[63:0] := (NOT(DEST[63:0])) BITWISE AND SRC[63:0]
DEST[127:64] := (NOT(DEST[127:64])) BITWISE AND SRC[127:64]
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VANDNPD __m512d _mm512_andnot_pd (__m512d a, __m512d b);
VANDNPD __m512d _mm512_mask_andnot_pd (__m512d s, __mmask8 k, __m512d a, __m512d b);
VANDNPD __m512d _mm512_maskz_andnot_pd (__mmask8 k, __m512d a, __m512d b);
VANDNPD __m256d _mm256_mask_andnot_pd (__m256d s, __mmask8 k, __m256d a, __m256d b);
VANDNPD __m256d _mm256_maskz_andnot_pd (__mmask8 k, __m256d a, __m256d b);
VANDNPD __m128d _mm_mask_andnot_pd (__m128d s, __mmask8 k, __m128d a, __m128d b);
VANDNPD __m128d _mm_maskz_andnot_pd (__mmask8 k, __m128d a, __m128d b);
VANDNPD __m256d _mm256_andnot_pd (__m256d a, __m256d b);
ANDNPD __m128d _mm_andnot_pd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions


None.

ANDNPD—Bitwise Logical AND NOT of Packed Double Precision Floating-Point Values Vol. 2A 3-82
Other Exceptions
VEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

ANDNPD—Bitwise Logical AND NOT of Packed Double Precision Floating-Point Values Vol. 2A 3-83
ANDNPS—Bitwise Logical AND NOT of Packed Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 55 /r A V/V SSE Return the bitwise logical AND NOT of packed single
ANDNPS xmm1, xmm2/m128 precision floating-point values in xmm1 and xmm2/mem.
VEX.128.0F 55 /r B V/V AVX Return the bitwise logical AND NOT of packed single
VANDNPS xmm1, xmm2, precision floating-point values in xmm2 and xmm3/mem.
xmm3/m128
VEX.256.0F 55 /r B V/V AVX Return the bitwise logical AND NOT of packed single
VANDNPS ymm1, ymm2, precision floating-point values in ymm2 and ymm3/mem.
ymm3/m256
EVEX.128.0F.W0 55 /r C V/V (AVX512VL AND Return the bitwise logical AND of packed single precision
VANDNPS xmm1 {k1}{z}, xmm2, AVX512DQ) OR floating-point values in xmm2 and xmm3/m128/m32bcst
xmm3/m128/m32bcst AVX10.11 subject to writemask k1.
EVEX.256.0F.W0 55 /r C V/V (AVX512VL AND Return the bitwise logical AND of packed single precision
VANDNPS ymm1 {k1}{z}, ymm2, AVX512DQ) OR floating-point values in ymm2 and ymm3/m256/m32bcst
ymm3/m256/m32bcst AVX10.11 subject to writemask k1.
EVEX.512.0F.W0 55 /r C V/V AVX512DQ Return the bitwise logical AND of packed single precision
VANDNPS zmm1 {k1}{z}, zmm2, OR AVX10.11 floating-point values in zmm2 and zmm3/m512/m32bcst
zmm3/m512/m32bcst subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a bitwise logical AND NOT of the four, eight or sixteen packed single precision floating-point values from
the first source operand and the second source operand, and stores the result in the destination operand.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a
32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with write-
mask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the
corresponding ZMM register destination are zeroed.
VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified.

ANDNPS—Bitwise Logical AND NOT of Packed Single Precision Floating-Point Values Vol. 2A 3-84
Operation
VANDNPS (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
IF (EVEX.b == 1) AND (SRC2 *is memory*)
THEN
DEST[i+31:i] := (NOT(SRC1[i+31:i])) BITWISE AND SRC2[31:0]
ELSE
DEST[i+31:i] := (NOT(SRC1[i+31:i])) BITWISE AND SRC2[i+31:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] = 0
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VANDNPS (VEX.256 Encoded Version)


DEST[31:0] := (NOT(SRC1[31:0])) BITWISE AND SRC2[31:0]
DEST[63:32] := (NOT(SRC1[63:32])) BITWISE AND SRC2[63:32]
DEST[95:64] := (NOT(SRC1[95:64])) BITWISE AND SRC2[95:64]
DEST[127:96] := (NOT(SRC1[127:96])) BITWISE AND SRC2[127:96]
DEST[159:128] := (NOT(SRC1[159:128])) BITWISE AND SRC2[159:128]
DEST[191:160] := (NOT(SRC1[191:160])) BITWISE AND SRC2[191:160]
DEST[223:192] := (NOT(SRC1[223:192])) BITWISE AND SRC2[223:192]
DEST[255:224] := (NOT(SRC1[255:224])) BITWISE AND SRC2[255:224].
DEST[MAXVL-1:256] := 0

VANDNPS (VEX.128 Encoded Version)


DEST[31:0] := (NOT(SRC1[31:0])) BITWISE AND SRC2[31:0]
DEST[63:32] := (NOT(SRC1[63:32])) BITWISE AND SRC2[63:32]
DEST[95:64] := (NOT(SRC1[95:64])) BITWISE AND SRC2[95:64]
DEST[127:96] := (NOT(SRC1[127:96])) BITWISE AND SRC2[127:96]
DEST[MAXVL-1:128] := 0

ANDNPS (128-bit Legacy SSE Version)


DEST[31:0] := (NOT(DEST[31:0])) BITWISE AND SRC[31:0]
DEST[63:32] := (NOT(DEST[63:32])) BITWISE AND SRC[63:32]
DEST[95:64] := (NOT(DEST[95:64])) BITWISE AND SRC[95:64]
DEST[127:96] := (NOT(DEST[127:96])) BITWISE AND SRC[127:96]
DEST[MAXVL-1:128] (Unmodified)

ANDNPS—Bitwise Logical AND NOT of Packed Single Precision Floating-Point Values Vol. 2A 3-85
Intel C/C++ Compiler Intrinsic Equivalent
VANDNPS __m512 _mm512_andnot_ps (__m512 a, __m512 b);
VANDNPS __m512 _mm512_mask_andnot_ps (__m512 s, __mmask16 k, __m512 a, __m512 b);
VANDNPS __m512 _mm512_maskz_andnot_ps (__mmask16 k, __m512 a, __m512 b);
VANDNPS __m256 _mm256_mask_andnot_ps (__m256 s, __mmask8 k, __m256 a, __m256 b);
VANDNPS __m256 _mm256_maskz_andnot_ps (__mmask8 k, __m256 a, __m256 b);
VANDNPS __m128 _mm_mask_andnot_ps (__m128 s, __mmask8 k, __m128 a, __m128 b);
VANDNPS __m128 _mm_maskz_andnot_ps (__mmask8 k, __m128 a, __m128 b);
VANDNPS __m256 _mm256_andnot_ps (__m256 a, __m256 b);
ANDNPS __m128 _mm_andnot_ps (__m128 a, __m128 b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
VEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

ANDNPS—Bitwise Logical AND NOT of Packed Single Precision Floating-Point Values Vol. 2A 3-86
ANDPD—Bitwise Logical AND of Packed Double Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 54 /r A V/V SSE2 Return the bitwise logical AND of packed double
ANDPD xmm1, xmm2/m128 precision floating-point values in xmm1 and
xmm2/mem.
VEX.128.66.0F 54 /r B V/V AVX Return the bitwise logical AND of packed double
VANDPD xmm1, xmm2, xmm3/m128 precision floating-point values in xmm2 and
xmm3/mem.
VEX.256.66.0F 54 /r B V/V AVX Return the bitwise logical AND of packed double
VANDPD ymm1, ymm2, ymm3/m256 precision floating-point values in ymm2 and
ymm3/mem.
EVEX.128.66.0F.W1 54 /r C V/V (AVX512VL AND Return the bitwise logical AND of packed double
VANDPD xmm1 {k1}{z}, xmm2, AVX512DQ) OR precision floating-point values in xmm2 and
xmm3/m128/m64bcst AVX10.11 xmm3/m128/m64bcst subject to writemask k1.
EVEX.256.66.0F.W1 54 /r C V/V (AVX512VL AND Return the bitwise logical AND of packed double
VANDPD ymm1 {k1}{z}, ymm2, AVX512DQ) OR precision floating-point values in ymm2 and
ymm3/m256/m64bcst AVX10.11 ymm3/m256/m64bcst subject to writemask k1.
EVEX.512.66.0F.W1 54 /r C V/V AVX512DQ Return the bitwise logical AND of packed double
VANDPD zmm1 {k1}{z}, zmm2, OR AVX10.11 precision floating-point values in zmm2 and
zmm3/m512/m64bcst zmm3/m512/m64bcst subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a bitwise logical AND of the two, four or eight packed double precision floating-point values from the first
source operand and the second source operand, and stores the result in the destination operand.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a
64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with write-
mask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the
corresponding ZMM register destination are zeroed.
VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
register destination are unmodified.

ANDPD—Bitwise Logical AND of Packed Double Precision Floating-Point Values Vol. 2A 3-87
Operation
VANDPD (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1) AND (SRC2 *is memory*)
THEN
DEST[i+63:i] := SRC1[i+63:i] BITWISE AND SRC2[63:0]
ELSE
DEST[i+63:i] := SRC1[i+63:i] BITWISE AND SRC2[i+63:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] = 0
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VANDPD (VEX.256 Encoded Version)


DEST[63:0] := SRC1[63:0] BITWISE AND SRC2[63:0]
DEST[127:64] := SRC1[127:64] BITWISE AND SRC2[127:64]
DEST[191:128] := SRC1[191:128] BITWISE AND SRC2[191:128]
DEST[255:192] := SRC1[255:192] BITWISE AND SRC2[255:192]
DEST[MAXVL-1:256] := 0

VANDPD (VEX.128 Encoded Version)


DEST[63:0] := SRC1[63:0] BITWISE AND SRC2[63:0]
DEST[127:64] := SRC1[127:64] BITWISE AND SRC2[127:64]
DEST[MAXVL-1:128] := 0

ANDPD (128-bit Legacy SSE Version)


DEST[63:0] := DEST[63:0] BITWISE AND SRC[63:0]
DEST[127:64] := DEST[127:64] BITWISE AND SRC[127:64]
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VANDPD __m512d _mm512_and_pd (__m512d a, __m512d b);
VANDPD __m512d _mm512_mask_and_pd (__m512d s, __mmask8 k, __m512d a, __m512d b);
VANDPD __m512d _mm512_maskz_and_pd (__mmask8 k, __m512d a, __m512d b);
VANDPD __m256d _mm256_mask_and_pd (__m256d s, __mmask8 k, __m256d a, __m256d b);
VANDPD __m256d _mm256_maskz_and_pd (__mmask8 k, __m256d a, __m256d b);
VANDPD __m128d _mm_mask_and_pd (__m128d s, __mmask8 k, __m128d a, __m128d b);
VANDPD __m128d _mm_maskz_and_pd (__mmask8 k, __m128d a, __m128d b);
VANDPD __m256d _mm256_and_pd (__m256d a, __m256d b);
ANDPD __m128d _mm_and_pd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions


None.

ANDPD—Bitwise Logical AND of Packed Double Precision Floating-Point Values Vol. 2A 3-88
Other Exceptions
VEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

ANDPD—Bitwise Logical AND of Packed Double Precision Floating-Point Values Vol. 2A 3-89
ANDPS—Bitwise Logical AND of Packed Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 54 /r A V/V SSE Return the bitwise logical AND of packed single precision
ANDPS xmm1, xmm2/m128 floating-point values in xmm1 and xmm2/mem.
VEX.128.0F 54 /r B V/V AVX Return the bitwise logical AND of packed single precision
VANDPS xmm1,xmm2, floating-point values in xmm2 and xmm3/mem.
xmm3/m128
VEX.256.0F 54 /r B V/V AVX Return the bitwise logical AND of packed single precision
VANDPS ymm1, ymm2, floating-point values in ymm2 and ymm3/mem.
ymm3/m256
EVEX.128.0F.W0 54 /r C V/V (AVX512VL AND Return the bitwise logical AND of packed single precision
VANDPS xmm1 {k1}{z}, xmm2, AVX512DQ) OR floating-point values in xmm2 and xmm3/m128/m32bcst
xmm3/m128/m32bcst AVX10.11 subject to writemask k1.
EVEX.256.0F.W0 54 /r C V/V (AVX512VL AND Return the bitwise logical AND of packed single precision
VANDPS ymm1 {k1}{z}, ymm2, AVX512DQ) OR floating-point values in ymm2 and ymm3/m256/m32bcst
ymm3/m256/m32bcst AVX10.11 subject to writemask k1.
EVEX.512.0F.W0 54 /r C V/V AVX512DQ Return the bitwise logical AND of packed single precision
VANDPS zmm1 {k1}{z}, zmm2, OR AVX10.11 floating-point values in zmm2 and zmm3/m512/m32bcst
zmm3/m512/m32bcst subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a bitwise logical AND of the four, eight or sixteen packed single precision floating-point values from the
first source operand and the second source operand, and stores the result in the destination operand.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a
32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with write-
mask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the
corresponding ZMM register destination are zeroed.
VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified.

ANDPS—Bitwise Logical AND of Packed Single Precision Floating-Point Values Vol. 2A 3-90
Operation
VANDPS (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
IF (EVEX.b == 1) AND (SRC2 *is memory*)
THEN
DEST[i+63:i] := SRC1[i+31:i] BITWISE AND SRC2[31:0]
ELSE
DEST[i+31:i] := SRC1[i+31:i] BITWISE AND SRC2[i+31:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0;

VANDPS (VEX.256 Encoded Version)


DEST[31:0] := SRC1[31:0] BITWISE AND SRC2[31:0]
DEST[63:32] := SRC1[63:32] BITWISE AND SRC2[63:32]
DEST[95:64] := SRC1[95:64] BITWISE AND SRC2[95:64]
DEST[127:96] := SRC1[127:96] BITWISE AND SRC2[127:96]
DEST[159:128] := SRC1[159:128] BITWISE AND SRC2[159:128]
DEST[191:160] := SRC1[191:160] BITWISE AND SRC2[191:160]
DEST[223:192] := SRC1[223:192] BITWISE AND SRC2[223:192]
DEST[255:224] := SRC1[255:224] BITWISE AND SRC2[255:224].
DEST[MAXVL-1:256] := 0;

VANDPS (VEX.128 Encoded Version)


DEST[31:0] := SRC1[31:0] BITWISE AND SRC2[31:0]
DEST[63:32] := SRC1[63:32] BITWISE AND SRC2[63:32]
DEST[95:64] := SRC1[95:64] BITWISE AND SRC2[95:64]
DEST[127:96] := SRC1[127:96] BITWISE AND SRC2[127:96]
DEST[MAXVL-1:128] := 0;

ANDPS (128-bit Legacy SSE Version)


DEST[31:0] := DEST[31:0] BITWISE AND SRC[31:0]
DEST[63:32] := DEST[63:32] BITWISE AND SRC[63:32]
DEST[95:64] := DEST[95:64] BITWISE AND SRC[95:64]
DEST[127:96] := DEST[127:96] BITWISE AND SRC[127:96]
DEST[MAXVL-1:128] (Unmodified)

ANDPS—Bitwise Logical AND of Packed Single Precision Floating-Point Values Vol. 2A 3-91
Intel C/C++ Compiler Intrinsic Equivalent
VANDPS __m512 _mm512_and_ps (__m512 a, __m512 b);
VANDPS __m512 _mm512_mask_and_ps (__m512 s, __mmask16 k, __m512 a, __m512 b);
VANDPS __m512 _mm512_maskz_and_ps (__mmask16 k, __m512 a, __m512 b);
VANDPS __m256 _mm256_mask_and_ps (__m256 s, __mmask8 k, __m256 a, __m256 b);
VANDPS __m256 _mm256_maskz_and_ps (__mmask8 k, __m256 a, __m256 b);
VANDPS __m128 _mm_mask_and_ps (__m128 s, __mmask8 k, __m128 a, __m128 b);
VANDPS __m128 _mm_maskz_and_ps (__mmask8 k, __m128 a, __m128 b);
VANDPS __m256 _mm256_and_ps (__m256 a, __m256 b);
ANDPS __m128 _mm_and_ps (__m128 a, __m128 b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
VEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

ANDPS—Bitwise Logical AND of Packed Single Precision Floating-Point Values Vol. 2A 3-92
BSF—Bit Scan Forward
Opcode Instruction Op/ 64-bit Compat/ Description
En Mode Leg Mode
0F BC /r BSF r16, r/m16 RM Valid Valid Bit scan forward on r/m16.
0F BC /r BSF r32, r/m32 RM Valid Valid Bit scan forward on r/m32.
REX.W + 0F BC /r BSF r64, r/m64 RM Valid N.E. Bit scan forward on r/m64.

Instruction Operand Encoding


Op/En Operand 1 Operand 2 Operand 3 Operand 4
RM ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Searches the source operand (second operand) for the least significant set bit (1 bit). If a least significant 1 bit is
found, its bit index is stored in the destination operand (first operand). The source operand can be a register or a
memory location; the destination operand is a register. The bit index is an unsigned offset from bit 0 of the source
operand. If the content of the source operand is zero, the destination operand is unmodified.1
In 64-bit mode, the instruction’s default operation size is 32 bits. Using a REX prefix in the form of REX.R permits
access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. See
the summary chart at the beginning of this section for encoding data and limits.

Operation
IF SRC = 0
THEN
ZF := 1;
DEST is undefined;
ELSE
ZF := 0;
temp := 0;
WHILE Bit(SRC, temp) = 0
DO
temp := temp + 1;
OD;
DEST := temp;
FI;

Flags Affected
The ZF flag is set to 1 if the source operand is 0; otherwise, the ZF flag is cleared. The CF, OF, SF, AF, and PF flags
are undefined.

Protected Mode Exceptions


#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.
If the DS, ES, FS, or GS register contains a NULL segment selector.
#SS(0) If a memory operand effective address is outside the SS segment limit.
#PF(fault-code) If a page fault occurs.
#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the
current privilege level is 3.
#UD If the LOCK prefix is used.

1. On some older processors, use of a 32-bit operand size may clear the upper 32 bits of a 64-bit destination while leaving the lower
32 bits unmodified.

BSF—Bit Scan Forward Vol. 2A 3-125


Real-Address Mode Exceptions
#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.
#SS If a memory operand effective address is outside the SS segment limit.
#UD If the LOCK prefix is used.

Virtual-8086 Mode Exceptions


#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.
#SS(0) If a memory operand effective address is outside the SS segment limit.
#PF(fault-code) If a page fault occurs.
#AC(0) If alignment checking is enabled and an unaligned memory reference is made.
#UD If the LOCK prefix is used.

Compatibility Mode Exceptions


Same exceptions as in protected mode.

64-Bit Mode Exceptions


#SS(0) If a memory address referencing the SS segment is in a non-canonical form.
#GP(0) If the memory address is in a non-canonical form.
#PF(fault-code) If a page fault occurs.
#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the
current privilege level is 3.
#UD If the LOCK prefix is used.

BSF—Bit Scan Forward Vol. 2A 3-126


BSR—Bit Scan Reverse
Opcode Instruction Op/ 64-bit Compat/ Description
En Mode Leg Mode
0F BD /r BSR r16, r/m16 RM Valid Valid Bit scan reverse on r/m16.
0F BD /r BSR r32, r/m32 RM Valid Valid Bit scan reverse on r/m32.
REX.W + 0F BD /r BSR r64, r/m64 RM Valid N.E. Bit scan reverse on r/m64.

Instruction Operand Encoding


Op/En Operand 1 Operand 2 Operand 3 Operand 4
RM ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Searches the source operand (second operand) for the most significant set bit (1 bit). If a most significant 1 bit is
found, its bit index is stored in the destination operand (first operand). The source operand can be a register or a
memory location; the destination operand is a register. The bit index is an unsigned offset from bit 0 of the source
operand. If the content source operand is zero, the destination operand is unmodified.1
In 64-bit mode, the instruction’s default operation size is 32 bits. Using a REX prefix in the form of REX.R permits
access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. See
the summary chart at the beginning of this section for encoding data and limits.

Operation
IF SRC = 0
THEN
ZF := 1;
DEST is undefined;
ELSE
ZF := 0;
temp := OperandSize – 1;
WHILE Bit(SRC, temp) = 0
DO
temp := temp - 1;
OD;
DEST := temp;
FI;

Flags Affected
The ZF flag is set to 1 if the source operand is 0; otherwise, the ZF flag is cleared. The CF, OF, SF, AF, and PF flags
are undefined.

Protected Mode Exceptions


#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.
If the DS, ES, FS, or GS register contains a NULL segment selector.
#SS(0) If a memory operand effective address is outside the SS segment limit.
#PF(fault-code) If a page fault occurs.
#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the
current privilege level is 3.
#UD If the LOCK prefix is used.

1. On some older processors, use of a 32-bit operand size may clear the upper 32 bits of a 64-bit destination while leaving the lower
32 bits unmodified.

BSR—Bit Scan Reverse Vol. 2A 3-127


Real-Address Mode Exceptions
#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.
#SS If a memory operand effective address is outside the SS segment limit.
#UD If the LOCK prefix is used.

Virtual-8086 Mode Exceptions


#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.
#SS(0) If a memory operand effective address is outside the SS segment limit.
#PF(fault-code) If a page fault occurs.
#AC(0) If alignment checking is enabled and an unaligned memory reference is made.
#UD If the LOCK prefix is used.

Compatibility Mode Exceptions


Same exceptions as in protected mode.

64-Bit Mode Exceptions


#SS(0) If a memory address referencing the SS segment is in a non-canonical form.
#GP(0) If the memory address is in a non-canonical form.
#PF(fault-code) If a page fault occurs.
#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the
current privilege level is 3.
#UD If the LOCK prefix is used.

BSR—Bit Scan Reverse Vol. 2A 3-128


CMPccXADD—Compare and Add if Condition is Met
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support

VEX.128.66.0F38.W0 E6 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If below or equal (CF=1 or ZF=1),
CMPBEXADD m32, r32, r32
add value from r32 (third operand) to m32 and
write new value in m32. The second operand is
always updated with the original value from
m32.

VEX.128.66.0F38.W1 E6 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If below or equal (CF=1 or ZF=1),
CMPBEXADD m64, r64, r64
add value from r64 (third operand) to m64 and
write new value in m64. The second operand is
always updated with the original value from
m64.

VEX.128.66.0F38.W0 E2 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If below (CF=1), add value from
CMPBXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.

VEX.128.66.0F38.W1 E2 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If below (CF=1), add value from
CMPBXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.

VEX.128.66.0F38.W0 EE !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If less or equal (ZF=1 or SF≠OF),
CMPLEXADD m32, r32, r32
add value from r32 (third operand) to m32 and
write new value in m32. The second operand is
always updated with the original value from
m32.

VEX.128.66.0F38.W1 EE !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If less or equal (ZF=1 or SF≠OF),
CMPLEXADD m64, r64, r64
add value from r64 (third operand) to m64 and
write new value in m64. The second operand is
always updated with the original value from
m64.

VEX.128.66.0F38.W0 EC !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If less (SF≠OF), add value from r32
CMPLXADD m32, r32, r32
(third operand) to m32 and write new value in
m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 EC !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
CMPLXADD m64, r64, r64 value in m64. If less (SF≠OF), add value from r64
(third operand) to m64 and write new value in
m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 E7 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
CMPNBEXADD m32, r32, r32 value in m32. If not below or equal (CF=0 and
ZF=0), add value from r32 (third operand) to
m32 and write new value in m32. The second
operand is always updated with the original
value from m32.

CMPccXADD—Compare and Add if Condition is Met Vol. 2A 3-179


Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
VEX.128.66.0F38.W1 E7 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
CMPNBEXADD m64, r64, r64 value in m64. If not below or equal (CF=0 and
ZF=0), add value from r64 (third operand) to
m64 and write new value in m64. The second
operand is always updated with the original
value from m64.
VEX.128.66.0F38.W0 E3 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
CMPNBXADD m32, r32, r32 value in m32. If not below (CF=0), add value from
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.

VEX.128.66.0F38.W1 E3 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not below (CF=0), add value from
CMPNBXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.

VEX.128.66.0F38.W0 EF !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If not less or equal (ZF=0 and
CMPNLEXADD m32, r32, r32
SF=OF), add value from r32 (third operand) to
m32 and write new value in m32. The second
operand is always updated with the original
value from m32.

VEX.128.66.0F38.W1 EF !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not less or equal (ZF=0 and
CMPNLEXADD m64, r64, r64
SF=OF), add value from r64 (third operand) to
m64 and write new value in m64. The second
operand is always updated with the original
value from m64.

VEX.128.66.0F38.W0 ED !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If not less (SF=OF), add value from
CMPNLXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.

VEX.128.66.0F38.W1 ED !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not less (SF=OF), add value from
CMPNLXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.

VEX.128.66.0F38.W0 E1 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If not overflow (OF=0), add value
CMPNOXADD m32, r32, r32
from r32 (third operand) to m32 and write new
value in m32. The second operand is always
updated with the original value from m32.

VEX.128.66.0F38.W1 E1 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not overflow (OF=0), add value
CMPNOXADD m64, r64, r64
from r64 (third operand) to m64 and write new
value in m64. The second operand is always
updated with the original value from m64.

CMPccXADD—Compare and Add if Condition is Met Vol. 2A 3-180


Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support

VEX.128.66.0F38.W0 EB !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If not parity (PF=0), add value from
CMPNPXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.

VEX.128.66.0F38.W1 EB !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not parity (PF=0), add value from
CMPNPXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.

VEX.128.66.0F38.W0 E9 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If not sign (SF=0), add value from
CMPNSXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.

VEX.128.66.0F38.W1 E9 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not sign (SF=0), add value from
CMPNSXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.

VEX.128.66.0F38.W0 E5 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If not zero (ZF=0), add value from
CMPNZXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.

VEX.128.66.0F38.W1 E5 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not zero (ZF=0), add value from
CMPNZXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.

VEX.128.66.0F38.W0 E0 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If overflow (OF=1), add value from
CMPOXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.

VEX.128.66.0F38.W1 E0 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If overflow (OF=1), add value from
CMPOXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.

VEX.128.66.0F38.W0 EA !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If parity (PF=1), add value from
CMPPXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.

VEX.128.66.0F38.W1 EA !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If parity (PF=1), add value from
CMPPXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.

CMPccXADD—Compare and Add if Condition is Met Vol. 2A 3-181


Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support

VEX.128.66.0F38.W0 E8 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If sign (SF=1), add value from r32
CMPSXADD m32, r32, r32
(third operand) to m32 and write new value in
m32. The second operand is always updated
with the original value from m32.

VEX.128.66.0F38.W1 E8 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If sign (SF=1), add value from r64
CMPSXADD m64, r64, r64
(third operand) to m64 and write new value in
m64. The second operand is always updated
with the original value from m64.

VEX.128.66.0F38.W0 E4 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If zero (ZF=1), add value from r32
CMPZXADD m32, r32, r32
(third operand) to m32 and write new value in
m32. The second operand is always updated
with the original value from m32.

VEX.128.66.0F38.W1 E4 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If zero (ZF=1), add value from r64
CMPZXADD m64, r64, r64
(third operand) to m64 and write new value in
m64. The second operand is always updated
with the original value from m64.

Instruction Operand Encoding1


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:r/m (r, w) ModRM:reg (r, w) VEX.vvvv (r) N/A

Description
This instruction compares the value from memory with the value of the second operand. If the specified condition
is met, then the processor will add the third operand to the memory operand and write it into memory, else the
memory is unchanged by this instruction.
This instruction must have MODRM.MOD equal to 0, 1, or 2. The value 3 for MODRM.MOD is reserved and will cause
an invalid opcode exception (#UD).
The second operand is always updated with the original value of the memory operand. The EFLAGS conditions are
updated from the results of the comparison.The instruction uses an implicit lock. This instruction does not permit
the use of an explicit lock prefix.

Operation
CMPCCXADD srcdest1, srcdest2, src3
tmp1 := load lock srcdest1
tmp2 := tmp1 + src3
EFLAGS.CS,OF,SF,ZF,AF,PF := CMP tmp1, srcdest2
IF <condition>:
srcdest1 := store unlock tmp2
ELSE
srcdest1 := store unlock tmp1
srcdest2 :=tmp1

1. ModRM.MOD != 011B

CMPccXADD—Compare and Add if Condition is Met Vol. 2A 3-182


Flags Affected
The EFLAGS conditions are updated from the results of the comparison.

Intel C/C++ Compiler Intrinsic Equivalent


CMPCCXADD int _cmpccxadd_epi32 (void* __A, int __B, int __C, const int __D);
CMPCCXADD __int64 _cmpccxadd_epi64 (void* __A, __int64 __B, __int64 __C, const int __D);

SIMD Floating-Point Exceptions


None.

Exceptions
Exceptions Type 14; see Table 2-31.

CMPccXADD—Compare and Add if Condition is Met Vol. 2A 3-183


CMPPD—Compare Packed Double Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F C2 /r ib A V/V SSE2 Compare packed double precision floating-point
CMPPD xmm1, xmm2/m128, imm8 values in xmm2/m128 and xmm1 using bits 2:0 of
imm8 as a comparison predicate.
VEX.128.66.0F.WIG C2 /r ib B V/V AVX Compare packed double precision floating-point
VCMPPD xmm1, xmm2, xmm3/m128, values in xmm3/m128 and xmm2 using bits 4:0 of
imm8 imm8 as a comparison predicate.
VEX.256.66.0F.WIG C2 /r ib B V/V AVX Compare packed double precision floating-point
VCMPPD ymm1, ymm2, ymm3/m256, values in ymm3/m256 and ymm2 using bits 4:0 of
imm8 imm8 as a comparison predicate.
EVEX.128.66.0F.W1 C2 /r ib C V/V (AVX512VL AND Compare packed double precision floating-point
VCMPPD k1 {k2}, xmm2, AVX512F) OR values in xmm3/m128/m64bcst and xmm2 using
xmm3/m128/m64bcst, imm8 AVX10.11 bits 4:0 of imm8 as a comparison predicate with
writemask k2 and leave the result in mask register
k1.
EVEX.256.66.0F.W1 C2 /r ib C V/V (AVX512VL AND Compare packed double precision floating-point
VCMPPD k1 {k2}, ymm2, AVX512F) OR values in ymm3/m256/m64bcst and ymm2 using
ymm3/m256/m64bcst, imm8 AVX10.11 bits 4:0 of imm8 as a comparison predicate with
writemask k2 and leave the result in mask register
k1.
EVEX.512.66.0F.W1 C2 /r ib C V/V AVX512F Compare packed double precision floating-point
VCMPPD k1 {k2}, zmm2, OR AVX10.11 values in zmm3/m512/m64bcst and zmm2 using
zmm3/m512/m64bcst {sae}, imm8 bits 4:0 of imm8 as a comparison predicate with
writemask k2 and leave the result in mask register
k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) imm8 N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
Performs a SIMD compare of the packed double precision floating-point values in the second source operand and
the first source operand and returns the result of the comparison to the destination operand. The comparison pred-
icate operand (immediate byte) specifies the type of comparison performed on each pair of packed values in the
two source operands.
EVEX encoded versions: The first source operand (second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 64-bit memory location. The destination operand (first operand) is an opmask register.
Comparison results are written to the destination operand under the writemask k2. Each comparison result is a
single mask bit of 1 (comparison true) or 0 (comparison false).
VEX.256 encoded version: The first source operand (second operand) is a YMM register. The second source
operand (third operand) can be a YMM register or a 256-bit memory location. The destination operand (first

CMPPD—Compare Packed Double Precision Floating-Point Values Vol. 2A 3-186


operand) is a YMM register. Four comparisons are performed with results written to the destination operand. The
result of each comparison is a quadword mask of all 1s (comparison true) or all 0s (comparison false).
128-bit Legacy SSE version: The first source and destination operand (first operand) is an XMM register. The
second source operand (second operand) can be an XMM register or 128-bit memory location. Bits (MAXVL-1:128)
of the corresponding ZMM destination register remain unchanged. Two comparisons are performed with results
written to bits 127:0 of the destination operand. The result of each comparison is a quadword mask of all 1s
(comparison true) or all 0s (comparison false).
VEX.128 encoded version: The first source operand (second operand) is an XMM register. The second source
operand (third operand) can be an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the destina-
tion ZMM register are zeroed. Two comparisons are performed with results written to bits 127:0 of the destination
operand.
The comparison predicate operand is an 8-bit immediate:
• For instructions encoded using the VEX or EVEX prefix, bits 4:0 define the type of comparison to be performed
(see Table 3-8). Bits 5 through 7 of the immediate are reserved.
• For instruction encodings that do not use VEX prefix, bits 2:0 define the type of comparison to be made (see
the first 8 rows of Table 3-8). Bits 3 through 7 of the immediate are reserved.

Table 3-8. Comparison Predicate for CMPPD and CMPPS Instructions


Predicate imm8 Description Result: A Is 1st Operand, B Is 2nd Operand Signals
Value #IA on
A >B A<B A=B Unordered1 QNAN

EQ_OQ (EQ) 0H Equal (ordered, non-signaling) False False True False No


LT_OS (LT) 1H Less-than (ordered, signaling) False True False False Yes
LE_OS (LE) 2H Less-than-or-equal (ordered, signaling) False True True False Yes
UNORD_Q (UNORD) 3H Unordered (non-signaling) False False False True No
NEQ_UQ (NEQ) 4H Not-equal (unordered, non-signaling) True True False True No
NLT_US (NLT) 5H Not-less-than (unordered, signaling) True False True True Yes
NLE_US (NLE) 6H Not-less-than-or-equal (unordered, signaling) True False False True Yes
ORD_Q (ORD) 7H Ordered (non-signaling) True True True False No
EQ_UQ 8H Equal (unordered, non-signaling) False False True True No
NGE_US (NGE) 9H Not-greater-than-or-equal (unordered, False True False True Yes
signaling)
NGT_US (NGT) AH Not-greater-than (unordered, signaling) False True True True Yes
FALSE_OQ(FALSE) BH False (ordered, non-signaling) False False False False No
NEQ_OQ CH Not-equal (ordered, non-signaling) True True False False No
GE_OS (GE) DH Greater-than-or-equal (ordered, signaling) True False True False Yes
GT_OS (GT) EH Greater-than (ordered, signaling) True False False False Yes
TRUE_UQ(TRUE) FH True (unordered, non-signaling) True True True True No
EQ_OS 10H Equal (ordered, signaling) False False True False Yes
LT_OQ 11H Less-than (ordered, nonsignaling) False True False False No
LE_OQ 12H Less-than-or-equal (ordered, nonsignaling) False True True False No
UNORD_S 13H Unordered (signaling) False False False True Yes
NEQ_US 14H Not-equal (unordered, signaling) True True False True Yes
NLT_UQ 15H Not-less-than (unordered, nonsignaling) True False True True No
NLE_UQ 16H Not-less-than-or-equal (unordered, nonsig- True False False True No
naling)

CMPPD—Compare Packed Double Precision Floating-Point Values Vol. 2A 3-187


Table 3-8. Comparison Predicate for CMPPD and CMPPS Instructions (Contd.)
Predicate imm8 Description Result: A Is 1st Operand, B Is 2nd Operand Signals
Value #IA on
A >B A<B A=B Unordered1 QNAN

ORD_S 17H Ordered (signaling) True True True False Yes

EQ_US 18H Equal (unordered, signaling) False False True True Yes
NGE_UQ 19H Not-greater-than-or-equal (unordered, non- False True False True No
signaling)
NGT_UQ 1AH Not-greater-than (unordered, nonsignaling) False True True True No
FALSE_OS 1BH False (ordered, signaling) False False False False Yes
NEQ_OS 1CH Not-equal (ordered, signaling) True True False False Yes
GE_OQ 1DH Greater-than-or-equal (ordered, nonsignal- True False True False No
ing)
GT_OQ 1EH Greater-than (ordered, nonsignaling) True False False False No
TRUE_US 1FH True (unordered, signaling) True True True True Yes

NOTES:
1. If either operand A or B is a NAN.

The unordered relationship is true when at least one of the two source operands being compared is a NaN; the
ordered relationship is true when neither source operand is a NaN.
A subsequent computational instruction that uses the mask result in the destination operand as an input operand
will not generate an exception, because a mask of all 0s corresponds to a floating-point value of +0.0 and a mask
of all 1s corresponds to a QNaN.
Note that processors with “CPUID.1H:ECX.AVX =0” do not implement the “greater-than”, “greater-than-or-equal”,
“not-greater than”, and “not-greater-than-or-equal relations” predicates. These comparisons can be made either
by using the inverse relationship (that is, use the “not-less-than-or-equal” to make a “greater-than” comparison)
or by using software emulation. When using software emulation, the program must swap the operands (copying
registers when necessary to protect the data that will now be in the destination), and then perform the compare
using a different predicate. The predicate to be used for these emulations is listed in the first 8 rows of Table 3-7
(Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A) under the heading Emulation.
Compilers and assemblers may implement the following two-operand pseudo-ops in addition to the three-operand
CMPPD instruction, for processors with “CPUID.1H:ECX.AVX =0”. See Table 3-9. The compiler should treat
reserved imm8 values as illegal syntax.
Table 3-9. Pseudo-Op and CMPPD Implementation
:

Pseudo-Op CMPPD Implementation


CMPEQPD xmm1, xmm2 CMPPD xmm1, xmm2, 0
CMPLTPD xmm1, xmm2 CMPPD xmm1, xmm2, 1
CMPLEPD xmm1, xmm2 CMPPD xmm1, xmm2, 2
CMPUNORDPD xmm1, xmm2 CMPPD xmm1, xmm2, 3
CMPNEQPD xmm1, xmm2 CMPPD xmm1, xmm2, 4
CMPNLTPD xmm1, xmm2 CMPPD xmm1, xmm2, 5
CMPNLEPD xmm1, xmm2 CMPPD xmm1, xmm2, 6
CMPORDPD xmm1, xmm2 CMPPD xmm1, xmm2, 7

The greater-than relations that the processor does not implement require more than one instruction to emulate in
software and therefore should not be implemented as pseudo-ops. (For these, the programmer should reverse the

CMPPD—Compare Packed Double Precision Floating-Point Values Vol. 2A 3-188


operands of the corresponding less than relations and use move instructions to ensure that the mask is moved to
the correct destination register and that the source operand is left intact.)
Processors with “CPUID.1H:ECX.AVX =1” implement the full complement of 32 predicates shown in Table 3-10,
software emulation is no longer needed. Compilers and assemblers may implement the following three-operand
pseudo-ops in addition to the four-operand VCMPPD instruction. See Table 3-10, where the notations of reg1 reg2,
and reg3 represent either XMM registers or YMM registers. The compiler should treat reserved imm8 values as
illegal syntax. Alternately, intrinsics can map the pseudo-ops to pre-defined constants to support a simpler intrinsic
interface. Compilers and assemblers may implement three-operand pseudo-ops for EVEX encoded VCMPPD
instructions in a similar fashion by extending the syntax listed in Table 3-10.
Table 3-10. Pseudo-Op and VCMPPD Implementation
:

Pseudo-Op CMPPD Implementation


VCMPEQPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 0
VCMPLTPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 1
VCMPLEPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 2
VCMPUNORDPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 3
VCMPNEQPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 4
VCMPNLTPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 5
VCMPNLEPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 6
VCMPORDPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 7
VCMPEQ_UQPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 8
VCMPNGEPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 9
VCMPNGTPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 0AH
VCMPFALSEPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 0BH
VCMPNEQ_OQPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 0CH
VCMPGEPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 0DH
VCMPGTPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 0EH
VCMPTRUEPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 0FH
VCMPEQ_OSPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 10H
VCMPLT_OQPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 11H
VCMPLE_OQPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 12H
VCMPUNORD_SPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 13H
VCMPNEQ_USPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 14H
VCMPNLT_UQPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 15H
VCMPNLE_UQPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 16H
VCMPORD_SPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 17H
VCMPEQ_USPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 18H
VCMPNGE_UQPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 19H
VCMPNGT_UQPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 1AH
VCMPFALSE_OSPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 1BH
VCMPNEQ_OSPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 1CH
VCMPGE_OQPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 1DH
VCMPGT_OQPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 1EH
VCMPTRUE_USPD reg1, reg2, reg3 VCMPPD reg1, reg2, reg3, 1FH

CMPPD—Compare Packed Double Precision Floating-Point Values Vol. 2A 3-189


Operation
CASE (COMPARISON PREDICATE) OF
0: OP3 := EQ_OQ; OP5 := EQ_OQ;
1: OP3 := LT_OS; OP5 := LT_OS;
2: OP3 := LE_OS; OP5 := LE_OS;
3: OP3 := UNORD_Q; OP5 := UNORD_Q;
4: OP3 := NEQ_UQ; OP5 := NEQ_UQ;
5: OP3 := NLT_US; OP5 := NLT_US;
6: OP3 := NLE_US; OP5 := NLE_US;
7: OP3 := ORD_Q; OP5 := ORD_Q;
8: OP5 := EQ_UQ;
9: OP5 := NGE_US;
10: OP5 := NGT_US;
11: OP5 := FALSE_OQ;
12: OP5 := NEQ_OQ;
13: OP5 := GE_OS;
14: OP5 := GT_OS;
15: OP5 := TRUE_UQ;
16: OP5 := EQ_OS;
17: OP5 := LT_OQ;
18: OP5 := LE_OQ;
19: OP5 := UNORD_S;
20: OP5 := NEQ_US;
21: OP5 := NLT_UQ;
22: OP5 := NLE_UQ;
23: OP5 := ORD_S;
24: OP5 := EQ_US;
25: OP5 := NGE_UQ;
26: OP5 := NGT_UQ;
27: OP5 := FALSE_OS;
28: OP5 := NEQ_OS;
29: OP5 := GE_OQ;
30: OP5 := GT_OQ;
31: OP5 := TRUE_US;
DEFAULT: Reserved;
ESAC;

VCMPPD (EVEX Encoded Versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k2[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
CMP := SRC1[i+63:i] OP5 SRC2[63:0]
ELSE
CMP := SRC1[i+63:i] OP5 SRC2[i+63:i]
FI;
IF CMP = TRUE
THEN DEST[j] := 1;
ELSE DEST[j] := 0; FI;
ELSE DEST[j] := 0 ; zeroing-masking only
FI;

CMPPD—Compare Packed Double Precision Floating-Point Values Vol. 2A 3-190


ENDFOR
DEST[MAX_KL-1:KL] := 0

VCMPPD (VEX.256 Encoded Version)


CMP0 := SRC1[63:0] OP5 SRC2[63:0];
CMP1 := SRC1[127:64] OP5 SRC2[127:64];
CMP2 := SRC1[191:128] OP5 SRC2[191:128];
CMP3 := SRC1[255:192] OP5 SRC2[255:192];
IF CMP0 = TRUE
THEN DEST[63:0] := FFFFFFFFFFFFFFFFH;
ELSE DEST[63:0] := 0000000000000000H; FI;
IF CMP1 = TRUE
THEN DEST[127:64] := FFFFFFFFFFFFFFFFH;
ELSE DEST[127:64] := 0000000000000000H; FI;
IF CMP2 = TRUE
THEN DEST[191:128] := FFFFFFFFFFFFFFFFH;
ELSE DEST[191:128] := 0000000000000000H; FI;
IF CMP3 = TRUE
THEN DEST[255:192] := FFFFFFFFFFFFFFFFH;
ELSE DEST[255:192] := 0000000000000000H; FI;
DEST[MAXVL-1:256] := 0

VCMPPD (VEX.128 Encoded Version)


CMP0 := SRC1[63:0] OP5 SRC2[63:0];
CMP1 := SRC1[127:64] OP5 SRC2[127:64];
IF CMP0 = TRUE
THEN DEST[63:0] := FFFFFFFFFFFFFFFFH;
ELSE DEST[63:0] := 0000000000000000H; FI;
IF CMP1 = TRUE
THEN DEST[127:64] := FFFFFFFFFFFFFFFFH;
ELSE DEST[127:64] := 0000000000000000H; FI;
DEST[MAXVL-1:128] := 0

CMPPD (128-bit Legacy SSE Version)


CMP0 := SRC1[63:0] OP3 SRC2[63:0];
CMP1 := SRC1[127:64] OP3 SRC2[127:64];
IF CMP0 = TRUE
THEN DEST[63:0] := FFFFFFFFFFFFFFFFH;
ELSE DEST[63:0] := 0000000000000000H; FI;
IF CMP1 = TRUE
THEN DEST[127:64] := FFFFFFFFFFFFFFFFH;
ELSE DEST[127:64] := 0000000000000000H; FI;
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VCMPPD __mmask8 _mm512_cmp_pd_mask( __m512d a, __m512d b, int imm);
VCMPPD __mmask8 _mm512_cmp_round_pd_mask( __m512d a, __m512d b, int imm, int sae);
VCMPPD __mmask8 _mm512_mask_cmp_pd_mask( __mmask8 k1, __m512d a, __m512d b, int imm);
VCMPPD __mmask8 _mm512_mask_cmp_round_pd_mask( __mmask8 k1, __m512d a, __m512d b, int imm, int sae);
VCMPPD __mmask8 _mm256_cmp_pd_mask( __m256d a, __m256d b, int imm);
VCMPPD __mmask8 _mm256_mask_cmp_pd_mask( __mmask8 k1, __m256d a, __m256d b, int imm);
VCMPPD __mmask8 _mm_cmp_pd_mask( __m128d a, __m128d b, int imm);
VCMPPD __mmask8 _mm_mask_cmp_pd_mask( __mmask8 k1, __m128d a, __m128d b, int imm);
VCMPPD __m256 _mm256_cmp_pd(__m256d a, __m256d b, int imm)

CMPPD—Compare Packed Double Precision Floating-Point Values Vol. 2A 3-191


(V)CMPPD __m128 _mm_cmp_pd(__m128d a, __m128d b, int imm)

SIMD Floating-Point Exceptions


Invalid if SNaN operand and invalid if QNaN and predicate as listed in Table 3-8, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

CMPPD—Compare Packed Double Precision Floating-Point Values Vol. 2A 3-192


CMPPS—Compare Packed Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F C2 /r ib A V/V SSE Compare packed single precision floating-point values in
CMPPS xmm1, xmm2/m128, imm8 xmm2/m128 and xmm1 using bits 2:0 of imm8 as a
comparison predicate.
VEX.128.0F.WIG C2 /r ib B V/V AVX Compare packed single precision floating-point values in
VCMPPS xmm1, xmm2, xmm3/m128 and xmm2 using bits 4:0 of imm8 as a
xmm3/m128, imm8 comparison predicate.
VEX.256.0F.WIG C2 /r ib B V/V AVX Compare packed single precision floating-point values in
VCMPPS ymm1, ymm2, ymm3/m256 and ymm2 using bits 4:0 of imm8 as a
ymm3/m256, imm8 comparison predicate.
EVEX.128.0F.W0 C2 /r ib C V/V (AVX512VL AND Compare packed single precision floating-point values in
VCMPPS k1 {k2}, xmm2, AVX512F) OR xmm3/m128/m32bcst and xmm2 using bits 4:0 of
xmm3/m128/m32bcst, imm8 AVX10.11 imm8 as a comparison predicate with writemask k2 and
leave the result in mask register k1.
EVEX.256.0F.W0 C2 /r ib C V/V (AVX512VL AND Compare packed single precision floating-point values in
VCMPPS k1 {k2}, ymm2, AVX512F) OR ymm3/m256/m32bcst and ymm2 using bits 4:0 of
ymm3/m256/m32bcst, imm8 AVX10.11 imm8 as a comparison predicate with writemask k2 and
leave the result in mask register k1.
EVEX.512.0F.W0 C2 /r ib C V/V AVX512F Compare packed single precision floating-point values in
VCMPPS k1 {k2}, zmm2, OR AVX10.11 zmm3/m512/m32bcst and zmm2 using bits 4:0 of imm8
zmm3/m512/m32bcst {sae}, imm8 as a comparison predicate with writemask k2 and leave
the result in mask register k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) imm8 N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
Performs a SIMD compare of the packed single precision floating-point values in the second source operand and
the first source operand and returns the result of the comparison to the destination operand. The comparison pred-
icate operand (immediate byte) specifies the type of comparison performed on each of the pairs of packed values.
EVEX encoded versions: The first source operand (second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 32-bit memory location. The destination operand (first operand) is an opmask register.
Comparison results are written to the destination operand under the writemask k2. Each comparison result is a
single mask bit of 1 (comparison true) or 0 (comparison false).
VEX.256 encoded version: The first source operand (second operand) is a YMM register. The second source
operand (third operand) can be a YMM register or a 256-bit memory location. The destination operand (first
operand) is a YMM register. Eight comparisons are performed with results written to the destination operand. The
result of each comparison is a doubleword mask of all 1s (comparison true) or all 0s (comparison false).
128-bit Legacy SSE version: The first source and destination operand (first operand) is an XMM register. The
second source operand (second operand) can be an XMM register or 128-bit memory location. Bits (MAXVL-1:128)

CMPPS—Compare Packed Single Precision Floating-Point Values Vol. 2A 3-193


of the corresponding ZMM destination register remain unchanged. Four comparisons are performed with results
written to bits 127:0 of the destination operand. The result of each comparison is a doubleword mask of all 1s
(comparison true) or all 0s (comparison false).
VEX.128 encoded version: The first source operand (second operand) is an XMM register. The second source
operand (third operand) can be an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the destina-
tion ZMM register are zeroed. Four comparisons are performed with results written to bits 127:0 of the destination
operand.
The comparison predicate operand is an 8-bit immediate:
• For instructions encoded using the VEX prefix and EVEX prefix, bits 4:0 define the type of comparison to be
performed (see Table 3-8). Bits 5 through 7 of the immediate are reserved.
• For instruction encodings that do not use VEX prefix, bits 2:0 define the type of comparison to be made (see
the first 8 rows of Table 3-8). Bits 3 through 7 of the immediate are reserved.
The unordered relationship is true when at least one of the two source operands being compared is a NaN; the
ordered relationship is true when neither source operand is a NaN.
A subsequent computational instruction that uses the mask result in the destination operand as an input operand
will not generate an exception, because a mask of all 0s corresponds to a floating-point value of +0.0 and a mask
of all 1s corresponds to a QNaN.
Note that processors with “CPUID.1H:ECX.AVX =0” do not implement the “greater-than”, “greater-than-or-equal”,
“not-greater than”, and “not-greater-than-or-equal relations” predicates. These comparisons can be made either
by using the inverse relationship (that is, use the “not-less-than-or-equal” to make a “greater-than” comparison)
or by using software emulation. When using software emulation, the program must swap the operands (copying
registers when necessary to protect the data that will now be in the destination), and then perform the compare
using a different predicate. The predicate to be used for these emulations is listed in the first 8 rows of Table 3-7
(Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A) under the heading Emulation.
Compilers and assemblers may implement the following two-operand pseudo-ops in addition to the three-operand
CMPPS instruction, for processors with “CPUID.1H:ECX.AVX =0”. See Table 3-11. The compiler should treat
reserved imm8 values as illegal syntax.
Table 3-11. Pseudo-Op and CMPPS Implementation
:

Pseudo-Op CMPPS Implementation


CMPEQPS xmm1, xmm2 CMPPS xmm1, xmm2, 0
CMPLTPS xmm1, xmm2 CMPPS xmm1, xmm2, 1
CMPLEPS xmm1, xmm2 CMPPS xmm1, xmm2, 2
CMPUNORDPS xmm1, xmm2 CMPPS xmm1, xmm2, 3
CMPNEQPS xmm1, xmm2 CMPPS xmm1, xmm2, 4
CMPNLTPS xmm1, xmm2 CMPPS xmm1, xmm2, 5
CMPNLEPS xmm1, xmm2 CMPPS xmm1, xmm2, 6
CMPORDPS xmm1, xmm2 CMPPS xmm1, xmm2, 7

The greater-than relations that the processor does not implement require more than one instruction to emulate in
software and therefore should not be implemented as pseudo-ops. (For these, the programmer should reverse the
operands of the corresponding less than relations and use move instructions to ensure that the mask is moved to
the correct destination register and that the source operand is left intact.)
Processors with “CPUID.1H:ECX.AVX =1” implement the full complement of 32 predicates shown in Table 3-12,
software emulation is no longer needed. Compilers and assemblers may implement the following three-operand
pseudo-ops in addition to the four-operand VCMPPS instruction. See Table 3-12, where the notation of reg1 and
reg2 represent either XMM registers or YMM registers. The compiler should treat reserved imm8 values as illegal
syntax. Alternately, intrinsics can map the pseudo-ops to pre-defined constants to support a simpler intrinsic inter-
face. Compilers and assemblers may implement three-operand pseudo-ops for EVEX encoded VCMPPS instructions
in a similar fashion by extending the syntax listed in Table 3-12.
:

CMPPS—Compare Packed Single Precision Floating-Point Values Vol. 2A 3-194


Table 3-12. Pseudo-Op and VCMPPS Implementation
Pseudo-Op CMPPS Implementation
VCMPEQPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 0
VCMPLTPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 1
VCMPLEPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 2
VCMPUNORDPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 3
VCMPNEQPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 4
VCMPNLTPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 5
VCMPNLEPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 6
VCMPORDPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 7
VCMPEQ_UQPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 8
VCMPNGEPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 9
VCMPNGTPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 0AH
VCMPFALSEPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 0BH
VCMPNEQ_OQPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 0CH
VCMPGEPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 0DH
VCMPGTPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 0EH
VCMPTRUEPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 0FH
VCMPEQ_OSPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 10H
VCMPLT_OQPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 11H
VCMPLE_OQPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 12H
VCMPUNORD_SPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 13H
VCMPNEQ_USPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 14H
VCMPNLT_UQPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 15H
VCMPNLE_UQPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 16H
VCMPORD_SPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 17H
VCMPEQ_USPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 18H
VCMPNGE_UQPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 19H
VCMPNGT_UQPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 1AH
VCMPFALSE_OSPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 1BH
VCMPNEQ_OSPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 1CH
VCMPGE_OQPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 1DH
VCMPGT_OQPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 1EH
VCMPTRUE_USPS reg1, reg2, reg3 VCMPPS reg1, reg2, reg3, 1FH

CMPPS—Compare Packed Single Precision Floating-Point Values Vol. 2A 3-195


Operation
CASE (COMPARISON PREDICATE) OF
0: OP3 := EQ_OQ; OP5 := EQ_OQ;
1: OP3 := LT_OS; OP5 := LT_OS;
2: OP3 := LE_OS; OP5 := LE_OS;
3: OP3 := UNORD_Q; OP5 := UNORD_Q;
4: OP3 := NEQ_UQ; OP5 := NEQ_UQ;
5: OP3 := NLT_US; OP5 := NLT_US;
6: OP3 := NLE_US; OP5 := NLE_US;
7: OP3 := ORD_Q; OP5 := ORD_Q;
8: OP5 := EQ_UQ;
9: OP5 := NGE_US;
10: OP5 := NGT_US;
11: OP5 := FALSE_OQ;
12: OP5 := NEQ_OQ;
13: OP5 := GE_OS;
14: OP5 := GT_OS;
15: OP5 := TRUE_UQ;
16: OP5 := EQ_OS;
17: OP5 := LT_OQ;
18: OP5 := LE_OQ;
19: OP5 := UNORD_S;
20: OP5 := NEQ_US;
21: OP5 := NLT_UQ;
22: OP5 := NLE_UQ;
23: OP5 := ORD_S;
24: OP5 := EQ_US;
25: OP5 := NGE_UQ;
26: OP5 := NGT_UQ;
27: OP5 := FALSE_OS;
28: OP5 := NEQ_OS;
29: OP5 := GE_OQ;
30: OP5 := GT_OQ;
31: OP5 := TRUE_US;
DEFAULT: Reserved
ESAC;

VCMPPS (EVEX Encoded Versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k2[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
CMP := SRC1[i+31:i] OP5 SRC2[31:0]
ELSE
CMP := SRC1[i+31:i] OP5 SRC2[i+31:i]
FI;
IF CMP = TRUE
THEN DEST[j] := 1;
ELSE DEST[j] := 0; FI;
ELSE DEST[j] := 0 ; zeroing-masking onlyFI;
FI;

CMPPS—Compare Packed Single Precision Floating-Point Values Vol. 2A 3-196


ENDFOR
DEST[MAX_KL-1:KL] := 0

VCMPPS (VEX.256 Encoded Version)


CMP0 := SRC1[31:0] OP5 SRC2[31:0];
CMP1 := SRC1[63:32] OP5 SRC2[63:32];
CMP2 := SRC1[95:64] OP5 SRC2[95:64];
CMP3 := SRC1[127:96] OP5 SRC2[127:96];
CMP4 := SRC1[159:128] OP5 SRC2[159:128];
CMP5 := SRC1[191:160] OP5 SRC2[191:160];
CMP6 := SRC1[223:192] OP5 SRC2[223:192];
CMP7 := SRC1[255:224] OP5 SRC2[255:224];
IF CMP0 = TRUE
THEN DEST[31:0] :=FFFFFFFFH;
ELSE DEST[31:0] := 000000000H; FI;
IF CMP1 = TRUE
THEN DEST[63:32] := FFFFFFFFH;
ELSE DEST[63:32] :=000000000H; FI;
IF CMP2 = TRUE
THEN DEST[95:64] := FFFFFFFFH;
ELSE DEST[95:64] := 000000000H; FI;
IF CMP3 = TRUE
THEN DEST[127:96] := FFFFFFFFH;
ELSE DEST[127:96] := 000000000H; FI;
IF CMP4 = TRUE
THEN DEST[159:128] := FFFFFFFFH;
ELSE DEST[159:128] := 000000000H; FI;
IF CMP5 = TRUE
THEN DEST[191:160] := FFFFFFFFH;
ELSE DEST[191:160] := 000000000H; FI;
IF CMP6 = TRUE
THEN DEST[223:192] := FFFFFFFFH;
ELSE DEST[223:192] :=000000000H; FI;
IF CMP7 = TRUE
THEN DEST[255:224] := FFFFFFFFH;
ELSE DEST[255:224] := 000000000H; FI;
DEST[MAXVL-1:256] := 0

VCMPPS (VEX.128 Encoded Version)


CMP0 := SRC1[31:0] OP5 SRC2[31:0];
CMP1 := SRC1[63:32] OP5 SRC2[63:32];
CMP2 := SRC1[95:64] OP5 SRC2[95:64];
CMP3 := SRC1[127:96] OP5 SRC2[127:96];
IF CMP0 = TRUE
THEN DEST[31:0] :=FFFFFFFFH;
ELSE DEST[31:0] := 000000000H; FI;
IF CMP1 = TRUE
THEN DEST[63:32] := FFFFFFFFH;
ELSE DEST[63:32] := 000000000H; FI;
IF CMP2 = TRUE
THEN DEST[95:64] := FFFFFFFFH;
ELSE DEST[95:64] := 000000000H; FI;
IF CMP3 = TRUE
THEN DEST[127:96] := FFFFFFFFH;

CMPPS—Compare Packed Single Precision Floating-Point Values Vol. 2A 3-197


ELSE DEST[127:96] :=000000000H; FI;
DEST[MAXVL-1:128] := 0

CMPPS (128-bit Legacy SSE Version)


CMP0 := SRC1[31:0] OP3 SRC2[31:0];
CMP1 := SRC1[63:32] OP3 SRC2[63:32];
CMP2 := SRC1[95:64] OP3 SRC2[95:64];
CMP3 := SRC1[127:96] OP3 SRC2[127:96];
IF CMP0 = TRUE
THEN DEST[31:0] :=FFFFFFFFH;
ELSE DEST[31:0] := 000000000H; FI;
IF CMP1 = TRUE
THEN DEST[63:32] := FFFFFFFFH;
ELSE DEST[63:32] := 000000000H; FI;
IF CMP2 = TRUE
THEN DEST[95:64] := FFFFFFFFH;
ELSE DEST[95:64] := 000000000H; FI;
IF CMP3 = TRUE
THEN DEST[127:96] := FFFFFFFFH;
ELSE DEST[127:96] :=000000000H; FI;
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VCMPPS __mmask16 _mm512_cmp_ps_mask( __m512 a, __m512 b, int imm);
VCMPPS __mmask16 _mm512_cmp_round_ps_mask( __m512 a, __m512 b, int imm, int sae);
VCMPPS __mmask16 _mm512_mask_cmp_ps_mask( __mmask16 k1, __m512 a, __m512 b, int imm);
VCMPPS __mmask16 _mm512_mask_cmp_round_ps_mask( __mmask16 k1, __m512 a, __m512 b, int imm, int sae);
VCMPPS __mmask8 _mm256_cmp_ps_mask( __m256 a, __m256 b, int imm);
VCMPPS __mmask8 _mm256_mask_cmp_ps_mask( __mmask8 k1, __m256 a, __m256 b, int imm);
VCMPPS __mmask8 _mm_cmp_ps_mask( __m128 a, __m128 b, int imm);
VCMPPS __mmask8 _mm_mask_cmp_ps_mask( __mmask8 k1, __m128 a, __m128 b, int imm);
VCMPPS __m256 _mm256_cmp_ps(__m256 a, __m256 b, int imm)
CMPPS __m128 _mm_cmp_ps(__m128 a, __m128 b, int imm)

SIMD Floating-Point Exceptions


Invalid if SNaN operand and invalid if QNaN and predicate as listed in Table 3-8, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

CMPPS—Compare Packed Single Precision Floating-Point Values Vol. 2A 3-198


CMPSD—Compare Scalar Double Precision Floating-Point Value
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F2 0F C2 /r ib A V/V SSE2 Compare low double precision floating-point value in
CMPSD xmm1, xmm2/m64, imm8 xmm2/m64 and xmm1 using bits 2:0 of imm8 as
comparison predicate.
VEX.LIG.F2.0F.WIG C2 /r ib B V/V AVX Compare low double precision floating-point value in
VCMPSD xmm1, xmm2, xmm3/m64, xmm3/m64 and xmm2 using bits 4:0 of imm8 as
imm8 comparison predicate.
EVEX.LLIG.F2.0F.W1 C2 /r ib C V/V AVX512F Compare low double precision floating-point value in
VCMPSD k1 {k2}, xmm2, OR AVX10.11 xmm3/m64 and xmm2 using bits 4:0 of imm8 as
xmm3/m64{sae}, imm8 comparison predicate with writemask k2 and leave the
result in mask register k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) imm8 N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
Compares the low double precision floating-point values in the second source operand and the first source operand
and returns the result of the comparison to the destination operand. The comparison predicate operand (imme-
diate operand) specifies the type of comparison performed.
128-bit Legacy SSE version: The first source and destination operand (first operand) is an XMM register. The
second source operand (second operand) can be an XMM register or 64-bit memory location. Bits (MAXVL-1:64) of
the corresponding YMM destination register remain unchanged. The comparison result is a quadword mask of all 1s
(comparison true) or all 0s (comparison false).
VEX.128 encoded version: The first source operand (second operand) is an XMM register. The second source
operand (third operand) can be an XMM register or a 64-bit memory location. The result is stored in the low quad-
word of the destination operand; the high quadword is filled with the contents of the high quadword of the first
source operand. Bits (MAXVL-1:128) of the destination ZMM register are zeroed. The comparison result is a quad-
word mask of all 1s (comparison true) or all 0s (comparison false).
EVEX encoded version: The first source operand (second operand) is an XMM register. The second source operand
can be a XMM register or a 64-bit memory location. The destination operand (first operand) is an opmask register.
The comparison result is a single mask bit of 1 (comparison true) or 0 (comparison false), written to the destination
starting from the LSB according to the writemask k2. Bits (MAX_KL-1:128) of the destination register are cleared.
The comparison predicate operand is an 8-bit immediate:
• For instructions encoded using the VEX prefix, bits 4:0 define the type of comparison to be performed (see
Table 3-8). Bits 5 through 7 of the immediate are reserved.
• For instruction encodings that do not use VEX prefix, bits 2:0 define the type of comparison to be made (see
the first 8 rows of Table 3-8). Bits 3 through 7 of the immediate are reserved.
The unordered relationship is true when at least one of the two source operands being compared is a NaN; the
ordered relationship is true when neither source operand is a NaN.

CMPSD—Compare Scalar Double Precision Floating-Point Value Vol. 2A 3-204


A subsequent computational instruction that uses the mask result in the destination operand as an input operand
will not generate an exception, because a mask of all 0s corresponds to a floating-point value of +0.0 and a mask
of all 1s corresponds to a QNaN.
Note that processors with “CPUID.1H:ECX.AVX =0” do not implement the “greater-than”, “greater-than-or-equal”,
“not-greater than”, and “not-greater-than-or-equal relations” predicates. These comparisons can be made either
by using the inverse relationship (that is, use the “not-less-than-or-equal” to make a “greater-than” comparison)
or by using software emulation. When using software emulation, the program must swap the operands (copying
registers when necessary to protect the data that will now be in the destination), and then perform the compare
using a different predicate. The predicate to be used for these emulations is listed in the first 8 rows of Table 3-7
(Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A) under the heading Emulation.
Compilers and assemblers may implement the following two-operand pseudo-ops in addition to the three-operand
CMPSD instruction, for processors with “CPUID.1H:ECX.AVX =0”. See Table 3-13. The compiler should treat
reserved imm8 values as illegal syntax.
Table 3-13. Pseudo-Op and CMPSD Implementation
:

Pseudo-Op CMPSD Implementation


CMPEQSD xmm1, xmm2 CMPSD xmm1, xmm2, 0
CMPLTSD xmm1, xmm2 CMPSD xmm1, xmm2, 1
CMPLESD xmm1, xmm2 CMPSD xmm1, xmm2, 2
CMPUNORDSD xmm1, xmm2 CMPSD xmm1, xmm2, 3
CMPNEQSD xmm1, xmm2 CMPSD xmm1, xmm2, 4
CMPNLTSD xmm1, xmm2 CMPSD xmm1, xmm2, 5
CMPNLESD xmm1, xmm2 CMPSD xmm1, xmm2, 6
CMPORDSD xmm1, xmm2 CMPSD xmm1, xmm2, 7

The greater-than relations that the processor does not implement require more than one instruction to emulate in
software and therefore should not be implemented as pseudo-ops. (For these, the programmer should reverse the
operands of the corresponding less than relations and use move instructions to ensure that the mask is moved to
the correct destination register and that the source operand is left intact.)
Processors with “CPUID.1H:ECX.AVX =1” implement the full complement of 32 predicates shown in Table 3-14,
software emulation is no longer needed. Compilers and assemblers may implement the following three-operand
pseudo-ops in addition to the four-operand VCMPSD instruction. See Table 3-14, where the notations of reg1 reg2,
and reg3 represent either XMM registers or YMM registers. The compiler should treat reserved imm8 values as
illegal syntax. Alternately, intrinsics can map the pseudo-ops to pre-defined constants to support a simpler intrinsic
interface. Compilers and assemblers may implement three-operand pseudo-ops for EVEX encoded VCMPSD
instructions in a similar fashion by extending the syntax listed in Table 3-14.
Table 3-14. Pseudo-Op and VCMPSD Implementation
:

Pseudo-Op CMPSD Implementation


VCMPEQSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 0
VCMPLTSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 1
VCMPLESD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 2
VCMPUNORDSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 3
VCMPNEQSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 4
VCMPNLTSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 5
VCMPNLESD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 6
VCMPORDSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 7
VCMPEQ_UQSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 8
VCMPNGESD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 9

CMPSD—Compare Scalar Double Precision Floating-Point Value Vol. 2A 3-205


Table 3-14. Pseudo-Op and VCMPSD Implementation (Contd.)
Pseudo-Op CMPSD Implementation
VCMPNGTSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 0AH
VCMPFALSESD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 0BH
VCMPNEQ_OQSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 0CH
VCMPGESD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 0DH
VCMPGTSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 0EH
VCMPTRUESD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 0FH
VCMPEQ_OSSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 10H
VCMPLT_OQSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 11H
VCMPLE_OQSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 12H
VCMPUNORD_SSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 13H
VCMPNEQ_USSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 14H
VCMPNLT_UQSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 15H
VCMPNLE_UQSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 16H
VCMPORD_SSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 17H
VCMPEQ_USSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 18H
VCMPNGE_UQSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 19H
VCMPNGT_UQSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 1AH
VCMPFALSE_OSSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 1BH
VCMPNEQ_OSSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 1CH
VCMPGE_OQSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 1DH
VCMPGT_OQSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 1EH
VCMPTRUE_USSD reg1, reg2, reg3 VCMPSD reg1, reg2, reg3, 1FH

Software should ensure VCMPSD is encoded with VEX.L=0. Encoding VCMPSD with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.

Operation
CASE (COMPARISON PREDICATE) OF
0: OP3 := EQ_OQ; OP5 := EQ_OQ;
1: OP3 := LT_OS; OP5 := LT_OS;
2: OP3 := LE_OS; OP5 := LE_OS;
3: OP3 := UNORD_Q; OP5 := UNORD_Q;
4: OP3 := NEQ_UQ; OP5 := NEQ_UQ;
5: OP3 := NLT_US; OP5 := NLT_US;
6: OP3 := NLE_US; OP5 := NLE_US;
7: OP3 := ORD_Q; OP5 := ORD_Q;
8: OP5 := EQ_UQ;
9: OP5 := NGE_US;
10: OP5 := NGT_US;
11: OP5 := FALSE_OQ;
12: OP5 := NEQ_OQ;
13: OP5 := GE_OS;
14: OP5 := GT_OS;
15: OP5 := TRUE_UQ;

CMPSD—Compare Scalar Double Precision Floating-Point Value Vol. 2A 3-206


16: OP5 := EQ_OS;
17: OP5 := LT_OQ;
18: OP5 := LE_OQ;
19: OP5 := UNORD_S;
20: OP5 := NEQ_US;
21: OP5 := NLT_UQ;
22: OP5 := NLE_UQ;
23: OP5 := ORD_S;
24: OP5 := EQ_US;
25: OP5 := NGE_UQ;
26: OP5 := NGT_UQ;
27: OP5 := FALSE_OS;
28: OP5 := NEQ_OS;
29: OP5 := GE_OQ;
30: OP5 := GT_OQ;
31: OP5 := TRUE_US;
DEFAULT: Reserved
ESAC;

VCMPSD (EVEX Encoded Version)


CMP0 := SRC1[63:0] OP5 SRC2[63:0];

IF k2[0] or *no writemask*


THEN IF CMP0 = TRUE
THEN DEST[0] := 1;
ELSE DEST[0] := 0; FI;
ELSE DEST[0] := 0 ; zeroing-masking only
FI;
DEST[MAX_KL-1:1] := 0

CMPSD (128-bit Legacy SSE Version)


CMP0 := DEST[63:0] OP3 SRC[63:0];
IF CMP0 = TRUE
THEN DEST[63:0] := FFFFFFFFFFFFFFFFH;
ELSE DEST[63:0] := 0000000000000000H; FI;
DEST[MAXVL-1:64] (Unmodified)

VCMPSD (VEX.128 Encoded Version)


CMP0 := SRC1[63:0] OP5 SRC2[63:0];
IF CMP0 = TRUE
THEN DEST[63:0] := FFFFFFFFFFFFFFFFH;
ELSE DEST[63:0] := 0000000000000000H; FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCMPSD __mmask8 _mm_cmp_sd_mask( __m128d a, __m128d b, int imm);
VCMPSD __mmask8 _mm_cmp_round_sd_mask( __m128d a, __m128d b, int imm, int sae);
VCMPSD __mmask8 _mm_mask_cmp_sd_mask( __mmask8 k1, __m128d a, __m128d b, int imm);
VCMPSD __mmask8 _mm_mask_cmp_round_sd_mask( __mmask8 k1, __m128d a, __m128d b, int imm, int sae);
(V)CMPSD __m128d _mm_cmp_sd(__m128d a, __m128d b, const int imm)

SIMD Floating-Point Exceptions


Invalid if SNaN operand, Invalid if QNaN and predicate as listed in Table 3-8, Denormal.

CMPSD—Compare Scalar Double Precision Floating-Point Value Vol. 2A 3-207


Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

CMPSD—Compare Scalar Double Precision Floating-Point Value Vol. 2A 3-208


CMPSS—Compare Scalar Single Precision Floating-Point Value
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F3 0F C2 /r ib A V/V SSE Compare low single precision floating-point value in
CMPSS xmm1, xmm2/m32, imm8 xmm2/m32 and xmm1 using bits 2:0 of imm8 as
comparison predicate.
VEX.LIG.F3.0F.WIG C2 /r ib B V/V AVX Compare low single precision floating-point value in
VCMPSS xmm1, xmm2, xmm3/m32, xmm3/m32 and xmm2 using bits 4:0 of imm8 as
imm8 comparison predicate.
EVEX.LLIG.F3.0F.W0 C2 /r ib C V/V AVX512F Compare low single precision floating-point value in
VCMPSS k1 {k2}, xmm2, OR AVX10.11 xmm3/m32 and xmm2 using bits 4:0 of imm8 as
xmm3/m32{sae}, imm8 comparison predicate with writemask k2 and leave
the result in mask register k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) imm8 N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
Compares the low single precision floating-point values in the second source operand and the first source operand
and returns the result of the comparison to the destination operand. The comparison predicate operand (imme-
diate operand) specifies the type of comparison performed.
128-bit Legacy SSE version: The first source and destination operand (first operand) is an XMM register. The
second source operand (second operand) can be an XMM register or 32-bit memory location. Bits (MAXVL-1:32) of
the corresponding YMM destination register remain unchanged. The comparison result is a doubleword mask of all
1s (comparison true) or all 0s (comparison false).
VEX.128 encoded version: The first source operand (second operand) is an XMM register. The second source
operand (third operand) can be an XMM register or a 32-bit memory location. The result is stored in the low 32 bits
of the destination operand; bits 127:32 of the destination operand are copied from the first source operand. Bits
(MAXVL-1:128) of the destination ZMM register are zeroed. The comparison result is a doubleword mask of all 1s
(comparison true) or all 0s (comparison false).
EVEX encoded version: The first source operand (second operand) is an XMM register. The second source operand
can be a XMM register or a 32-bit memory location. The destination operand (first operand) is an opmask register.
The comparison result is a single mask bit of 1 (comparison true) or 0 (comparison false), written to the destination
starting from the LSB according to the writemask k2. Bits (MAX_KL-1:128) of the destination register are cleared.
The comparison predicate operand is an 8-bit immediate:
• For instructions encoded using the VEX prefix, bits 4:0 define the type of comparison to be performed (see
Table 3-8). Bits 5 through 7 of the immediate are reserved.
• For instruction encodings that do not use VEX prefix, bits 2:0 define the type of comparison to be made (see
the first 8 rows of Table 3-8). Bits 3 through 7 of the immediate are reserved.

The unordered relationship is true when at least one of the two source operands being compared is a NaN; the
ordered relationship is true when neither source operand is a NaN.

CMPSS—Compare Scalar Single Precision Floating-Point Value Vol. 2A 3-208


A subsequent computational instruction that uses the mask result in the destination operand as an input operand
will not generate an exception, because a mask of all 0s corresponds to a floating-point value of +0.0 and a mask
of all 1s corresponds to a QNaN.
Note that processors with “CPUID.1H:ECX.AVX =0” do not implement the “greater-than”, “greater-than-or-equal”,
“not-greater than”, and “not-greater-than-or-equal relations” predicates. These comparisons can be made either
by using the inverse relationship (that is, use the “not-less-than-or-equal” to make a “greater-than” comparison)
or by using software emulation. When using software emulation, the program must swap the operands (copying
registers when necessary to protect the data that will now be in the destination), and then perform the compare
using a different predicate. The predicate to be used for these emulations is listed in the first 8 rows of Table 3-7
(Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A) under the heading Emulation.
Compilers and assemblers may implement the following two-operand pseudo-ops in addition to the three-operand
CMPSS instruction, for processors with “CPUID.1H:ECX.AVX =0”. See Table 3-15. The compiler should treat
reserved imm8 values as illegal syntax.
Table 3-15. Pseudo-Op and CMPSS Implementation
:

Pseudo-Op CMPSS Implementation


CMPEQSS xmm1, xmm2 CMPSS xmm1, xmm2, 0
CMPLTSS xmm1, xmm2 CMPSS xmm1, xmm2, 1
CMPLESS xmm1, xmm2 CMPSS xmm1, xmm2, 2
CMPUNORDSS xmm1, xmm2 CMPSS xmm1, xmm2, 3
CMPNEQSS xmm1, xmm2 CMPSS xmm1, xmm2, 4
CMPNLTSS xmm1, xmm2 CMPSS xmm1, xmm2, 5
CMPNLESS xmm1, xmm2 CMPSS xmm1, xmm2, 6
CMPORDSS xmm1, xmm2 CMPSS xmm1, xmm2, 7

The greater-than relations that the processor does not implement require more than one instruction to emulate in
software and therefore should not be implemented as pseudo-ops. (For these, the programmer should reverse the
operands of the corresponding less than relations and use move instructions to ensure that the mask is moved to
the correct destination register and that the source operand is left intact.)
Processors with “CPUID.1H:ECX.AVX =1” implement the full complement of 32 predicates shown in Table 3-14,
software emulation is no longer needed. Compilers and assemblers may implement the following three-operand
pseudo-ops in addition to the four-operand VCMPSS instruction. See Table 3-16, where the notations of reg1 reg2,
and reg3 represent either XMM registers or YMM registers. The compiler should treat reserved imm8 values as
illegal syntax. Alternately, intrinsics can map the pseudo-ops to pre-defined constants to support a simpler intrinsic
interface. Compilers and assemblers may implement three-operand pseudo-ops for EVEX encoded VCMPSS
instructions in a similar fashion by extending the syntax listed in Table 3-16.
Table 3-16. Pseudo-Op and VCMPSS Implementation
:

Pseudo-Op CMPSS Implementation


VCMPEQSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 0
VCMPLTSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 1
VCMPLESS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 2
VCMPUNORDSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 3
VCMPNEQSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 4
VCMPNLTSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 5
VCMPNLESS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 6
VCMPORDSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 7
VCMPEQ_UQSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 8
VCMPNGESS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 9

CMPSS—Compare Scalar Single Precision Floating-Point Value Vol. 2A 3-209


Table 3-16. Pseudo-Op and VCMPSS Implementation (Contd.)
Pseudo-Op CMPSS Implementation
VCMPNGTSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 0AH
VCMPFALSESS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 0BH
VCMPNEQ_OQSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 0CH
VCMPGESS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 0DH
VCMPGTSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 0EH
VCMPTRUESS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 0FH
VCMPEQ_OSSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 10H
VCMPLT_OQSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 11H
VCMPLE_OQSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 12H
VCMPUNORD_SSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 13H
VCMPNEQ_USSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 14H
VCMPNLT_UQSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 15H
VCMPNLE_UQSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 16H
VCMPORD_SSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 17H
VCMPEQ_USSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 18H
VCMPNGE_UQSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 19H
VCMPNGT_UQSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 1AH
VCMPFALSE_OSSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 1BH
VCMPNEQ_OSSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 1CH
VCMPGE_OQSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 1DH
VCMPGT_OQSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 1EH
VCMPTRUE_USSS reg1, reg2, reg3 VCMPSS reg1, reg2, reg3, 1FH

Software should ensure VCMPSS is encoded with VEX.L=0. Encoding VCMPSS with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.

Operation
CASE (COMPARISON PREDICATE) OF
0: OP3 := EQ_OQ; OP5 := EQ_OQ;
1: OP3 := LT_OS; OP5 := LT_OS;
2: OP3 := LE_OS; OP5 := LE_OS;
3: OP3 := UNORD_Q; OP5 := UNORD_Q;
4: OP3 := NEQ_UQ; OP5 := NEQ_UQ;
5: OP3 := NLT_US; OP5 := NLT_US;
6: OP3 := NLE_US; OP5 := NLE_US;
7: OP3 := ORD_Q; OP5 := ORD_Q;
8: OP5 := EQ_UQ;
9: OP5 := NGE_US;
10: OP5 := NGT_US;
11: OP5 := FALSE_OQ;
12: OP5 := NEQ_OQ;
13: OP5 := GE_OS;
14: OP5 := GT_OS;
15: OP5 := TRUE_UQ;

CMPSS—Compare Scalar Single Precision Floating-Point Value Vol. 2A 3-210


16: OP5 := EQ_OS;
17: OP5 := LT_OQ;
18: OP5 := LE_OQ;
19: OP5 := UNORD_S;
20: OP5 := NEQ_US;
21: OP5 := NLT_UQ;
22: OP5 := NLE_UQ;
23: OP5 := ORD_S;
24: OP5 := EQ_US;
25: OP5 := NGE_UQ;
26: OP5 := NGT_UQ;
27: OP5 := FALSE_OS;
28: OP5 := NEQ_OS;
29: OP5 := GE_OQ;
30: OP5 := GT_OQ;
31: OP5 := TRUE_US;
DEFAULT: Reserved
ESAC;

VCMPSS (EVEX Encoded Version)


CMP0 := SRC1[31:0] OP5 SRC2[31:0];

IF k2[0] or *no writemask*


THEN IF CMP0 = TRUE
THEN DEST[0] := 1;
ELSE DEST[0] := 0; FI;
ELSE DEST[0] := 0 ; zeroing-masking only
FI;
DEST[MAX_KL-1:1] := 0

CMPSS (128-bit Legacy SSE Version)


CMP0 := DEST[31:0] OP3 SRC[31:0];
IF CMP0 = TRUE
THEN DEST[31:0] := FFFFFFFFH;
ELSE DEST[31:0] := 00000000H; FI;
DEST[MAXVL-1:32] (Unmodified)

VCMPSS (VEX.128 Encoded Version)


CMP0 := SRC1[31:0] OP5 SRC2[31:0];
IF CMP0 = TRUE
THEN DEST[31:0] := FFFFFFFFH;
ELSE DEST[31:0] := 00000000H; FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCMPSS __mmask8 _mm_cmp_ss_mask( __m128 a, __m128 b, int imm);
VCMPSS __mmask8 _mm_cmp_round_ss_mask( __m128 a, __m128 b, int imm, int sae);
VCMPSS __mmask8 _mm_mask_cmp_ss_mask( __mmask8 k1, __m128 a, __m128 b, int imm);
VCMPSS __mmask8 _mm_mask_cmp_round_ss_mask( __mmask8 k1, __m128 a, __m128 b, int imm, int sae);
(V)CMPSS __m128 _mm_cmp_ss(__m128 a, __m128 b, const int imm)

SIMD Floating-Point Exceptions


Invalid if SNaN operand, Invalid if QNaN and predicate as listed in Table 3-8, Denormal.

CMPSS—Compare Scalar Single Precision Floating-Point Value Vol. 2A 3-211


Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

CMPSS—Compare Scalar Single Precision Floating-Point Value Vol. 2A 3-212


COMISD—Compare Scalar Ordered Double Precision Floating-Point Values and Set EFLAGS
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
66 0F 2F /r A V/V SSE2 Compare low double precision floating-point values in
COMISD xmm1, xmm2/m64 xmm1 and xmm2/mem64 and set the EFLAGS flags
accordingly.
VEX.LIG.66.0F.WIG 2F /r A V/V AVX Compare low double precision floating-point values in
VCOMISD xmm1, xmm2/m64 xmm1 and xmm2/mem64 and set the EFLAGS flags
accordingly.
EVEX.LLIG.66.0F.W1 2F /r B V/V AVX512F Compare low double precision floating-point values in
VCOMISD xmm1, xmm2/m64{sae} OR AVX10.11 xmm1 and xmm2/mem64 and set the EFLAGS flags
accordingly.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Tuple1 Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Compares the double precision floating-point values in the low quadwords of operand 1 (first operand) and operand
2 (second operand), and sets the ZF, PF, and CF flags in the EFLAGS register according to the result (unordered,
greater than, less than, or equal). The OF, SF, and AF flags in the EFLAGS register are set to 0. The unordered result
is returned if either source operand is a NaN (QNaN or SNaN).
Operand 1 is an XMM register; operand 2 can be an XMM register or a 64 bit memory location. The COMISD instruc-
tion differs from the UCOMISD instruction in that it signals a SIMD floating-point invalid operation exception (#I)
when a source operand is either a QNaN or SNaN. The UCOMISD instruction signals an invalid operation exception
only if a source operand is an SNaN.
The EFLAGS register is not updated if an unmasked SIMD floating-point exception is generated.
VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.
Software should ensure VCOMISD is encoded with VEX.L=0. Encoding VCOMISD with VEX.L=1 may encounter
unpredictable behavior across different processor generations.

Operation
COMISD (All Versions)
RESULT :=OrderedCompare(DEST[63:0] <> SRC[63:0]) {
(* Set EFLAGS *) CASE (RESULT) OF
UNORDERED: ZF,PF,CF := 111;
GREATER_THAN: ZF,PF,CF := 000;
LESS_THAN: ZF,PF,CF := 001;
EQUAL: ZF,PF,CF := 100;
ESAC;
OF, AF, SF := 0; }

COMISD—Compare Scalar Ordered Double Precision Floating-Point Values and Set EFLAGS Vol. 2A 3-218
Intel C/C++ Compiler Intrinsic Equivalent
VCOMISD int _mm_comi_round_sd(__m128d a, __m128d b, int imm, int sae);
VCOMISD int _mm_comieq_sd (__m128d a, __m128d b)
VCOMISD int _mm_comilt_sd (__m128d a, __m128d b)
VCOMISD int _mm_comile_sd (__m128d a, __m128d b)
VCOMISD int _mm_comigt_sd (__m128d a, __m128d b)
VCOMISD int _mm_comige_sd (__m128d a, __m128d b)
VCOMISD int _mm_comineq_sd (__m128d a, __m128d b)

SIMD Floating-Point Exceptions


Invalid (if SNaN or QNaN operands), Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

COMISD—Compare Scalar Ordered Double Precision Floating-Point Values and Set EFLAGS Vol. 2A 3-219
COMISS—Compare Scalar Ordered Single Precision Floating-Point Values and Set EFLAGS
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
NP 0F 2F /r A V/V SSE Compare low single precision floating-point values in
COMISS xmm1, xmm2/m32 xmm1 and xmm2/mem32 and set the EFLAGS flags
accordingly.
VEX.LIG.0F.WIG 2F /r A V/V AVX Compare low single precision floating-point values in
VCOMISS xmm1, xmm2/m32 xmm1 and xmm2/mem32 and set the EFLAGS flags
accordingly.
EVEX.LLIG.0F.W0 2F /r B V/V AVX512F Compare low single precision floating-point values in
VCOMISS xmm1, xmm2/m32{sae} OR AVX10.11 xmm1 and xmm2/mem32 and set the EFLAGS flags
accordingly.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Tuple1 Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Compares the single precision floating-point values in the low quadwords of operand 1 (first operand) and operand
2 (second operand), and sets the ZF, PF, and CF flags in the EFLAGS register according to the result (unordered,
greater than, less than, or equal). The OF, SF, and AF flags in the EFLAGS register are set to 0. The unordered result
is returned if either source operand is a NaN (QNaN or SNaN).
Operand 1 is an XMM register; operand 2 can be an XMM register or a 32 bit memory location.
The COMISS instruction differs from the UCOMISS instruction in that it signals a SIMD floating-point invalid opera-
tion exception (#I) when a source operand is either a QNaN or SNaN. The UCOMISS instruction signals an invalid
operation exception only if a source operand is an SNaN.
The EFLAGS register is not updated if an unmasked SIMD floating-point exception is generated.
VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.
Software should ensure VCOMISS is encoded with VEX.L=0. Encoding VCOMISS with VEX.L=1 may encounter
unpredictable behavior across different processor generations.

Operation
COMISS (All Versions)
RESULT :=OrderedCompare(DEST[31:0] <> SRC[31:0]) {
(* Set EFLAGS *) CASE (RESULT) OF
UNORDERED: ZF,PF,CF := 111;
GREATER_THAN: ZF,PF,CF := 000;
LESS_THAN: ZF,PF,CF := 001;
EQUAL: ZF,PF,CF := 100;
ESAC;
OF, AF, SF := 0; }

COMISS—Compare Scalar Ordered Single Precision Floating-Point Values and Set EFLAGS Vol. 2A 3-220
Intel C/C++ Compiler Intrinsic Equivalent
VCOMISS int _mm_comi_round_ss(__m128 a, __m128 b, int imm, int sae);
VCOMISS int _mm_comieq_ss (__m128 a, __m128 b)
VCOMISS int _mm_comilt_ss (__m128 a, __m128 b)
VCOMISS int _mm_comile_ss (__m128 a, __m128 b)
VCOMISS int _mm_comigt_ss (__m128 a, __m128 b)
VCOMISS int _mm_comige_ss (__m128 a, __m128 b)
VCOMISS int _mm_comineq_ss (__m128 a, __m128 b)

SIMD Floating-Point Exceptions


Invalid (if SNaN or QNaN operands), Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

COMISS—Compare Scalar Ordered Single Precision Floating-Point Values and Set EFLAGS Vol. 2A 3-221
CPUID—CPU Identification
Opcode Instruction Op/ 64-Bit Compat/ Description
En Mode Leg Mode
0F A2 CPUID ZO Valid Valid Returns processor identification and feature
information to the EAX, EBX, ECX, and EDX
registers, as determined by input entered in
EAX (in some cases, ECX as well).

Instruction Operand Encoding


Op/En Operand 1 Operand 2 Operand 3 Operand 4
ZO N/A N/A N/A N/A

Description
The ID flag (bit 21) in the EFLAGS register indicates support for the CPUID instruction. If a software procedure can
set and clear this flag, the processor executing the procedure supports the CPUID instruction. This instruction
operates the same in non-64-bit modes and 64-bit mode.
CPUID returns processor identification and feature information in the EAX, EBX, ECX, and EDX registers.1 The
instruction’s output is dependent on the contents of the EAX register upon execution (in some cases, ECX as well).
For example, the following pseudocode loads EAX with 00H and causes CPUID to return a Maximum Return Value
and the Vendor Identification String in the appropriate registers:
MOV EAX, 00H
CPUID
Table 3-17 shows information returned, depending on the initial value loaded into the EAX register.
Two types of information are returned: basic and extended function information. If a value entered for CPUID.EAX
is higher than the maximum input value for basic or extended function for that processor then the data for the
highest basic information leaf is returned. For example, using some Intel processors, the following is true:
CPUID.EAX = 05H (* Returns MONITOR/MWAIT leaf. *)
CPUID.EAX = 0AH (* Returns Architectural Performance Monitoring leaf. *)
CPUID.EAX = 0BH (* Returns Extended Topology Enumeration leaf. *)2
CPUID.EAX =1FH (* Returns V2 Extended Topology Enumeration leaf. *)2
CPUID.EAX = 80000008H (* Returns linear/physical address size data. *)
CPUID.EAX = 8000000AH (* INVALID: Returns same information as CPUID.EAX = 0BH. *)
If a value entered for CPUID.EAX is less than or equal to the maximum input value and the leaf is not supported on
that processor then 0 is returned in all the registers.
When CPUID returns the highest basic leaf information as a result of an invalid input EAX value, any dependence
on input ECX value in the basic leaf is honored.
CPUID can be executed at any privilege level to serialize instruction execution. Serializing instruction execution
guarantees that any modifications to flags, registers, and memory for previous instructions are completed before
the next instruction is fetched and executed.
See also:
“Serializing Instructions” in Chapter 10, “Multiple-Processor Management,” in the Intel® 64 and IA-32 Architec-
tures Software Developer’s Manual, Volume 3A.
“Caching Translation Information” in Chapter 4, “Linear-Address Pre-Processing,” in the Intel® 64 and IA-32 Archi-
tectures Software Developer’s Manual, Volume 3A.

1. On Intel 64 processors, CPUID clears the high 32 bits of the RAX/RBX/RCX/RDX registers in all modes.
2. CPUID leaf 1FH is a preferred superset to leaf 0BH. Intel recommends first checking for the existence of CPUID leaf 1FH before
using leaf 0BH.

CPUID—CPU Identification Vol. 2A 3-222


Table 3-17. Information Returned by CPUID Instruction
Initial EAX
Value Information Provided about the Processor
Basic CPUID Information
0H EAX Maximum Input Value for Basic CPUID Information.
EBX “Genu”
ECX “ntel”
EDX “ineI”
01H EAX Version Information: Type, Family, Model, and Stepping ID (see Figure 3-6).
EBX Bits 07-00: Brand Index.
Bits 15-08: CLFLUSH line size (Value ∗ 8 = cache line size in bytes; used also by CLFLUSHOPT).
Bits 23-16: Maximum number of addressable IDs for logical processors in this physical package*.
Bits 31-24: Initial APIC ID**.
ECX Feature Information (see Figure 3-7 and Table 3-19).
EDX Feature Information (see Figure 3-8 and Table 3-20).
NOTES:
* The nearest power-of-2 integer that is not smaller than EBX[23:16] is the number of unique initial APIC
IDs reserved for addressing different logical processors in a physical package. This field is only valid if
CPUID.1.EDX.HTT[bit 28]= 1.
** The 8-bit initial APIC ID in EBX[31:24] is replaced by the 32-bit x2APIC ID, available in Leaf 0BH and
Leaf 1FH.
02H EAX Cache and TLB Information (see Table 3-21).
EBX Cache and TLB Information.
ECX Cache and TLB Information.
EDX Cache and TLB Information.
03H EAX Reserved.
EBX Reserved.
ECX Bits 00-31 of 96-bit processor serial number. (Available in Pentium III processor only; otherwise, the value
in this register is reserved.)
EDX Bits 32-63 of 96-bit processor serial number. (Available in Pentium III processor only; otherwise, the value
in this register is reserved.)
NOTES:
Processor serial number (PSN) is not supported in the Pentium 4 processor or later. On all models, use
the PSN flag (returned using CPUID) to check for PSN support before accessing the feature.
CPUID leaves above 2 and below 80000000H are visible only when IA32_MISC_ENABLE[bit 22] has its default value of 0.
Deterministic Cache Parameters Leaf (Initial EAX Value = 04H)
04H NOTES:
Leaf 04H output depends on the initial value in ECX.*
See also: “INPUT EAX = 04H: Returns Deterministic Cache Parameters for Each Level” on page 257.

EAX Bits 04-00: Cache Type Field.


0 = Null - No more caches.
1 = Data Cache.
2 = Instruction Cache.
3 = Unified Cache.
4-31 = Reserved.

CPUID—CPU Identification Vol. 2A 3-223


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
Bits 07-05: Cache Level (starts at 1).
Bit 08: Self Initializing cache level (does not need SW initialization).
Bit 09: Fully Associative cache.
Bits 13-10: Reserved.
Bits 25-14: Maximum number of addressable IDs for logical processors sharing this cache**, ***.
Bits 31-26: Maximum number of addressable IDs for processor cores in the physical
package**, ****, *****.
EBX Bits 11-00: L = System Coherency Line Size**.
Bits 21-12: P = Physical Line partitions**.
Bits 31-22: W = Ways of associativity**.
ECX Bits 31-00: S = Number of Sets**.
EDX Bit 00: Write-Back Invalidate/Invalidate.
0 = WBINVD/INVD from threads sharing this cache acts upon lower level caches for threads sharing this
cache.
1 = WBINVD/INVD is not guaranteed to act upon lower level caches of non-originating threads sharing
this cache.
Bit 01: Cache Inclusiveness.
0 = Cache is not inclusive of lower cache levels.
1 = Cache is inclusive of lower cache levels.
Bit 02: Complex Cache Indexing.
0 = Direct mapped cache.
1 = A complex function is used to index the cache, potentially using all address bits.
Bits 31-03: Reserved = 0.
NOTES:
* If ECX contains an invalid sub leaf index, EAX/EBX/ECX/EDX return 0. Sub-leaf index n+1 is invalid if sub-
leaf n returns EAX[4:0] as 0.
** Add one to the return value to get the result.
***The nearest power-of-2 integer that is not smaller than (1 + EAX[25:14]) is the number of unique ini-
tial APIC IDs reserved for addressing different logical processors sharing this cache.
**** The nearest power-of-2 integer that is not smaller than (1 + EAX[31:26]) is the number of unique
Core_IDs reserved for addressing different processor cores in a physical package. Core ID is a subset of
bits of the initial APIC ID.
***** The returned value is constant for valid initial values in ECX. Valid ECX values start from 0.
MONITOR/MWAIT Leaf (Initial EAX Value = 05H)
05H EAX Bits 15-00: Smallest monitor-line size in bytes (default is processor's monitor granularity).
Bits 31-16: Reserved = 0.
EBX Bits 15-00: Largest monitor-line size in bytes (default is processor's monitor granularity).
Bits 31-16: Reserved = 0.
ECX Bit 00: Enumeration of Monitor-Mwait extensions (beyond EAX and EBX registers) supported.
Bit 01: Supports treating interrupts as break-event for MWAIT, even when interrupts disabled.
Bits 31-02: Reserved.
EDX Bits 03-00: Number of C0* sub C-states supported using MWAIT.
Bits 07-04: Number of C1* sub C-states supported using MWAIT.
Bits 11-08: Number of C2* sub C-states supported using MWAIT.
Bits 15-12: Number of C3* sub C-states supported using MWAIT.
Bits 19-16: Number of C4* sub C-states supported using MWAIT.
Bits 23-20: Number of C5* sub C-states supported using MWAIT.
Bits 27-24: Number of C6* sub C-states supported using MWAIT.
Bits 31-28: Number of C7* sub C-states supported using MWAIT.

CPUID—CPU Identification Vol. 2A 3-224


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
NOTE:
* The definition of C0 through C7 states for MWAIT extension are processor-specific C-states, not ACPI C-
states.
Thermal and Power Management Leaf (Initial EAX Value = 06H)
06H EAX Bit 00: Digital temperature sensor is supported if set.
Bit 01: Intel Turbo Boost Technology available (see description of IA32_MISC_ENABLE[38]).
Bit 02: ARAT. APIC-Timer-always-running feature is supported if set.
Bit 03: Reserved.
Bit 04: PLN. Power limit notification controls are supported if set.
Bit 05: ECMD. Clock modulation duty cycle extension is supported if set.
Bit 06: PTM. Package thermal management is supported if set.
Bit 07: HWP. HWP base registers (IA32_PM_ENABLE[bit 0], IA32_HWP_CAPABILITIES, IA32_HWP_RE-
QUEST, IA32_HWP_STATUS) are supported if set.
Bit 08: HWP_Notification. IA32_HWP_INTERRUPT MSR is supported if set.
Bit 09: HWP_Activity_Window. IA32_HWP_REQUEST[bits 41:32] is supported if set.
Bit 10: HWP_Energy_Performance_Preference. IA32_HWP_REQUEST[bits 31:24] is supported if set.
Bit 11: HWP_Package_Level_Request. IA32_HWP_REQUEST_PKG MSR is supported if set.
Bit 12: Reserved.
Bit 13: HDC. HDC base registers IA32_PKG_HDC_CTL, IA32_PM_CTL1, IA32_THREAD_STALL MSRs are
supported if set.
Bit 14: Intel® Turbo Boost Max Technology 3.0 available.
Bit 15: HWP Capabilities. Highest Performance change is supported if set.
Bit 16: HWP PECI override is supported if set.
Bit 17: Flexible HWP is supported if set.
Bit 18: Fast access mode, low latency, and posted IA32_HWP_REQUEST MSR are supported if set.
Bit 19: HW_FEEDBACK. IA32_HW_FEEDBACK_PTR MSR, IA32_HW_FEEDBACK_CONFIG MSR,
IA32_PACKAGE_THERM_STATUS MSR bit 26, and IA32_PACKAGE_THERM_INTERRUPT MSR bit 25 are
supported if set.
Bit 20: Ignoring Idle Logical Processor HWP request is supported if set.
Bit 21: Reserved.
Bit 22: HWP Control MSR Support. The IA32_HWP_CTL MSR is supported if set.
Bit 23: Intel® Thread Director supported if set. The IA32_HW_FEEDBACK_CHAR and
IA32_HW_FEEDBACK_THREAD_CONFIG MSRs are supported if set.
Bit 24: IA32_THERM_INTERRUPT MSR bit 25 is supported if set.
Bits 31-25: Reserved.
EBX Bits 03-00: Number of Interrupt Thresholds in Digital Thermal Sensor.
Bits 31-04: Reserved.
ECX Bit 00: Hardware Coordination Feedback Capability (Presence of IA32_MPERF and IA32_APERF). The
capability to provide a measure of delivered processor performance (since last reset of the counters), as a
percentage of the expected processor performance when running at the TSC frequency.
Bits 02-01: Reserved = 0.
Bit 03: The processor supports performance-energy bias preference if CPUID.06H:ECX.SETBH[bit 3] is set
and it also implies the presence of a new architectural MSR called IA32_ENERGY_PERF_BIAS (1B0H).
Bits 07-04: Reserved = 0.
Bits 15-08: Number of Intel® Thread Director classes supported by the processor. Information for that
many classes is written into the Intel Thread Director Table by the hardware.
Bits 31-16: Reserved = 0.

CPUID—CPU Identification Vol. 2A 3-225


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
EDX Bits 07-00: Bitmap of supported hardware feedback interface capabilities.
0 = When set to 1, indicates support for performance capability reporting.
1 = When set to 1, indicates support for energy efficiency capability reporting.
2-7 = Reserved
Bits 11-08: Enumerates the size of the hardware feedback interface structure in number of 4 KB pages;
add one to the return value to get the result.
Bits 31-16: Index (starting at 0) of this logical processor's row in the hardware feedback interface struc-
ture. Note that on some parts the index may be same for multiple logical processors. On some parts the
indices may not be contiguous, i.e., there may be unused rows in the hardware feedback interface struc-
ture.
NOTE:
Bits 0 and 1 will always be set together.
Structured Extended Feature Flags Enumeration Leaf (Initial EAX Value = 07H, ECX = 0)
07H EAX Bits 31-00: Reports the maximum input value for supported leaf 7 sub-leaves.
EBX Bit 00: FSGSBASE. Supports RDFSBASE/RDGSBASE/WRFSBASE/WRGSBASE if 1.
Bit 01: IA32_TSC_ADJUST MSR is supported if 1.
Bit 02: SGX. Supports Intel® Software Guard Extensions (Intel® SGX Extensions) if 1.
Bit 03: BMI1.
Bit 04: HLE.
Bit 05: AVX2. Supports Intel® Advanced Vector Extensions 2 (Intel® AVX2) if 1.
Bit 06: FDP_EXCPTN_ONLY. x87 FPU Data Pointer updated only on x87 exceptions if 1.
Bit 07: SMEP. Supports Supervisor-Mode Execution Prevention if 1.
Bit 08: BMI2.
Bit 09: Supports Enhanced REP MOVSB/STOSB if 1.
Bit 10: INVPCID. If 1, supports INVPCID instruction for system software that manages process-context
identifiers.
Bit 11: RTM.
Bit 12: RDT-M. Supports Intel® Resource Director Technology (Intel® RDT) Monitoring capability if 1.
Bit 13: Deprecates FPU CS and FPU DS values if 1.
Bit 14: MPX. Supports Intel® Memory Protection Extensions if 1.
Bit 15: RDT-A. Supports Intel® Resource Director Technology (Intel® RDT) Allocation capability if 1.
Bit 16: AVX512F.
Bit 17: AVX512DQ.
Bit 18: RDSEED.
Bit 19: ADX.
Bit 20: SMAP. Supports Supervisor-Mode Access Prevention (and the CLAC/STAC instructions) if 1.
Bit 21: AVX512_IFMA.
Bit 22: Reserved.
Bit 23: CLFLUSHOPT.
Bit 24: CLWB.
Bit 25: Intel Processor Trace.
Bit 26: AVX512PF. (Intel® Xeon Phi™ only.)
Bit 27: AVX512ER. (Intel® Xeon Phi™ only.)
Bit 28: AVX512CD.
Bit 29: SHA. supports Intel® Secure Hash Algorithm Extensions (Intel® SHA Extensions) if 1.
Bit 30: AVX512BW.
Bit 31: AVX512VL.

CPUID—CPU Identification Vol. 2A 3-226


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
ECX Bit 00: PREFETCHWT1. (Intel® Xeon Phi™ only.)
Bit 01: AVX512_VBMI.
Bit 02: UMIP. Supports user-mode instruction prevention if 1.
Bit 03: PKU. Supports protection keys for user-mode pages if 1.
Bit 04: OSPKE. If 1, OS has set CR4.PKE to enable protection keys (and the RDPKRU/WRPKRU instruc-
tions).
Bit 05: WAITPKG.
Bit 06: AVX512_VBMI2.
Bit 07: CET_SS. Supports CET shadow stack features if 1. Processors that set this bit define bits 1:0 of the
IA32_U_CET and IA32_S_CET MSRs. Enumerates support for the following MSRs: IA32_INTERRUPT_SP-
P_TABLE_ADDR, IA32_PL3_SSP, IA32_PL2_SSP, IA32_PL1_SSP, and IA32_PL0_SSP.
Bit 08: GFNI.
Bit 09: VAES.
Bit 10: VPCLMULQDQ.
Bit 11: AVX512_VNNI.
Bit 12: AVX512_BITALG.
Bits 13: TME_EN. If 1, the following MSRs are supported: IA32_TME_CAPABILITY, IA32_TME_ACTIVATE,
IA32_TME_EXCLUDE_MASK, and IA32_TME_EXCLUDE_BASE.
Bit 14: AVX512_VPOPCNTDQ.
Bit 15: Reserved.
Bit 16: LA57. Supports 57-bit linear addresses and five-level paging if 1.
Bits 21-17: The value of MAWAU used by the BNDLDX and BNDSTX instructions in 64-bit mode.
Bit 22: RDPID and IA32_TSC_AUX are available if 1.
Bit 23: KL. Supports Key Locker if 1.
Bit 24: BUS_LOCK_DETECT. If 1, indicates support for OS bus-lock detection.
Bit 25: CLDEMOTE. Supports cache line demote if 1.
Bit 26: Reserved.
Bit 27: MOVDIRI. Supports MOVDIRI if 1.
Bit 28: MOVDIR64B. Supports MOVDIR64B if 1.
Bit 29: ENQCMD. Supports Enqueue Stores if 1.
Bit 30: SGX_LC. Supports SGX Launch Configuration if 1.
Bit 31: PKS. Supports protection keys for supervisor-mode pages if 1.
EDX Bit 00: Reserved.
Bit 01: SGX-KEYS. If 1, Attestation Services for Intel® SGX is supported.
Bit 02: AVX512_4VNNIW. (Intel® Xeon Phi™ only.)
Bit 03: AVX512_4FMAPS. (Intel® Xeon Phi™ only.)
Bit 04: Fast Short REP MOV.
Bit 05: UINTR. If 1, the processor supports user interrupts.
Bits 07-06: Reserved.
Bit 08: AVX512_VP2INTERSECT.
Bit 09: SRBDS_CTRL. If 1, enumerates support for the IA32_MCU_OPT_CTRL MSR and indicates its bit 0
(RNGDS_MITG_DIS) is also supported.
Bit 10: MD_CLEAR supported.
Bit 11: RTM_ALWAYS_ABORT. If set, any execution of XBEGIN immediately aborts and transitions to the
specified fallback address.
Bit 12: Reserved.
Bit 13: If 1, RTM_FORCE_ABORT supported. Processors that set this bit support the IA32_TSX_-
FORCE_ABORT MSR. They allow software to set IA32_TSX_FORCE_ABORT[0] (RTM_FORCE_ABORT).
Bit 14: SERIALIZE.
Bit 15: Hybrid. If 1, the processor is identified as a hybrid part. If CPUID.0.MAXLEAF ≥ 1AH and
CPUID.1A.EAX ≠ 0, then the Native Model ID Enumeration Leaf 1AH exists.
Bit 16: TSXLDTRK. If 1, the processor supports Intel TSX suspend/resume of load address tracking.

CPUID—CPU Identification Vol. 2A 3-227


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
Bit 17: Reserved.
Bit 18: PCONFIG. Supports PCONFIG if 1.
Bit 19: Architectural LBRs. If 1, indicates support for architectural LBRs.
Bit 20: CET_IBT. Supports CET indirect branch tracking features if 1. Processors that set this bit define bits
5:2 and bits 63:10 of the IA32_U_CET and IA32_S_CET MSRs.
Bit 21: Reserved.
Bit 22: AMX-BF16. If 1, the processor supports tile computational operations on bfloat16 numbers.
Bit 23: AVX512_FP16.
Bit 24: AMX-TILE. If 1, the processor supports tile architecture.
Bits 25: AMX-INT8. If 1, the processor supports tile computational operations on 8-bit integers.
Bit 26: Enumerates support for indirect branch restricted speculation (IBRS) and the indirect branch pre-
dictor barrier (IBPB). Processors that set this bit support the IA32_SPEC_CTRL MSR and the
IA32_PRED_CMD MSR. They allow software to set IA32_SPEC_CTRL[0] (IBRS) and IA32_PRED_CMD[0]
(IBPB).
Bit 27: Enumerates support for single thread indirect branch predictors (STIBP). Processors that set this
bit support the IA32_SPEC_CTRL MSR. They allow software to set IA32_SPEC_CTRL[1] (STIBP).
Bit 28: Enumerates support for L1D_FLUSH. Processors that set this bit support the IA32_FLUSH_CMD
MSR. They allow software to set IA32_FLUSH_CMD[0] (L1D_FLUSH).
Bit 29: Enumerates support for the IA32_ARCH_CAPABILITIES MSR.
Bit 30: Enumerates support for the IA32_CORE_CAPABILITIES MSR.
IA32_CORE_CAPABILITIES is an architectural MSR that enumerates model-specific features. A bit being
set in this MSR indicates that a model specific feature is supported; software must still consult CPUID
family/model/stepping to determine the behavior of the enumerated feature as features enumerated in
IA32_CORE_CAPABILITIES may have different behavior on different processor models. Some of these
features may have behavior that is consistent across processor models (and for which consultation of
CPUID family/model/stepping is not necessary); such features are identified explicitly where they are
documented in this manual.
Bit 31: Enumerates support for Speculative Store Bypass Disable (SSBD). Processors that set this bit sup-
port the IA32_SPEC_CTRL MSR. They allow software to set IA32_SPEC_CTRL[2] (SSBD).
NOTE:
* If ECX contains an invalid sub-leaf index, EAX/EBX/ECX/EDX return 0. Sub-leaf index n is invalid if n
exceeds the value that sub-leaf 0 returns in EAX.
Structured Extended Feature Enumeration Sub-leaf (Initial EAX Value = 07H, ECX = 1)
07H NOTES:
Leaf 07H output depends on the initial value in ECX.
If ECX contains an invalid sub leaf index, EAX/EBX/ECX/EDX return 0.
EAX This field reports 0 if the sub-leaf index, 1, is invalid.
Bit 00: SHA512. If 1, supports the SHA512 instructions.
Bit 01: SM3. If 1, supports the SM3 instructions.
Bit 02: SM4. If 1, supports the SM4 instructions.
Bit 03: Reserved.
Bit 04: AVX-VNNI. AVX (VEX-encoded) versions of the Vector Neural Network Instructions.
Bit 05: AVX512_BF16. Vector Neural Network Instructions supporting BFLOAT16 inputs and conversion
instructions from IEEE single precision.
Bit 06: LASS. If 1, supports Linear Address Space Separation.
Bit 07: CMPCCXADD. If 1, supports the CMPccXADD instruction.
Bit 08: ArchPerfmonExt. If 1, supports ArchPerfmonExt. When set, indicates that the Architectural Perfor-
mance Monitoring Extended Leaf (EAX = 23H) is valid.
Bit 09: Reserved.
Bit 10: If 1, supports fast zero-length REP MOVSB.
Bit 11: If 1, supports fast short REP STOSB.
Bit 12: If 1, supports fast short REP CMPSB, REP SCASB.

CPUID—CPU Identification Vol. 2A 3-228


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
Bits 18-13: Reserved.
Bit 19: WRMSRNS. If 1, supports the WRMSRNS instruction.
Bit 20: Reserved.
Bit 21: AMX-FP16. If 1, the processor supports tile computational operations on FP16 numbers.
Bit 22: HRESET. If 1, supports history reset via the HRESET instruction and the IA32_HRESET_ENABLE
MSR. When set, indicates that the Processor History Reset Leaf (EAX = 20H) is valid.
Bit 23: AVX-IFMA. If 1, supports the AVX-IFMA instructions.
Bits 25-24: Reserved.
Bit 26: LAM. If 1, supports Linear Address Masking.
Bit 27: MSRLIST. If 1, supports the RDMSRLIST and WRMSRLIST instructions and the IA32_BARRIER MSR.
Bits 29-28: Reserved.
Bit 30: INVD_DISABLE_POST_BIOS_DONE. If 1, supports INVD execution prevention after BIOS Done.
Bit 31: Reserved.
EBX This field reports 0 if the sub-leaf index, 1, is invalid.
Bit 00: Enumerates the presence of the IA32_PPIN and IA32_PPIN_CTL MSRs. If 1, these MSRs are sup-
ported.
Bits 02-01: Reserved.
Bit 03: CPUIDMAXVAL_LIM_RMV. If 1, IA32_MISC_ENABLE[bit 22] cannot be set to 1 to limit the value
returned by CPUID.00H:EAX[bits 7:0].
Bits 31-04: Reserved.
ECX This field reports 0 if the sub-leaf index, 1, is invalid; otherwise it is reserved.
EDX This field reports 0 if the sub-leaf index, 1, is invalid.
Bits 03-00: Reserved.
Bit 04: AVX-VNNI-INT8. If 1, supports the AVX-VNNI-INT8 instructions.
Bit 05: AVX-NE-CONVERT. If 1, supports the AVX-NE-CONVERT instructions.
Bits 09-06: Reserved.
Bit 10: AVX-VNNI-INT16. If 1, supports the AVX-VNNI-INT16 instructions.
Bits 13-09: Reserved.
Bit 14: PREFETCHI. If 1, supports the PREFETCHIT0/1 instructions.
Bits 16-15: Reserved.
Bit 17: UIRET_UIF. If 1, UIRET sets UIF to the value of bit 1 of the RFLAGS image loaded from the stack.
Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor shadow stacks as long as
it ensures that a supervisor shadow stack cannot become prematurely busy due to page faults (see Sec-
tion 17.2.3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1). When
emulating the CPUID instruction, a virtual-machine monitor (VMM) should return this bit as 1 only if it
ensures that VM exits cannot cause a guest supervisor shadow stack to appear to be prematurely busy.
Such a VMM could set the “prematurely busy shadow stack” VM-exit control and use the additional infor-
mation that it provides.
Bit 19: AVX10. If 1, supports the Intel® AVX10 instructions and indicates the presence of CPUID Leaf 24H,
which enumerates version number and supported vector lengths.
Bits 31-20: Reserved.
Structured Extended Feature Enumeration Sub-leaf (Initial EAX Value = 07H, ECX = 2)
07H NOTES:
Leaf 07H output depends on the initial value in ECX.
If ECX contains an invalid sub leaf index, EAX/EBX/ECX/EDX return 0.
EAX This field reports 0 if the sub-leaf index, 2, is invalid; otherwise it is reserved.
EBX This field reports 0 if the sub-leaf index, 2, is invalid; otherwise it is reserved.
ECX This field reports 0 if the sub-leaf index, 2, is invalid; otherwise it is reserved.

CPUID—CPU Identification Vol. 2A 3-229


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
EDX This field reports 0 if the sub-leaf index, 2, is invalid.
Bit 00: PSFD. If 1, indicates bit 7 of the IA32_SPEC_CTRL MSR is supported. Bit 7 of this MSR disables Fast
Store Forwarding Predictor without disabling Speculative Store Bypass.
Bit 01: IPRED_CTRL. If 1, indicates bits 3 and 4 of the IA32_SPEC_CTRL MSR are supported. Bit 3 of this
MSR enables IPRED_DIS control for CPL3. Bit 4 of this MSR enables IPRED_DIS control for CPL0/1/2.
Bit 02: RRSBA_CTRL. If 1, indicates bits 5 and 6 of the IA32_SPEC_CTRL MSR are supported. Bit 5 of this
MSR disables RRSBA behavior for CPL3. Bit 6 of this MSR disables RRSBA behavior for CPL0/1/2.
Bit 03: DDPD_U. If 1, indicates bit 8 of the IA32_SPEC_CTRL MSR is supported. Bit 8 of this MSR disables
Data Dependent Prefetcher.
Bit 04: BHI_CTRL. If 1, indicates bit 10 of the IA32_SPEC_CTRL MSR is supported. Bit 10 of this MSR
enables BHI_DIS_S behavior.
Bit 05: MCDT_NO. Processors that enumerate this bit as 1 do not exhibit MXCSR Configuration Dependent
Timing (MCDT) behavior and do not need to be mitigated to avoid data-dependent behavior for certain
instructions.
Bit 06: If 1, supports the UC-lock disable feature and it causes #AC.
Bit 07: MONITOR_MITG_NO. If 1, indicates that the MONITOR/UMONITOR instructions are not affected by
performance or power issues due to MONITOR/UMONITOR instructions exceeding the capacity of an
internal monitor tracking table. If 0, then the product may be affected by this issue.
Bits 31-08: Reserved.
Direct Cache Access Information Leaf (Initial EAX Value = 09H)
09H EAX Value of bits [31:0] of IA32_PLATFORM_DCA_CAP MSR (address 1F8H).
EBX Reserved.
ECX Reserved.
EDX Reserved.
Architectural Performance Monitoring Leaf (Initial EAX Value = 0AH)
0AH EAX Bits 07-00: Version ID of architectural performance monitoring.
Bits 15-08: Number of general-purpose performance monitoring counter per logical processor.
Bits 23-16: Bit width of general-purpose, performance monitoring counter.
Bits 31-24: Length of EBX bit vector to enumerate architectural performance monitoring events. Archi-
tectural event x is supported if EBX[x]=0 && EAX[31:24]>x.
EBX Bit 00: Core cycle event not available if 1 or if EAX[31:24]<1.
Bit 01: Instruction retired event not available if 1 or if EAX[31:24]<2.
Bit 02: Reference cycles event not available if 1 or if EAX[31:24]<3.
Bit 03: Last-level cache reference event not available if 1 or if EAX[31:24]<4.
Bit 04: Last-level cache misses event not available if 1 or if EAX[31:24]<5.
Bit 05: Branch instruction retired event not available if 1 or if EAX[31:24]<6.
Bit 06: Branch mispredict retired event not available if 1 or if EAX[31:24]<7.
Bit 07: Top-down slots event not available if 1 or if EAX[31:24]<8.
Bits 31-08: Reserved = 0.
ECX Bits 31-00: Supported fixed counters bit mask. Fixed-function performance counter 'i' is supported if bit ‘i’
is 1 (first counter index starts at zero). It is recommended to use the following logic to determine if a
Fixed Counter is supported: FxCtr[i]_is_supported := ECX[i] || (EDX[4:0] > i);
EDX Bits 04-00: Number of contiguous fixed-function performance counters starting from 0 (if Version ID > 1).
Bits 12-05: Bit width of fixed-function performance counters (if Version ID > 1).
Bits 14-13: Reserved = 0.
Bit 15: AnyThread deprecation.
Bits 31-16: Reserved = 0.

CPUID—CPU Identification Vol. 2A 3-230


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
Extended Topology Enumeration Leaf (Initial EAX Value = 0BH, ECX ≥ 0)
0BH NOTES:
CPUID leaf 1FH is a preferred superset to leaf 0BH. Intel recommends first checking for the existence
of Leaf 1FH before using leaf 0BH.
The sub-leaves of CPUID leaf 0BH describe an ordered hierarchy of logical processors starting from the
smallest-scoped domain of a Logical Processor (sub-leaf index 0) to the Core domain (sub-leaf index 1)
to the largest-scoped domain (the last valid sub-leaf index) that is implicitly subordinate to the
unenumerated highest-scoped domain of the processor package (socket).
The details of each valid domain is enumerated by a corresponding sub-leaf. Details for a domain include
its type and how all instances of that domain determine the number of logical processors and x2 APIC
ID partitioning at the next higher-scoped domain. The ordering of domains within the hierarchy is fixed
architecturally as shown below. For a given processor, not all domains may be relevant or enumerated;
however, the logical processor and core domains are always enumerated.
For two valid sub-leaves N and N+1, sub-leaf N+1 represents the next immediate higher-scoped
domain with respect to the domain of sub-leaf N for the given processor.
If sub-leaf index “N” returns an invalid domain type in ECX[15:08] (00H), then all sub-leaves with an
index greater than “N” shall also return an invalid domain type. A sub-leaf returning an invalid domain
always returns 0 in EAX and EBX.
EAX Bits 04-00: The number of bits that the x2APIC ID must be shifted to the right to address instances of the
next higher-scoped domain. When logical processor is not supported by the processor, the value of this
field at the Logical Processor domain sub-leaf may be returned as either 0 (no allocated bits in the x2APIC
ID) or 1 (one allocated bit in the x2APIC ID); software should plan accordingly.
Bits 31-05: Reserved.
EBX Bits 15-00: The number of logical processors across all instances of this domain within the next higher-
scoped domain. (For example, in a processor socket/package comprising “M” dies of “N” cores each, where
each core has “L” logical processors, the “die” domain sub-leaf value of this field would be M*N*L.) This
number reflects configuration as shipped by Intel. Note, software must not use this field to enumerate
processor topology*.
Bits 31-16: Reserved.
ECX Bits 07-00: The input ECX sub-leaf index.
Bits 15-08: Domain Type. This field provides an identification value which indicates the domain as shown
below. Although domains are ordered, their assigned identification values are not and software should
not depend on it.

Hierarchy Domain Domain Type Identification Value


Lowest Logical Processor 1
Highest Core 2
(Note that enumeration values of 0 and 3-255 are reserved.)

Bits 31-16: Reserved.


EDX Bits 31-00: x2APIC ID of the current logical processor.
NOTES:
* Software must not use the value of EBX[15:0] to enumerate processor topology of the system. The
value is only intended for display and diagnostic purposes. The actual number of logical processors avail-
able to BIOS/OS/Applications may be different from the value of EBX[15:0], depending on software and
platform hardware configurations.
Processor Extended State Enumeration Main Leaf (Initial EAX Value = 0DH, ECX = 0)
0DH NOTES:
Leaf 0DH main leaf (ECX = 0).

CPUID—CPU Identification Vol. 2A 3-231


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
EAX Bits 31-00: Reports the supported bits of the lower 32 bits of XCR0. XCR0[n] can be set to 1 only if
EAX[n] is 1.
Bit 00: x87 state.
Bit 01: SSE state.
Bit 02: AVX state.
Bits 04-03: MPX state.
Bits 07-05: AVX-512 state.
Bit 08: Used for IA32_XSS.
Bit 09: PKRU state.
Bits 16-10: Used for IA32_XSS.
Bit 17: TILECFG state.
Bit 18: TILEDATA state.
Bits 31-19: Reserved.
EBX Bits 31-00: Maximum size (bytes, from the beginning of the XSAVE/XRSTOR save area) required by
enabled features in XCR0. May be different than ECX if some features at the end of the XSAVE save area
are not enabled.
ECX Bit 31-00: Maximum size (bytes, from the beginning of the XSAVE/XRSTOR save area) of the
XSAVE/XRSTOR save area required by all supported features in the processor, i.e., all the valid bit fields in
XCR0.
EDX Bit 31-00: Reports the supported bits of the upper 32 bits of XCR0. XCR0[n+32] can be set to 1 only if
EDX[n] is 1.
Bits 31-00: Reserved.
Processor Extended State Enumeration Sub-leaf (Initial EAX Value = 0DH, ECX = 1)
0DH EAX Bit 00: XSAVEOPT is available.
Bit 01: Supports XSAVEC and the compacted form of XRSTOR if set.
Bit 02: Supports XGETBV with ECX = 1 if set.
Bit 03: Supports XSAVES/XRSTORS and IA32_XSS if set.
Bit 04: Supports extended feature disable (XFD) if set.
Bits 31-05: Reserved.
EBX Bits 31-00: The size in bytes of the XSAVE area containing all states enabled by XCRO | IA32_XSS.
NOTES:
If EAX[3] is enumerated as 0 and EAX[1] is enumerated as 1, EBX enumerates the size of the XSAVE area
containing all states enabled by XCRO. If EAX[1] and EAX[3] are both enumerated as 0, EBX enumerates
zero.
ECX Bits 31-00: Reports the supported bits of the lower 32 bits of the IA32_XSS MSR. IA32_XSS[n] can be
set to 1 only if ECX[n] is 1.
Bits 07-00: Used for XCR0.
Bit 08: PT state.
Bit 09: Used for XCR0.
Bit 10: PASID state.
Bit 11: CET user state.
Bit 12: CET supervisor state.
Bit 13: HDC state.
Bit 14: UINTR state.
Bit 15: LBR state (only for the architectural LBR feature).
Bit 16: HWP state.
Bits 18-17: Used for XCR0.
Bits 31-19: Reserved.
EDX Bits 31-00: Reports the supported bits of the upper 32 bits of the IA32_XSS MSR. IA32_XSS[n+32] can
be set to 1 only if EDX[n] is 1.
Bits 31-00: Reserved.

CPUID—CPU Identification Vol. 2A 3-232


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
Processor Extended State Enumeration Sub-leaves (Initial EAX Value = 0DH, ECX = n, n > 1)
0DH NOTES:
Leaf 0DH output depends on the initial value in ECX.
Each sub-leaf index (starting at position 2) is supported if it corresponds to a supported bit in either the
XCR0 register or the IA32_XSS MSR.
* If ECX contains an invalid sub-leaf index, EAX/EBX/ECX/EDX return 0. Sub-leaf n (0 ≤ n ≤ 31) is invalid
if sub-leaf 0 returns 0 in EAX[n] and sub-leaf 1 returns 0 in ECX[n]. Sub-leaf n (32 ≤ n ≤ 63) is invalid if
sub-leaf 0 returns 0 in EDX[n-32] and sub-leaf 1 returns 0 in EDX[n-32].
EAX Bits 31-00: The size in bytes (from the offset specified in EBX) of the save area for an extended state
feature associated with a valid sub-leaf index, n.
EBX Bits 31-00: The offset in bytes of this extended state component’s save area from the beginning of the
XSAVE/XRSTOR area.
This field reports 0 if the sub-leaf index, n, does not map to a valid bit in the XCR0 register*.
ECX Bit 00 is set if the bit n (corresponding to the sub-leaf index) is supported in the IA32_XSS MSR; it is clear
if bit n is instead supported in XCR0.
Bit 01 is set if, when the compacted format of an XSAVE area is used, this extended state component
located on the next 64-byte boundary following the preceding state component (otherwise, it is located
immediately following the preceding state component).
Bits 31-02 are reserved.
This field reports 0 if the sub-leaf index, n, is invalid*.
EDX This field reports 0 if the sub-leaf index, n, is invalid*; otherwise it is reserved.
Intel® Resource Director Technology (Intel® RDT) Monitoring Enumeration Sub-leaf (Initial EAX Value = 0FH, ECX = 0)
0FH NOTES:
Leaf 0FH output depends on the initial value in ECX.
Sub-leaf index 0 reports valid resource type starting at bit position 1 of EDX.
EAX Reserved.
EBX Bits 31-00: Maximum range (zero-based) of RMID within this physical processor of all types.
ECX Reserved.
EDX Bit 00: Reserved.
Bit 01: Supports L3 Cache Intel RDT Monitoring if 1.
Bits 31-02: Reserved.
L3 Cache Intel® RDT Monitoring Capability Enumeration Sub-leaf (Initial EAX Value = 0FH, ECX = 1)
0FH NOTES:
Leaf 0FH output depends on the initial value in ECX.
EAX Bits 07-00:The counter width is encoded as an offset from 24b. A value of zero in this field indicates that
24-bit counters are supported. A value of 8 in this field indicates that 32-bit counters are supported.
Bit 08: If 1, indicates the presence of an overflow bit in the IA32_QM_CTR MSR (bit 61).
Bit 09: If 1, indicates the presence of non-CPU agent Intel RDT CMT support.
Bit 10: If 1, indicates the presence of non-CPU agent Intel RDT MBM support.
Bits 31-11: Reserved.
EBX Bits 31-00: Conversion factor from reported IA32_QM_CTR value to occupancy metric (bytes) and Mem-
ory Bandwidth Monitoring (MBM) metrics.
ECX Maximum range (zero-based) of RMID of this resource type.

CPUID—CPU Identification Vol. 2A 3-233


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
EDX Bit 00: Supports L3 occupancy monitoring if 1.
Bit 01: Supports L3 Total Bandwidth monitoring if 1.
Bit 02: Supports L3 Local Bandwidth monitoring if 1.
Bits 31-03: Reserved.
Intel® Resource Director Technology (Intel® RDT) Allocation Enumeration Sub-leaf (Initial EAX Value = 10H, ECX = 0)
10H NOTES:
Leaf 10H output depends on the initial value in ECX.
Sub-leaf index 0 reports valid resource identification (ResID) starting at bit position 1 of EBX.
EAX Reserved.
EBX Bit 00: Reserved.
Bit 01: Supports L3 Cache Allocation Technology if 1.
Bit 02: Supports L2 Cache Allocation Technology if 1.
Bit 03: Supports Memory Bandwidth Allocation if 1.
Bits 31-04: Reserved.
ECX Reserved.
EDX Reserved.
L3 Cache Allocation Technology Enumeration Sub-leaf (Initial EAX Value = 10H, ECX = ResID =1)
10H NOTES:
Leaf 10H output depends on the initial value in ECX.
EAX Bits 04-00: Length of the capacity bit mask for the corresponding ResID. Add one to the return value to
get the result.
Bits 31-05: Reserved.
EBX Bits 31-00: Bit-granular map of isolation/contention of allocation units.
ECX Bit 00: Reserved.
Bit 01: If 1, indicates L3 CAT for non-CPU agents is supported.
Bit 02: If 1, indicates L3 Code and Data Prioritization Technology is supported.
Bit 03: If 1, indicates non-contiguous capacity bitmask is supported. The bits that are set in the various
IA32_L3_MASK_n registers do not have to be contiguous.
Bits 31-04: Reserved.
EDX Bits 15-00: Highest Class of Service (CLOS) number supported for this ResID.
Bits 31-16: Reserved.
L2 Cache Allocation Technology Enumeration Sub-leaf (Initial EAX Value = 10H, ECX = ResID =2)
10H NOTES:
Leaf 10H output depends on the initial value in ECX.
EAX Bits 04-00: Length of the capacity bit mask for the corresponding ResID. Add one to the return value to
get the result.
Bits 31-05: Reserved.
EBX Bits 31-00: Bit-granular map of isolation/contention of allocation units.
ECX Bits 01-00: Reserved.
Bit 02: CDP. If 1, indicates L2 Code and Data Prioritization Technology is supported.
Bit 03: If 1, indicates non-contiguous capacity bitmask is supported. The bits that are set in the various
IA32_L2_MASK_n registers do not have to be contiguous.
Bits 31-04: Reserved.
EDX Bits 15-00: Highest CLOS number supported for this ResID.
Bits 31-16: Reserved.

CPUID—CPU Identification Vol. 2A 3-234


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
Memory Bandwidth Allocation Enumeration Sub-leaf (Initial EAX Value = 10H, ECX = ResID =3)
10H NOTES:
Leaf 10H output depends on the initial value in ECX.
EAX Bits 11-00: Reports the maximum MBA throttling value supported for the corresponding ResID. Add one
to the return value to get the result.
Bits 31-12: Reserved.
EBX Bits 31-00: Reserved.
ECX Bits 01-00: Reserved.
Bit 02: Reports whether the response of the delay values is linear.
Bits 31-03: Reserved.
EDX Bits 15-00: Highest CLOS number supported for this ResID.
Bits 31-16: Reserved.
Intel® SGX Capability Enumeration Leaf, Sub-leaf 0 (Initial EAX Value = 12H, ECX = 0)
12H NOTES:
Leaf 12H sub-leaf 0 (ECX = 0) is supported if CPUID.(EAX=07H, ECX=0H):EBX[SGX] = 1.
EAX Bit 00: SGX1. If 1, Indicates Intel SGX supports the collection of SGX1 leaf functions.
Bit 01: SGX2. If 1, Indicates Intel SGX supports the collection of SGX2 leaf functions.
Bits 04-02: Reserved.
Bit 05: If 1, indicates Intel SGX supports ENCLV instruction leaves EINCVIRTCHILD, EDECVIRTCHILD, and
ESETCONTEXT.
Bit 06: If 1, indicates Intel SGX supports ENCLS instruction leaves ETRACKC, ERDINFO, ELDBC, and ELDUC.
Bit 07: If 1, indicates Intel SGX supports ENCLU instruction leaf EVERIFYREPORT2.
Bits 09-08: Reserved.
Bit 10: If 1, indicates Intel SGX supports ENCLS instruction leaf EUPDATESVN.
Bit 11: If 1, indicates Intel SGX supports ENCLU instruction leaf EDECCSSA.
Bits 31-12: Reserved.
EBX Bits 31-00: MISCSELECT. Bit vector of supported extended SGX features.
ECX Bits 31-00: Reserved.
EDX Bits 07-00: MaxEnclaveSize_Not64. The maximum supported enclave size in non-64-bit mode is
2^(EDX[7:0]).
Bits 15-08: MaxEnclaveSize_64. The maximum supported enclave size in 64-bit mode is 2^(EDX[15:8]).
Bits 31-16: Reserved.
Intel SGX Attributes Enumeration Leaf, Sub-leaf 1 (Initial EAX Value = 12H, ECX = 1)
12H NOTES:
Leaf 12H sub-leaf 1 (ECX = 1) is supported if CPUID.(EAX=07H, ECX=0H):EBX[SGX] = 1.
EAX Bit 31-00: Reports the valid bits of SECS.ATTRIBUTES[31:0] that software can set with ECREATE.
EBX Bit 31-00: Reports the valid bits of SECS.ATTRIBUTES[63:32] that software can set with ECREATE.
ECX Bit 31-00: Reports the valid bits of SECS.ATTRIBUTES[95:64] that software can set with ECREATE.
EDX Bit 31-00: Reports the valid bits of SECS.ATTRIBUTES[127:96] that software can set with ECREATE.
Intel® SGX EPC Enumeration Leaf, Sub-leaves (Initial EAX Value = 12H, ECX = 2 or higher)
12H NOTES:
Leaf 12H sub-leaf 2 or higher (ECX >= 2) is supported if CPUID.(EAX=07H, ECX=0H):EBX[SGX] = 1.
For sub-leaves (ECX = 2 or higher), definition of EDX,ECX,EBX,EAX[31:4] depends on the sub-leaf type
listed below.

CPUID—CPU Identification Vol. 2A 3-235


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
EAX Bit 03-00: Sub-leaf Type
0000b: Indicates this sub-leaf is invalid.
0001b: This sub-leaf enumerates an EPC section. EBX:EAX and EDX:ECX provide information on the
Enclave Page Cache (EPC) section.
All other type encodings are reserved.
Type 0000b. This sub-leaf is invalid.
EDX:ECX:EBX:EAX return 0.
Type 0001b. This sub-leaf enumerates an EPC sections with EDX:ECX, EBX:EAX defined as follows.
EAX[11:04]: Reserved (enumerate 0).
EAX[31:12]: Bits 31:12 of the physical address of the base of the EPC section.

EBX[19:00]: Bits 51:32 of the physical address of the base of the EPC section.
EBX[31:20]: Reserved.

ECX[03:00]: EPC section property encoding defined as follows:


If ECX[3:0] = 0000b, then all bits of the EDX:ECX pair are enumerated as 0.
If ECX[3:0] = 0001b, then this section has confidentiality, integrity, and replay protection.
If ECX[3:0] = 0010b, then this section has confidentiality protection only.
If ECX[3:0] = 0011b, then this section has confidentiality and integrity protection.
All other encodings are reserved.
ECX[11:04]: Reserved (enumerate 0).
ECX[31:12]: Bits 31:12 of the size of the corresponding EPC section within the Processor Reserved
Memory.

EDX[19:00]: Bits 51:32 of the size of the corresponding EPC section within the Processor Reserved
Memory.
EDX[31:20]: Reserved.
Intel® Processor Trace Enumeration Main Leaf (Initial EAX Value = 14H, ECX = 0)
14H NOTES:
Leaf 14H main leaf (ECX = 0).
EAX Bits 31-00: Reports the maximum sub-leaf supported in leaf 14H.
EBX Bit 00: If 1, indicates that IA32_RTIT_CTL.CR3Filter can be set to 1, and that IA32_RTIT_CR3_MATCH MSR
can be accessed.
Bit 01: If 1, indicates support of Configurable PSB and Cycle-Accurate Mode.
Bit 02: If 1, indicates support of IP Filtering, TraceStop filtering, and preservation of Intel PT MSRs across
warm reset.
Bit 03: If 1, indicates support of MTC timing packet and suppression of COFI-based packets.
Bit 04: If 1, indicates support of PTWRITE. Writes can set IA32_RTIT_CTL[12] (PTWEn) and
IA32_RTIT_CTL[5] (FUPonPTW), and PTWRITE can generate packets.
Bit 05: If 1, indicates support of Power Event Trace. Writes can set IA32_RTIT_CTL[4] (PwrEvtEn),
enabling Power Event Trace packet generation.
Bit 06: If 1, indicates support for PSB and PMI preservation. Writes can set IA32_RTIT_CTL[56] (InjectPsb-
PmiOnEnable), enabling the processor to set IA32_RTIT_STATUS[7] (PendTopaPMI) and/or IA32_R-
TIT_STATUS[6] (PendPSB) in order to preserve ToPA PMIs and/or PSBs otherwise lost due to Intel PT
disable. Writes can also set PendToPAPMI and PendPSB.
Bit 07: If 1, writes can set IA32_RTIT_CTL[31] (EventEn), enabling Event Trace packet generation.
Bit 08: If 1, writes can set IA32_RTIT_CTL[55] (DisTNT), disabling TNT packet generation.
Bit 31-09: Reserved.

CPUID—CPU Identification Vol. 2A 3-236


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
ECX Bit 00: If 1, Tracing can be enabled with IA32_RTIT_CTL.ToPA = 1, hence utilizing the ToPA output
scheme; IA32_RTIT_OUTPUT_BASE and IA32_RTIT_OUTPUT_MASK_PTRS MSRs can be accessed.
Bit 01: If 1, ToPA tables can hold any number of output entries, up to the maximum allowed by the
MaskOrTableOffset field of IA32_RTIT_OUTPUT_MASK_PTRS.
Bit 02: If 1, indicates support of Single-Range Output scheme.
Bit 03: If 1, indicates support of output to Trace Transport subsystem.
Bit 30-04: Reserved.
Bit 31: If 1, generated packets which contain IP payloads have LIP values, which include the CS base com-
ponent.
EDX Bits 31-00: Reserved.
Intel® Processor Trace Enumeration Sub-leaf (Initial EAX Value = 14H, ECX = 1)
14H EAX Bits 02-00: Number of configurable Address Ranges for filtering.
Bits 15-03: Reserved.
Bits 31-16: Bitmap of supported MTC period encodings.
EBX Bits 15-00: Bitmap of supported Cycle Threshold value encodings.
Bit 31-16: Bitmap of supported Configurable PSB frequency encodings.
ECX Bits 31-00: Reserved.
EDX Bits 31-00: Reserved.
Time Stamp Counter and Nominal Core Crystal Clock Information Leaf (Initial EAX Value = 15H)
15H NOTES:
If EBX[31:0] is 0, the TSC/”core crystal clock” ratio is not enumerated.
EBX[31:0]/EAX[31:0] indicates the ratio of the TSC frequency and the core crystal clock frequency.
If ECX is 0, the nominal core crystal clock frequency is not enumerated.
“TSC frequency” = “core crystal clock frequency” * EBX/EAX.
The core crystal clock may differ from the reference clock, bus clock, or core clock frequencies.
EAX Bits 31-00: An unsigned integer which is the denominator of the TSC/”core crystal clock” ratio.
EBX Bits 31-00: An unsigned integer which is the numerator of the TSC/”core crystal clock” ratio.
ECX Bits 31-00: An unsigned integer which is the nominal frequency of the core crystal clock in Hz.
EDX Bits 31-00: Reserved = 0.
Processor Frequency Information Leaf (Initial EAX Value = 16H)
16H EAX Bits 15-00: Processor Base Frequency (in MHz).
Bits 31-16: Reserved =0.
EBX Bits 15-00: Maximum Frequency (in MHz).
Bits 31-16: Reserved = 0.
ECX Bits 15-00: Bus (Reference) Frequency (in MHz).
Bits 31-16: Reserved = 0.
EDX Reserved.

CPUID—CPU Identification Vol. 2A 3-237


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
NOTES:
* Data is returned from this interface in accordance with the processor's specification and does not reflect
actual values. Suitable use of this data includes the display of processor information in like manner to the
processor brand string and for determining the appropriate range to use when displaying processor
information e.g. frequency history graphs. The returned information should not be used for any other
purpose as the returned information does not accurately correlate to information / counters returned by
other processor interfaces.

While a processor may support the Processor Frequency Information leaf, fields that return a value of zero
are not supported.
System-On-Chip Vendor Attribute Enumeration Main Leaf (Initial EAX Value = 17H, ECX = 0)
17H NOTES:
Leaf 17H main leaf (ECX = 0).
Leaf 17H output depends on the initial value in ECX.
Leaf 17H sub-leaves 1 through 3 reports SOC Vendor Brand String.
Leaf 17H is valid if MaxSOCID_Index >= 3.
Leaf 17H sub-leaves 4 and above are reserved.

EAX Bits 31-00: MaxSOCID_Index. Reports the maximum input value of supported sub-leaf in leaf 17H.
EBX Bits 15-00: SOC Vendor ID.
Bit 16: IsVendorScheme. If 1, the SOC Vendor ID field is assigned via an industry standard enumeration
scheme. Otherwise, the SOC Vendor ID field is assigned by Intel.
Bits 31-17: Reserved = 0.
ECX Bits 31-00: Project ID. A unique number an SOC vendor assigns to its SOC projects.
EDX Bits 31-00: Stepping ID. A unique number within an SOC project that an SOC vendor assigns.
System-On-Chip Vendor Attribute Enumeration Sub-leaf (Initial EAX Value = 17H, ECX = 1..3)
17H EAX Bit 31-00: SOC Vendor Brand String. UTF-8 encoded string.
EBX Bit 31-00: SOC Vendor Brand String. UTF-8 encoded string.
ECX Bit 31-00: SOC Vendor Brand String. UTF-8 encoded string.
EDX Bit 31-00: SOC Vendor Brand String. UTF-8 encoded string.
NOTES:
Leaf 17H output depends on the initial value in ECX.
SOC Vendor Brand String is a UTF-8 encoded string padded with trailing bytes of 00H.
The complete SOC Vendor Brand String is constructed by concatenating in ascending order of
EAX:EBX:ECX:EDX and from the sub-leaf 1 fragment towards sub-leaf 3.
System-On-Chip Vendor Attribute Enumeration Sub-leaves (Initial EAX Value = 17H, ECX > MaxSOCID_Index)
17H NOTES:
Leaf 17H output depends on the initial value in ECX.
EAX Bits 31-00: Reserved = 0.
EBX Bits 31-00: Reserved = 0.
ECX Bits 31-00: Reserved = 0.
EDX Bits 31-00: Reserved = 0.

CPUID—CPU Identification Vol. 2A 3-238


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
Deterministic Address Translation Parameters Main Leaf (Initial EAX Value = 18H, ECX = 0)
18H NOTES:
Each sub-leaf enumerates a different address translation structure.
If ECX contains an invalid sub-leaf index, EAX/EBX/ECX/EDX return 0. Sub-leaf index n is invalid if n
exceeds the value that sub-leaf 0 returns in EAX. A sub-leaf index is also invalid if EDX[4:0] returns 0.
Valid sub-leaves do not need to be contiguous or in any particular order. A valid sub-leaf may be in a
higher input ECX value than an invalid sub-leaf or than a valid sub-leaf of a higher or lower-level struc-
ture.
* Some unified TLBs will allow a single TLB entry to satisfy data read/write and instruction fetches.
Others will require separate entries (e.g., one loaded on data read/write and another loaded on an
instruction fetch). See the Intel® 64 and IA-32 Architectures Optimization Reference Manual for details
of a particular product.
** Add one to the return value to get the result.

EAX Bits 31-00: Reports the maximum input value of supported sub-leaf in leaf 18H.
EBX Bit 00: 4K page size entries supported by this structure.
Bit 01: 2MB page size entries supported by this structure.
Bit 02: 4MB page size entries supported by this structure.
Bit 03: 1 GB page size entries supported by this structure.
Bits 07-04: Reserved.
Bits 10-08: Partitioning (0: Soft partitioning between the logical processors sharing this structure).
Bits 15-11: Reserved.
Bits 31-16: W = Ways of associativity.
ECX Bits 31-00: S = Number of Sets.
EDX Bits 04-00: Translation cache type field.
00000b: Null (indicates this sub-leaf is not valid).
00001b: Data TLB.
00010b: Instruction TLB.
00011b: Unified TLB*.
00100b: Load Only TLB. Hit on loads; fills on both loads and stores.
00101b: Store Only TLB. Hit on stores; fill on stores.
All other encodings are reserved.
Bits 07-05: Translation cache level (starts at 1).
Bit 08: Fully associative structure.
Bits 13-09: Reserved.
Bits 25-14: Maximum number of addressable IDs for logical processors sharing this translation cache.**
Bits 31-26: Reserved.
Deterministic Address Translation Parameters Sub-leaf (Initial EAX Value = 18H, ECX ≥ 1)
18H NOTES:
Each sub-leaf enumerates a different address translation structure.
If ECX contains an invalid sub-leaf index, EAX/EBX/ECX/EDX return 0. Sub-leaf index n is invalid if n
exceeds the value that sub-leaf 0 returns in EAX. A sub-leaf index is also invalid if EDX[4:0] returns 0.
Valid sub-leaves do not need to be contiguous or in any particular order. A valid sub-leaf may be in a
higher input ECX value than an invalid sub-leaf or than a valid sub-leaf of a higher or lower-level struc-
ture.
* Some unified TLBs will allow a single TLB entry to satisfy data read/write and instruction fetches.
Others will require separate entries (e.g., one loaded on data read/write and another loaded on an
instruction fetch. See the Intel® 64 and IA-32 Architectures Optimization Reference Manual for details
of a particular product.
** Add one to the return value to get the result.

CPUID—CPU Identification Vol. 2A 3-239


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
EAX Bits 31-00: Reserved.
EBX Bit 00: 4K page size entries supported by this structure.
Bit 01: 2MB page size entries supported by this structure.
Bit 02: 4MB page size entries supported by this structure.
Bit 03: 1 GB page size entries supported by this structure.
Bits 07-04: Reserved.
Bits 10-08: Partitioning (0: Soft partitioning between the logical processors sharing this structure).
Bits 15-11: Reserved.
Bits 31-16: W = Ways of associativity.
ECX Bits 31-00: S = Number of Sets.
EDX Bits 04-00: Translation cache type field.
0000b: Null (indicates this sub-leaf is not valid).
0001b: Data TLB.
0010b: Instruction TLB.
0011b: Unified TLB*.
All other encodings are reserved.
Bits 07-05: Translation cache level (starts at 1).
Bit 08: Fully associative structure.
Bits 13-09: Reserved.
Bits 25-14: Maximum number of addressable IDs for logical processors sharing this translation cache**
Bits 31-26: Reserved.
Key Locker Leaf (Initial EAX Value = 19H)
19H EAX Bit 00: Key Locker restriction of CPL0-only supported.
Bit 01: Key Locker restriction of no-encrypt supported.
Bit 02: Key Locker restriction of no-decrypt supported.
Bits 31-03: Reserved.
EBX Bit 00: AESKLE. If 1, the AES Key Locker instructions are fully enabled.
Bit 01: Reserved.
Bit 02: If 1, the AES wide Key Locker instructions are supported.
Bit 03: Reserved.
Bit 04: If 1, the platform supports the Key Locker MSRs (IA32_COPY_LOCAL_TO_PLATFORM,
IA23_COPY_PLATFORM_TO_LOCAL, IA32_COPY_STATUS, and IA32_IWKEYBACKUP_STATUS) and backing
up the internal wrapping key.
Bits 31-05: Reserved.
ECX Bit 00: If 1, the NoBackup parameter to LOADIWKEY is supported.
Bit 01: If 1, KeySource encoding of 1 (randomization of the internal wrapping key) is supported.
Bits 31-02: Reserved.
EDX Reserved.
Native Model ID Enumeration Leaf (Initial EAX Value = 1AH, ECX = 0)
1AH NOTES:
This leaf exists on all hybrid parts, however this leaf is not only available on hybrid parts. The following
algorithm is used for detection of this leaf:
If CPUID.0.MAXLEAF ≥ 1AH and CPUID.1A.EAX ≠ 0, then the leaf exists.

CPUID—CPU Identification Vol. 2A 3-240


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
EAX Enumerates the native model ID and core type.
Bits 31-24: Core type*
10H: Reserved
20H: Intel Atom®
30H: Reserved
40H: Intel® Core™
Bits 23-00: Native model ID of the core. The core-type and native model ID can be used to uniquely
identify the microarchitecture of the core. This native model ID is not unique across core types, and not
related to the model ID reported in CPUID leaf 01H, and does not identify the SOC.
* The core type may only be used as an identification of the microarchitecture for this logical processor
and its numeric value has no significance, neither large nor small. This field neither implies nor expresses
any other attribute to this logical processor and software should not assume any.
EBX Reserved.
ECX Reserved.
EDX Reserved.
PCONFIG Information Sub-leaf (Initial EAX Value = 1BH, ECX ≥ 0)
1BH For details on this sub-leaf, see “INPUT EAX = 1BH: Returns PCONFIG Information” on page 3-259.
NOTE:
Leaf 1BH is supported if CPUID.(EAX=07H, ECX=0H):EDX[18] = 1.
Last Branch Records Information Leaf (Initial EAX Value = 1CH)
1CH NOTE:
This leaf pertains to the architectural feature.
EAX Bits 07-00: Supported LBR Depth Values. For each bit n set in this field, the IA32_LBR_DEPTH.DEPTH
value 8*(n+1) is supported.
Bits 29-08: Reserved.
Bit 30: Deep C-state Reset. If set, indicates that LBRs may be cleared on an MWAIT that requests a C-state
numerically greater than C1.
Bit 31: IP Values Contain LIP. If set, LBR IP values contain LIP. If clear, IP values contain Effective IP.
EBX Bit 00: CPL Filtering Supported. If set, the processor supports setting IA32_LBR_CTL[2:1] to non-zero
value.
Bit 01: Branch Filtering Supported. If set, the processor supports setting IA32_LBR_CTL[22:16] to non-
zero value.
Bit 02: Call-stack Mode Supported. If set, the processor supports setting IA32_LBR_CTL[3] to 1.
Bits 31-03: Reserved.
ECX Bit 00: Mispredict Bit Supported. IA32_LBR_x_INFO[63] holds indication of branch misprediction
(MISPRED).
Bit 01: Timed LBRs Supported. IA32_LBR_x_INFO[15:0] holds CPU cycles since last LBR entry (CYC_CNT),
and IA32_LBR_x_INFO[60] holds an indication of whether the value held there is valid (CYC_CNT_VALID).
Bit 02: Branch Type Field Supported. IA32_LBR_INFO_x[59:56] holds indication of the recorded
operation's branch type (BR_TYPE).
Bits 15-03: Reserved.
Bits 19-16: Event Logging Supported bitmap.
Bits 31-20: Reserved.
EDX Bits 31-00: Reserved.
Tile Information Main Leaf (Initial EAX Value = 1DH, ECX = 0)
1DH NOTES:
For sub-leaves of 1DH, they are indexed by the palette id.
Leaf 1DH sub-leaves 2 and above are reserved.

CPUID—CPU Identification Vol. 2A 3-241


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
EAX Bits 31-00: max_palette. Highest numbered palette sub-leaf. Value = 1.
EBX Bits 31-00: Reserved = 0.
ECX Bits 31-00: Reserved = 0.
EDX Bits 31-00: Reserved = 0.
Tile Palette 1 Sub-leaf (Initial EAX Value = 1DH, ECX = 1)
1DH EAX Bits 15-00: Palette 1 total_tile_bytes. Value = 8192.
Bits 31-16: Palette 1 bytes_per_tile. Value = 1024.
EBX Bits 15-00: Palette 1 bytes_per_row. Value = 64.
Bits 31-16: Palette 1 max_names (number of tile registers). Value = 8.
ECX Bits 15-00: Palette 1 max_rows. Value = 16.
Bits 31-16: Reserved = 0.
EDX Bits 31-00: Reserved = 0.
TMUL Information Main Leaf (Initial EAX Value = 1EH, ECX = 0)
1EH NOTE:
Leaf 1EH sub-leaves 1 and above are reserved.
EAX Bits 31-00: Reserved = 0.
EBX Bits 07-00: tmul_maxk (rows or columns). Value = 16.
Bits 23-08: tmul_maxn (column bytes). Value = 64.
Bits 31-24: Reserved = 0.
ECX Bits 31-00: Reserved = 0.
EDX Bits 31-00: Reserved = 0.
V2 Extended Topology Enumeration Leaf (Initial EAX Value = 1FH, ECX ≥ 0)
1FH NOTES:
CPUID leaf 1FH is a preferred superset to leaf 0BH. Intel recommends using leaf 1FH when available
rather than leaf 0BH and ensuring that any leaf 0BH algorithms are updated to support leaf 1FH.
The sub-leaves of CPUID leaf 1FH describe an ordered hierarchy of logical processors starting from the
smallest-scoped domain of a Logical Processor (sub-leaf index 0) to the Core domain (sub-leaf index 1)
to the largest-scoped domain (the last valid sub-leaf index) that is implicitly subordinate to the
unenumerated highest-scoped domain of the processor package (socket).
The details of each valid domain is enumerated by a corresponding sub-leaf. Details for a domain include
its type and how all instances of that domain determine the number of logical processors and x2 APIC
ID partitioning at the next higher-scoped domain. The ordering of domains within the hierarchy is fixed
architecturally as shown below. For a given processor, not all domains may be relevant or enumerated;
however, the logical processor and core domains are always enumerated. As an example, a processor
may report an ordered hierarchy consisting only of “Logical Processor,” “Core,” and “Die.”
For two valid sub-leaves N and N+1, sub-leaf N+1 represents the next immediate higher-scoped
domain with respect to the domain of sub-leaf N for the given processor.
If sub-leaf index “N” returns an invalid domain type in ECX[15:08] (00H), then all sub-leaves with an
index greater than “N” shall also return an invalid domain type. A sub-leaf returning an invalid domain
always returns 0 in EAX and EBX.
EAX Bits 04-00: The number of bits that the x2APIC ID must be shifted to the right to address instances of the
next higher-scoped domain. When logical processor is not supported by the processor, the value of this
field at the Logical Processor domain sub-leaf may be returned as either 0 (no allocated bits in the x2APIC
ID) or 1 (one allocated bit in the x2APIC ID); software should plan accordingly.
Bits 31-05: Reserved.

CPUID—CPU Identification Vol. 2A 3-242


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
EBX Bits 15-00: The number of logical processors across all instances of this domain within the next higher-
scoped domain relative to this current logical processor. (For example, in a processor socket/package
comprising “M” dies of “N” cores each, where each core has “L” logical processors, the “die” domain sub-
leaf value of this field would be M*N*L. In an asymmetric topology this would be the summation of the
value across the lower domain level instances to create each upper domain level instance.) This number
reflects configuration as shipped by Intel. Note, software must not use this field to enumerate processor
topology*.
Bits 31-16: Reserved.
ECX Bits 07-00: The input ECX sub-leaf index.
Bits 15-08: Domain Type. This field provides an identification value which indicates the domain as shown
below. Although domains are ordered, as also shown below, their assigned identification values are not
and software should not depend on it. (For example, if a new domain between core and module is speci-
fied, it will have an identification value higher than 5.)

Hierarchy Domain Domain Type Identification Value


Lowest Logical Processor 1
... Core 2
... Module 3
... Tile 4
... Die 5
... DieGrp 6
Highest Package/Socket (implied)
(Note that enumeration values of 0 and 7-255 are reserved.)

Bits 31-16: Reserved.


EDX Bits 31-00: x2APIC ID of the current logical processor. It is always valid and does not vary with the sub-
leaf index in ECX.
NOTES:
* Software must not use the value of EBX[15:0] to enumerate processor topology of the system. The
value is only intended for display and diagnostic purposes. The actual number of logical processors avail-
able to BIOS/OS/Applications may be different from the value of EBX[15:0], depending on software and
platform hardware configurations.
Processor History Reset Sub-leaf (Initial EAX Value = 20H, ECX = 0)
20H EAX Reports the maximum number of sub-leaves that are supported in leaf 20H.
EBX Indicates which bits may be set in the IA32_HRESET_ENABLE MSR to enable reset of different compo-
nents of hardware-maintained history.
Bit 00: Indicates support for both HRESET’s EAX[0] parameter, and IA32_HRESET_ENABLE[0] set by the
OS to enable reset of Intel® Thread Director history.
Bits 31-01: Reserved = 0.
ECX Reserved.
EDX Reserved.
Architectural Performance Monitoring Extended Main Leaf (Initial EAX Value = 23H, ECX = 0)
23H NOTE:
Output depends on ECX input value.
EAX Bits 31-0: If bit n is set, sub-leaf n is supported. (For unsupported sub-leaves, 0 is returned in the
registers EAX, EBX, ECX, and EDX.)

CPUID—CPU Identification Vol. 2A 3-243


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
EBX Bit 00: UnitMask2 supported. If set, the processor supports the UnitMask2 field in the
IA32_PERFEVTSELx MSRs.
Bit 01: EQ-bit supported. If set, the processor supports the equal flag in the IA32_PERFEVTSELx MSRs.
Bits 31-02: Reserved.
ECX Bits 07-00: Number of Top-down Microarchitecture Analysis (TMA) slots per cycle. This number can be
multiplied by the number of cycles (from CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.CORE or
IA32_FIXED_CTR1) to determine the total number of slots.
Bits 31-08: Reserved.
EDX Bits 31-00: Reserved.
Architectural Performance Monitoring Extended Sub-Leaf (Initial EAX Value = 23H, ECX = 1)
23H EAX Bits 31-00: General counters bitmap. For each bit n set in this field, the processor supports general-
purpose performance monitoring counter n.
EBX Bits 31-00: Fixed counters bitmap. For each bit m set in this field, the processor supports fixed-function
performance monitoring counter m.
ECX Bits 31-00: Reserved.
EDX Bits 31-00: Reserved.
Architectural Performance Monitoring Extended Sub-Leaf (Initial EAX Value = 23H, ECX = 2)
23H EAX Bits 31-00: Bitmap of Auto Counter Reload (ACR) general counters that can be reloaded. For each bit n
set in this field, the processor supports ACR for general-purpose performance monitoring counter n.
EBX Bits 31-00: Bitmap of Auto Counter Reload (ACR) fixed counters that can be reloaded. For each bit m set
in this field, the processor supports ACR for fixed-function performance monitoring counter m.
ECX Bits 31-00: Bitmap of Auto Counter Reload (ACR) general counters that can cause reloads. For each bit y
set in this field, the processor allows general-purpose performance monitoring counter y to reload all
existing general-purpose performance monitoring counters capable of being reloaded.
EDX Bits 31-00: Bitmap of Auto Counter Reload (ACR) fixed counters that can cause reloads. For each bit x set
in this field, the processor allows fixed-function performance monitoring counter x to reload all existing
fixed-function performance monitoring counters capable of being reloaded.
Architectural Performance Monitoring Extended Sub-Leaf (Initial EAX Value = 23H, ECX = 3)
23H NOTE:
Architectural Performance Monitoring Events Bitmap. For each bit n set in this field, the processor sup-
ports Architectural Performance Monitoring Event of index n.
EAX Bit 00: Core cycles.
Bit 01: Instructions retired.
Bit 02: Reference cycles.
Bit 03: Last level cache references.
Bit 04: Last level cache misses.
Bit 05: Branch instructions retired.
Bit 06: Branch mispredicts retired.
Bit 07: Topdown slots.
Bit 08: Topdown backend bound.
Bit 09: Topdown bad speculation.

CPUID—CPU Identification Vol. 2A 3-244


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
Bit 10: Topdown frontend bound.
Bit 11: Topdown retiring.
Bit 12: LBR inserts.
Bits 31-13: Reserved.
EBX Bits 31-00: Reserved.
ECX Bits 31-00: Reserved.
EDX Bits 31-00: Reserved.
Converged Vector ISA Main Leaf (Initial EAX Value = 24H, ECX = 0)
24H NOTE:
Output depends on ECX input value.
EAX Bits 31-00: Reports the maximum number sub-leaves that are supported in leaf 24H.
EBX Bits 07-00: Reports the Intel AVX10 Converged Vector ISA version.
Bits 15-08: Reserved.
Bit 16: If 1, indicates that 128-bit vector support is present.
Bit 17: If 1, indicates that 256-bit vector support is present.
Bit 18: If 1, indicates that 512-bit vector support is present.
Bits 31-19: Reserved.
ECX Bits 31-00: Reserved.
EDX Bits 31-00: Reserved.
Unimplemented CPUID Leaf Functions
21H Invalid. No existing or future CPU will return processor identification or feature information if the initial
EAX value is 21H. If the value returned by CPUID.0:EAX (the maximum input value for basic CPUID
information) is at least 21H, 0 is returned in the registers EAX, EBX, ECX, and EDX. Otherwise, the data
for the highest basic information leaf is returned.
40000000H Invalid. No existing or future CPU will return processor identification or feature information if the initial
− EAX value is in the range 40000000H to 4FFFFFFFH.
4FFFFFFFH
Extended Function CPUID Information
80000000H EAX Maximum Input Value for Extended Function CPUID Information.
EBX Reserved.
ECX Reserved.
EDX Reserved.
80000001H EAX Extended Processor Signature and Feature Bits.
EBX Reserved.
ECX Bit 00: LAHF/SAHF available in 64-bit mode.*
Bits 04-01: Reserved.
Bit 05: LZCNT.
Bits 07-06: Reserved.
Bit 08: PREFETCHW.
Bits 31-09: Reserved.

CPUID—CPU Identification Vol. 2A 3-245


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
EDX Bits 10-00: Reserved.
Bit 11: SYSCALL/SYSRET.**
Bits 19-12: Reserved = 0.
Bit 20: Execute Disable Bit available.
Bits 25-21: Reserved = 0.
Bit 26: 1-GByte pages are available if 1.
Bit 27: RDTSCP and IA32_TSC_AUX are available if 1.
Bit 28: Reserved = 0.
Bit 29: Intel® 64 Architecture available if 1.
Bits 31-30: Reserved = 0.

NOTES:
* LAHF and SAHF are always available in other modes, regardless of the enumeration of this feature flag.
** Intel processors support SYSCALL and SYSRET only in 64-bit mode. This feature flag is always enumer-
ated as 0 outside 64-bit mode.
80000002H EAX Processor Brand String.
EBX Processor Brand String Continued.
ECX Processor Brand String Continued.
EDX Processor Brand String Continued.
80000003H EAX Processor Brand String Continued.
EBX Processor Brand String Continued.
ECX Processor Brand String Continued.
EDX Processor Brand String Continued.
80000004H EAX Processor Brand String Continued.
EBX Processor Brand String Continued.
ECX Processor Brand String Continued.
EDX Processor Brand String Continued.
80000005H EAX Reserved = 0.
EBX Reserved = 0.
ECX Reserved = 0.
EDX Reserved = 0.
80000006H EAX Reserved = 0.
EBX Reserved = 0.
ECX Bits 07-00: Cache Line size in bytes.
Bits 11-08: Reserved.
Bits 15-12: L2 Associativity field *.
Bits 31-16: Cache size in 1K units.
EDX Reserved = 0.
NOTES:
* L2 associativity field encodings:
00H - Disabled 08H - 16 ways
01H - 1 way (direct mapped) 09H - Reserved
02H - 2 ways 0AH - 32 ways
03H - Reserved 0BH - 48 ways
04H - 4 ways 0CH - 64 ways
05H - Reserved 0DH - 96 ways
06H - 8 ways 0EH - 128 ways
07H - See CPUID leaf 04H, sub-leaf 2** 0FH - Fully associative

** CPUID leaf 04H provides details of deterministic cache parameters, including the L2 cache in sub-leaf 2

CPUID—CPU Identification Vol. 2A 3-246


Table 3-17. Information Returned by CPUID Instruction (Contd.)
Initial EAX
Value Information Provided about the Processor
80000007H EAX Reserved = 0.
EBX Reserved = 0.
ECX Reserved = 0.
EDX Bits 07-00: Reserved = 0.
Bit 08: Invariant TSC available if 1.
Bits 31-09: Reserved = 0.
80000008H EAX Linear/Physical Address size.
Bits 07-00: #Physical Address Bits*.
Bits 15-08: #Linear Address Bits.
Bits 31-16: Reserved = 0.
EBX Bits 08-00: Reserved = 0.
Bit 09: WBNOINVD is available if 1.
Bits 31-10: Reserved = 0.
ECX Reserved = 0.
EDX Reserved = 0.
NOTES:
* If CPUID.80000008H:EAX[7:0] is supported, the maximum physical address number supported should
come from this field. If TME-MK is enabled, the number of bits that can be used to address physical
memory is CPUID.80000008H:EAX[7:0] - IA32_TME_ACTIVATE[35:32].

INPUT EAX = 0: Returns CPUID’s Highest Value for Basic Processor Information and the Vendor Identification String
When CPUID executes with EAX set to 0, the processor returns the highest value the CPUID recognizes for
returning basic processor information. The value is returned in the EAX register and is processor specific.
A vendor identification string is also returned in EBX, EDX, and ECX. For Intel processors, the string is “Genu-
ineIntel” and is expressed:
EBX := 756e6547h (* “Genu”, with G in the low eight bits of BL *)
EDX := 49656e69h (* “ineI”, with i in the low eight bits of DL *)
ECX := 6c65746eh (* “ntel”, with n in the low eight bits of CL *)

INPUT EAX = 80000000H: Returns CPUID’s Highest Value for Extended Processor Information
When CPUID executes with EAX set to 80000000H, the processor returns the highest value the processor recog-
nizes for returning extended processor information. The value is returned in the EAX register and is processor
specific.

IA32_BIOS_SIGN_ID Returns Microcode Update Signature


For processors that support the microcode update facility, the IA32_BIOS_SIGN_ID MSR is loaded with the update
signature whenever CPUID executes. The signature is returned in the upper DWORD. For details, see Chapter 11 in
the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.

INPUT EAX = 01H: Returns Model, Family, Stepping Information


When CPUID executes with EAX set to 01H, version information is returned in EAX (see Figure 3-6). For example:
model, family, and processor type for the Intel Xeon processor 5100 series is as follows:
• Model — 1111B
• Family — 0101B
• Processor Type — 00B
See Table 3-18 for available processor type values. Stepping IDs are provided as needed.

CPUID—CPU Identification Vol. 2A 3-247


31 28 27 20 19 16 15 14 13 12 11 8 7 4 3 0

Extended Extended Family Stepping


EAX Model
Family ID Model ID ID ID

Extended Family ID (0)


Extended Model ID (0)
Processor Type
Family (0FH for the Pentium 4 Processor Family)
Model

Reserved
OM16525

Figure 3-6. Version Information Returned by CPUID in EAX

Table 3-18. Processor Type Field


Type Encoding
Original OEM Processor 00B
®
Intel OverDrive Processor 01B
Dual processor (not applicable to Intel486 processors) 10B
Intel reserved 11B

NOTE
See Chapter 20 in the Intel®
64 and IA-32 Architectures Software Developer’s Manual, Volume 1,
for information on identifying earlier IA-32 processors.

The Extended Family ID needs to be examined only when the Family ID is 0FH. Integrate the fields into a display
using the following rule:
IF Family_ID ≠ 0FH
THEN DisplayFamily = Family_ID;
ELSE DisplayFamily = Extended_Family_ID + Family_ID;
FI;
(* Show DisplayFamily as HEX field. *)
The Extended Model ID needs to be examined only when the Family ID is 06H or 0FH. Integrate the field into a
display using the following rule:
IF (Family_ID = 06H or Family_ID = 0FH)
THEN DisplayModel = (Extended_Model_ID « 4) + Model_ID;
(* Right justify and zero-extend 4-bit field; display Model_ID as HEX field.*)
ELSE DisplayModel = Model_ID;
FI;
(* Show DisplayModel as HEX field. *)

CPUID—CPU Identification Vol. 2A 3-248


INPUT EAX = 01H: Returns Additional Information in EBX
When CPUID executes with EAX set to 01H, additional information is returned to the EBX register:
• Brand index (low byte of EBX) — this number provides an entry into a brand string table that contains brand
strings for IA-32 processors. More information about this field is provided later in this section.
• CLFLUSH instruction cache line size (second byte of EBX) — this number indicates the size of the cache line
flushed by the CLFLUSH and CLFLUSHOPT instructions in 8-byte increments. This field was introduced in the
Pentium 4 processor.
• Local APIC ID (high byte of EBX) — this number is the 8-bit ID that is assigned to the local APIC on the
processor during power up. This field was introduced in the Pentium 4 processor.

INPUT EAX = 01H: Returns Feature Information in ECX and EDX


When CPUID executes with EAX set to 01H, feature information is returned in ECX and EDX.
• Figure 3-7 and Table 3-19 show encodings for ECX.
• Figure 3-8 and Table 3-20 show encodings for EDX.
For all feature flags, a 1 indicates that the feature is supported. Use Intel to properly interpret feature flags.

NOTE
Software must confirm that a processor feature is present using feature flags returned by CPUID
prior to using the feature. Software should not depend on future offerings retaining all features.

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

ECX
0

RDRAND
F16C
AVX
OSXSAVE
XSAVE
AES
TSC-Deadline
POPCNT
MOVBE
x2APIC
SSE4_2 — SSE4.2
SSE4_1 — SSE4.1
DCA — Direct Cache Access
PCID — Process-context Identifiers
PDCM — Perf/Debug Capability MSR
xTPR Update Control
CMPXCHG16B
FMA — Fused Multiply Add
SDBG
CNXT-ID — L1 Context ID
SSSE3 — SSSE3 Extensions
TM2 — Thermal Monitor 2
EIST — Enhanced Intel SpeedStep® Technology
SMX — Safer Mode Extensions
VMX — Virtual Machine Extensions
DS-CPL — CPL Qualified Debug Store
MONITOR — MONITOR/MWAIT
DTES64 — 64-bit DS Area
PCLMULQDQ — Carryless Multiplication
SSE3 — SSE3 Extensions
OM16524b
Reserved

Figure 3-7. Feature Information Returned in the ECX Register

CPUID—CPU Identification Vol. 2A 3-249


Table 3-19. Feature Information Returned in the ECX Register
Bit # Mnemonic Description
0 SSE3 Streaming SIMD Extensions 3 (SSE3). A value of 1 indicates the processor supports this
technology.
1 PCLMULQDQ PCLMULQDQ. A value of 1 indicates the processor supports the PCLMULQDQ instruction.
2 DTES64 64-bit DS Area. A value of 1 indicates the processor supports DS area using 64-bit layout.
3 MONITOR MONITOR/MWAIT. A value of 1 indicates the processor supports this feature.
4 DS-CPL CPL Qualified Debug Store. A value of 1 indicates the processor supports the extensions to the
Debug Store feature to allow for branch message storage qualified by CPL.
5 VMX Virtual Machine Extensions. A value of 1 indicates that the processor supports this technology.
6 SMX Safer Mode Extensions. A value of 1 indicates that the processor supports this technology. See
Chapter 7, “Safer Mode Extensions Reference.”
7 EIST Enhanced Intel SpeedStep® technology. A value of 1 indicates that the processor supports this
technology.
8 TM2 Thermal Monitor 2. A value of 1 indicates whether the processor supports this technology.
9 SSSE3 A value of 1 indicates the presence of the Supplemental Streaming SIMD Extensions 3 (SSSE3). A
value of 0 indicates the instruction extensions are not present in the processor.
10 CNXT-ID L1 Context ID. A value of 1 indicates the L1 data cache mode can be set to either adaptive mode
or shared mode. A value of 0 indicates this feature is not supported. See definition of the
IA32_MISC_ENABLE MSR Bit 24 (L1 Data Cache Context Mode) for details.
11 SDBG A value of 1 indicates the processor supports IA32_DEBUG_INTERFACE MSR for silicon debug.
12 FMA A value of 1 indicates the processor supports FMA extensions using YMM state.
13 CMPXCHG16B CMPXCHG16B Available. A value of 1 indicates that the feature is available. See the
“CMPXCHG8B/CMPXCHG16B—Compare and Exchange Bytes” section in this chapter for a
description.
14 xTPR Update xTPR Update Control. A value of 1 indicates that the processor supports changing
Control IA32_MISC_ENABLE[bit 23].
15 PDCM Perfmon and Debug Capability: A value of 1 indicates the processor supports the performance
and debug feature indication MSR IA32_PERF_CAPABILITIES.
16 Reserved Reserved
17 PCID Process-context identifiers. A value of 1 indicates that the processor supports PCIDs and that
software may set CR4.PCIDE to 1.
18 DCA A value of 1 indicates the processor supports the ability to prefetch data from a memory mapped
device.
19 SSE4_1 A value of 1 indicates that the processor supports SSE4.1.
20 SSE4_2 A value of 1 indicates that the processor supports SSE4.2.
21 x2APIC A value of 1 indicates that the processor supports x2APIC feature.
22 MOVBE A value of 1 indicates that the processor supports MOVBE instruction.
23 POPCNT A value of 1 indicates that the processor supports the POPCNT instruction.
24 TSC-Deadline A value of 1 indicates that the processor’s local APIC timer supports one-shot operation using a
TSC deadline value.
25 AESNI A value of 1 indicates that the processor supports the AESNI instruction extensions.
26 XSAVE A value of 1 indicates that the processor supports the XSAVE/XRSTOR processor extended states
feature, the XSETBV/XGETBV instructions, and XCR0.
27 OSXSAVE A value of 1 indicates that the OS has set CR4.OSXSAVE[bit 18] to enable XSETBV/XGETBV
instructions to access XCR0 and to support processor extended state management using
XSAVE/XRSTOR.
28 AVX A value of 1 indicates the processor supports the AVX instruction extensions.

CPUID—CPU Identification Vol. 2A 3-250


Table 3-19. Feature Information Returned in the ECX Register (Contd.)
Bit # Mnemonic Description
29 F16C A value of 1 indicates that processor supports 16-bit floating-point conversion instructions.
30 RDRAND A value of 1 indicates that processor supports RDRAND instruction.
31 Not Used Always returns 0.

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

EDX

PBE–Pend. Brk. EN.


TM–Therm. Monitor
HTT–Multi-threading
SS–Self Snoop
SSE2–SSE2 Extensions
SSE–SSE Extensions
FXSR–FXSAVE/FXRSTOR
MMX–MMX Technology
ACPI–Thermal Monitor and Clock Ctrl
DS–Debug Store
CLFSH–CLFLUSH instruction
PSN–Processor Serial Number
PSE-36 – Page Size Extension
PAT–Page Attribute Table
CMOV–Conditional Move/Compare Instruction
MCA–Machine Check Architecture
PGE–PTE Global Bit
MTRR–Memory Type Range Registers
SEP–SYSENTER and SYSEXIT
APIC–APIC on Chip
CX8–CMPXCHG8B Inst.
MCE–Machine Check Exception
PAE–Physical Address Extensions
MSR–RDMSR and WRMSR Support
TSC–Time Stamp Counter
PSE–Page Size Extensions
DE–Debugging Extensions
VME–Virtual-8086 Mode Enhancement
FPU–x87 FPU on Chip

Reserved
OM16523

Figure 3-8. Feature Information Returned in the EDX Register

CPUID—CPU Identification Vol. 2A 3-251


Table 3-20. More on Feature Information Returned in the EDX Register
Bit # Mnemonic Description
0 FPU Floating-Point Unit On-Chip. The processor contains an x87 FPU.
1 VME Virtual 8086 Mode Enhancements. Virtual 8086 mode enhancements, including CR4.VME for controlling the
feature, CR4.PVI for protected mode virtual interrupts, software interrupt indirection, expansion of the TSS
with the software indirection bitmap, and EFLAGS.VIF and EFLAGS.VIP flags.
2 DE Debugging Extensions. Support for I/O breakpoints, including CR4.DE for controlling the feature, and optional
trapping of accesses to DR4 and DR5.
3 PSE Page Size Extension. Large pages of size 4 MByte are supported, including CR4.PSE for controlling the
feature, the defined dirty bit in PDE (Page Directory Entries), optional reserved bit trapping in CR3, PDEs, and
PTEs.
4 TSC Time Stamp Counter. The RDTSC instruction is supported, including CR4.TSD for controlling privilege.
5 MSR Model Specific Registers RDMSR and WRMSR Instructions. The RDMSR and WRMSR instructions are
supported. Some of the MSRs are implementation dependent.
6 PAE Physical Address Extension. Physical addresses greater than 32 bits are supported: extended page table
entry formats, an extra level in the page translation tables is defined, 2-MByte pages are supported instead of
4 Mbyte pages if PAE bit is 1.
7 MCE Machine Check Exception. Exception 18 is defined for Machine Checks, including CR4.MCE for controlling the
feature. This feature does not define the model-specific implementations of machine-check error logging,
reporting, and processor shutdowns. Machine Check exception handlers may have to depend on processor
version to do model specific processing of the exception, or test for the presence of the Machine Check feature.
8 CX8 CMPXCHG8B Instruction. The compare-and-exchange 8 bytes (64 bits) instruction is supported (implicitly
locked and atomic).
9 APIC APIC On-Chip. The processor contains an Advanced Programmable Interrupt Controller (APIC), responding to
memory mapped commands in the physical address range FFFE0000H to FFFE0FFFH (by default - some
processors permit the APIC to be relocated).
10 Reserved Reserved
11 SEP SYSENTER and SYSEXIT Instructions. The SYSENTER and SYSEXIT and associated MSRs are supported.
12 MTRR Memory Type Range Registers. MTRRs are supported. The MTRRcap MSR contains feature bits that describe
what memory types are supported, how many variable MTRRs are supported, and whether fixed MTRRs are
supported.
13 PGE Page Global Bit. The global bit is supported in paging-structure entries that map a page, indicating TLB entries
that are common to different processes and need not be flushed. The CR4.PGE bit controls this feature.
14 MCA Machine Check Architecture. A value of 1 indicates the Machine Check Architecture of reporting machine
errors is supported. The MCG_CAP MSR contains feature bits describing how many banks of error reporting
MSRs are supported.
15 CMOV Conditional Move Instructions. The conditional move instruction CMOV is supported. In addition, if x87 FPU is
present as indicated by the CPUID.FPU feature bit, then the FCOMI and FCMOV instructions are supported
16 PAT Page Attribute Table. Page Attribute Table is supported. This feature augments the Memory Type Range
Registers (MTRRs), allowing an operating system to specify attributes of memory accessed through a linear
address on a 4KB granularity.
17 PSE-36 36-Bit Page Size Extension. 4-MByte pages addressing physical memory beyond 4 GBytes are supported with
32-bit paging. This feature indicates that upper bits of the physical address of a 4-MByte page are encoded in
bits 20:13 of the page-directory entry. Such physical addresses are limited by MAXPHYADDR and may be up to
40 bits in size.
18 PSN Processor Serial Number. The processor supports the 96-bit processor identification number feature and the
feature is enabled.
19 CLFSH CLFLUSH Instruction. CLFLUSH Instruction is supported.
20 Reserved Reserved

CPUID—CPU Identification Vol. 2A 3-252


Table 3-20. More on Feature Information Returned in the EDX Register (Contd.)
Bit # Mnemonic Description
21 DS Debug Store. The processor supports the ability to write debug information into a memory resident buffer.
This feature is used by the branch trace store (BTS) and processor event-based sampling (PEBS) facilities (see
Chapter 25, “Introduction to Virtual Machine Extensions,” in the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3C).
22 ACPI Thermal Monitor and Software Controlled Clock Facilities. The processor implements internal MSRs that
allow processor temperature to be monitored and processor performance to be modulated in predefined duty
cycles under software control.
23 MMX Intel MMX Technology. The processor supports the Intel MMX technology.
24 FXSR FXSAVE and FXRSTOR Instructions. The FXSAVE and FXRSTOR instructions are supported for fast save and
restore of the floating-point context. Presence of this bit also indicates that CR4.OSFXSR is available for an
operating system to indicate that it supports the FXSAVE and FXRSTOR instructions.
25 SSE SSE. The processor supports the SSE extensions.
26 SSE2 SSE2. The processor supports the SSE2 extensions.
27 SS Self Snoop. The processor supports the management of conflicting memory types by performing a snoop of its
own cache structure for transactions issued to the bus.
28 HTT Max APIC IDs reserved field is Valid. A value of 0 for HTT indicates there is only a single logical processor in
the package and software should assume only a single APIC ID is reserved. A value of 1 for HTT indicates the
value in CPUID.1.EBX[23:16] (the Maximum number of addressable IDs for logical processors in this package) is
valid for the package.
29 TM Thermal Monitor. The processor implements the thermal monitor automatic thermal control circuitry (TCC).
30 Reserved Reserved
31 PBE Pending Break Enable. The processor supports the use of the FERR#/PBE# pin when the processor is in the
stop-clock state (STPCLK# is asserted) to signal the processor that an interrupt is pending and that the
processor should return to normal operation to handle the interrupt.

INPUT EAX = 02H: TLB/Cache/Prefetch Information Returned in EAX, EBX, ECX, EDX
When CPUID executes with EAX set to 02H, the processor returns information about the processor’s internal TLBs,
cache, and prefetch hardware in the EAX, EBX, ECX, and EDX registers. The information is reported in encoded
form and fall into the following categories:
• The least-significant byte in register EAX (register AL) will always return 01H. Software should ignore this value
and not interpret it as an informational descriptor.
• The most significant bit (bit 31) of each register indicates whether the register contains valid information (set
to 0) or is reserved (set to 1).
• If a register contains valid information, the information is contained in 1 byte descriptors. There are four types
of encoding values for the byte descriptor, the encoding type is noted in the second column of Table 3-21. Table
3-21 lists the encoding of these descriptors. Note that the order of descriptors in the EAX, EBX, ECX, and EDX
registers is not defined; that is, specific bytes are not designated to contain descriptors for specific cache,
prefetch, or TLB types. The descriptors may appear in any order. Note also a processor may report a general
descriptor type (FFH) and not report any byte descriptor of “cache type” via CPUID leaf 2.

CPUID—CPU Identification Vol. 2A 3-253


Table 3-21. Encoding of CPUID Leaf 2 Descriptors
Descriptor
Type Cache or TLB Description
Value
00H General Null descriptor, this byte contains no information.
01H TLB Instruction TLB: 4 KByte pages, 4-way set associative, 32 entries.
02H TLB Instruction TLB: 4 MByte pages, fully associative, 2 entries.
03H TLB Data TLB: 4 KByte pages, 4-way set associative, 64 entries.
04H TLB Data TLB: 4 MByte pages, 4-way set associative, 8 entries.
05H TLB Data TLB1: 4 MByte pages, 4-way set associative, 32 entries.
06H Cache 1st-level instruction cache: 8 KBytes, 4-way set associative, 32 byte line size.
08H Cache 1st-level instruction cache: 16 KBytes, 4-way set associative, 32 byte line size.
09H Cache 1st-level instruction cache: 32KBytes, 4-way set associative, 64 byte line size.
0AH Cache 1st-level data cache: 8 KBytes, 2-way set associative, 32 byte line size.
0BH TLB Instruction TLB: 4 MByte pages, 4-way set associative, 4 entries.
0CH Cache 1st-level data cache: 16 KBytes, 4-way set associative, 32 byte line size.
0DH Cache 1st-level data cache: 16 KBytes, 4-way set associative, 64 byte line size.
0EH Cache 1st-level data cache: 24 KBytes, 6-way set associative, 64 byte line size.
1DH Cache 2nd-level cache: 128 KBytes, 2-way set associative, 64 byte line size.
21H Cache 2nd-level cache: 256 KBytes, 8-way set associative, 64 byte line size.
22H Cache 3rd-level cache: 512 KBytes, 4-way set associative, 64 byte line size, 2 lines per sector.
23H Cache 3rd-level cache: 1 MBytes, 8-way set associative, 64 byte line size, 2 lines per sector.
24H Cache 2nd-level cache: 1 MBytes, 16-way set associative, 64 byte line size.
25H Cache 3rd-level cache: 2 MBytes, 8-way set associative, 64 byte line size, 2 lines per sector.
29H Cache 3rd-level cache: 4 MBytes, 8-way set associative, 64 byte line size, 2 lines per sector.
2CH Cache 1st-level data cache: 32 KBytes, 8-way set associative, 64 byte line size.
30H Cache 1st-level instruction cache: 32 KBytes, 8-way set associative, 64 byte line size.
40H Cache No 2nd-level cache or, if processor contains a valid 2nd-level cache, no 3rd-level cache.
41H Cache 2nd-level cache: 128 KBytes, 4-way set associative, 32 byte line size.
42H Cache 2nd-level cache: 256 KBytes, 4-way set associative, 32 byte line size.
43H Cache 2nd-level cache: 512 KBytes, 4-way set associative, 32 byte line size.
44H Cache 2nd-level cache: 1 MByte, 4-way set associative, 32 byte line size.
45H Cache 2nd-level cache: 2 MByte, 4-way set associative, 32 byte line size.
46H Cache 3rd-level cache: 4 MByte, 4-way set associative, 64 byte line size.
47H Cache 3rd-level cache: 8 MByte, 8-way set associative, 64 byte line size.
48H Cache 2nd-level cache: 3MByte, 12-way set associative, 64 byte line size.
49H Cache 3rd-level cache: 4MB, 16-way set associative, 64-byte line size (Intel Xeon processor MP, Family 0FH,
Model 06H);
2nd-level cache: 4 MByte, 16-way set associative, 64 byte line size.
4AH Cache 3rd-level cache: 6MByte, 12-way set associative, 64 byte line size.
4BH Cache 3rd-level cache: 8MByte, 16-way set associative, 64 byte line size.
4CH Cache 3rd-level cache: 12MByte, 12-way set associative, 64 byte line size.
4DH Cache 3rd-level cache: 16MByte, 16-way set associative, 64 byte line size.
4EH Cache 2nd-level cache: 6MByte, 24-way set associative, 64 byte line size.
4FH TLB Instruction TLB: 4 KByte pages, 32 entries.

CPUID—CPU Identification Vol. 2A 3-254


Table 3-21. Encoding of CPUID Leaf 2 Descriptors (Contd.)
Descriptor
Type Cache or TLB Description
Value
50H TLB Instruction TLB: 4 KByte and 2-MByte or 4-MByte pages, 64 entries.
51H TLB Instruction TLB: 4 KByte and 2-MByte or 4-MByte pages, 128 entries.
52H TLB Instruction TLB: 4 KByte and 2-MByte or 4-MByte pages, 256 entries.
55H TLB Instruction TLB: 2-MByte or 4-MByte pages, fully associative, 7 entries.
56H TLB Data TLB0: 4 MByte pages, 4-way set associative, 16 entries.
57H TLB Data TLB0: 4 KByte pages, 4-way associative, 16 entries.
59H TLB Data TLB0: 4 KByte pages, fully associative, 16 entries.
5AH TLB Data TLB0: 2 MByte or 4 MByte pages, 4-way set associative, 32 entries.
5BH TLB Data TLB: 4 KByte and 4 MByte pages, 64 entries.
5CH TLB Data TLB: 4 KByte and 4 MByte pages,128 entries.
5DH TLB Data TLB: 4 KByte and 4 MByte pages,256 entries.
60H Cache 1st-level data cache: 16 KByte, 8-way set associative, 64 byte line size.
61H TLB Instruction TLB: 4 KByte pages, fully associative, 48 entries.
63H TLB Data TLB: 2 MByte or 4 MByte pages, 4-way set associative, 32 entries and a separate array with 1 GByte
pages, 4-way set associative, 4 entries.
64H TLB Data TLB: 4 KByte pages, 4-way set associative, 512 entries.
66H Cache 1st-level data cache: 8 KByte, 4-way set associative, 64 byte line size.
67H Cache 1st-level data cache: 16 KByte, 4-way set associative, 64 byte line size.
68H Cache 1st-level data cache: 32 KByte, 4-way set associative, 64 byte line size.
6AH Cache uTLB: 4 KByte pages, 8-way set associative, 64 entries.
6BH Cache DTLB: 4 KByte pages, 8-way set associative, 256 entries.
6CH Cache DTLB: 2M/4M pages, 8-way set associative, 128 entries.
6DH Cache DTLB: 1 GByte pages, fully associative, 16 entries.
70H Cache Trace cache: 12 K-μop, 8-way set associative.
71H Cache Trace cache: 16 K-μop, 8-way set associative.
72H Cache Trace cache: 32 K-μop, 8-way set associative.
76H TLB Instruction TLB: 2M/4M pages, fully associative, 8 entries.
78H Cache 2nd-level cache: 1 MByte, 4-way set associative, 64byte line size.
79H Cache 2nd-level cache: 128 KByte, 8-way set associative, 64 byte line size, 2 lines per sector.
7AH Cache 2nd-level cache: 256 KByte, 8-way set associative, 64 byte line size, 2 lines per sector.
7BH Cache 2nd-level cache: 512 KByte, 8-way set associative, 64 byte line size, 2 lines per sector.
7CH Cache 2nd-level cache: 1 MByte, 8-way set associative, 64 byte line size, 2 lines per sector.
7DH Cache 2nd-level cache: 2 MByte, 8-way set associative, 64byte line size.
7FH Cache 2nd-level cache: 512 KByte, 2-way set associative, 64-byte line size.
80H Cache 2nd-level cache: 512 KByte, 8-way set associative, 64-byte line size.
82H Cache 2nd-level cache: 256 KByte, 8-way set associative, 32 byte line size.
83H Cache 2nd-level cache: 512 KByte, 8-way set associative, 32 byte line size.
84H Cache 2nd-level cache: 1 MByte, 8-way set associative, 32 byte line size.
85H Cache 2nd-level cache: 2 MByte, 8-way set associative, 32 byte line size.
86H Cache 2nd-level cache: 512 KByte, 4-way set associative, 64 byte line size.
87H Cache 2nd-level cache: 1 MByte, 8-way set associative, 64 byte line size.

CPUID—CPU Identification Vol. 2A 3-255


Table 3-21. Encoding of CPUID Leaf 2 Descriptors (Contd.)
Descriptor
Type Cache or TLB Description
Value
A0H DTLB DTLB: 4k pages, fully associative, 32 entries.
B0H TLB Instruction TLB: 4 KByte pages, 4-way set associative, 128 entries.
B1H TLB Instruction TLB: 2M pages, 4-way, 8 entries or 4M pages, 4-way, 4 entries.
B2H TLB Instruction TLB: 4KByte pages, 4-way set associative, 64 entries.
B3H TLB Data TLB: 4 KByte pages, 4-way set associative, 128 entries.
B4H TLB Data TLB1: 4 KByte pages, 4-way associative, 256 entries.
B5H TLB Instruction TLB: 4KByte pages, 8-way set associative, 64 entries.
B6H TLB Instruction TLB: 4KByte pages, 8-way set associative, 128 entries.
BAH TLB Data TLB1: 4 KByte pages, 4-way associative, 64 entries.
C0H TLB Data TLB: 4 KByte and 4 MByte pages, 4-way associative, 8 entries.
C1H STLB Shared 2nd-Level TLB: 4 KByte/2MByte pages, 8-way associative, 1024 entries.
C2H DTLB DTLB: 4 KByte/2 MByte pages, 4-way associative, 16 entries.
C3H STLB Shared 2nd-Level TLB: 4 KByte /2 MByte pages, 6-way associative, 1536 entries. Also 1GBbyte pages, 4-
way, 16 entries.
C4H DTLB DTLB: 2M/4M Byte pages, 4-way associative, 32 entries.
CAH STLB Shared 2nd-Level TLB: 4 KByte pages, 4-way associative, 512 entries.
D0H Cache 3rd-level cache: 512 KByte, 4-way set associative, 64 byte line size.
D1H Cache 3rd-level cache: 1 MByte, 4-way set associative, 64 byte line size.
D2H Cache 3rd-level cache: 2 MByte, 4-way set associative, 64 byte line size.
D6H Cache 3rd-level cache: 1 MByte, 8-way set associative, 64 byte line size.
D7H Cache 3rd-level cache: 2 MByte, 8-way set associative, 64 byte line size.
D8H Cache 3rd-level cache: 4 MByte, 8-way set associative, 64 byte line size.
DCH Cache 3rd-level cache: 1.5 MByte, 12-way set associative, 64 byte line size.
DDH Cache 3rd-level cache: 3 MByte, 12-way set associative, 64 byte line size.
DEH Cache 3rd-level cache: 6 MByte, 12-way set associative, 64 byte line size.
E2H Cache 3rd-level cache: 2 MByte, 16-way set associative, 64 byte line size.
E3H Cache 3rd-level cache: 4 MByte, 16-way set associative, 64 byte line size.
E4H Cache 3rd-level cache: 8 MByte, 16-way set associative, 64 byte line size.
EAH Cache 3rd-level cache: 12MByte, 24-way set associative, 64 byte line size.
EBH Cache 3rd-level cache: 18MByte, 24-way set associative, 64 byte line size.
ECH Cache 3rd-level cache: 24MByte, 24-way set associative, 64 byte line size.
F0H Prefetch 64-Byte prefetching.
F1H Prefetch 128-Byte prefetching.
FEH General CPUID leaf 2 does not report TLB descriptor information; use CPUID leaf 18H to query TLB and other
address translation parameters.
FFH General CPUID leaf 2 does not report cache descriptor information, use CPUID leaf 4 to query cache parameters.

CPUID—CPU Identification Vol. 2A 3-256


Example 3-1. Example of Cache and TLB Interpretation
The first member of the family of Pentium 4 processors returns the following information about caches and TLBs
when the CPUID executes with an input value of 2:
EAX 66 5B 50 01H
EBX 0H
ECX 0H
EDX 00 7A 70 00H
Which means:
• The least-significant byte (byte 0) of register EAX is set to 01H. This value should be ignored.
• The most-significant bit of all four registers (EAX, EBX, ECX, and EDX) is set to 0, indicating that each register
contains valid 1-byte descriptors.
• Bytes 1, 2, and 3 of register EAX indicate that the processor has:
— 50H - a 64-entry instruction TLB, for mapping 4-KByte and 2-MByte or 4-MByte pages.
— 5BH - a 64-entry data TLB, for mapping 4-KByte and 4-MByte pages.
— 66H - an 8-KByte 1st level data cache, 4-way set associative, with a 64-Byte cache line size.
• The descriptors in registers EBX and ECX are valid, but contain NULL descriptors.
• Bytes 0, 1, 2, and 3 of register EDX indicate that the processor has:
— 00H - NULL descriptor.
— 70H - Trace cache: 12 K-μop, 8-way set associative.
— 7AH - a 256-KByte 2nd level cache, 8-way set associative, with a sectored, 64-byte cache line size.
— 00H - NULL descriptor.

INPUT EAX = 04H: Returns Deterministic Cache Parameters for Each Level
When CPUID executes with EAX set to 04H and ECX contains an index value, the processor returns encoded data
that describe a set of deterministic cache parameters (for the cache level associated with the input in ECX). Valid
index values start from 0.
Software can enumerate the deterministic cache parameters for each level of the cache hierarchy starting with an
index value of 0, until the parameters report the value associated with the cache type field is 0. The architecturally
defined fields reported by deterministic cache parameters are documented in Table 3-17.
This Cache Size in Bytes
= (Ways + 1) * (Partitions + 1) * (Line_Size + 1) * (Sets + 1)
= (EBX[31:22] + 1) * (EBX[21:12] + 1) * (EBX[11:0] + 1) * (ECX + 1)

The CPUID leaf 04H also reports data that can be used to derive the topology of processor cores in a physical
package. This information is constant for all valid index values. Software can query the raw data reported by
executing CPUID with EAX=04H and ECX=0 and use it as part of the topology enumeration algorithm described in
Chapter 10, “Multiple-Processor Management,” in the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 3A.

INPUT EAX = 05H: Returns MONITOR and MWAIT Features


When CPUID executes with EAX set to 05H, the processor returns information about features available to
MONITOR/MWAIT instructions. The MONITOR instruction is used for address-range monitoring in conjunction with
MWAIT instruction. The MWAIT instruction optionally provides additional extensions for advanced power manage-
ment. See Table 3-17.

INPUT EAX = 06H: Returns Thermal and Power Management Features


When CPUID executes with EAX set to 06H, the processor returns information about thermal and power manage-
ment features. See Table 3-17.

CPUID—CPU Identification Vol. 2A 3-257


INPUT EAX = 07H: Returns Structured Extended Feature Enumeration Information
When CPUID executes with EAX set to 07H and ECX = 0, the processor returns information about the maximum
input value for sub-leaves that contain extended feature flags. See Table 3-17.
When CPUID executes with EAX set to 07H and the input value of ECX is invalid (see leaf 07H entry in Table 3-17),
the processor returns 0 in EAX/EBX/ECX/EDX. In subleaf 0, EAX returns the maximum input value of the highest
leaf 7 sub-leaf, and EBX, ECX & EDX contain information of extended feature flags.

INPUT EAX = 09H: Returns Direct Cache Access Information


When CPUID executes with EAX set to 09H, the processor returns information about Direct Cache Access capabili-
ties. See Table 3-17.

INPUT EAX = 0AH: Returns Architectural Performance Monitoring Features


When CPUID executes with EAX set to 0AH, the processor returns information about support for architectural
performance monitoring capabilities. Architectural performance monitoring is supported if the version ID (see
Table 3-17) is greater than Pn 0. See Table 3-17.
For each version of architectural performance monitoring capability, software must enumerate this leaf to discover
the programming facilities and the architectural performance events available in the processor. The details are
described in Chapter 25, “Introduction to Virtual Machine Extensions,” in the Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 3C.

INPUT EAX = 0BH: Returns Extended Topology Information


CPUID leaf 1FH is a preferred superset to leaf 0BH. Intel recommends first checking for the existence of Leaf 1FH
before using leaf 0BH.
When CPUID executes with EAX set to 0BH, the processor returns information about extended topology enumera-
tion data. Software must detect the presence of CPUID leaf 0BH by verifying (a) the highest leaf index supported
by CPUID is >= 0BH, and (b) CPUID.0BH:EBX[15:0] reports a non-zero value. See Table 3-17.

INPUT EAX = 0DH: Returns Processor Extended States Enumeration Information


When CPUID executes with EAX set to 0DH and ECX = 0, the processor returns information about the bit-vector
representation of all processor state extensions that are supported in the processor and storage size requirements
of the XSAVE/XRSTOR area. See Table 3-17.
When CPUID executes with EAX set to 0DH and ECX = n (n > 1, and is a valid sub-leaf index), the processor returns
information about the size and offset of each processor extended state save area within the XSAVE/XRSTOR area.
See Table 3-17. Software can use the forward-extendable technique depicted below to query the valid sub-leaves
and obtain size and offset information for each processor extended state save area:
For i = 2 to 62 // sub-leaf 1 is reserved
IF (CPUID.(EAX=0DH, ECX=0H):VECTOR[i] = 1 ) // VECTOR is the 64-bit value of EDX:EAX
Execute CPUID.(EAX=0DH, ECX = i) to examine size and offset for sub-leaf i;
FI;

INPUT EAX = 0FH: Returns Intel Resource Director Technology (Intel RDT) Monitoring Enumeration Information
When CPUID executes with EAX set to 0FH and ECX = 0, the processor returns information about the bit-vector
representation of QoS monitoring resource types that are supported in the processor and maximum range of RMID
values the processor can use to monitor of any supported resource types. Each bit, starting from bit 1, corresponds
to a specific resource type if the bit is set. The bit position corresponds to the sub-leaf index (or ResID) that soft-
ware must use to query QoS monitoring capability available for that type. See Table 3-17.
When CPUID executes with EAX set to 0FH and ECX = n (n >= 1, and is a valid ResID), the processor returns infor-
mation software can use to program IA32_PQR_ASSOC, IA32_QM_EVTSEL MSRs before reading QoS data from the
IA32_QM_CTR MSR.

CPUID—CPU Identification Vol. 2A 3-258


INPUT EAX = 10H: Returns Intel Resource Director Technology (Intel RDT) Allocation Enumeration Information
When CPUID executes with EAX set to 10H and ECX = 0, the processor returns information about the bit-vector
representation of QoS Enforcement resource types that are supported in the processor. Each bit, starting from bit
1, corresponds to a specific resource type if the bit is set. The bit position corresponds to the sub-leaf index (or
ResID) that software must use to query QoS enforcement capability available for that type. See Table 3-17.
When CPUID executes with EAX set to 10H and ECX = n (n >= 1, and is a valid ResID), the processor returns infor-
mation about available classes of service and range of QoS mask MSRs that software can use to configure each
class of services using capability bit masks in the QoS Mask registers, IA32_resourceType_Mask_n.

INPUT EAX = 12H: Returns Intel SGX Enumeration Information


When CPUID executes with EAX set to 12H and ECX = 0H, the processor returns information about Intel SGX capa-
bilities. See Table 3-17.
When CPUID executes with EAX set to 12H and ECX = 1H, the processor returns information about Intel SGX attri-
butes. See Table 3-17.
When CPUID executes with EAX set to 12H and ECX = n (n > 1), the processor returns information about Intel SGX
Enclave Page Cache. See Table 3-17.

INPUT EAX = 14H: Returns Intel Processor Trace Enumeration Information


When CPUID executes with EAX set to 14H and ECX = 0H, the processor returns information about Intel Processor
Trace extensions. See Table 3-17.
When CPUID executes with EAX set to 14H and ECX = n (n > 0 and less than the number of non-zero bits in
CPUID.(EAX=14H, ECX= 0H).EAX), the processor returns information about packet generation in Intel Processor
Trace. See Table 3-17.

INPUT EAX = 15H: Returns Time Stamp Counter and Nominal Core Crystal Clock Information
When CPUID executes with EAX set to 15H and ECX = 0H, the processor returns information about Time Stamp
Counter and Core Crystal Clock. See Table 3-17.

INPUT EAX = 16H: Returns Processor Frequency Information


When CPUID executes with EAX set to 16H, the processor returns information about Processor Frequency Informa-
tion. See Table 3-17.

INPUT EAX = 17H: Returns System-On-Chip Information


When CPUID executes with EAX set to 17H, the processor returns information about the System-On-Chip Vendor
Attribute Enumeration. See Table 3-17.

INPUT EAX = 18H: Returns Deterministic Address Translation Parameters Information


When CPUID executes with EAX set to 18H, the processor returns information about the Deterministic Address
Translation Parameters. See Table 3-17.

INPUT EAX = 19H: Returns Key Locker Information


When CPUID executes with EAX set to 19H, the processor returns information about Key Locker. See Table 3-17.

INPUT EAX = 1AH: Returns Native Model ID Information


When CPUID executes with EAX set to 1AH, the processor returns information about Native Model Identification.
See Table 3-17.

INPUT EAX = 1BH: Returns PCONFIG Information


When CPUID executes with EAX set to 1BH, the processor returns information about PCONFIG capabilities. This
information is enumerated in sub-leaves selected by the value of ECX (starting with 0).

CPUID—CPU Identification Vol. 2A 3-259


Each sub-leaf of CPUID function 1BH enumerates its sub-leaf type in EAX. If a sub-leaf type is 0, the sub-leaf is
invalid and zero is returned in EBX, ECX, and EDX. In this case, all subsequent sub-leaves (selected by larger input
values of ECX) are also invalid.
The only valid sub-leaf type currently defined is 1, indicating that the sub-leaf enumerates target identifiers for the
PCONFIG instruction. Any non-zero value returned in EBX, ECX, or EDX indicates a valid target identifier of the
PCONFIG instruction (any value of zero should be ignored). The only target identifier currently defined is 1, indi-
cating TME-MK. See the “PCONFIG—Platform Configuration” instruction in Chapter 4 of the Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 2B, for more information.

INPUT EAX = 1CH: Returns Last Branch Record Information


When CPUID executes with EAX set to 1CH, the processor returns information about LBRs (the architectural
feature). See Table 3-17.

INPUT EAX = 1DH: Returns Tile Information


When CPUID executes with EAX set to 1DH and ECX = 0H, the processor returns information about tile
architecture. See Table 3-17.
When CPUID executes with EAX set to 1DH and ECX = 1H, the processor returns information about tile palette 1.
See Table 3-17.

INPUT EAX = 1EH: Returns TMUL Information


When CPUID executes with EAX set to 1EH and ECX = 0H, the processor returns information about TMUL
capabilities. See Table 3-17.

INPUT EAX = 1FH: Returns V2 Extended Topology Information


When CPUID executes with EAX set to 1FH, the processor returns information about extended topology enumera-
tion data. Software must detect the presence of CPUID leaf 1FH by verifying (a) the highest leaf index supported
by CPUID is >= 1FH, and (b) CPUID.1FH:EBX[15:0] reports a non-zero value. See Table 3-17.

INPUT EAX = 20H: Returns History Reset Information


When CPUID executes with EAX set to 20H, the processor returns information about History Reset. See Table 3-17.

INPUT EAX = 23H: Returns Architectural Performance Monitoring Extended Information


When CPUID executes with EAX set to 23H, the processor returns architectural performance monitoring extended
information. See Table 3-17.

INPUT EAX = 24H: Returns Intel AVX10 Converged Vector ISA Information
When CPUID executes with EAX set to 24H, the processor returns Intel AVX10 converged vector ISA information.
See Table 3-17.

METHODS FOR RETURNING BRANDING INFORMATION


Use the following techniques to access branding information:
1. Processor brand string method.
2. Processor brand index; this method uses a software supplied brand string table.
These two methods are discussed in the following sections. For methods that are available in early processors, see
Section: “Identification of Earlier IA-32 Processors” in Chapter 20 of the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 1.

The Processor Brand String Method


Figure 3-9 describes the algorithm used for detection of the brand string. Processor brand identification software
should execute this algorithm on all Intel 64 and IA-32 processors.

CPUID—CPU Identification Vol. 2A 3-260


This method (introduced with Pentium 4 processors) returns an ASCII brand identification string and the Processor
Base frequency of the processor to the EAX, EBX, ECX, and EDX registers.

Input: EAX=
0x80000000

CPUID

False Processor Brand


IF (EAX & 0x80000000) String Not
Supported

CPUID
True ≥
Function
Extended
Supported

EAX Return Value =


Max. Extended CPUID
Function Index

True Processor Brand


IF (EAX Return Value
≥ 0x80000004) String Supported

OM15194

Figure 3-9. Determination of Support for the Processor Brand String

How Brand Strings Work


To use the brand string method, execute CPUID with EAX input of 8000002H through 80000004H. For each input
value, CPUID returns 16 ASCII characters using EAX, EBX, ECX, and EDX. The returned string will be NULL-termi-
nated.
Table 3-22 shows the brand string that is returned by the first processor in the Pentium 4 processor family.

Table 3-22. Processor Brand String Returned with Pentium 4 Processor


EAX Input Value Return Values ASCII Equivalent
80000002H EAX = 20202020H “ ”
EBX = 20202020H “ ”
ECX = 20202020H “ ”
EDX = 6E492020H “nI ”
80000003H EAX = 286C6574H “(let”
EBX = 50202952H “P )R”
ECX = 69746E65H “itne”
EDX = 52286D75H “R(mu”

CPUID—CPU Identification Vol. 2A 3-261


Table 3-22. Processor Brand String Returned with Pentium 4 Processor (Contd.)
EAX Input Value Return Values ASCII Equivalent
80000004H EAX = 20342029H “ 4 )”
EBX = 20555043H “ UPC”
ECX = 30303531H “0051”
EDX = 007A484DH “\0zHM”

Extracting the Processor Frequency from Brand Strings


Figure 3-10 provides an algorithm which software can use to extract the Processor Base frequency from the
processor brand string.

Scan "Brand String" in


Reverse Byte Order

"zHM", or
Match
"zHG", or
Substring
"zHT"

False
IF Substring Matched Report Error

Determine "Freq" True If "zHM"


and "Multiplier" Multiplier = 1 x 106

If "zHG"
Multiplier = 1 x 109
Determine "Multiplier" If "zHT"
Multiplier = 1 x 1012

Scan Digits
Until Blank Reverse Digits
Determine "Freq"
In Reverse Order To Decimal Value

Processor Base
Frequency =
"Freq" = X.YZ if
"Freq" x "Multiplier"
Digits = "ZY.X"

Figure 3-10. Algorithm for Extracting Processor Frequency

The Processor Brand Index Method


The brand index method (introduced with Pentium® III Xeon® processors) provides an entry point into a brand
identification table that is maintained in memory by system software and is accessible from system- and user-level
code. In this table, each brand index is associate with an ASCII brand identification string that identifies the official
Intel family and model number of a processor.
When CPUID executes with EAX set to 1, the processor returns a brand index to the low byte in EBX. Software can
then use this index to locate the brand identification string for the processor in the brand identification table. The
first entry (brand index 0) in this table is reserved, allowing for backward compatibility with processors that do not
support the brand identification feature. Starting with processor signature family ID = 0FH, model = 03H, brand
index method is no longer supported. Use brand string method instead.
Table 3-23 shows brand indices that have identification strings associated with them.

CPUID—CPU Identification Vol. 2A 3-262


Table 3-23. Mapping of Brand Indices; and Intel 64 and IA-32 Processor Brand Strings
Brand Index Brand String
00H This processor does not support the brand identification feature
01H Intel(R) Celeron(R) processor1
02H Intel(R) Pentium(R) III processor1
03H Intel(R) Pentium(R) III Xeon(R) processor; If processor signature = 000006B1h, then Intel(R) Celeron(R)
processor
04H Intel(R) Pentium(R) III processor
06H Mobile Intel(R) Pentium(R) III processor-M
07H Mobile Intel(R) Celeron(R) processor1
08H Intel(R) Pentium(R) 4 processor
09H Intel(R) Pentium(R) 4 processor
0AH Intel(R) Celeron(R) processor1
0BH Intel(R) Xeon(R) processor; If processor signature = 00000F13h, then Intel(R) Xeon(R) processor MP
0CH Intel(R) Xeon(R) processor MP
0EH Mobile Intel(R) Pentium(R) 4 processor-M; If processor signature = 00000F13h, then Intel(R) Xeon(R) processor
0FH Mobile Intel(R) Celeron(R) processor1
11H Mobile Genuine Intel(R) processor
12H Intel(R) Celeron(R) M processor
13H Mobile Intel(R) Celeron(R) processor1
14H Intel(R) Celeron(R) processor
15H Mobile Genuine Intel(R) processor
16H Intel(R) Pentium(R) M processor
17H Mobile Intel(R) Celeron(R) processor1
18H – 0FFH RESERVED
NOTES:
1. Indicates versions of these processors that were introduced after the Pentium III

IA-32 Architecture Compatibility


CPUID is not supported in early models of the Intel486 processor or in any IA-32 processor earlier than the
Intel486 processor.

Operation
IA32_BIOS_SIGN_ID MSR := Update with installed microcode revision number;
CASE (EAX) OF
EAX = 0:
EAX := Highest basic function input value understood by CPUID;
EBX := Vendor identification string;
EDX := Vendor identification string;
ECX := Vendor identification string;
BREAK;
EAX = 1H:
EAX[3:0] := Stepping ID;
EAX[7:4] := Model;
EAX[11:8] := Family;

CPUID—CPU Identification Vol. 2A 3-263


EAX[13:12] := Processor type;
EAX[15:14] := Reserved;
EAX[19:16] := Extended Model;
EAX[27:20] := Extended Family;
EAX[31:28] := Reserved;
EBX[7:0] := Brand Index; (* Reserved if the value is zero. *)
EBX[15:8] := CLFLUSH Line Size;
EBX[16:23] := Reserved; (* Number of threads enabled = 2 if MT enable fuse set. *)
EBX[24:31] := Initial APIC ID;
ECX := Feature flags; (* See Figure 3-7. *)
EDX := Feature flags; (* See Figure 3-8. *)
BREAK;
EAX = 2H:
EAX := Cache and TLB information;
EBX := Cache and TLB information;
ECX := Cache and TLB information;
EDX := Cache and TLB information;
BREAK;
EAX = 3H:
EAX := Reserved;
EBX := Reserved;
ECX := ProcessorSerialNumber[31:0];
(* Pentium III processors only, otherwise reserved. *)
EDX := ProcessorSerialNumber[63:32];
(* Pentium III processors only, otherwise reserved. *
BREAK
EAX = 4H:
EAX := Deterministic Cache Parameters Leaf; (* See Table 3-17. *)
EBX := Deterministic Cache Parameters Leaf;
ECX := Deterministic Cache Parameters Leaf;
EDX := Deterministic Cache Parameters Leaf;
BREAK;
EAX = 5H:
EAX := MONITOR/MWAIT Leaf; (* See Table 3-17. *)
EBX := MONITOR/MWAIT Leaf;
ECX := MONITOR/MWAIT Leaf;
EDX := MONITOR/MWAIT Leaf;
BREAK;
EAX = 6H:
EAX := Thermal and Power Management Leaf; (* See Table 3-17. *)
EBX := Thermal and Power Management Leaf;
ECX := Thermal and Power Management Leaf;
EDX := Thermal and Power Management Leaf;
BREAK;
EAX = 7H:
EAX := Structured Extended Feature Flags Enumeration Leaf; (* See Table 3-17. *)
EBX := Structured Extended Feature Flags Enumeration Leaf;
ECX := Structured Extended Feature Flags Enumeration Leaf;
EDX := Structured Extended Feature Flags Enumeration Leaf;
BREAK;
EAX = 8H:
EAX := Reserved = 0;
EBX := Reserved = 0;
ECX := Reserved = 0;

CPUID—CPU Identification Vol. 2A 3-264


EDX := Reserved = 0;
BREAK;
EAX = 9H:
EAX := Direct Cache Access Information Leaf; (* See Table 3-17. *)
EBX := Direct Cache Access Information Leaf;
ECX := Direct Cache Access Information Leaf;
EDX := Direct Cache Access Information Leaf;
BREAK;
EAX = AH:
EAX := Architectural Performance Monitoring Leaf; (* See Table 3-17. *)
EBX := Architectural Performance Monitoring Leaf;
ECX := Architectural Performance Monitoring Leaf;
EDX := Architectural Performance Monitoring Leaf;
BREAK
EAX = BH:
EAX := Extended Topology Enumeration Leaf; (* See Table 3-17. *)
EBX := Extended Topology Enumeration Leaf;
ECX := Extended Topology Enumeration Leaf;
EDX := Extended Topology Enumeration Leaf;
BREAK;
EAX = CH:
EAX := Reserved = 0;
EBX := Reserved = 0;
ECX := Reserved = 0;
EDX := Reserved = 0;
BREAK;
EAX = DH:
EAX := Processor Extended State Enumeration Leaf; (* See Table 3-17. *)
EBX := Processor Extended State Enumeration Leaf;
ECX := Processor Extended State Enumeration Leaf;
EDX := Processor Extended State Enumeration Leaf;
BREAK;
EAX = EH:
EAX := Reserved = 0;
EBX := Reserved = 0;
ECX := Reserved = 0;
EDX := Reserved = 0;
BREAK;
EAX = FH:
EAX := Intel Resource Director Technology Monitoring Enumeration Leaf; (* See Table 3-17. *)
EBX := Intel Resource Director Technology Monitoring Enumeration Leaf;
ECX := Intel Resource Director Technology Monitoring Enumeration Leaf;
EDX := Intel Resource Director Technology Monitoring Enumeration Leaf;
BREAK;
EAX = 10H:
EAX := Intel Resource Director Technology Allocation Enumeration Leaf; (* See Table 3-17. *)
EBX := Intel Resource Director Technology Allocation Enumeration Leaf;
ECX := Intel Resource Director Technology Allocation Enumeration Leaf;
EDX := Intel Resource Director Technology Allocation Enumeration Leaf;
BREAK;
EAX = 12H:
EAX := Intel SGX Enumeration Leaf; (* See Table 3-17. *)
EBX := Intel SGX Enumeration Leaf;
ECX := Intel SGX Enumeration Leaf;

CPUID—CPU Identification Vol. 2A 3-265


EDX := Intel SGX Enumeration Leaf;
BREAK;
EAX = 14H:
EAX := Intel Processor Trace Enumeration Leaf; (* See Table 3-17. *)
EBX := Intel Processor Trace Enumeration Leaf;
ECX := Intel Processor Trace Enumeration Leaf;
EDX := Intel Processor Trace Enumeration Leaf;
BREAK;
EAX = 15H:
EAX := Time Stamp Counter and Nominal Core Crystal Clock Information Leaf; (* See Table 3-17. *)
EBX := Time Stamp Counter and Nominal Core Crystal Clock Information Leaf;
ECX := Time Stamp Counter and Nominal Core Crystal Clock Information Leaf;
EDX := Time Stamp Counter and Nominal Core Crystal Clock Information Leaf;
BREAK;
EAX = 16H:
EAX := Processor Frequency Information Enumeration Leaf; (* See Table 3-17. *)
EBX := Processor Frequency Information Enumeration Leaf;
ECX := Processor Frequency Information Enumeration Leaf;
EDX := Processor Frequency Information Enumeration Leaf;
BREAK;
EAX = 17H:
EAX := System-On-Chip Vendor Attribute Enumeration Leaf; (* See Table 3-17. *)
EBX := System-On-Chip Vendor Attribute Enumeration Leaf;
ECX := System-On-Chip Vendor Attribute Enumeration Leaf;
EDX := System-On-Chip Vendor Attribute Enumeration Leaf;
BREAK;
EAX = 18H:
EAX := Deterministic Address Translation Parameters Enumeration Leaf; (* See Table 3-17. *)
EBX := Deterministic Address Translation Parameters Enumeration Leaf;
ECX := Deterministic Address Translation Parameters Enumeration Leaf;
EDX := Deterministic Address Translation Parameters Enumeration Leaf;
BREAK;
EAX = 19H:
EAX := Key Locker Enumeration Leaf; (* See Table 3-17. *)
EBX := Key Locker Enumeration Leaf;
ECX := Key Locker Enumeration Leaf;
EDX := Key Locker Enumeration Leaf;
BREAK;
EAX = 1AH:
EAX := Native Model ID Enumeration Leaf; (* See Table 3-17. *)
EBX := Native Model ID Enumeration Leaf;
ECX := Native Model ID Enumeration Leaf;
EDX := Native Model ID Enumeration Leaf;
BREAK;
EAX = 1BH:
EAX := PCONFIG Information Enumeration Leaf; (* See “INPUT EAX = 1BH: Returns PCONFIG Information” on page 3-259. *)
EBX := PCONFIG Information Enumeration Leaf;
ECX := PCONFIG Information Enumeration Leaf;
EDX := PCONFIG Information Enumeration Leaf;
BREAK;
EAX = 1CH:
EAX := Last Branch Record Information Enumeration Leaf; (* See Table 3-17. *)
EBX := Last Branch Record Information Enumeration Leaf;
ECX := Last Branch Record Information Enumeration Leaf;

CPUID—CPU Identification Vol. 2A 3-266


EDX := Last Branch Record Information Enumeration Leaf;
BREAK;
EAX = 1DH:
EAX := Tile Information Enumeration Leaf; (* See Table 3-17. *)
EBX := Tile Information Enumeration Leaf;
ECX := Tile Information Enumeration Leaf;
EDX := Tile Information Enumeration Leaf;
BREAK;
EAX = 1EH:
EAX := TMUL Information Enumeration Leaf; (* See Table 3-17. *)
EBX := TMUL Information Enumeration Leaf;
ECX := TMUL Information Enumeration Leaf;
EDX := TMUL Information Enumeration Leaf;
BREAK;
EAX = 1FH:
EAX := V2 Extended Topology Enumeration Leaf; (* See Table 3-17. *)
EBX := V2 Extended Topology Enumeration Leaf;
ECX := V2 Extended Topology Enumeration Leaf;
EDX := V2 Extended Topology Enumeration Leaf;
BREAK;
EAX = 20H:
EAX := Processor History Reset Sub-leaf; (* See Table 3-17. *)
EBX := Processor History Reset Sub-leaf;
ECX := Processor History Reset Sub-leaf;
EDX := Processor History Reset Sub-leaf;
BREAK;
EAX = 23H:
EAX := Architectural Performance Monitoring Extended Leaf; (* See Table 3-17. *)
EBX := Architectural Performance Monitoring Extended Leaf;
ECX := Architectural Performance Monitoring Extended Leaf;
EDX := Architectural Performance Monitoring Extended Leaf;
BREAK;
EAX = 24H:
EAX := Intel AVX10 Converged Vector ISA Leaf; (* See Table 3-17. *)
EBX := Intel AVX10 Converged Vector ISA Leaf;
ECX := Intel AVX10 Converged Vector ISA Leaf;
EDX := Intel AVX10 Converged Vector ISA Leaf;
BREAK;
EAX = 80000000H:
EAX := Highest extended function input value understood by CPUID;
EBX := Reserved;
ECX := Reserved;
EDX := Reserved;
BREAK;
EAX = 80000001H:
EAX := Reserved;
EBX := Reserved;
ECX := Extended Feature Bits (* See Table 3-17.*);
EDX := Extended Feature Bits (* See Table 3-17. *);
BREAK;
EAX = 80000002H:
EAX := Processor Brand String;
EBX := Processor Brand String, continued;
ECX := Processor Brand String, continued;

CPUID—CPU Identification Vol. 2A 3-267


EDX := Processor Brand String, continued;
BREAK;
EAX = 80000003H:
EAX := Processor Brand String, continued;
EBX := Processor Brand String, continued;
ECX := Processor Brand String, continued;
EDX := Processor Brand String, continued;
BREAK;
EAX = 80000004H:
EAX := Processor Brand String, continued;
EBX := Processor Brand String, continued;
ECX := Processor Brand String, continued;
EDX := Processor Brand String, continued;
BREAK;
EAX = 80000005H:
EAX := Reserved = 0;
EBX := Reserved = 0;
ECX := Reserved = 0;
EDX := Reserved = 0;
BREAK;
EAX = 80000006H:
EAX := Reserved = 0;
EBX := Reserved = 0;
ECX := Cache information;
EDX := Reserved = 0;
BREAK;
EAX = 80000007H:
EAX := Reserved = 0;
EBX := Reserved = 0;
ECX := Reserved = 0;
EDX := Reserved = Misc Feature Flags;
BREAK;
EAX = 80000008H:
EAX := Address Size Information;
EBX := Misc Feature Flags;
ECX := Reserved = 0;
EDX := Reserved = 0;
BREAK;
EAX >= 40000000H and EAX <= 4FFFFFFFH:
DEFAULT: (* EAX = Value outside of recognized range for CPUID. *)
(* If the highest basic information leaf data depend on ECX input value, ECX is honored.*)
EAX := Reserved; (* Information returned for highest basic information leaf. *)
EBX := Reserved; (* Information returned for highest basic information leaf. *)
ECX := Reserved; (* Information returned for highest basic information leaf. *)
EDX := Reserved; (* Information returned for highest basic information leaf. *)
BREAK;
ESAC;

Flags Affected
None.

CPUID—CPU Identification Vol. 2A 3-268


Exceptions (All Operating Modes)
#UD If the LOCK prefix is used.
In earlier IA-32 processors that do not support the CPUID instruction, execution of the instruc-
tion results in an invalid opcode (#UD) exception being generated.

CPUID—CPU Identification Vol. 2A 3-269


CVTDQ2PD—Convert Packed Doubleword Integers to Packed Double Precision Floating-Point
Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
F3 0F E6 /r A V/V SSE2 Convert two packed signed doubleword integers
CVTDQ2PD xmm1, xmm2/m64 from xmm2/mem to two packed double precision
floating-point values in xmm1.
VEX.128.F3.0F.WIG E6 /r A V/V AVX Convert two packed signed doubleword integers
VCVTDQ2PD xmm1, xmm2/m64 from xmm2/mem to two packed double precision
floating-point values in xmm1.
VEX.256.F3.0F.WIG E6 /r A V/V AVX Convert four packed signed doubleword integers
VCVTDQ2PD ymm1, xmm2/m128 from xmm2/mem to four packed double precision
floating-point values in ymm1.
EVEX.128.F3.0F.W0 E6 /r B V/V (AVX512VL AND Convert 2 packed signed doubleword integers from
VCVTDQ2PD xmm1 {k1}{z}, AVX512F) OR xmm2/m64/m32bcst to eight packed double
xmm2/m64/m32bcst AVX10.11 precision floating-point values in xmm1 with
writemask k1.
EVEX.256.F3.0F.W0 E6 /r B V/V (AVX512VL AND Convert 4 packed signed doubleword integers from
VCVTDQ2PD ymm1 {k1}{z}, AVX512F) OR xmm2/m128/m32bcst to 4 packed double precision
xmm2/m128/m32bcst AVX10.11 floating-point values in ymm1 with writemask k1.
EVEX.512.F3.0F.W0 E6 /r B V/V AVX512F Convert eight packed signed doubleword integers
VCVTDQ2PD zmm1 {k1}{z}, OR AVX10.11 from ymm2/m256/m32bcst to eight packed double
ymm2/m256/m32bcst precision floating-point values in zmm1 with
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Half ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts two, four or eight packed signed doubleword integers in the source operand (the second operand) to two,
four or eight packed double precision floating-point values in the destination operand (the first operand).
EVEX encoded versions: The source operand can be a YMM/XMM/XMM (low 64 bits) register, a 256/128/64-bit
memory location or a 256/128/64-bit vector broadcasted from a 32-bit memory location. The destination operand
is a ZMM/YMM/XMM register conditionally updated with writemask k1. Attempt to encode this instruction with EVEX
embedded rounding is ignored.
VEX.256 encoded version: The source operand is an XMM register or 128- bit memory location. The destination
operand is a YMM register.
VEX.128 encoded version: The source operand is an XMM register or 64- bit memory location. The destination
operand is a XMM register. The upper Bits (MAXVL-1:128) of the corresponding ZMM register destination are
zeroed.
128-bit Legacy SSE version: The source operand is an XMM register or 64- bit memory location. The destination
operand is an XMM register. The upper Bits (MAXVL-1:128) of the corresponding ZMM register destination are
unmodified.

CVTDQ2PD—Convert Packed Doubleword Integers to Packed Double Precision Floating-Point Values Vol. 2A 3-272
VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.

SRC X3 X2 X1 X0

DEST X3 X2 X1 X0

Figure 3-11. CVTDQ2PD (VEX.256 encoded version)

Operation
VCVTDQ2PD (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_Integer_To_Double_Precision_Floating_Point(SRC[k+31:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTDQ2PD (EVEX Encoded Versions) When SRC Operand is a Memory Source


(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
Convert_Integer_To_Double_Precision_Floating_Point(SRC[31:0])
ELSE
DEST[i+63:i] :=
Convert_Integer_To_Double_Precision_Floating_Point(SRC[k+31:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*

CVTDQ2PD—Convert Packed Doubleword Integers to Packed Double Precision Floating-Point Values Vol. 2A 3-273
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTDQ2PD (VEX.256 Encoded Version)


DEST[63:0] := Convert_Integer_To_Double_Precision_Floating_Point(SRC[31:0])
DEST[127:64] := Convert_Integer_To_Double_Precision_Floating_Point(SRC[63:32])
DEST[191:128] := Convert_Integer_To_Double_Precision_Floating_Point(SRC[95:64])
DEST[255:192] := Convert_Integer_To_Double_Precision_Floating_Point(SRC[127:96)
DEST[MAXVL-1:256] := 0

VCVTDQ2PD (VEX.128 Encoded Version)


DEST[63:0] := Convert_Integer_To_Double_Precision_Floating_Point(SRC[31:0])
DEST[127:64] := Convert_Integer_To_Double_Precision_Floating_Point(SRC[63:32])
DEST[MAXVL-1:128] := 0

CVTDQ2PD (128-bit Legacy SSE Version)


DEST[63:0] := Convert_Integer_To_Double_Precision_Floating_Point(SRC[31:0])
DEST[127:64] := Convert_Integer_To_Double_Precision_Floating_Point(SRC[63:32])
DEST[MAXVL-1:128] (unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VCVTDQ2PD __m512d _mm512_cvtepi32_pd( __m256i a);
VCVTDQ2PD __m512d _mm512_mask_cvtepi32_pd( __m512d s, __mmask8 k, __m256i a);
VCVTDQ2PD __m512d _mm512_maskz_cvtepi32_pd( __mmask8 k, __m256i a);
VCVTDQ2PD __m256d _mm256_cvtepi32_pd (__m128i src);
VCVTDQ2PD __m256d _mm256_mask_cvtepi32_pd( __m256d s, __mmask8 k, __m256i a);
VCVTDQ2PD __m256d _mm256_maskz_cvtepi32_pd( __mmask8 k, __m256i a);
VCVTDQ2PD __m128d _mm_mask_cvtepi32_pd( __m128d s, __mmask8 k, __m128i a);
VCVTDQ2PD __m128d _mm_maskz_cvtepi32_pd( __mmask8 k, __m128i a);
CVTDQ2PD __m128d _mm_cvtepi32_pd (__m128i src)

Other Exceptions
VEX-encoded instructions, see Table 2-22, “Type 5 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-53, “Type E5 Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

CVTDQ2PD—Convert Packed Doubleword Integers to Packed Double Precision Floating-Point Values Vol. 2A 3-274
CVTDQ2PS—Convert Packed Doubleword Integers to Packed Single Precision Floating-Point
Values
Opcode Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 5B /r A V/V SSE2 Convert four packed signed doubleword integers
CVTDQ2PS xmm1, xmm2/m128 from xmm2/mem to four packed single precision
floating-point values in xmm1.
VEX.128.0F.WIG 5B /r A V/V AVX Convert four packed signed doubleword integers
VCVTDQ2PS xmm1, xmm2/m128 from xmm2/mem to four packed single precision
floating-point values in xmm1.
VEX.256.0F.WIG 5B /r A V/V AVX Convert eight packed signed doubleword integers
VCVTDQ2PS ymm1, ymm2/m256 from ymm2/mem to eight packed single precision
floating-point values in ymm1.
EVEX.128.0F.W0 5B /r B V/V (AVX512VL AND Convert four packed signed doubleword integers
VCVTDQ2PS xmm1 {k1}{z}, AVX512F) OR from xmm2/m128/m32bcst to four packed single
xmm2/m128/m32bcst AVX10.11 precision floating-point values in xmm1with
writemask k1.
EVEX.256.0F.W0 5B /r B V/V (AVX512VL AND Convert eight packed signed doubleword integers
VCVTDQ2PS ymm1 {k1}{z}, AVX512F) OR from ymm2/m256/m32bcst to eight packed single
ymm2/m256/m32bcst AVX10.11 precision floating-point values in ymm1with
writemask k1.
EVEX.512.0F.W0 5B /r B V/V AVX512F Convert sixteen packed signed doubleword integers
VCVTDQ2PS zmm1 {k1}{z}, OR AVX10.11 from zmm2/m512/m32bcst to sixteen packed single
zmm2/m512/m32bcst {er} precision floating-point values in zmm1with
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts four, eight or sixteen packed signed doubleword integers in the source operand to four, eight or sixteen
packed single precision floating-point values in the destination operand.
EVEX encoded versions: The source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a
ZMM/YMM/XMM register conditionally updated with writemask k1.
VEX.256 encoded version: The source operand is a YMM register or 256- bit memory location. The destination
operand is a YMM register. Bits (MAXVL-1:256) of the corresponding register destination are zeroed.
VEX.128 encoded version: The source operand is an XMM register or 128- bit memory location. The destination
operand is a XMM register. The upper bits (MAXVL-1:128) of the corresponding register destination are zeroed.
128-bit Legacy SSE version: The source operand is an XMM register or 128- bit memory location. The destination
operand is an XMM register. The upper Bits (MAXVL-1:128) of the corresponding register destination are unmodi-
fied.
VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.

CVTDQ2PS—Convert Packed Doubleword Integers to Packed Single Precision Floating-Point Values Vol. 2A 3-276
Operation
VCVTDQ2PS (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC); ; refer to Table 15-4 in the Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 1
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC); ; refer to Table 15-4 in the Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 1
FI;

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
Convert_Integer_To_Single_Precision_Floating_Point(SRC[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTDQ2PS (EVEX Encoded Versions) When SRC Operand is a Memory Source


(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_Integer_To_Single_Precision_Floating_Point(SRC[31:0])
ELSE
DEST[i+31:i] :=
Convert_Integer_To_Single_Precision_Floating_Point(SRC[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

CVTDQ2PS—Convert Packed Doubleword Integers to Packed Single Precision Floating-Point Values Vol. 2A 3-277
VCVTDQ2PS (VEX.256 Encoded Version)
DEST[31:0] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[31:0])
DEST[63:32] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[63:32])
DEST[95:64] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[95:64])
DEST[127:96] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[127:96)
DEST[159:128] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[159:128])
DEST[191:160] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[191:160])
DEST[223:192] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[223:192])
DEST[255:224] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[255:224)
DEST[MAXVL-1:256] := 0

VCVTDQ2PS (VEX.128 Encoded Version)


DEST[31:0] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[31:0])
DEST[63:32] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[63:32])
DEST[95:64] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[95:64])
DEST[127:96] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[127z:96)
DEST[MAXVL-1:128] := 0

CVTDQ2PS (128-bit Legacy SSE Version)


DEST[31:0] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[31:0])
DEST[63:32] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[63:32])
DEST[95:64] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[95:64])
DEST[127:96] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[127z:96)
DEST[MAXVL-1:128] (unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VCVTDQ2PS __m512 _mm512_cvtepi32_ps( __m512i a);
VCVTDQ2PS __m512 _mm512_mask_cvtepi32_ps( __m512 s, __mmask16 k, __m512i a);
VCVTDQ2PS __m512 _mm512_maskz_cvtepi32_ps( __mmask16 k, __m512i a);
VCVTDQ2PS __m512 _mm512_cvt_roundepi32_ps( __m512i a, int r);
VCVTDQ2PS __m512 _mm512_mask_cvt_roundepi_ps( __m512 s, __mmask16 k, __m512i a, int r);
VCVTDQ2PS __m512 _mm512_maskz_cvt_roundepi32_ps( __mmask16 k, __m512i a, int r);
VCVTDQ2PS __m256 _mm256_mask_cvtepi32_ps( __m256 s, __mmask8 k, __m256i a);
VCVTDQ2PS __m256 _mm256_maskz_cvtepi32_ps( __mmask8 k, __m256i a);
VCVTDQ2PS __m128 _mm_mask_cvtepi32_ps( __m128 s, __mmask8 k, __m128i a);
VCVTDQ2PS __m128 _mm_maskz_cvtepi32_ps( __mmask8 k, __m128i a);
CVTDQ2PS __m256 _mm256_cvtepi32_ps (__m256i src)
CVTDQ2PS __m128 _mm_cvtepi32_ps (__m128i src)

SIMD Floating-Point Exceptions


Precision.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

CVTDQ2PS—Convert Packed Doubleword Integers to Packed Single Precision Floating-Point Values Vol. 2A 3-278
CVTPD2DQ—Convert Packed Double Precision Floating-Point Values to Packed Doubleword
Integers
Opcode Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
F2 0F E6 /r A V/V SSE2 Convert two packed double precision floating-point
CVTPD2DQ xmm1, xmm2/m128 values in xmm2/mem to two signed doubleword
integers in xmm1.
VEX.128.F2.0F.WIG E6 /r A V/V AVX Convert two packed double precision floating-point
VCVTPD2DQ xmm1, xmm2/m128 values in xmm2/mem to two signed doubleword
integers in xmm1.
VEX.256.F2.0F.WIG E6 /r A V/V AVX Convert four packed double precision floating-point
VCVTPD2DQ xmm1, ymm2/m256 values in ymm2/mem to four signed doubleword
integers in xmm1.
EVEX.128.F2.0F.W1 E6 /r B V/V (AVX512VL AND Convert two packed double precision floating-point
VCVTPD2DQ xmm1 {k1}{z}, AVX512F) OR values in xmm2/m128/m64bcst to two signed
xmm2/m128/m64bcst AVX10.11 doubleword integers in xmm1 subject to writemask
k1.
EVEX.256.F2.0F.W1 E6 /r B V/V (AVX512VL AND Convert four packed double precision floating-point
VCVTPD2DQ xmm1 {k1}{z}, AVX512F) OR values in ymm2/m256/m64bcst to four signed
ymm2/m256/m64bcst AVX10.11 doubleword integers in xmm1 subject to writemask
k1.
EVEX.512.F2.0F.W1 E6 /r B V/V AVX512F Convert eight packed double precision floating-
VCVTPD2DQ ymm1 {k1}{z}, OR AVX10.11 point values in zmm2/m512/m64bcst to eight
zmm2/m512/m64bcst {er} signed doubleword integers in ymm1 subject to
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts packed double precision floating-point values in the source operand (second operand) to packed signed
doubleword integers in the destination operand (first operand).
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value
(2w-1, where w represents the number of bits in the destination format) is returned.
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register, a 512-bit memory location, or a 512-bit
vector broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register condi-
tionally updated with writemask k1. The upper bits (MAXVL-1:256/128/64) of the corresponding destination are
zeroed.
VEX.256 encoded version: The source operand is a YMM register or 256- bit memory location. The destination
operand is an XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are
zeroed.

CVTPD2DQ—Convert Packed Double Precision Floating-Point Values to Packed Doubleword Integers Vol. 2A 3-279
VEX.128 encoded version: The source operand is an XMM register or 128- bit memory location. The destination
operand is a XMM register. The upper bits (MAXVL-1:64) of the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The source operand is an XMM register or 128- bit memory location. The destination
operand is an XMM register. Bits[127:64] of the destination XMM register are zeroed. However, the upper bits
(MAXVL-1:128) of the corresponding ZMM register destination are unmodified.
VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.

SRC X3 X2 X1 X0

DEST 0 X3 X2 X1 X0

Figure 3-12. VCVTPD2DQ (VEX.256 encoded version)

Operation
VCVTPD2DQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;

FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_Integer(SRC[k+63:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0

CVTPD2DQ—Convert Packed Double Precision Floating-Point Values to Packed Doubleword Integers Vol. 2A 3-280
VCVTPD2DQ (EVEX Encoded Versions) When SRC Operand is a Memory Source
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_Integer(SRC[63:0])
ELSE
DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_Integer(SRC[k+63:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0

VCVTPD2DQ (VEX.256 Encoded Version)


DEST[31:0] := Convert_Double_Precision_Floating_Point_To_Integer(SRC[63:0])
DEST[63:32] := Convert_Double_Precision_Floating_Point_To_Integer(SRC[127:64])
DEST[95:64] := Convert_Double_Precision_Floating_Point_To_Integer(SRC[191:128])
DEST[127:96] := Convert_Double_Precision_Floating_Point_To_Integer(SRC[255:192)
DEST[MAXVL-1:128] := 0

VCVTPD2DQ (VEX.128 Encoded Version)


DEST[31:0] := Convert_Double_Precision_Floating_Point_To_Integer(SRC[63:0])
DEST[63:32] := Convert_Double_Precision_Floating_Point_To_Integer(SRC[127:64])
DEST[MAXVL-1:64] := 0

CVTPD2DQ (128-bit Legacy SSE Version)


DEST[31:0] := Convert_Double_Precision_Floating_Point_To_Integer(SRC[63:0])
DEST[63:32] := Convert_Double_Precision_Floating_Point_To_Integer(SRC[127:64])
DEST[127:64] := 0
DEST[MAXVL-1:128] (unmodified)

CVTPD2DQ—Convert Packed Double Precision Floating-Point Values to Packed Doubleword Integers Vol. 2A 3-281
Intel C/C++ Compiler Intrinsic Equivalent
VCVTPD2DQ __m256i _mm512_cvtpd_epi32( __m512d a);
VCVTPD2DQ __m256i _mm512_mask_cvtpd_epi32( __m256i s, __mmask8 k, __m512d a);
VCVTPD2DQ __m256i _mm512_maskz_cvtpd_epi32( __mmask8 k, __m512d a);
VCVTPD2DQ __m256i _mm512_cvt_roundpd_epi32( __m512d a, int r);
VCVTPD2DQ __m256i _mm512_mask_cvt_roundpd_epi32( __m256i s, __mmask8 k, __m512d a, int r);
VCVTPD2DQ __m256i _mm512_maskz_cvt_roundpd_epi32( __mmask8 k, __m512d a, int r);
VCVTPD2DQ __m128i _mm256_mask_cvtpd_epi32( __m128i s, __mmask8 k, __m256d a);
VCVTPD2DQ __m128i _mm256_maskz_cvtpd_epi32( __mmask8 k, __m256d a);
VCVTPD2DQ __m128i _mm_mask_cvtpd_epi32( __m128i s, __mmask8 k, __m128d a);
VCVTPD2DQ __m128i _mm_maskz_cvtpd_epi32( __mmask8 k, __m128d a);
VCVTPD2DQ __m128i _mm256_cvtpd_epi32 (__m256d src)
CVTPD2DQ __m128i _mm_cvtpd_epi32 (__m128d src)

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
See Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

CVTPD2DQ—Convert Packed Double Precision Floating-Point Values to Packed Doubleword Integers Vol. 2A 3-282
CVTPD2PS—Convert Packed Double Precision Floating-Point Values to Packed Single Precision
Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 5A /r A V/V SSE2 Convert two packed double precision floating-point
CVTPD2PS xmm1, xmm2/m128 values in xmm2/mem to two single precision
floating-point values in xmm1.
VEX.128.66.0F.WIG 5A /r A V/V AVX Convert two packed double precision floating-point
VCVTPD2PS xmm1, xmm2/m128 values in xmm2/mem to two single precision
floating-point values in xmm1.
VEX.256.66.0F.WIG 5A /r A V/V AVX Convert four packed double precision floating-
VCVTPD2PS xmm1, ymm2/m256 point values in ymm2/mem to four single precision
floating-point values in xmm1.
EVEX.128.66.0F.W1 5A /r B V/V (AVX512VL AND Convert two packed double precision floating-point
VCVTPD2PS xmm1 {k1}{z}, AVX512F) OR values in xmm2/m128/m64bcst to two single
xmm2/m128/m64bcst AVX10.11 precision floating-point values in xmm1with
writemask k1.
EVEX.256.66.0F.W1 5A /r B V/V (AVX512VL AND Convert four packed double precision floating-
VCVTPD2PS xmm1 {k1}{z}, AVX512F) OR point values in ymm2/m256/m64bcst to four
ymm2/m256/m64bcst AVX10.11 single precision floating-point values in xmm1with
writemask k1.
EVEX.512.66.0F.W1 5A /r B V/V AVX512F Convert eight packed double precision floating-
VCVTPD2PS ymm1 {k1}{z}, OR AVX10.11 point values in zmm2/m512/m64bcst to eight
zmm2/m512/m64bcst {er} single precision floating-point values in ymm1with
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts two, four or eight packed double precision floating-point values in the source operand (second operand)
to two, four or eight packed single precision floating-point values in the destination operand (first operand).
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits.
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or
a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is a
YMM/XMM/XMM (low 64-bits) register conditionally updated with writemask k1. The upper bits (MAXVL-
1:256/128/64) of the corresponding destination are zeroed.
VEX.256 encoded version: The source operand is a YMM register or 256- bit memory location. The destination
operand is an XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are
zeroed.
VEX.128 encoded version: The source operand is an XMM register or 128- bit memory location. The destination
operand is a XMM register. The upper bits (MAXVL-1:64) of the corresponding ZMM register destination are zeroed.

CVTPD2PS—Convert Packed Double Precision Floating-Point Values to Packed Single Precision Floating-Point Values Vol. 2A 3-284
128-bit Legacy SSE version: The source operand is an XMM register or 128- bit memory location. The destination
operand is an XMM register. Bits[127:64] of the destination XMM register are zeroed. However, the upper Bits
(MAXVL-1:128) of the corresponding ZMM register destination are unmodified.
VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

SRC X3 X2 X1 X0

DEST 0 X3 X2 X1 X0

Figure 3-13. VCVTPD2PS (VEX.256 encoded version)

Operation
VCVTPD2PS (EVEX Encoded Version) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;

FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN
DEST[i+31:i] := Convert_Double_Precision_Floating_Point_To_Single_Precision_Floating_Point(SRC[k+63:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0

CVTPD2PS—Convert Packed Double Precision Floating-Point Values to Packed Single Precision Floating-Point Values Vol. 2A 3-285
VCVTPD2PS (EVEX Encoded Version) When SRC Operand is a Memory Source
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=Convert_Double_Precision_Floating_Point_To_Single_Precision_Floating_Point(SRC[63:0])
ELSE
DEST[i+31:i] := Convert_Double_Precision_Floating_Point_To_Single_Precision_Floating_Point(SRC[k+63:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0

VCVTPD2PS (VEX.256 Encoded Version)


DEST[31:0] := Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[63:0])
DEST[63:32] := Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[127:64])
DEST[95:64] := Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[191:128])
DEST[127:96] := Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[255:192)
DEST[MAXVL-1:128] := 0

VCVTPD2PS (VEX.128 Encoded Version)


DEST[31:0] := Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[63:0])
DEST[63:32] := Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[127:64])
DEST[MAXVL-1:64] := 0

CVTPD2PS (128-bit Legacy SSE Version)


DEST[31:0] := Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[63:0])
DEST[63:32] := Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[127:64])
DEST[127:64] := 0
DEST[MAXVL-1:128] (unmodified)

CVTPD2PS—Convert Packed Double Precision Floating-Point Values to Packed Single Precision Floating-Point Values Vol. 2A 3-286
Intel C/C++ Compiler Intrinsic Equivalent
VCVTPD2PS __m256 _mm512_cvtpd_ps( __m512d a);
VCVTPD2PS __m256 _mm512_mask_cvtpd_ps( __m256 s, __mmask8 k, __m512d a);
VCVTPD2PS __m256 _mm512_maskz_cvtpd_ps( __mmask8 k, __m512d a);
VCVTPD2PS __m256 _mm512_cvt_roundpd_ps( __m512d a, int r);
VCVTPD2PS __m256 _mm512_mask_cvt_roundpd_ps( __m256 s, __mmask8 k, __m512d a, int r);
VCVTPD2PS __m256 _mm512_maskz_cvt_roundpd_ps( __mmask8 k, __m512d a, int r);
VCVTPD2PS __m128 _mm256_mask_cvtpd_ps( __m128 s, __mmask8 k, __m256d a);
VCVTPD2PS __m128 _mm256_maskz_cvtpd_ps( __mmask8 k, __m256d a);
VCVTPD2PS __m128 _mm_mask_cvtpd_ps( __m128 s, __mmask8 k, __m128d a);
VCVTPD2PS __m128 _mm_maskz_cvtpd_ps( __mmask8 k, __m128d a);
VCVTPD2PS __m128 _mm256_cvtpd_ps (__m256d a)
CVTPD2PS __m128 _mm_cvtpd_ps (__m128d a)

SIMD Floating-Point Exceptions


Invalid, Precision, Underflow, Overflow, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

CVTPD2PS—Convert Packed Double Precision Floating-Point Values to Packed Single Precision Floating-Point Values Vol. 2A 3-287
CVTPS2DQ—Convert Packed Single Precision Floating-Point Values to Packed Signed
Doubleword Integer Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 5B /r A V/V SSE2 Convert four packed single precision floating-point
CVTPS2DQ xmm1, xmm2/m128 values from xmm2/mem to four packed signed
doubleword values in xmm1.
VEX.128.66.0F.WIG 5B /r A V/V AVX Convert four packed single precision floating-point
VCVTPS2DQ xmm1, xmm2/m128 values from xmm2/mem to four packed signed
doubleword values in xmm1.
VEX.256.66.0F.WIG 5B /r A V/V AVX Convert eight packed single precision floating-point
VCVTPS2DQ ymm1, ymm2/m256 values from ymm2/mem to eight packed signed
doubleword values in ymm1.
EVEX.128.66.0F.W0 5B /r B V/V (AVX512VL AND Convert four packed single precision floating-point
VCVTPS2DQ xmm1 {k1}{z}, AVX512F) OR values from xmm2/m128/m32bcst to four packed
xmm2/m128/m32bcst AVX10.11 signed doubleword values in xmm1 subject to
writemask k1.
EVEX.256.66.0F.W0 5B /r B V/V (AVX512VL AND Convert eight packed single precision floating-point
VCVTPS2DQ ymm1 {k1}{z}, AVX512F) OR values from ymm2/m256/m32bcst to eight packed
ymm2/m256/m32bcst AVX10.11 signed doubleword values in ymm1 subject to
writemask k1.
EVEX.512.66.0F.W0 5B /r B V/V AVX512F Convert sixteen packed single precision floating-point
VCVTPS2DQ zmm1 {k1}{z}, OR AVX10.11 values from zmm2/m512/m32bcst to sixteen packed
zmm2/m512/m32bcst {er} signed doubleword values in zmm1 subject to
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts four, eight or sixteen packed single precision floating-point values in the source operand to four, eight or
sixteen signed doubleword integers in the destination operand.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value
(2w-1, where w represents the number of bits in the destination format) is returned.
EVEX encoded versions: The source operand is a ZMM register, a 512-bit memory location or a 512-bit vector
broadcasted from a 32-bit memory location. The destination operand is a ZMM register conditionally updated with
writemask k1.
VEX.256 encoded version: The source operand is a YMM register or 256- bit memory location. The destination
operand is a YMM register. The upper bits (MAXVL-1:256) of the corresponding ZMM register destination are
zeroed.

CVTPS2DQ—Convert Packed Single Precision Floating-Point Values to Packed Signed Doubleword Integer Values Vol. 2A 3-290
VEX.128 encoded version: The source operand is an XMM register or 128- bit memory location. The destination
operand is a XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are
zeroed.
128-bit Legacy SSE version: The source operand is an XMM register or 128- bit memory location. The destination
operand is an XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are
unmodified.
VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

Operation
VCVTPS2DQ (Encoded Versions) When SRC Operand is a Register
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_Integer(SRC[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTPS2DQ (EVEX Encoded Versions) When SRC Operand is a Memory Source


(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO 15
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_Integer(SRC[31:0])
ELSE
DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_Integer(SRC[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0

CVTPS2DQ—Convert Packed Single Precision Floating-Point Values to Packed Signed Doubleword Integer Values Vol. 2A 3-291
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTPS2DQ (VEX.256 Encoded Version)


DEST[31:0] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[31:0])
DEST[63:32] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[63:32])
DEST[95:64] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[95:64])
DEST[127:96] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[127:96)
DEST[159:128] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[159:128])
DEST[191:160] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[191:160])
DEST[223:192] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[223:192])
DEST[255:224] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[255:224])

VCVTPS2DQ (VEX.128 Encoded Version)


DEST[31:0] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[31:0])
DEST[63:32] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[63:32])
DEST[95:64] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[95:64])
DEST[127:96] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[127:96])
DEST[MAXVL-1:128] := 0

CVTPS2DQ (128-bit Legacy SSE Version)


DEST[31:0] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[31:0])
DEST[63:32] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[63:32])
DEST[95:64] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[95:64])
DEST[127:96] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[127:96])
DEST[MAXVL-1:128] (unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VCVTPS2DQ __m512i _mm512_cvtps_epi32( __m512 a);
VCVTPS2DQ __m512i _mm512_mask_cvtps_epi32( __m512i s, __mmask16 k, __m512 a);
VCVTPS2DQ __m512i _mm512_maskz_cvtps_epi32( __mmask16 k, __m512 a);
VCVTPS2DQ __m512i _mm512_cvt_roundps_epi32( __m512 a, int r);
VCVTPS2DQ __m512i _mm512_mask_cvt_roundps_epi32( __m512i s, __mmask16 k, __m512 a, int r);
VCVTPS2DQ __m512i _mm512_maskz_cvt_roundps_epi32( __mmask16 k, __m512 a, int r);
VCVTPS2DQ __m256i _mm256_mask_cvtps_epi32( __m256i s, __mmask8 k, __m256 a);
VCVTPS2DQ __m256i _mm256_maskz_cvtps_epi32( __mmask8 k, __m256 a);
VCVTPS2DQ __m128i _mm_mask_cvtps_epi32( __m128i s, __mmask8 k, __m128 a);
VCVTPS2DQ __m128i _mm_maskz_cvtps_epi32( __mmask8 k, __m128 a);
VCVTPS2DQ __ m256i _mm256_cvtps_epi32 (__m256 a)
CVTPS2DQ __m128i _mm_cvtps_epi32 (__m128 a)

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

CVTPS2DQ—Convert Packed Single Precision Floating-Point Values to Packed Signed Doubleword Integer Values Vol. 2A 3-292
CVTPS2PD—Convert Packed Single Precision Floating-Point Values to Packed Double Precision
Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 5A /r A V/V SSE2 Convert two packed single precision floating-point
CVTPS2PD xmm1, xmm2/m64 values in xmm2/m64 to two packed double precision
floating-point values in xmm1.
VEX.128.0F.WIG 5A /r A V/V AVX Convert two packed single precision floating-point
VCVTPS2PD xmm1, xmm2/m64 values in xmm2/m64 to two packed double precision
floating-point values in xmm1.
VEX.256.0F.WIG 5A /r A V/V AVX Convert four packed single precision floating-point
VCVTPS2PD ymm1, xmm2/m128 values in xmm2/m128 to four packed double precision
floating-point values in ymm1.
EVEX.128.0F.W0 5A /r B V/V (AVX512VL AND Convert two packed single precision floating-point
VCVTPS2PD xmm1 {k1}{z}, AVX512F) OR values in xmm2/m64/m32bcst to packed double
xmm2/m64/m32bcst AVX10.11 precision floating-point values in xmm1 with writemask
k1.
EVEX.256.0F.W0 5A /r B V/V (AVX512VL AND Convert four packed single precision floating-point
VCVTPS2PD ymm1 {k1}{z}, AVX512F) OR values in xmm2/m128/m32bcst to packed double
xmm2/m128/m32bcst AVX10.11 precision floating-point values in ymm1 with writemask
k1.
EVEX.512.0F.W0 5A /r B V/V AVX512F Convert eight packed single precision floating-point
VCVTPS2PD zmm1 {k1}{z}, OR AVX10.11 values in ymm2/m256/b32bcst to eight packed double
ymm2/m256/m32bcst {sae} precision floating-point values in zmm1 with writemask
k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Half ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts two, four or eight packed single precision floating-point values in the source operand (second operand) to
two, four or eight packed double precision floating-point values in the destination operand (first operand).
EVEX encoded versions: The source operand is a YMM/XMM/XMM (low 64-bits) register, a 256/128/64-bit memory
location or a 256/128/64-bit vector broadcasted from a 32-bit memory location. The destination operand is a
ZMM/YMM/XMM register conditionally updated with writemask k1.
VEX.256 encoded version: The source operand is an XMM register or 128- bit memory location. The destination
operand is a YMM register. Bits (MAXVL-1:256) of the corresponding destination ZMM register are zeroed.
VEX.128 encoded version: The source operand is an XMM register or 64- bit memory location. The destination
operand is a XMM register. The upper Bits (MAXVL-1:128) of the corresponding ZMM register destination are
zeroed.
128-bit Legacy SSE version: The source operand is an XMM register or 64- bit memory location. The destination
operand is an XMM register. The upper Bits (MAXVL-1:128) of the corresponding ZMM register destination are
unmodified.

CVTPS2PD—Convert Packed Single Precision Floating-Point Values to Packed Double Precision Floating-Point Values Vol. 2A 3-293
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

SRC X3 X2 X1 X0

DEST X3 X2 X1 X0

Figure 3-14. CVTPS2PD (VEX.256 encoded version)

Operation
VCVTPS2PD (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_Single_Precision_To_Double_Precision_Floating_Point(SRC[k+31:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTPS2PD (EVEX Encoded Versions) When SRC Operand is a Memory Source


(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
Convert_Single_Precision_To_Double_Precision_Floating_Point(SRC[31:0])
ELSE
DEST[i+63:i] :=
Convert_Single_Precision_To_Double_Precision_Floating_Point(SRC[k+31:k])
FI;

CVTPS2PD—Convert Packed Single Precision Floating-Point Values to Packed Double Precision Floating-Point Values Vol. 2A 3-294
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTPS2PD (VEX.256 Encoded Version)


DEST[63:0] := Convert_Single_Precision_To_Double_Precision_Floating_Point(SRC[31:0])
DEST[127:64] := Convert_Single_Precision_To_Double_Precision_Floating_Point(SRC[63:32])
DEST[191:128] := Convert_Single_Precision_To_Double_Precision_Floating_Point(SRC[95:64])
DEST[255:192] := Convert_Single_Precision_To_Double_Precision_Floating_Point(SRC[127:96)
DEST[MAXVL-1:256] := 0

VCVTPS2PD (VEX.128 Encoded Version)


DEST[63:0] := Convert_Single_Precision_To_Double_Precision_Floating_Point(SRC[31:0])
DEST[127:64] := Convert_Single_Precision_To_Double_Precision_Floating_Point(SRC[63:32])
DEST[MAXVL-1:128] := 0

CVTPS2PD (128-bit Legacy SSE Version)


DEST[63:0] := Convert_Single_Precision_To_Double_Precision_Floating_Point(SRC[31:0])
DEST[127:64] := Convert_Single_Precision_To_Double_Precision_Floating_Point(SRC[63:32])
DEST[MAXVL-1:128] (unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VCVTPS2PD __m512d _mm512_cvtps_pd( __m256 a);
VCVTPS2PD __m512d _mm512_mask_cvtps_pd( __m512d s, __mmask8 k, __m256 a);
VCVTPS2PD __m512d _mm512_maskz_cvtps_pd( __mmask8 k, __m256 a);
VCVTPS2PD __m512d _mm512_cvt_roundps_pd( __m256 a, int sae);
VCVTPS2PD __m512d _mm512_mask_cvt_roundps_pd( __m512d s, __mmask8 k, __m256 a, int sae);
VCVTPS2PD __m512d _mm512_maskz_cvt_roundps_pd( __mmask8 k, __m256 a, int sae);
VCVTPS2PD __m256d _mm256_mask_cvtps_pd( __m256d s, __mmask8 k, __m128 a);
VCVTPS2PD __m256d _mm256_maskz_cvtps_pd( __mmask8 k, __m128a);
VCVTPS2PD __m128d _mm_mask_cvtps_pd( __m128d s, __mmask8 k, __m128 a);
VCVTPS2PD __m128d _mm_maskz_cvtps_pd( __mmask8 k, __m128 a);
VCVTPS2PD __m256d _mm256_cvtps_pd (__m128 a)
CVTPS2PD __m128d _mm_cvtps_pd (__m128 a)

SIMD Floating-Point Exceptions


Invalid, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

CVTPS2PD—Convert Packed Single Precision Floating-Point Values to Packed Double Precision Floating-Point Values Vol. 2A 3-295
CVTSD2SI—Convert Scalar Double Precision Floating-Point Value to Doubleword Integer
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F2 0F 2D /r A V/V SSE2 Convert one double precision floating-point value from
CVTSD2SI r32, xmm1/m64 xmm1/m64 to one signed doubleword integer r32.
F2 REX.W 0F 2D /r A V/N.E. SSE2 Convert one double precision floating-point value from
CVTSD2SI r64, xmm1/m64 xmm1/m64 to one signed quadword integer sign-
extended into r64.
VEX.LIG.F2.0F.W0 2D /r 1 A V/V AVX Convert one double precision floating-point value from
VCVTSD2SI r32, xmm1/m64 xmm1/m64 to one signed doubleword integer r32.
VEX.LIG.F2.0F.W1 2D /r 1 A V/N.E.2 AVX Convert one double precision floating-point value from
VCVTSD2SI r64, xmm1/m64 xmm1/m64 to one signed quadword integer sign-
extended into r64.
EVEX.LLIG.F2.0F.W0 2D /r B V/V AVX512F Convert one double precision floating-point value from
VCVTSD2SI r32, xmm1/m64{er} OR AVX10.13 xmm1/m64 to one signed doubleword integer r32.
EVEX.LLIG.F2.0F.W1 2D /r B V/N.E.2 AVX512F Convert one double precision floating-point value from
VCVTSD2SI r64, xmm1/m64{er} OR AVX10.13 xmm1/m64 to one signed quadword integer sign-
extended into r64.

NOTES:
1. Software should ensure VCVTSD2SI is encoded with VEX.L=0. Encoding VCVTSD2SI with VEX.L=1 may encounter unpredictable
behavior across different processor generations.
2. VEX.W1/EVEX.W1 in non-64 bit is ignored; the instructions behaves as if the W0 version is used.
3. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Tuple1 Fixed ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts a double precision floating-point value in the source operand (the second operand) to a signed double-
word integer in the destination operand (first operand). The source operand can be an XMM register or a 64-bit
memory location. The destination operand is a general-purpose register. When the source operand is an XMM
register, the double precision floating-point value is contained in the low quadword of the register.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register.
If a converted result exceeds the range limits of signed doubleword integer (in non-64-bit modes or 64-bit mode
with REX.W/VEX.W/EVEX.W=0), the floating-point invalid exception is raised, and if this exception is masked, the
indefinite integer value (80000000H) is returned.
If a converted result exceeds the range limits of signed quadword integer (in 64-bit mode and
REX.W/VEX.W/EVEX.W = 1), the floating-point invalid exception is raised, and if this exception is masked, the
indefinite integer value (80000000_00000000H) is returned.
Legacy SSE instruction: Use of the REX.W prefix promotes the instruction to produce 64-bit data in 64-bit mode.
See the summary chart at the beginning of this section for encoding data and limits.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.

CVTSD2SI—Convert Scalar Double Precision Floating-Point Value to Doubleword Integer Vol. 2A 3-297
Software should ensure VCVTSD2SI is encoded with VEX.L=0. Encoding VCVTSD2SI with VEX.L=1 may encounter
unpredictable behavior across different processor generations.

Operation
VCVTSD2SI (EVEX Encoded Version)
IF SRC *is register* AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF 64-Bit Mode and OperandSize = 64
THEN DEST[63:0] := Convert_Double_Precision_Floating_Point_To_Integer(SRC[63:0]);
ELSE DEST[31:0] := Convert_Double_Precision_Floating_Point_To_Integer(SRC[63:0]);
FI

(V)CVTSD2SI
IF 64-Bit Mode and OperandSize = 64
THEN
DEST[63:0] := Convert_Double_Precision_Floating_Point_To_Integer(SRC[63:0]);
ELSE
DEST[31:0] := Convert_Double_Precision_Floating_Point_To_Integer(SRC[63:0]);
FI;

Intel C/C++ Compiler Intrinsic Equivalent


VCVTSD2SI int _mm_cvtsd_i32(__m128d);
VCVTSD2SI int _mm_cvt_roundsd_i32(__m128d, int r);
VCVTSD2SI __int64 _mm_cvtsd_i64(__m128d);
VCVTSD2SI __int64 _mm_cvt_roundsd_i64(__m128d, int r);
CVTSD2SI __int64 _mm_cvtsd_si64(__m128d);
CVTSD2SI int _mm_cvtsd_si32(__m128d a)

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

CVTSD2SI—Convert Scalar Double Precision Floating-Point Value to Doubleword Integer Vol. 2A 3-298
CVTSD2SS—Convert Scalar Double Precision Floating-Point Value to Scalar Single Precision
Floating-Point Value
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F2 0F 5A /r A V/V SSE2 Convert one double precision floating-point value in
CVTSD2SS xmm1, xmm2/m64 xmm2/m64 to one single precision floating-point value
in xmm1.
VEX.LIG.F2.0F.WIG 5A /r B V/V AVX Convert one double precision floating-point value in
VCVTSD2SS xmm1,xmm2, xmm3/m64 xmm3/m64 to one single precision floating-point value
and merge with high bits in xmm2.
EVEX.LLIG.F2.0F.W1 5A /r C V/V AVX512F Convert one double precision floating-point value in
VCVTSD2SS xmm1 {k1}{z}, xmm2, OR AVX10.11 xmm3/m64 to one single precision floating-point value
xmm3/m64{er} and merge with high bits in xmm2 under writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Converts a double precision floating-point value in the “convert-from” source operand (the second operand in SSE2
version, otherwise the third operand) to a single precision floating-point value in the destination operand.
When the “convert-from” operand is an XMM register, the double precision floating-point value is contained in the
low quadword of the register. The result is stored in the low doubleword of the destination operand. When the
conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register.
128-bit Legacy SSE version: The “convert-from” source operand (the second operand) is an XMM register or
memory location. Bits (MAXVL-1:32) of the corresponding destination register remain unchanged. The destination
operand is an XMM register.
VEX.128 and EVEX encoded versions: The “convert-from” source operand (the third operand) can be an XMM
register or a 64-bit memory location. The first source and destination operands are XMM registers. Bits (127:32) of
the XMM register destination are copied from the corresponding bits in the first source operand. Bits (MAXVL-
1:128) of the destination register are zeroed.
EVEX encoded version: the converted result in written to the low doubleword element of the destination under the
writemask.
Software should ensure VCVTSD2SS is encoded with VEX.L=0. Encoding VCVTSD2SS with VEX.L=1 may encounter
unpredictable behavior across different processor generations.

CVTSD2SS—Convert Scalar Double Precision Floating-Point Value to Scalar Single Precision Floating-Point Value Vol. 2A 3-299
Operation
VCVTSD2SS (EVEX Encoded Version)
IF (SRC2 *is register*) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC2[63:0]);
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

VCVTSD2SS (VEX.128 Encoded Version)


DEST[31:0] := Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC2[63:0]);
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

CVTSD2SS (128-bit Legacy SSE Version)


DEST[31:0] := Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[63:0]);
(* DEST[MAXVL-1:32] Unmodified *)

Intel C/C++ Compiler Intrinsic Equivalent


VCVTSD2SS __m128 _mm_mask_cvtsd_ss(__m128 s, __mmask8 k, __m128 a, __m128d b);
VCVTSD2SS __m128 _mm_maskz_cvtsd_ss( __mmask8 k, __m128 a,__m128d b);
VCVTSD2SS __m128 _mm_cvt_roundsd_ss(__m128 a, __m128d b, int r);
VCVTSD2SS __m128 _mm_mask_cvt_roundsd_ss(__m128 s, __mmask8 k, __m128 a, __m128d b, int r);
VCVTSD2SS __m128 _mm_maskz_cvt_roundsd_ss( __mmask8 k, __m128 a,__m128d b, int r);
CVTSD2SS __m128_mm_cvtsd_ss(__m128 a, __m128d b)

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

CVTSD2SS—Convert Scalar Double Precision Floating-Point Value to Scalar Single Precision Floating-Point Value Vol. 2A 3-300
CVTSI2SD—Convert Doubleword Integer to Scalar Double Precision Floating-Point Value
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature
Support Flag
F2 0F 2A /r A V/V SSE2 Convert one signed doubleword integer from
CVTSI2SD xmm1, r32/m32 r32/m32 to one double precision floating-point
value in xmm1.
F2 REX.W 0F 2A /r A V/N.E. SSE2 Convert one signed quadword integer from r/m64
CVTSI2SD xmm1, r/m64 to one double precision floating-point value in
xmm1.
VEX.LIG.F2.0F.W0 2A /r B V/V AVX Convert one signed doubleword integer from
VCVTSI2SD xmm1, xmm2, r/m32 r/m32 to one double precision floating-point value
in xmm1.
VEX.LIG.F2.0F.W1 2A /r B V/N.E.1 AVX Convert one signed quadword integer from r/m64
VCVTSI2SD xmm1, xmm2, r/m64 to one double precision floating-point value in
xmm1.
EVEX.LLIG.F2.0F.W0 2A /r C V/V AVX512F Convert one signed doubleword integer from
VCVTSI2SD xmm1, xmm2, r/m32 OR r/m32 to one double precision floating-point value
AVX10.12 in xmm1.
EVEX.LLIG.F2.0F.W1 2A /r C V/N.E.1 AVX512F Convert one signed quadword integer from r/m64
VCVTSI2SD xmm1, xmm2, r/m64{er} OR to one double precision floating-point value in
AVX10.12 xmm1.

NOTES:
1. VEX.W1/EVEX.W1 in non-64 bit is ignored; the instructions behaves as if the W0 version is used.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Converts a signed doubleword integer (or signed quadword integer if operand size is 64 bits) in the “convert-from”
source operand to a double precision floating-point value in the destination operand. The result is stored in the low
quadword of the destination operand, and the high quadword left unchanged. When conversion is inexact, the
value returned is rounded according to the rounding control bits in the MXCSR register.
The second source operand can be a general-purpose register or a 32/64-bit memory location. The first source and
destination operands are XMM registers.
128-bit Legacy SSE version: Use of the REX.W prefix promotes the instruction to 64-bit operands. The “convert-
from” source operand (the second operand) is a general-purpose register or memory location. The destination is
an XMM register Bits (MAXVL-1:64) of the corresponding destination register remain unchanged.
VEX.128 and EVEX encoded versions: The “convert-from” source operand (the third operand) can be a general-
purpose register or a memory location. The first source and destination operands are XMM registers. Bits (127:64)
of the XMM register destination are copied from the corresponding bits in the first source operand. Bits (MAXVL-
1:128) of the destination register are zeroed.
EVEX.W0 version: attempt to encode this instruction with EVEX embedded rounding is ignored.
VEX.W1 and EVEX.W1 versions: promotes the instruction to use 64-bit input value in 64-bit mode.

CVTSI2SD—Convert Doubleword Integer to Scalar Double Precision Floating-Point Value Vol. 2A 3-301
Software should ensure VCVTSI2SD is encoded with VEX.L=0. Encoding VCVTSI2SD with VEX.L=1 may encounter
unpredictable behavior across different processor generations.

Operation
VCVTSI2SD (EVEX Encoded Version)
IF (SRC2 *is register*) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF 64-Bit Mode And OperandSize = 64
THEN
DEST[63:0] := Convert_Integer_To_Double_Precision_Floating_Point(SRC2[63:0]);
ELSE
DEST[63:0] := Convert_Integer_To_Double_Precision_Floating_Point(SRC2[31:0]);
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

VCVTSI2SD (VEX.128 Encoded Version)


IF 64-Bit Mode And OperandSize = 64
THEN
DEST[63:0] := Convert_Integer_To_Double_Precision_Floating_Point(SRC2[63:0]);
ELSE
DEST[63:0] := Convert_Integer_To_Double_Precision_Floating_Point(SRC2[31:0]);
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

CVTSI2SD
IF 64-Bit Mode And OperandSize = 64
THEN
DEST[63:0] := Convert_Integer_To_Double_Precision_Floating_Point(SRC[63:0]);
ELSE
DEST[63:0] := Convert_Integer_To_Double_Precision_Floating_Point(SRC[31:0]);
FI;
DEST[MAXVL-1:64] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VCVTSI2SD __m128d _mm_cvti32_sd(__m128d s, int a);
VCVTSI2SD __m128d _mm_cvti64_sd(__m128d s, __int64 a);
VCVTSI2SD __m128d _mm_cvt_roundi64_sd(__m128d s, __int64 a, int r);
CVTSI2SD __m128d _mm_cvtsi64_sd(__m128d s, __int64 a);
CVTSI2SD __m128d_mm_cvtsi32_sd(__m128d a, int b)

SIMD Floating-Point Exceptions


Precision.

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions,” if W1; else see Table 2-22, “Type
5 Class Exception Conditions.”

CVTSI2SD—Convert Doubleword Integer to Scalar Double Precision Floating-Point Value Vol. 2A 3-302
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions,” if W1; else see Table 2-61,
“Type E10NF Class Exception Conditions.”

CVTSI2SD—Convert Doubleword Integer to Scalar Double Precision Floating-Point Value Vol. 2A 3-303
CVTSI2SS—Convert Doubleword Integer to Scalar Single Precision Floating-Point Value
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature
Support Flag
F3 0F 2A /r A V/V SSE Convert one signed doubleword integer from r/m32
CVTSI2SS xmm1, r/m32 to one single precision floating-point value in xmm1.
F3 REX.W 0F 2A /r A V/N.E. SSE Convert one signed quadword integer from r/m64 to
CVTSI2SS xmm1, r/m64 one single precision floating-point value in xmm1.
VEX.LIG.F3.0F.W0 2A /r B V/V AVX Convert one signed doubleword integer from r/m32
VCVTSI2SS xmm1, xmm2, r/m32 to one single precision floating-point value in xmm1.
VEX.LIG.F3.0F.W1 2A /r B V/N.E.1 AVX Convert one signed quadword integer from r/m64 to
VCVTSI2SS xmm1, xmm2, r/m64 one single precision floating-point value in xmm1.
EVEX.LLIG.F3.0F.W0 2A /r C V/V AVX512F Convert one signed doubleword integer from r/m32
VCVTSI2SS xmm1, xmm2, r/m32{er} OR to one single precision floating-point value in xmm1.
AVX10.12
EVEX.LLIG.F3.0F.W1 2A /r C V/N.E.1 AVX512F Convert one signed quadword integer from r/m64 to
VCVTSI2SS xmm1, xmm2, r/m64{er} OR one single precision floating-point value in xmm1.
AVX10.12

NOTES:
1. VEX.W1/EVEX.W1 in non-64 bit is ignored; the instructions behaves as if the W0 version is used.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Converts a signed doubleword integer (or signed quadword integer if operand size is 64 bits) in the “convert-from”
source operand to a single precision floating-point value in the destination operand (first operand). The “convert-
from” source operand can be a general-purpose register or a memory location. The destination operand is an XMM
register. The result is stored in the low doubleword of the destination operand, and the upper three doublewords
are left unchanged. When a conversion is inexact, the value returned is rounded according to the rounding control
bits in the MXCSR register or the embedded rounding control bits.
128-bit Legacy SSE version: In 64-bit mode, Use of the REX.W prefix promotes the instruction to use 64-bit input
value. The “convert-from” source operand (the second operand) is a general-purpose register or memory location.
Bits (MAXVL-1:32) of the corresponding destination register remain unchanged.
VEX.128 and EVEX encoded versions: The “convert-from” source operand (the third operand) can be a general-
purpose register or a memory location. The first source and destination operands are XMM registers. Bits (127:32)
of the XMM register destination are copied from corresponding bits in the first source operand. Bits (MAXVL-1:128)
of the destination register are zeroed.
EVEX encoded version: the converted result in written to the low doubleword element of the destination under the
writemask.
Software should ensure VCVTSI2SS is encoded with VEX.L=0. Encoding VCVTSI2SS with VEX.L=1 may encounter
unpredictable behavior across different processor generations.

CVTSI2SS—Convert Doubleword Integer to Scalar Single Precision Floating-Point Value Vol. 2A 3-303
Operation
VCVTSI2SS (EVEX Encoded Version)
IF (SRC2 *is register*) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF 64-Bit Mode And OperandSize = 64
THEN
DEST[31:0] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[63:0]);
ELSE
DEST[31:0] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[31:0]);
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

VCVTSI2SS (VEX.128 Encoded Version)


IF 64-Bit Mode And OperandSize = 64
THEN
DEST[31:0] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[63:0]);
ELSE
DEST[31:0] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[31:0]);
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

CVTSI2SS (128-bit Legacy SSE Version)


IF 64-Bit Mode And OperandSize = 64
THEN
DEST[31:0] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[63:0]);
ELSE
DEST[31:0] :=Convert_Integer_To_Single_Precision_Floating_Point(SRC[31:0]);
FI;
DEST[MAXVL-1:32] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VCVTSI2SS __m128 _mm_cvti32_ss(__m128 s, int a);
VCVTSI2SS __m128 _mm_cvt_roundi32_ss(__m128 s, int a, int r);
VCVTSI2SS __m128 _mm_cvti64_ss(__m128 s, __int64 a);
VCVTSI2SS __m128 _mm_cvt_roundi64_ss(__m128 s, __int64 a, int r);
CVTSI2SS __m128 _mm_cvtsi64_ss(__m128 s, __int64 a);
CVTSI2SS __m128 _mm_cvtsi32_ss(__m128 a, int b);

SIMD Floating-Point Exceptions


Precision.

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”

CVTSI2SS—Convert Doubleword Integer to Scalar Single Precision Floating-Point Value Vol. 2A 3-304
CVTSS2SD—Convert Scalar Single Precision Floating-Point Value to Scalar Double Precision
Floating-Point Value
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F3 0F 5A /r A V/V SSE2 Convert one single precision floating-point value in
CVTSS2SD xmm1, xmm2/m32 xmm2/m32 to one double precision floating-point value
in xmm1.
VEX.LIG.F3.0F.WIG 5A /r B V/V AVX Convert one single precision floating-point value in
VCVTSS2SD xmm1, xmm2, xmm3/m32 to one double precision floating-point value
xmm3/m32 and merge with high bits of xmm2.
EVEX.LLIG.F3.0F.W0 5A /r C V/V AVX512F Convert one single precision floating-point value in
VCVTSS2SD xmm1 {k1}{z}, xmm2, OR AVX10.11 xmm3/m32 to one double precision floating-point value
xmm3/m32{sae} and merge with high bits of xmm2 under writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Converts a single precision floating-point value in the “convert-from” source operand to a double precision floating-
point value in the destination operand. When the “convert-from” source operand is an XMM register, the single
precision floating-point value is contained in the low doubleword of the register. The result is stored in the low
quadword of the destination operand.
128-bit Legacy SSE version: The “convert-from” source operand (the second operand) is an XMM register or
memory location. Bits (MAXVL-1:64) of the corresponding destination register remain unchanged. The destination
operand is an XMM register.
VEX.128 and EVEX encoded versions: The “convert-from” source operand (the third operand) can be an XMM
register or a 32-bit memory location. The first source and destination operands are XMM registers. Bits (127:64) of
the XMM register destination are copied from the corresponding bits in the first source operand. Bits (MAXVL-
1:128) of the destination register are zeroed.
Software should ensure VCVTSS2SD is encoded with VEX.L=0. Encoding VCVTSS2SD with VEX.L=1 may encounter
unpredictable behavior across different processor generations.

CVTSS2SD—Convert Scalar Single Precision Floating-Point Value to Scalar Double Precision Floating-Point Value Vol. 2A 3-305
Operation
VCVTSS2SD (EVEX Encoded Version)
IF k1[0] or *no writemask*
THEN DEST[63:0] := Convert_Single_Precision_To_Double_Precision_Floating_Point(SRC2[31:0]);
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] = 0
FI;
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

VCVTSS2SD (VEX.128 Encoded Version)


DEST[63:0] := Convert_Single_Precision_To_Double_Precision_Floating_Point(SRC2[31:0])
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

CVTSS2SD (128-bit Legacy SSE Version)


DEST[63:0] := Convert_Single_Precision_To_Double_Precision_Floating_Point(SRC[31:0]);
DEST[MAXVL-1:64] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VCVTSS2SD __m128d _mm_cvt_roundss_sd(__m128d a, __m128 b, int r);
VCVTSS2SD __m128d _mm_mask_cvt_roundss_sd(__m128d s, __mmask8 m, __m128d a,__m128 b, int r);
VCVTSS2SD __m128d _mm_maskz_cvt_roundss_sd(__mmask8 k, __m128d a, __m128 a, int r);
VCVTSS2SD __m128d _mm_mask_cvtss_sd(__m128d s, __mmask8 m, __m128d a,__m128 b);
VCVTSS2SD __m128d _mm_maskz_cvtss_sd(__mmask8 m, __m128d a,__m128 b);
CVTSS2SD __m128d_mm_cvtss_sd(__m128d a, __m128 a);

SIMD Floating-Point Exceptions


Invalid, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

CVTSS2SD—Convert Scalar Single Precision Floating-Point Value to Scalar Double Precision Floating-Point Value Vol. 2A 3-306
CVTSS2SI—Convert Scalar Single Precision Floating-Point Value to Doubleword Integer
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F3 0F 2D /r A V/V SSE Convert one single precision floating-point value from
CVTSS2SI r32, xmm1/m32 xmm1/m32 to one signed doubleword integer in r32.
F3 REX.W 0F 2D /r A V/N.E. SSE Convert one single precision floating-point value from
CVTSS2SI r64, xmm1/m32 xmm1/m32 to one signed quadword integer in r64.
VEX.LIG.F3.0F.W0 2D /r 1 A V/V AVX Convert one single precision floating-point value from
VCVTSS2SI r32, xmm1/m32 xmm1/m32 to one signed doubleword integer in r32.
VEX.LIG.F3.0F.W1 2D /r 1 A V/N.E.2 AVX Convert one single precision floating-point value from
VCVTSS2SI r64, xmm1/m32 xmm1/m32 to one signed quadword integer in r64.
EVEX.LLIG.F3.0F.W0 2D /r B V/V AVX512F Convert one single precision floating-point value from
VCVTSS2SI r32, xmm1/m32{er} OR AVX10.13 xmm1/m32 to one signed doubleword integer in r32.
EVEX.LLIG.F3.0F.W1 2D /r B V/N.E.2 AVX512F Convert one single precision floating-point value from
VCVTSS2SI r64, xmm1/m32{er} OR AVX10.13 xmm1/m32 to one signed quadword integer in r64.

NOTES:
1. Software should ensure VCVTSS2SI is encoded with VEX.L=0. Encoding VCVTSS2SI with VEX.L=1 may
encounter unpredictable behavior across different processor generations.
2. VEX.W1/EVEX.W1 in non-64 bit is ignored; the instructions behaves as if the W0 version is used.
3. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Tuple1 Fixed ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts a single precision floating-point value in the source operand (the second operand) to a signed doubleword
integer (or signed quadword integer if operand size is 64 bits) in the destination operand (the first operand). The
source operand can be an XMM register or a memory location. The destination operand is a general-purpose
register. When the source operand is an XMM register, the single precision floating-point value is contained in the
low doubleword of the register.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value
(2w-1, where w represents the number of bits in the destination format) is returned.
Legacy SSE instructions: In 64-bit mode, Use of the REX.W prefix promotes the instruction to produce 64-bit data.
See the summary chart at the beginning of this section for encoding data and limits.
VEX.W1 and EVEX.W1 versions: promotes the instruction to produce 64-bit data in 64-bit mode.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.
Software should ensure VCVTSS2SI is encoded with VEX.L=0. Encoding VCVTSS2SI with VEX.L=1 may encounter
unpredictable behavior across different processor generations.

CVTSS2SI—Convert Scalar Single Precision Floating-Point Value to Doubleword Integer Vol. 2A 3-307
Operation
VCVTSS2SI (EVEX Encoded Version)
IF (SRC *is register*) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF 64-bit Mode and OperandSize = 64
THEN
DEST[63:0] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[31:0]);
ELSE
DEST[31:0] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[31:0]);
FI;

(V)CVTSS2SI (Legacy and VEX.128 Encoded Version)


IF 64-bit Mode and OperandSize = 64
THEN
DEST[63:0] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[31:0]);
ELSE
DEST[31:0] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[31:0]);
FI;

Intel C/C++ Compiler Intrinsic Equivalent


VCVTSS2SI int _mm_cvtss_i32( __m128 a);
VCVTSS2SI int _mm_cvt_roundss_i32( __m128 a, int r);
VCVTSS2SI __int64 _mm_cvtss_i64( __m128 a);
VCVTSS2SI __int64 _mm_cvt_roundss_i64( __m128 a, int r);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions,” additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”

CVTSS2SI—Convert Scalar Single Precision Floating-Point Value to Doubleword Integer Vol. 2A 3-308
CVTTPD2DQ—Convert with Truncation Packed Double Precision Floating-Point Values to
Packed Doubleword Integers
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F E6 /r A V/V SSE2 Convert two packed double precision floating-point
CVTTPD2DQ xmm1, xmm2/m128 values in xmm2/mem to two signed doubleword
integers in xmm1 using truncation.
VEX.128.66.0F.WIG E6 /r A V/V AVX Convert two packed double precision floating-point
VCVTTPD2DQ xmm1, xmm2/m128 values in xmm2/mem to two signed doubleword
integers in xmm1 using truncation.
VEX.256.66.0F.WIG E6 /r A V/V AVX Convert four packed double precision floating-point
VCVTTPD2DQ xmm1, ymm2/m256 values in ymm2/mem to four signed doubleword
integers in xmm1 using truncation.
EVEX.128.66.0F.W1 E6 /r B V/V (AVX512VL AND Convert two packed double precision floating-point
VCVTTPD2DQ xmm1 {k1}{z}, AVX512F) OR values in xmm2/m128/m64bcst to two signed
xmm2/m128/m64bcst AVX10.11 doubleword integers in xmm1 using truncation
subject to writemask k1.
EVEX.256.66.0F.W1 E6 /r B V/V (AVX512VL AND Convert four packed double precision floating-point
VCVTTPD2DQ xmm1 {k1}{z}, AVX512F) OR values in ymm2/m256/m64bcst to four signed
ymm2/m256/m64bcst AVX10.11 doubleword integers in xmm1 using truncation
subject to writemask k1.
EVEX.512.66.0F.W1 E6 /r B V/V AVX512F Convert eight packed double precision floating-point
VCVTTPD2DQ ymm1 {k1}{z}, OR AVX10.11 values in zmm2/m512/m64bcst to eight signed
zmm2/m512/m64bcst {sae} doubleword integers in ymm1 using truncation
subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts two, four or eight packed double precision floating-point values in the source operand (second operand)
to two, four or eight packed signed doubleword integers in the destination operand (first operand).
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result is larger than
the maximum signed doubleword integer, the floating-point invalid exception is raised, and if this exception is
masked, the indefinite integer value (80000000H) is returned.
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or
a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is a
YMM/XMM/XMM (low 64 bits) register conditionally updated with writemask k1. The upper bits (MAXVL-1:256) of
the corresponding destination are zeroed.
VEX.256 encoded version: The source operand is a YMM register or 256- bit memory location. The destination
operand is an XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are
zeroed.

CVTTPD2DQ—Convert with Truncation Packed Double Precision Floating-Point Values to Packed Doubleword Integers Vol. 2A 3-309
VEX.128 encoded version: The source operand is an XMM register or 128- bit memory location. The destination
operand is a XMM register. The upper bits (MAXVL-1:64) of the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The source operand is an XMM register or 128- bit memory location. The destination
operand is an XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are
unmodified.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.

SRC X3 X2 X1 X0

DEST 0 X3 X2 X1 X0

Figure 3-15. VCVTTPD2DQ (VEX.256 encoded version)

Operation
VCVTTPD2DQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[k+63:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0

CVTTPD2DQ—Convert with Truncation Packed Double Precision Floating-Point Values to Packed Doubleword Integers Vol. 2A 3-310
VCVTTPD2DQ (EVEX Encoded Versions) When SRC Operand is a Memory Source
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[63:0])
ELSE
DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[k+63:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0

VCVTTPD2DQ (VEX.256 Encoded Version)


DEST[31:0] := Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[63:0])
DEST[63:32] := Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[127:64])
DEST[95:64] := Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[191:128])
DEST[127:96] := Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[255:192)
DEST[MAXVL-1:128] := 0

VCVTTPD2DQ (VEX.128 Encoded Version)


DEST[31:0] := Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[63:0])
DEST[63:32] := Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[127:64])
DEST[MAXVL-1:64] := 0

CVTTPD2DQ (128-bit Legacy SSE Version)


DEST[31:0] := Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[63:0])
DEST[63:32] := Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[127:64])
DEST[127:64] := 0
DEST[MAXVL-1:128] (unmodified)

CVTTPD2DQ—Convert with Truncation Packed Double Precision Floating-Point Values to Packed Doubleword Integers Vol. 2A 3-311
Intel C/C++ Compiler Intrinsic Equivalent
VCVTTPD2DQ __m256i _mm512_cvttpd_epi32( __m512d a);
VCVTTPD2DQ __m256i _mm512_mask_cvttpd_epi32( __m256i s, __mmask8 k, __m512d a);
VCVTTPD2DQ __m256i _mm512_maskz_cvttpd_epi32( __mmask8 k, __m512d a);
VCVTTPD2DQ __m256i _mm512_cvtt_roundpd_epi32( __m512d a, int sae);
VCVTTPD2DQ __m256i _mm512_mask_cvtt_roundpd_epi32( __m256i s, __mmask8 k, __m512d a, int sae);
VCVTTPD2DQ __m256i _mm512_maskz_cvtt_roundpd_epi32( __mmask8 k, __m512d a, int sae);
VCVTTPD2DQ __m128i _mm256_mask_cvttpd_epi32( __m128i s, __mmask8 k, __m256d a);
VCVTTPD2DQ __m128i _mm256_maskz_cvttpd_epi32( __mmask8 k, __m256d a);
VCVTTPD2DQ __m128i _mm_mask_cvttpd_epi32( __m128i s, __mmask8 k, __m128d a);
VCVTTPD2DQ __m128i _mm_maskz_cvttpd_epi32( __mmask8 k, __m128d a);
VCVTTPD2DQ __m128i _mm256_cvttpd_epi32 (__m256d src);
CVTTPD2DQ __m128i _mm_cvttpd_epi32 (__m128d src);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

CVTTPD2DQ—Convert with Truncation Packed Double Precision Floating-Point Values to Packed Doubleword Integers Vol. 2A 3-312
CVTTPS2DQ—Convert With Truncation Packed Single Precision Floating-Point Values to Packed
Signed Doubleword Integer Values
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F3 0F 5B /r A V/V SSE2 Convert four packed single precision floating-point
CVTTPS2DQ xmm1, xmm2/m128 values from xmm2/mem to four packed signed
doubleword values in xmm1 using truncation.
VEX.128.F3.0F.WIG 5B /r A V/V AVX Convert four packed single precision floating-point
VCVTTPS2DQ xmm1, xmm2/m128 values from xmm2/mem to four packed signed
doubleword values in xmm1 using truncation.
VEX.256.F3.0F.WIG 5B /r A V/V AVX Convert eight packed single precision floating-point
VCVTTPS2DQ ymm1, ymm2/m256 values from ymm2/mem to eight packed signed
doubleword values in ymm1 using truncation.
EVEX.128.F3.0F.W0 5B /r B V/V AVX512VL Convert four packed single precision floating-point
VCVTTPS2DQ xmm1 {k1}{z}, AVX512F values from xmm2/m128/m32bcst to four packed
xmm2/m128/m32bcst signed doubleword values in xmm1 using truncation
subject to writemask k1.
EVEX.256.F3.0F.W0 5B /r B V/V AVX512VL Convert eight packed single precision floating-point
VCVTTPS2DQ ymm1 {k1}{z}, AVX512F values from ymm2/m256/m32bcst to eight packed
ymm2/m256/m32bcst signed doubleword values in ymm1 using truncation
subject to writemask k1.
EVEX.512.F3.0F.W0 5B /r B V/V AVX512F Convert sixteen packed single precision floating-point
VCVTTPS2DQ zmm1 {k1}{z}, values from zmm2/m512/m32bcst to sixteen packed
zmm2/m512/m32bcst {sae} signed doubleword values in zmm1 using truncation
subject to writemask k1.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts four, eight or sixteen packed single precision floating-point values in the source operand to four, eight or
sixteen signed doubleword integers in the destination operand.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result is larger than
the maximum signed doubleword integer, the floating-point invalid exception is raised, and if this exception is
masked, the indefinite integer value (80000000H) is returned.
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or
a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a
ZMM/YMM/XMM register conditionally updated with writemask k1.
VEX.256 encoded version: The source operand is a YMM register or 256- bit memory location. The destination
operand is a YMM register. The upper bits (MAXVL-1:256) of the corresponding ZMM register destination are
zeroed.
VEX.128 encoded version: The source operand is an XMM register or 128- bit memory location. The destination
operand is a XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are
zeroed.
128-bit Legacy SSE version: The source operand is an XMM register or 128- bit memory location. The destination
operand is an XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are
unmodified.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

CVTTPS2DQ—Convert With Truncation Packed Single Precision Floating-Point Values to Packed Signed Doubleword Integer Values Vol. 2A 3-314
Operation
VCVTTPS2DQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTTPS2DQ (EVEX Encoded Versions) When SRC Operand is a Memory Source


(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO 15
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[31:0])
ELSE
DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTTPS2DQ (VEX.256 Encoded Version)


DEST[31:0] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[31:0])
DEST[63:32] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[63:32])
DEST[95:64] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[95:64])
DEST[127:96] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[127:96)
DEST[159:128] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[159:128])
DEST[191:160] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[191:160])
DEST[223:192] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[223:192])
DEST[255:224] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[255:224])

CVTTPS2DQ—Convert With Truncation Packed Single Precision Floating-Point Values to Packed Signed Doubleword Integer Values Vol. 2A 3-315
VCVTTPS2DQ (VEX.128 Encoded Version)
DEST[31:0] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[31:0])
DEST[63:32] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[63:32])
DEST[95:64] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[95:64])
DEST[127:96] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[127:96])
DEST[MAXVL-1:128] := 0

CVTTPS2DQ (128-bit Legacy SSE Version)


DEST[31:0] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[31:0])
DEST[63:32] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[63:32])
DEST[95:64] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[95:64])
DEST[127:96] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[127:96])
DEST[MAXVL-1:128] (unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VCVTTPS2DQ __m512i _mm512_cvttps_epi32( __m512 a);
VCVTTPS2DQ __m512i _mm512_mask_cvttps_epi32( __m512i s, __mmask16 k, __m512 a);
VCVTTPS2DQ __m512i _mm512_maskz_cvttps_epi32( __mmask16 k, __m512 a);
VCVTTPS2DQ __m512i _mm512_cvtt_roundps_epi32( __m512 a, int sae);
VCVTTPS2DQ __m512i _mm512_mask_cvtt_roundps_epi32( __m512i s, __mmask16 k, __m512 a, int sae);
VCVTTPS2DQ __m512i _mm512_maskz_cvtt_roundps_epi32( __mmask16 k, __m512 a, int sae);
VCVTTPS2DQ __m256i _mm256_mask_cvttps_epi32( __m256i s, __mmask8 k, __m256 a);
VCVTTPS2DQ __m256i _mm256_maskz_cvttps_epi32( __mmask8 k, __m256 a);
VCVTTPS2DQ __m128i _mm_mask_cvttps_epi32( __m128i s, __mmask8 k, __m128 a);
VCVTTPS2DQ __m128i _mm_maskz_cvttps_epi32( __mmask8 k, __m128 a);
VCVTTPS2DQ __m256i _mm256_cvttps_epi32 (__m256 a)
CVTTPS2DQ __m128i _mm_cvttps_epi32 (__m128 a)

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

CVTTPS2DQ—Convert With Truncation Packed Single Precision Floating-Point Values to Packed Signed Doubleword Integer Values Vol. 2A 3-316
CVTTSD2SI—Convert With Truncation Scalar Double Precision Floating-Point Value to Signed
Integer
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F2 0F 2C /r A V/V SSE2 Convert one double precision floating-point value
CVTTSD2SI r32, xmm1/m64 from xmm1/m64 to one signed doubleword integer in
r32 using truncation.
F2 REX.W 0F 2C /r A V/N.E. SSE2 Convert one double precision floating-point value
CVTTSD2SI r64, xmm1/m64 from xmm1/m64 to one signed quadword integer in
r64 using truncation.
VEX.LIG.F2.0F.W0 2C /r 1 A V/V AVX Convert one double precision floating-point value
VCVTTSD2SI r32, xmm1/m64 from xmm1/m64 to one signed doubleword integer in
r32 using truncation.
VEX.LIG.F2.0F.W1 2C /r 1 B V/N.E.2 AVX Convert one double precision floating-point value
VCVTTSD2SI r64, xmm1/m64 from xmm1/m64 to one signed quadword integer in
r64 using truncation.
EVEX.LLIG.F2.0F.W0 2C /r B V/V AVX512F Convert one double precision floating-point value
VCVTTSD2SI r32, xmm1/m64{sae} OR AVX10.13 from xmm1/m64 to one signed doubleword integer in
r32 using truncation.
EVEX.LLIG.F2.0F.W1 2C /r B V/N.E.2 AVX512F Convert one double precision floating-point value
VCVTTSD2SI r64, xmm1/m64{sae} OR AVX10.13 from xmm1/m64 to one signed quadword integer in
r64 using truncation.

NOTES:
1. Software should ensure VCVTTSD2SI is encoded with VEX.L=0. Encoding VCVTTSD2SI with VEX.L=1 may encounter unpredictable
behavior across different processor generations.
2. For this specific instruction, VEX.W/EVEX.W in non-64 bit is ignored; the instructions behaves as if the W0 version is used.
3. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Tuple1 Fixed ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts a double precision floating-point value in the source operand (the second operand) to a signed double-
word integer (or signed quadword integer if operand size is 64 bits) in the destination operand (the first operand).
The source operand can be an XMM register or a 64-bit memory location. The destination operand is a general
purpose register. When the source operand is an XMM register, the double precision floating-point value is
contained in the low quadword of the register.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register.
If a converted result exceeds the range limits of signed doubleword integer (in non-64-bit modes or 64-bit mode
with REX.W/VEX.W/EVEX.W=0), the floating-point invalid exception is raised, and if this exception is masked, the
indefinite integer value (80000000H) is returned.
If a converted result exceeds the range limits of signed quadword integer (in 64-bit mode and
REX.W/VEX.W/EVEX.W = 1), the floating-point invalid exception is raised, and if this exception is masked, the
indefinite integer value (80000000_00000000H) is returned.

CVTTSD2SI—Convert With Truncation Scalar Double Precision Floating-Point Value to Signed Integer Vol. 2A 3-318
Legacy SSE instructions: In 64-bit mode, Use of the REX.W prefix promotes the instruction to 64-bit operation. See
the summary chart at the beginning of this section for encoding data and limits.
VEX.W1 and EVEX.W1 versions: promotes the instruction to produce 64-bit data in 64-bit mode.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.
Software should ensure VCVTTSD2SI is encoded with VEX.L=0. Encoding VCVTTSD2SI with VEX.L=1 may
encounter unpredictable behavior across different processor generations.

Operation
(V)CVTTSD2SI (All Versions)
IF 64-Bit Mode and OperandSize = 64
THEN
DEST[63:0] := Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[63:0]);
ELSE
DEST[31:0] := Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[63:0]);
FI;

Intel C/C++ Compiler Intrinsic Equivalent


VCVTTSD2SI int _mm_cvttsd_i32( __m128d a);
VCVTTSD2SI int _mm_cvtt_roundsd_i32( __m128d a, int sae);
VCVTTSD2SI __int64 _mm_cvttsd_i64( __m128d a);
VCVTTSD2SI __int64 _mm_cvtt_roundsd_i64( __m128d a, int sae);
CVTTSD2SI int _mm_cvttsd_si32( __m128d a);
CVTTSD2SI __int64 _mm_cvttsd_si64( __m128d a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions,” additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”

CVTTSD2SI—Convert With Truncation Scalar Double Precision Floating-Point Value to Signed Integer Vol. 2A 3-319
CVTTSS2SI—Convert With Truncation Scalar Single Precision Floating-Point Value to Signed
Integer
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F3 0F 2C /r A V/V SSE Convert one single precision floating-point value from
CVTTSS2SI r32, xmm1/m32 xmm1/m32 to one signed doubleword integer in r32
using truncation.
F3 REX.W 0F 2C /r A V/N.E. SSE Convert one single precision floating-point value from
CVTTSS2SI r64, xmm1/m32 xmm1/m32 to one signed quadword integer in r64
using truncation.
VEX.LIG.F3.0F.W0 2C /r 1 A V/V AVX Convert one single precision floating-point value from
VCVTTSS2SI r32, xmm1/m32 xmm1/m32 to one signed doubleword integer in r32
using truncation.
VEX.LIG.F3.0F.W1 2C /r 1 A V/N.E.2 AVX Convert one single precision floating-point value from
VCVTTSS2SI r64, xmm1/m32 xmm1/m32 to one signed quadword integer in r64
using truncation.
EVEX.LLIG.F3.0F.W0 2C /r B V/V AVX512F Convert one single precision floating-point value from
VCVTTSS2SI r32, xmm1/m32{sae} OR AVX10.13 xmm1/m32 to one signed doubleword integer in r32
using truncation.
EVEX.LLIG.F3.0F.W1 2C /r B V/N.E.2 AVX512F Convert one single precision floating-point value from
VCVTTSS2SI r64, xmm1/m32{sae} OR AVX10.13 xmm1/m32 to one signed quadword integer in r64
using truncation.

NOTES:
1. Software should ensure VCVTTSS2SI is encoded with VEX.L=0. Encoding VCVTTSS2SI with VEX.L=1 may encounter unpredictable
behavior across different processor generations.
2. For this specific instruction, VEX.W/EVEX.W in non-64 bit is ignored; the instructions behaves as if the W0 version is used.
3. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Tuple1 Fixed ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts a single precision floating-point value in the source operand (the second operand) to a signed doubleword
integer (or signed quadword integer if operand size is 64 bits) in the destination operand (the first operand). The
source operand can be an XMM register or a 32-bit memory location. The destination operand is a general purpose
register. When the source operand is an XMM register, the single precision floating-point value is contained in the
low doubleword of the register.
When a conversion is inexact, a truncated (round toward zero) result is returned. If a converted result is larger
than the maximum signed doubleword integer, the floating-point invalid exception is raised. If this exception is
masked, the indefinite integer value (80000000H or 80000000_00000000H if operand size is 64 bits) is returned.
Legacy SSE instructions: In 64-bit mode, Use of the REX.W prefix promotes the instruction to 64-bit operation. See
the summary chart at the beginning of this section for encoding data and limits.
VEX.W1 and EVEX.W1 versions: promotes the instruction to produce 64-bit data in 64-bit mode.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.

CVTTSS2SI—Convert With Truncation Scalar Single Precision Floating-Point Value to Signed Integer Vol. 2A 3-320
Software should ensure VCVTTSS2SI is encoded with VEX.L=0. Encoding VCVTTSS2SI with VEX.L=1 may
encounter unpredictable behavior across different processor generations.

Operation
(V)CVTTSS2SI (All Versions)
IF 64-Bit Mode and OperandSize = 64
THEN
DEST[63:0] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[31:0]);
ELSE
DEST[31:0] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[31:0]);
FI;

Intel C/C++ Compiler Intrinsic Equivalent


VCVTTSS2SI int _mm_cvttss_i32( __m128 a);
VCVTTSS2SI int _mm_cvtt_roundss_i32( __m128 a, int sae);
VCVTTSS2SI __int64 _mm_cvttss_i64( __m128 a);
VCVTTSS2SI __int64 _mm_cvtt_roundss_i64( __m128 a, int sae);
CVTTSS2SI int _mm_cvttss_si32( __m128 a);
CVTTSS2SI __int64 _mm_cvttss_si64( __m128 a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
See Table 2-20, “Type 3 Class Exception Conditions,” additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”

CVTTSS2SI—Convert With Truncation Scalar Single Precision Floating-Point Value to Signed Integer Vol. 2A 3-321
DIVPD—Divide Packed Double Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 5E /r A V/V SSE2 Divide packed double precision floating-point
DIVPD xmm1, xmm2/m128 values in xmm1 by packed double precision
floating-point values in xmm2/mem.
VEX.128.66.0F.WIG 5E /r B V/V AVX Divide packed double precision floating-point
VDIVPD xmm1, xmm2, xmm3/m128 values in xmm2 by packed double precision
floating-point values in xmm3/mem.
VEX.256.66.0F.WIG 5E /r B V/V AVX Divide packed double precision floating-point
VDIVPD ymm1, ymm2, ymm3/m256 values in ymm2 by packed double precision
floating-point values in ymm3/mem.
EVEX.128.66.0F.W1 5E /r C V/V (AVX512VL AND Divide packed double precision floating-point
VDIVPD xmm1 {k1}{z}, xmm2, AVX512F) OR values in xmm2 by packed double precision
xmm3/m128/m64bcst AVX10.11 floating-point values in xmm3/m128/m64bcst and
write results to xmm1 subject to writemask k1.
EVEX.256.66.0F.W1 5E /r C V/V (AVX512VL AND Divide packed double precision floating-point
VDIVPD ymm1 {k1}{z}, ymm2, AVX512F) OR values in ymm2 by packed double precision
ymm3/m256/m64bcst AVX10.11 floating-point values in ymm3/m256/m64bcst and
write results to ymm1 subject to writemask k1.
EVEX.512.66.0F.W1 5E /r C V/V AVX512F Divide packed double precision floating-point
VDIVPD zmm1 {k1}{z}, zmm2, OR AVX10.11 values in zmm2 by packed double precision
zmm3/m512/m64bcst{er} floating-point values in zmm3/m512/m64bcst and
write results to zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD divide of the double precision floating-point values in the first source operand by the floating-
point values in the second source operand (the third operand). Results are written to the destination operand (the
first operand).
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
VEX.256 encoded version: The first source operand (the second operand) is a YMM register. The second source
operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register. The upper
bits (MAXVL-1:256) of the corresponding destination are zeroed.
VEX.128 encoded version: The first source operand (the second operand) is a XMM register. The second source
operand can be a XMM register or a 128-bit memory location. The destination operand is a XMM register. The upper
bits (MAXVL-1:128) of the corresponding destination are zeroed.

DIVPD—Divide Packed Double Precision Floating-Point Values Vol. 2A 3-329


128-bit Legacy SSE version: The second source operand (the second operand) can be an XMM register or an 128-
bit memory location. The destination is the same as the first source operand. The upper bits (MAXVL-1:128) of the
corresponding destination are unmodified.

Operation
VDIVPD (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1) AND SRC2 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC); ; refer to Table 15-4 in the Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 1
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+63:i] := SRC1[i+63:i] / SRC2[63:0]
ELSE
DEST[i+63:i] := SRC1[i+63:i] / SRC2[i+63:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VDIVPD (VEX.256 Encoded Version)


DEST[63:0] := SRC1[63:0] / SRC2[63:0]
DEST[127:64] := SRC1[127:64] / SRC2[127:64]
DEST[191:128] := SRC1[191:128] / SRC2[191:128]
DEST[255:192] := SRC1[255:192] / SRC2[255:192]
DEST[MAXVL-1:256] := 0;

VDIVPD (VEX.128 Encoded Version)


DEST[63:0] := SRC1[63:0] / SRC2[63:0]
DEST[127:64] := SRC1[127:64] / SRC2[127:64]
DEST[MAXVL-1:128] := 0;

DIVPD (128-bit Legacy SSE Version)


DEST[63:0] := SRC1[63:0] / SRC2[63:0]
DEST[127:64] := SRC1[127:64] / SRC2[127:64]
DEST[MAXVL-1:128] (Unmodified)

DIVPD—Divide Packed Double Precision Floating-Point Values Vol. 2A 3-330


Intel C/C++ Compiler Intrinsic Equivalent
VDIVPD __m512d _mm512_div_pd( __m512d a, __m512d b);
VDIVPD __m512d _mm512_mask_div_pd(__m512d s, __mmask8 k, __m512d a, __m512d b);
VDIVPD __m512d _mm512_maskz_div_pd( __mmask8 k, __m512d a, __m512d b);
VDIVPD __m256d _mm256_mask_div_pd(__m256d s, __mmask8 k, __m256d a, __m256d b);
VDIVPD __m256d _mm256_maskz_div_pd( __mmask8 k, __m256d a, __m256d b);
VDIVPD __m128d _mm_mask_div_pd(__m128d s, __mmask8 k, __m128d a, __m128d b);
VDIVPD __m128d _mm_maskz_div_pd( __mmask8 k, __m128d a, __m128d b);
VDIVPD __m512d _mm512_div_round_pd( __m512d a, __m512d b, int);
VDIVPD __m512d _mm512_mask_div_round_pd(__m512d s, __mmask8 k, __m512d a, __m512d b, int);
VDIVPD __m512d _mm512_maskz_div_round_pd( __mmask8 k, __m512d a, __m512d b, int);
VDIVPD __m256d _mm256_div_pd (__m256d a, __m256d b);
DIVPD __m128d _mm_div_pd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Divide-by-Zero, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

DIVPD—Divide Packed Double Precision Floating-Point Values Vol. 2A 3-331


DIVPS—Divide Packed Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 5E /r A V/V SSE Divide packed single precision floating-point values
DIVPS xmm1, xmm2/m128 in xmm1 by packed single precision floating-point
values in xmm2/mem.
VEX.128.0F.WIG 5E /r B V/V AVX Divide packed single precision floating-point values
VDIVPS xmm1, xmm2, xmm3/m128 in xmm2 by packed single precision floating-point
values in xmm3/mem.
VEX.256.0F.WIG 5E /r B V/V AVX Divide packed single precision floating-point values
VDIVPS ymm1, ymm2, ymm3/m256 in ymm2 by packed single precision floating-point
values in ymm3/mem.
EVEX.128.0F.W0 5E /r C V/V (AVX512VL AND Divide packed single precision floating-point values
VDIVPS xmm1 {k1}{z}, xmm2, AVX512F) OR in xmm2 by packed single precision floating-point
xmm3/m128/m32bcst AVX10.11 values in xmm3/m128/m32bcst and write results
to xmm1 subject to writemask k1.
EVEX.256.0F.W0 5E /r C V/V (AVX512VL AND Divide packed single precision floating-point values
VDIVPS ymm1 {k1}{z}, ymm2, AVX512F) OR in ymm2 by packed single precision floating-point
ymm3/m256/m32bcst AVX10.11 values in ymm3/m256/m32bcst and write results to
ymm1 subject to writemask k1.
EVEX.512.0F.W0 5E /r C V/V AVX512F Divide packed single precision floating-point values
VDIVPS zmm1 {k1}{z}, zmm2, OR AVX10.11 in zmm2 by packed single precision floating-point
zmm3/m512/m32bcst{er} values in zmm3/m512/m32bcst and write results to
zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD divide of the four, eight or sixteen packed single precision floating-point values in the first source
operand (the second operand) by the four, eight or sixteen packed single precision floating-point values in the
second source operand (the third operand). Results are written to the destination operand (the first operand).
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register.
VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.

DIVPS—Divide Packed Single Precision Floating-Point Values Vol. 2A 3-332


128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified.

Operation
VDIVPS (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1) AND SRC2 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+31:i] := SRC1[i+31:i] / SRC2[31:0]
ELSE
DEST[i+31:i] := SRC1[i+31:i] / SRC2[i+31:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VDIVPS (VEX.256 Encoded Version)


DEST[31:0] := SRC1[31:0] / SRC2[31:0]
DEST[63:32] := SRC1[63:32] / SRC2[63:32]
DEST[95:64] := SRC1[95:64] / SRC2[95:64]
DEST[127:96] := SRC1[127:96] / SRC2[127:96]
DEST[159:128] := SRC1[159:128] / SRC2[159:128]
DEST[191:160] := SRC1[191:160] / SRC2[191:160]
DEST[223:192] := SRC1[223:192] / SRC2[223:192]
DEST[255:224] := SRC1[255:224] / SRC2[255:224].
DEST[MAXVL-1:256] := 0;

VDIVPS (VEX.128 Encoded Version)


DEST[31:0] := SRC1[31:0] / SRC2[31:0]
DEST[63:32] := SRC1[63:32] / SRC2[63:32]
DEST[95:64] := SRC1[95:64] / SRC2[95:64]
DEST[127:96] := SRC1[127:96] / SRC2[127:96]
DEST[MAXVL-1:128] := 0

DIVPS—Divide Packed Single Precision Floating-Point Values Vol. 2A 3-333


DIVPS (128-bit Legacy SSE Version)
DEST[31:0] := SRC1[31:0] / SRC2[31:0]
DEST[63:32] := SRC1[63:32] / SRC2[63:32]
DEST[95:64] := SRC1[95:64] / SRC2[95:64]
DEST[127:96] := SRC1[127:96] / SRC2[127:96]
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VDIVPS __m512 _mm512_div_ps( __m512 a, __m512 b);
VDIVPS __m512 _mm512_mask_div_ps(__m512 s, __mmask16 k, __m512 a, __m512 b);
VDIVPS __m512 _mm512_maskz_div_ps(__mmask16 k, __m512 a, __m512 b);
VDIVPD __m256d _mm256_mask_div_pd(__m256d s, __mmask8 k, __m256d a, __m256d b);
VDIVPD __m256d _mm256_maskz_div_pd( __mmask8 k, __m256d a, __m256d b);
VDIVPD __m128d _mm_mask_div_pd(__m128d s, __mmask8 k, __m128d a, __m128d b);
VDIVPD __m128d _mm_maskz_div_pd( __mmask8 k, __m128d a, __m128d b);
VDIVPS __m512 _mm512_div_round_ps( __m512 a, __m512 b, int);
VDIVPS __m512 _mm512_mask_div_round_ps(__m512 s, __mmask16 k, __m512 a, __m512 b, int);
VDIVPS __m512 _mm512_maskz_div_round_ps(__mmask16 k, __m512 a, __m512 b, int);
VDIVPS __m256 _mm256_div_ps (__m256 a, __m256 b);
DIVPS __m128 _mm_div_ps (__m128 a, __m128 b);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Divide-by-Zero, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

DIVPS—Divide Packed Single Precision Floating-Point Values Vol. 2A 3-334


DIVSD—Divide Scalar Double Precision Floating-Point Value
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F2 0F 5E /r A V/V SSE2 Divide low double precision floating-point value in
DIVSD xmm1, xmm2/m64 xmm1 by low double precision floating-point value
in xmm2/m64.
VEX.LIG.F2.0F.WIG 5E /r B V/V AVX Divide low double precision floating-point value in
VDIVSD xmm1, xmm2, xmm3/m64 xmm2 by low double precision floating-point value
in xmm3/m64.
EVEX.LLIG.F2.0F.W1 5E /r C V/V AVX512F Divide low double precision floating-point value in
VDIVSD xmm1 {k1}{z}, xmm2, OR AVX10.11 xmm2 by low double precision floating-point value
xmm3/m64{er} in xmm3/m64.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Divides the low double precision floating-point value in the first source operand by the low double precision
floating-point value in the second source operand, and stores the double precision floating-point result in the desti-
nation operand. The second source operand can be an XMM register or a 64-bit memory location. The first source
and destination are XMM registers.
128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (MAXVL-
1:64) of the corresponding ZMM destination register remain unchanged.
VEX.128 encoded version: The first source operand is an xmm register encoded by VEX.vvvv. The quadword at bits
127:64 of the destination operand is copied from the corresponding quadword of the first source operand. Bits
(MAXVL-1:128) of the destination register are zeroed.
EVEX.128 encoded version: The first source operand is an xmm register encoded by EVEX.vvvv. The quadword
element of the destination operand at bits 127:64 are copied from the first source operand. Bits (MAXVL-1:128) of
the destination register are zeroed.
EVEX version: The low quadword element of the destination is updated according to the writemask.
Software should ensure VDIVSD is encoded with VEX.L=0. Encoding VDIVSD with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.

DIVSD—Divide Scalar Double Precision Floating-Point Value Vol. 2A 3-335


Operation
VDIVSD (EVEX Encoded Version)
IF (EVEX.b = 1) AND SRC2 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[63:0] := SRC1[63:0] / SRC2[63:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

VDIVSD (VEX.128 Encoded Version)


DEST[63:0] := SRC1[63:0] / SRC2[63:0]
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

DIVSD (128-bit Legacy SSE Version)


DEST[63:0] := DEST[63:0] / SRC[63:0]
DEST[MAXVL-1:64] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VDIVSD __m128d _mm_mask_div_sd(__m128d s, __mmask8 k, __m128d a, __m128d b);
VDIVSD __m128d _mm_maskz_div_sd( __mmask8 k, __m128d a, __m128d b);
VDIVSD __m128d _mm_div_round_sd( __m128d a, __m128d b, int);
VDIVSD __m128d _mm_mask_div_round_sd(__m128d s, __mmask8 k, __m128d a, __m128d b, int);
VDIVSD __m128d _mm_maskz_div_round_sd( __mmask8 k, __m128d a, __m128d b, int);
DIVSD __m128d _mm_div_sd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Divide-by-Zero, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

DIVSD—Divide Scalar Double Precision Floating-Point Value Vol. 2A 3-336


DIVSS—Divide Scalar Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F3 0F 5E /r A V/V SSE Divide low single precision floating-point value in
DIVSS xmm1, xmm2/m32 xmm1 by low single precision floating-point value in
xmm2/m32.
VEX.LIG.F3.0F.WIG 5E /r B V/V AVX Divide low single precision floating-point value in
VDIVSS xmm1, xmm2, xmm3/m32 xmm2 by low single precision floating-point value in
xmm3/m32.
EVEX.LLIG.F3.0F.W0 5E /r C V/V AVX512F Divide low single precision floating-point value in
VDIVSS xmm1 {k1}{z}, xmm2, OR AVX10.11 xmm2 by low single precision floating-point value in
xmm3/m32{er} xmm3/m32.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Divides the low single precision floating-point value in the first source operand by the low single precision floating-
point value in the second source operand, and stores the single precision floating-point result in the destination
operand. The second source operand can be an XMM register or a 32-bit memory location.
128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (MAXVL-
1:32) of the corresponding YMM destination register remain unchanged.
VEX.128 encoded version: The first source operand is an xmm register encoded by VEX.vvvv. The three high-order
doublewords of the destination operand are copied from the first source operand. Bits (MAXVL-1:128) of the desti-
nation register are zeroed.
EVEX.128 encoded version: The first source operand is an xmm register encoded by EVEX.vvvv. The doubleword
elements of the destination operand at bits 127:32 are copied from the first source operand. Bits (MAXVL-1:128)
of the destination register are zeroed.
EVEX version: The low doubleword element of the destination is updated according to the writemask.
Software should ensure VDIVSS is encoded with VEX.L=0. Encoding VDIVSS with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.

DIVSS—Divide Scalar Single Precision Floating-Point Values Vol. 2A 3-337


Operation
VDIVSS (EVEX Encoded Version)
IF (EVEX.b = 1) AND SRC2 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := SRC1[31:0] / SRC2[31:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

VDIVSS (VEX.128 Encoded Version)


DEST[31:0] := SRC1[31:0] / SRC2[31:0]
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

DIVSS (128-bit Legacy SSE Version)


DEST[31:0] := DEST[31:0] / SRC[31:0]
DEST[MAXVL-1:32] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VDIVSS __m128 _mm_mask_div_ss(__m128 s, __mmask8 k, __m128 a, __m128 b);
VDIVSS __m128 _mm_maskz_div_ss( __mmask8 k, __m128 a, __m128 b);
VDIVSS __m128 _mm_div_round_ss( __m128 a, __m128 b, int);
VDIVSS __m128 _mm_mask_div_round_ss(__m128 s, __mmask8 k, __m128 a, __m128 b, int);
VDIVSS __m128 _mm_maskz_div_round_ss( __mmask8 k, __m128 a, __m128 b, int);
DIVSS __m128 _mm_div_ss(__m128 a, __m128 b);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Divide-by-Zero, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

DIVSS—Divide Scalar Single Precision Floating-Point Values Vol. 2A 3-338


EXTRACTPS—Extract Packed Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 3A 17 /r ib A VV SSE4_1 Extract one single precision floating-point value
EXTRACTPS reg/m32, xmm1, imm8 from xmm1 at the offset specified by imm8 and
store the result in reg or m32. Zero extend the
results in 64-bit register if applicable.
VEX.128.66.0F3A.WIG 17 /r ib A V/V AVX Extract one single precision floating-point value
VEXTRACTPS reg/m32, xmm1, imm8 from xmm1 at the offset specified by imm8 and
store the result in reg or m32. Zero extend the
results in 64-bit register if applicable.
EVEX.128.66.0F3A.WIG 17 /r ib B V/V AVX512F Extract one single precision floating-point value
VEXTRACTPS reg/m32, xmm1, imm8 OR AVX10.11 from xmm1 at the offset specified by imm8 and
store the result in reg or m32. Zero extend the
results in 64-bit register if applicable.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:r/m (w) ModRM:reg (r) imm8 N/A
B Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) imm8 N/A

Description
Extracts a single precision floating-point value from the source operand (second operand) at the 32-bit offset spec-
ified from imm8. Immediate bits higher than the most significant offset for the vector length are ignored.
The extracted single precision floating-point value is stored in the low 32-bits of the destination operand
In 64-bit mode, destination register operand has default operand size of 64 bits. The upper 32-bits of the register
are filled with zero. REX.W is ignored.
VEX.128 and EVEX encoded version: When VEX.W1 or EVEX.W1 form is used in 64-bit mode with a general
purpose register (GPR) as a destination operand, the packed single quantity is zero extended to 64 bits.
VEX.vvvv/EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
128-bit Legacy SSE version: When a REX.W prefix is used in 64-bit mode with a general purpose register (GPR) as
a destination operand, the packed single quantity is zero extended to 64 bits.
The source register is an XMM register. Imm8[1:0] determine the starting DWORD offset from which to extract the
32-bit floating-point value.
If VEXTRACTPS is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause
an #UD exception.

EXTRACTPS—Extract Packed Floating-Point Values Vol. 2A 3-363


Operation
VEXTRACTPS (EVEX and VEX.128 Encoded Version)
SRC_OFFSET := IMM8[1:0]
IF (64-Bit Mode and DEST is register)
DEST[31:0] := (SRC[127:0] >> (SRC_OFFSET*32)) AND 0FFFFFFFFh
DEST[63:32] := 0
ELSE
DEST[31:0] := (SRC[127:0] >> (SRC_OFFSET*32)) AND 0FFFFFFFFh
FI

EXTRACTPS (128-bit Legacy SSE Version)


SRC_OFFSET := IMM8[1:0]
IF (64-Bit Mode and DEST is register)
DEST[31:0] := (SRC[127:0] >> (SRC_OFFSET*32)) AND 0FFFFFFFFh
DEST[63:32] := 0
ELSE
DEST[31:0] := (SRC[127:0] >> (SRC_OFFSET*32)) AND 0FFFFFFFFh
FI

Intel C/C++ Compiler Intrinsic Equivalent


EXTRACTPS int _mm_extract_ps (__m128 a, const int nidx);

SIMD Floating-Point Exceptions


None.

Other Exceptions
VEX-encoded instructions, see Table 2-22, “Type 5 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-59, “Type E9NF Class Exception Conditions.”
Additionally:
#UD IF VEX.L = 0.
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

EXTRACTPS—Extract Packed Floating-Point Values Vol. 2A 3-364


GF2P8AFFINEINVQB—Galois Field Affine Transformation Inverse
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F3A CF /r /ib A V/V GFNI Computes inverse affine transformation in the
GF2P8AFFINEINVQB xmm1, finite field GF(2^8).
xmm2/m128, imm8
VEX.128.66.0F3A.W1 CF /r /ib B V/V AVX Computes inverse affine transformation in the
VGF2P8AFFINEINVQB xmm1, xmm2, GFNI finite field GF(2^8).
xmm3/m128, imm8
VEX.256.66.0F3A.W1 CF /r /ib B V/V AVX Computes inverse affine transformation in the
VGF2P8AFFINEINVQB ymm1, ymm2, GFNI finite field GF(2^8).
ymm3/m256, imm8
EVEX.128.66.0F3A.W1 CF /r /ib C V/V (AVX512VL Computes inverse affine transformation in the
VGF2P8AFFINEINVQB xmm1{k1}{z}, OR AVX10.11) finite field GF(2^8).
xmm2, xmm3/m128/m64bcst, imm8 GFNI
EVEX.256.66.0F3A.W1 CF /r /ib C V/V (AVX512VL Computes inverse affine transformation in the
VGF2P8AFFINEINVQB ymm1{k1}{z}, OR AVX10.11) finite field GF(2^8).
ymm2, ymm3/m256/m64bcst, imm8 GFNI
EVEX.512.66.0F3A.W1 CF /r /ib C V/V (AVX512F Computes inverse affine transformation in the
VGF2P8AFFINEINVQB zmm1{k1}{z}, OR AVX10.11) finite field GF(2^8).
zmm2, zmm3/m512/m64bcst, imm8 GFNI

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) imm8 (r) N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8 (r)
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8 (r)

Description
The AFFINEINVB instruction computes an affine transformation in the Galois Field 28. For this instruction, an affine
transformation is defined by A * inv(x) + b where “A” is an 8 by 8 bit matrix, and “x” and “b” are 8-bit vectors. The
inverse of the bytes in x is defined with respect to the reduction polynomial x8 + x4 + x3 + x + 1.
One SIMD register (operand 1) holds “x” as either 16, 32 or 64 8-bit vectors. A second SIMD (operand 2) register
or memory operand contains 2, 4, or 8 “A” values, which are operated upon by the correspondingly aligned 8 “x”
values in the first register. The “b” vector is constant for all calculations and contained in the immediate byte.
The EVEX encoded form of this instruction does not support memory fault suppression. The SSE encoded forms of
the instruction require 16B alignment on their memory operations.
The inverse of each byte is given by the following table. The upper nibble is on the vertical axis and the lower nibble
is on the horizontal axis. For example, the inverse of 0x95 is 0x8A.

GF2P8AFFINEINVQB—Galois Field Affine Transformation Inverse Vol. 2A 3-484


Table 3-59. Inverse Byte Listings
- 0 1 2 3 4 5 6 7 8 9 A B C D E F
0 0 1 8D F6 CB 52 7B D1 E8 4F 29 C0 B0 E1 E5 C7
1 74 B4 AA 4B 99 2B 60 5F 58 3F FD CC FF 40 EE B2
2 3A 6E 5A F1 55 4D A8 C9 C1 A 98 15 30 44 A2 C2
3 2C 45 92 6C F3 39 66 42 F2 35 20 6F 77 BB 59 19
4 1D FE 37 67 2D 31 F5 69 A7 64 AB 13 54 25 E9 9
5 ED 5C 5 CA 4C 24 87 BF 18 3E 22 F0 51 EC 61 17
6 16 5E AF D3 49 A6 36 43 F4 47 91 DF 33 93 21 3B
7 79 B7 97 85 10 B5 BA 3C B6 70 D0 6 A1 FA 81 82
8 83 7E 7F 80 96 73 BE 56 9B 9E 95 D9 F7 2 B9 A4
9 DE 6A 32 6D D8 8A 84 72 2A 14 9F 88 F9 DC 89 9A
A FB 7C 2E C3 8F B8 65 48 26 C8 12 4A CE E7 D2 62
B C E0 1F EF 11 75 78 71 A5 8E 76 3D BD BC 86 57
C B 28 2F A3 DA D4 E4 F A9 27 53 4 1B FC AC E6
D 7A 7 AE 63 C5 DB E2 EA 94 8B C4 D5 9D F8 90 6B
E B1 D D6 EB C6 E CF AD 8 4E D7 E3 5D 50 1E B3
F 5B 23 38 34 68 46 3 8C DD 9C 7D A0 CD 1A 41 1C

Operation
define affine_inverse_byte(tsrc2qw, src1byte, imm):
FOR i := 0 to 7:
* parity(x) = 1 if x has an odd number of 1s in it, and 0 otherwise.*
* inverse(x) is defined in the table above *
retbyte.bit[i] := parity(tsrc2qw.byte[7-i] AND inverse(src1byte)) XOR imm8.bit[i]
return retbyte

VGF2P8AFFINEINVQB dest, src1, src2, imm8 (EVEX Encoded Version)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1:
IF SRC2 is memory and EVEX.b==1:
tsrc2 := SRC2.qword[0]
ELSE:
tsrc2 := SRC2.qword[j]

FOR b := 0 to 7:
IF k1[j*8+b] OR *no writemask*:
FOR i := 0 to 7:
DEST.qword[j].byte[b] := affine_inverse_byte(tsrc2, SRC1.qword[j].byte[b], imm8)
ELSE IF *zeroing*:
DEST.qword[j].byte[b] := 0
*ELSE DEST.qword[j].byte[b] remains unchanged*
DEST[MAX_VL-1:VL] := 0

GF2P8AFFINEINVQB—Galois Field Affine Transformation Inverse Vol. 2A 3-485


VGF2P8AFFINEINVQB dest, src1, src2, imm8 (128b and 256b VEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256)
FOR j := 0 TO KL-1:
FOR b := 0 to 7:
DEST.qword[j].byte[b] := affine_inverse_byte(SRC2.qword[j], SRC1.qword[j].byte[b], imm8)
DEST[MAX_VL-1:VL] := 0

GF2P8AFFINEINVQB srcdest, src1, imm8 (128b SSE Encoded Version)


FOR j := 0 TO 1:
FOR b := 0 to 7:
SRCDEST.qword[j].byte[b] := affine_inverse_byte(SRC1.qword[j], SRCDEST.qword[j].byte[b], imm8)

Intel C/C++ Compiler Intrinsic Equivalent


(V)GF2P8AFFINEINVQB __m128i _mm_gf2p8affineinv_epi64_epi8(__m128i, __m128i, int);
(V)GF2P8AFFINEINVQB __m128i _mm_mask_gf2p8affineinv_epi64_epi8(__m128i, __mmask16, __m128i, __m128i, int);
(V)GF2P8AFFINEINVQB __m128i _mm_maskz_gf2p8affineinv_epi64_epi8(__mmask16, __m128i, __m128i, int);
VGF2P8AFFINEINVQB __m256i _mm256_gf2p8affineinv_epi64_epi8(__m256i, __m256i, int);
VGF2P8AFFINEINVQB __m256i _mm256_mask_gf2p8affineinv_epi64_epi8(__m256i, __mmask32, __m256i, __m256i, int);
VGF2P8AFFINEINVQB __m256i _mm256_maskz_gf2p8affineinv_epi64_epi8(__mmask32, __m256i, __m256i, int);
VGF2P8AFFINEINVQB __m512i _mm512_gf2p8affineinv_epi64_epi8(__m512i, __m512i, int);
VGF2P8AFFINEINVQB __m512i _mm512_mask_gf2p8affineinv_epi64_epi8(__m512i, __mmask64, __m512i, __m512i, int);
VGF2P8AFFINEINVQB __m512i _mm512_maskz_gf2p8affineinv_epi64_epi8(__mmask64, __m512i, __m512i, int);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Legacy-encoded and VEX-encoded: See Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded: See Table 2-52, “Type E4NF Class Exception Conditions.”

GF2P8AFFINEINVQB—Galois Field Affine Transformation Inverse Vol. 2A 3-486


GF2P8AFFINEQB—Galois Field Affine Transformation
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F3A CE /r /ib A V/V GFNI Computes affine transformation in the finite
GF2P8AFFINEQB xmm1, field GF(2^8).
xmm2/m128, imm8
VEX.128.66.0F3A.W1 CE /r /ib B V/V AVX Computes affine transformation in the finite
VGF2P8AFFINEQB xmm1, xmm2, GFNI field GF(2^8).
xmm3/m128, imm8
VEX.256.66.0F3A.W1 CE /r /ib B V/V AVX Computes affine transformation in the finite
VGF2P8AFFINEQB ymm1, ymm2, GFNI field GF(2^8).
ymm3/m256, imm8
EVEX.128.66.0F3A.W1 CE /r /ib C V/V (AVX512VL Computes affine transformation in the finite
VGF2P8AFFINEQB xmm1{k1}{z}, OR AVX10.11) field GF(2^8).
xmm2, xmm3/m128/m64bcst, imm8 GFNI
EVEX.256.66.0F3A.W1 CE /r /ib C V/V (AVX512VL Computes affine transformation in the finite
VGF2P8AFFINEQB ymm1{k1}{z}, OR AVX10.11) field GF(2^8).
ymm2, ymm3/m256/m64bcst, imm8 GFNI
EVEX.512.66.0F3A.W1 CE /r /ib C V/V (AVX512F Computes affine transformation in the finite
VGF2P8AFFINEQB zmm1{k1}{z}, OR AVX10.11) field GF(2^8).
zmm2, zmm3/m512/m64bcst, imm8 GFNI

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) imm8 (r) N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8 (r)
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8 (r)

Description
The AFFINEB instruction computes an affine transformation in the Galois Field 28. For this instruction, an affine
transformation is defined by A * x + b where “A” is an 8 by 8 bit matrix, and “x” and “b” are 8-bit vectors. One SIMD
register (operand 1) holds “x” as either 16, 32 or 64 8-bit vectors. A second SIMD (operand 2) register or memory
operand contains 2, 4, or 8 “A” values, which are operated upon by the correspondingly aligned 8 “x” values in the
first register. The “b” vector is constant for all calculations and contained in the immediate byte.
The EVEX encoded form of this instruction does not support memory fault suppression. The SSE encoded forms of
the instruction require16B alignment on their memory operations.

Operation
define parity(x):
t := 0 // single bit
FOR i := 0 to 7:
t = t xor x.bit[i]
return t

define affine_byte(tsrc2qw, src1byte, imm):

GF2P8AFFINEQB—Galois Field Affine Transformation Vol. 2A 3-487


FOR i := 0 to 7:
* parity(x) = 1 if x has an odd number of 1s in it, and 0 otherwise.*
retbyte.bit[i] := parity(tsrc2qw.byte[7-i] AND src1byte) XOR imm8.bit[i]
return retbyte

VGF2P8AFFINEQB dest, src1, src2, imm8 (EVEX Encoded Version)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1:
IF SRC2 is memory and EVEX.b==1:
tsrc2 := SRC2.qword[0]
ELSE:
tsrc2 := SRC2.qword[j]

FOR b := 0 to 7:
IF k1[j*8+b] OR *no writemask*:
DEST.qword[j].byte[b] := affine_byte(tsrc2, SRC1.qword[j].byte[b], imm8)
ELSE IF *zeroing*:
DEST.qword[j].byte[b] := 0
*ELSE DEST.qword[j].byte[b] remains unchanged*
DEST[MAX_VL-1:VL] := 0

VGF2P8AFFINEQB dest, src1, src2, imm8 (128b and 256b VEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256)
FOR j := 0 TO KL-1:
FOR b := 0 to 7:
DEST.qword[j].byte[b] := affine_byte(SRC2.qword[j], SRC1.qword[j].byte[b], imm8)
DEST[MAX_VL-1:VL] := 0

GF2P8AFFINEQB srcdest, src1, imm8 (128b SSE Encoded Version)


FOR j := 0 TO 1:
FOR b := 0 to 7:
SRCDEST.qword[j].byte[b] := affine_byte(SRC1.qword[j], SRCDEST.qword[j].byte[b], imm8)

Intel C/C++ Compiler Intrinsic Equivalent


(V)GF2P8AFFINEQB __m128i _mm_gf2p8affine_epi64_epi8(__m128i, __m128i, int);
(V)GF2P8AFFINEQB __m128i _mm_mask_gf2p8affine_epi64_epi8(__m128i, __mmask16, __m128i, __m128i, int);
(V)GF2P8AFFINEQB __m128i _mm_maskz_gf2p8affine_epi64_epi8(__mmask16, __m128i, __m128i, int);
VGF2P8AFFINEQB __m256i _mm256_gf2p8affine_epi64_epi8(__m256i, __m256i, int);
VGF2P8AFFINEQB __m256i _mm256_mask_gf2p8affine_epi64_epi8(__m256i, __mmask32, __m256i, __m256i, int);
VGF2P8AFFINEQB __m256i _mm256_maskz_gf2p8affine_epi64_epi8(__mmask32, __m256i, __m256i, int);
VGF2P8AFFINEQB __m512i _mm512_gf2p8affine_epi64_epi8(__m512i, __m512i, int);
VGF2P8AFFINEQB __m512i _mm512_mask_gf2p8affine_epi64_epi8(__m512i, __mmask64, __m512i, __m512i, int);
VGF2P8AFFINEQB __m512i _mm512_maskz_gf2p8affine_epi64_epi8(__mmask64, __m512i, __m512i, int);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Legacy-encoded and VEX-encoded: See Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded: See Table 2-52, “Type E4NF Class Exception Conditions.”

GF2P8AFFINEQB—Galois Field Affine Transformation Vol. 2A 3-488


GF2P8MULB—Galois Field Multiply Bytes
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F38 CF /r A V/V GFNI Multiplies elements in the finite field GF(2^8).
GF2P8MULB xmm1, xmm2/m128
VEX.128.66.0F38.W0 CF /r B V/V AVX Multiplies elements in the finite field GF(2^8).
VGF2P8MULB xmm1, xmm2, GFNI
xmm3/m128
VEX.256.66.0F38.W0 CF /r B V/V AVX Multiplies elements in the finite field GF(2^8).
VGF2P8MULB ymm1, ymm2, GFNI
ymm3/m256
EVEX.128.66.0F38.W0 CF /r C V/V (AVX512VL Multiplies elements in the finite field GF(2^8).
VGF2P8MULB xmm1{k1}{z}, xmm2, OR AVX10.11)
xmm3/m128 GFNI
EVEX.256.66.0F38.W0 CF /r C V/V (AVX512VL Multiplies elements in the finite field GF(2^8).
VGF2P8MULB ymm1{k1}{z}, ymm2, OR AVX10.11)
ymm3/m256 GFNI
EVEX.512.66.0F38.W0 CF /r C V/V (AVX512F Multiplies elements in the finite field GF(2^8).
VGF2P8MULB zmm1{k1}{z}, zmm2, OR AVX10.11)
zmm3/m512 GFNI

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
The instruction multiplies elements in the finite field GF(28), operating on a byte (field element) in the first source
operand and the corresponding byte in a second source operand. The field GF(28) is represented in polynomial
representation with the reduction polynomial x8 + x4 + x3 + x + 1.
This instruction does not support broadcasting.
The EVEX encoded form of this instruction supports memory fault suppression. The SSE encoded forms of the
instruction require16B alignment on their memory operations.

GF2P8MULB—Galois Field Multiply Bytes Vol. 2A 3-489


Operation
define gf2p8mul_byte(src1byte, src2byte):
tword := 0
FOR i := 0 to 7:
IF src2byte.bit[i]:
tword := tword XOR (src1byte<< i)
* carry out polynomial reduction by the characteristic polynomial p*
FOR i := 14 downto 8:
p := 0x11B << (i-8) *0x11B = 0000_0001_0001_1011 in binary*
IF tword.bit[i]:
tword := tword XOR p
return tword.byte[0]

VGF2P8MULB dest, src1, src2 (EVEX Encoded Version)


(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.byte[j] := gf2p8mul_byte(SRC1.byte[j], SRC2.byte[j])
ELSE iF *zeroing*:
DEST.byte[j] := 0
* ELSE DEST.byte[j] remains unchanged*
DEST[MAX_VL-1:VL] := 0

VGF2P8MULB dest, src1, src2 (128b and 256b VEX Encoded Versions)
(KL, VL) = (16, 128), (32, 256)
FOR j := 0 TO KL-1:
DEST.byte[j] := gf2p8mul_byte(SRC1.byte[j], SRC2.byte[j])
DEST[MAX_VL-1:VL] := 0

GF2P8MULB srcdest, src1 (128b SSE Encoded Version)


FOR j := 0 TO 15:
SRCDEST.byte[j] :=gf2p8mul_byte(SRCDEST.byte[j], SRC1.byte[j])

Intel C/C++ Compiler Intrinsic Equivalent


(V)GF2P8MULB __m128i _mm_gf2p8mul_epi8(__m128i, __m128i);
(V)GF2P8MULB __m128i _mm_mask_gf2p8mul_epi8(__m128i, __mmask16, __m128i, __m128i);
(V)GF2P8MULB __m128i _mm_maskz_gf2p8mul_epi8(__mmask16, __m128i, __m128i);
VGF2P8MULB __m256i _mm256_gf2p8mul_epi8(__m256i, __m256i);
VGF2P8MULB __m256i _mm256_mask_gf2p8mul_epi8(__m256i, __mmask32, __m256i, __m256i);
VGF2P8MULB __m256i _mm256_maskz_gf2p8mul_epi8(__mmask32, __m256i, __m256i);
VGF2P8MULB __m512i _mm512_gf2p8mul_epi8(__m512i, __m512i);
VGF2P8MULB __m512i _mm512_mask_gf2p8mul_epi8(__m512i, __mmask64, __m512i, __m512i);
VGF2P8MULB __m512i _mm512_maskz_gf2p8mul_epi8(__mmask64, __m512i, __m512i);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Legacy-encoded and VEX-encoded: See Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded: See Table 2-51, “Type E4 Class Exception Conditions.”

GF2P8MULB—Galois Field Multiply Bytes Vol. 2A 3-490


INSERTPS—Insert Scalar Single Precision Floating-Point Value
Opcode/ Op / En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
66 0F 3A 21 /r ib A V/V SSE4_1 Insert a single precision floating-point value selected
INSERTPS xmm1, xmm2/m32, imm8 by imm8 from xmm2/m32 into xmm1 at the specified
destination element specified by imm8 and zero out
destination elements in xmm1 as indicated in imm8.
VEX.128.66.0F3A.WIG 21 /r ib B V/V AVX Insert a single precision floating-point value selected
VINSERTPS xmm1, xmm2, xmm3/m32, by imm8 from xmm3/m32 and merge with values in
imm8 xmm2 at the specified destination element specified
by imm8 and write out the result and zero out
destination elements in xmm1 as indicated in imm8.
EVEX.128.66.0F3A.W0 21 /r ib C V/V AVX512F Insert a single precision floating-point value selected
VINSERTPS xmm1, xmm2, xmm3/m32, OR AVX10.11 by imm8 from xmm3/m32 and merge with values in
imm8 xmm2 at the specified destination element specified
by imm8 and write out the result and zero out
destination elements in xmm1 as indicated in imm8.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) imm8 N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
(register source form)
Copy a single precision scalar floating-point element into a 128-bit vector register. The immediate operand has
three fields, where the ZMask bits specify which elements of the destination will be set to zero, the Count_D bits
specify which element of the destination will be overwritten with the scalar value, and for vector register sources
the Count_S bits specify which element of the source will be copied. When the scalar source is a memory operand
the Count_S bits are ignored.
(memory source form)
Load a floating-point element from a 32-bit memory location and destination operand it into the first source at the
location indicated by the Count_D bits of the immediate operand. Store in the destination and zero out destination
elements based on the ZMask bits of the immediate operand.
128-bit Legacy SSE version: The first source register is an XMM register. The second source operand is either an
XMM register or a 32-bit memory location. The destination is not distinct from the first source XMM register and the
upper bits (MAXVL-1:128) of the corresponding register destination are unmodified.
VEX.128 and EVEX encoded version: The destination and first source register is an XMM register. The second
source operand is either an XMM register or a 32-bit memory location. The upper bits (MAXVL-1:128) of the corre-
sponding register destination are zeroed.
If VINSERTPS is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause
an #UD exception.

INSERTPS—Insert Scalar Single Precision Floating-Point Value Vol. 2A 3-519


Operation
VINSERTPS (VEX.128 and EVEX Encoded Version)
IF (SRC = REG) THEN COUNT_S := imm8[7:6]
ELSE COUNT_S := 0
COUNT_D := imm8[5:4]
ZMASK := imm8[3:0]
CASE (COUNT_S) OF
0: TMP := SRC2[31:0]
1: TMP := SRC2[63:32]
2: TMP := SRC2[95:64]
3: TMP := SRC2[127:96]
ESAC;
CASE (COUNT_D) OF
0: TMP2[31:0] := TMP
TMP2[127:32] := SRC1[127:32]
1: TMP2[63:32] := TMP
TMP2[31:0] := SRC1[31:0]
TMP2[127:64] := SRC1[127:64]
2: TMP2[95:64] := TMP
TMP2[63:0] := SRC1[63:0]
TMP2[127:96] := SRC1[127:96]
3: TMP2[127:96] := TMP
TMP2[95:0] := SRC1[95:0]
ESAC;

IF (ZMASK[0] = 1) THEN DEST[31:0] := 00000000H


ELSE DEST[31:0] := TMP2[31:0]
IF (ZMASK[1] = 1) THEN DEST[63:32] := 00000000H
ELSE DEST[63:32] := TMP2[63:32]
IF (ZMASK[2] = 1) THEN DEST[95:64] := 00000000H
ELSE DEST[95:64] := TMP2[95:64]
IF (ZMASK[3] = 1) THEN DEST[127:96] := 00000000H
ELSE DEST[127:96] := TMP2[127:96]
DEST[MAXVL-1:128] := 0

INSERTPS (128-bit Legacy SSE Version)


IF (SRC = REG) THEN COUNT_S :=imm8[7:6]
ELSE COUNT_S :=0
COUNT_D := imm8[5:4]
ZMASK := imm8[3:0]
CASE (COUNT_S) OF
0: TMP := SRC[31:0]
1: TMP := SRC[63:32]
2: TMP := SRC[95:64]
3: TMP := SRC[127:96]
ESAC;

CASE (COUNT_D) OF
0: TMP2[31:0] := TMP
TMP2[127:32] := DEST[127:32]
1: TMP2[63:32] := TMP
TMP2[31:0] := DEST[31:0]
TMP2[127:64] := DEST[127:64]
2: TMP2[95:64] := TMP

INSERTPS—Insert Scalar Single Precision Floating-Point Value Vol. 2A 3-520


TMP2[63:0] := DEST[63:0]
TMP2[127:96] := DEST[127:96]
3: TMP2[127:96] := TMP
TMP2[95:0] := DEST[95:0]
ESAC;

IF (ZMASK[0] = 1) THEN DEST[31:0] := 00000000H


ELSE DEST[31:0] := TMP2[31:0]
IF (ZMASK[1] = 1) THEN DEST[63:32] := 00000000H
ELSE DEST[63:32] := TMP2[63:32]
IF (ZMASK[2] = 1) THEN DEST[95:64] := 00000000H
ELSE DEST[95:64] := TMP2[95:64]
IF (ZMASK[3] = 1) THEN DEST[127:96] := 00000000H
ELSE DEST[127:96] := TMP2[127:96]
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VINSERTPS __m128 _mm_insert_ps(__m128 dst, __m128 src, const int nidx);
INSETRTPS __m128 _mm_insert_ps(__m128 dst, __m128 src, const int nidx);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions,” additionally:
#UD If VEX.L = 0.
EVEX-encoded instruction, see Table 2-59, “Type E9NF Class Exception Conditions.”

INSERTPS—Insert Scalar Single Precision Floating-Point Value Vol. 2A 3-521


KADDW/KADDB/KADDQ/KADDD—ADD Two Masks
Opcode/ Op/En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
VEX.L1.0F.W0 4A /r RVR V/V AVX512DQ Add 16 bits masks in k2 and k3 and place result in k1.
KADDW k1, k2, k3 OR AVX10.1
VEX.L1.66.0F.W0 4A /r RVR V/V AVX512DQ Add 8 bits masks in k2 and k3 and place result in k1.
KADDB k1, k2, k3 OR AVX10.1
VEX.L1.0F.W1 4A /r RVR V/V AVX512BW Add 64 bits masks in k2 and k3 and place result in k1.
KADDQ k1, k2, k3 OR AVX10.1
VEX.L1.66.0F.W1 4A /r RVR V/V AVX512BW Add 32 bits masks in k2 and k3 and place result in k1.
KADDD k1, k2, k3 OR AVX10.1

Instruction Operand Encoding


Op/En Operand 1 Operand 2 Operand 3
RVR ModRM:reg (w) VEX.1vvv (r) ModRM:r/m (r, ModRM:[7:6] must be 11b)

Description
Adds the vector mask k2 and the vector mask k3, and writes the result into vector mask k1.

Operation
KADDW
DEST[15:0] := SRC1[15:0] + SRC2[15:0]
DEST[MAX_KL-1:16] := 0

KADDB
DEST[7:0] := SRC1[7:0] + SRC2[7:0]
DEST[MAX_KL-1:8] := 0

KADDQ
DEST[63:0] := SRC1[63:0] + SRC2[63:0]
DEST[MAX_KL-1:64] := 0

KADDD
DEST[31:0] := SRC1[31:0] + SRC2[31:0]
DEST[MAX_KL-1:32] := 0

Intel C/C++ Compiler Intrinsic Equivalent


KADDW __mmask16 _kadd_mask16 (__mmask16 a, __mmask16 b);
KADDB __mmask8 _kadd_mask8 (__mmask8 a, __mmask8 b);
KADDQ __mmask64 _kadd_mask64 (__mmask64 a, __mmask64 b);
KADDD __mmask32 _kadd_mask32 (__mmask32 a, __mmask32 b);

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

KADDW/KADDB/KADDQ/KADDD—ADD Two Masks Vol. 2A 3-570


Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”

KADDW/KADDB/KADDQ/KADDD—ADD Two Masks Vol. 2A 3-571


KANDNW/KANDNB/KANDNQ/KANDND—Bitwise Logical AND NOT Masks
Opcode/ Op/En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
VEX.L1.0F.W0 42 /r RVR V/V AVX512F Bitwise AND NOT 16 bits masks k2 and k3 and place result in k1.
KANDNW k1, k2, k3 OR AVX10.1
VEX.L1.66.0F.W0 42 /r RVR V/V AVX512DQ Bitwise AND NOT 8 bits masks k1 and k2 and place result in k1.
KANDNB k1, k2, k3 OR AVX10.1
VEX.L1.0F.W1 42 /r RVR V/V AVX512BW Bitwise AND NOT 64 bits masks k2 and k3 and place result in k1.
KANDNQ k1, k2, k3 OR AVX10.1
VEX.L1.66.0F.W1 42 /r RVR V/V AVX512BW Bitwise AND NOT 32 bits masks k2 and k3 and place result in k1.
KANDND k1, k2, k3 OR AVX10.1

Instruction Operand Encoding


Op/En Operand 1 Operand 2 Operand 3
RVR ModRM:reg (w) VEX.1vvv (r) ModRM:r/m (r, ModRM:[7:6] must be 11b)

Description
Performs a bitwise AND NOT between the vector mask k2 and the vector mask k3, and writes the result into vector
mask k1.

Operation
KANDNW
DEST[15:0] := (BITWISE NOT SRC1[15:0]) BITWISE AND SRC2[15:0]
DEST[MAX_KL-1:16] := 0

KANDNB
DEST[7:0] := (BITWISE NOT SRC1[7:0]) BITWISE AND SRC2[7:0]
DEST[MAX_KL-1:8] := 0

KANDNQ
DEST[63:0] := (BITWISE NOT SRC1[63:0]) BITWISE AND SRC2[63:0]
DEST[MAX_KL-1:64] := 0

KANDND
DEST[31:0] := (BITWISE NOT SRC1[31:0]) BITWISE AND SRC2[31:0]
DEST[MAX_KL-1:32] := 0

Intel C/C++ Compiler Intrinsic Equivalent


KANDNW __mmask16 _mm512_kandn(__mmask16 a, __mmask16 b);

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”

KANDNW/KANDNB/KANDNQ/KANDND—Bitwise Logical AND NOT Masks Vol. 2A 3-572


KANDW/KANDB/KANDQ/KANDD—Bitwise Logical AND Masks
Opcode/ Op/En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
VEX.L1.0F.W0 41 /r RVR V/V AVX512F Bitwise AND 16 bits masks k2 and k3 and place result in k1.
KANDW k1, k2, k3 OR AVX10.1
VEX.L1.66.0F.W0 41 /r RVR V/V AVX512DQ Bitwise AND 8 bits masks k2 and k3 and place result in k1.
KANDB k1, k2, k3 OR AVX10.1
VEX.L1.0F.W1 41 /r RVR V/V AVX512BW Bitwise AND 64 bits masks k2 and k3 and place result in k1.
KANDQ k1, k2, k3 OR AVX10.1
VEX.L1.66.0F.W1 41 /r RVR V/V AVX512BW Bitwise AND 32 bits masks k2 and k3 and place result in k1.
KANDD k1, k2, k3 OR AVX10.1

Instruction Operand Encoding


Op/En Operand 1 Operand 2 Operand 3
RVR ModRM:reg (w) VEX.1vvv (r) ModRM:r/m (r, ModRM:[7:6] must be 11b)

Description
Performs a bitwise AND between the vector mask k2 and the vector mask k3, and writes the result into vector mask
k1.

Operation
KANDW
DEST[15:0] := SRC1[15:0] BITWISE AND SRC2[15:0]
DEST[MAX_KL-1:16] := 0

KANDB
DEST[7:0] := SRC1[7:0] BITWISE AND SRC2[7:0]
DEST[MAX_KL-1:8] := 0

KANDQ
DEST[63:0] := SRC1[63:0] BITWISE AND SRC2[63:0]
DEST[MAX_KL-1:64] := 0

KANDD
DEST[31:0] := SRC1[31:0] BITWISE AND SRC2[31:0]
DEST[MAX_KL-1:32] := 0

Intel C/C++ Compiler Intrinsic Equivalent


KANDW __mmask16 _mm512_kand(__mmask16 a, __mmask16 b);

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”

KANDW/KANDB/KANDQ/KANDD—Bitwise Logical AND Masks Vol. 2A 3-573


KMOVW/KMOVB/KMOVQ/KMOVD—Move From and to Mask Registers
Opcode/ Op/En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
VEX.L0.0F.W0 90 /r RM V/V AVX512F Move 16 bits mask from k2/m16 and store the result in k1.
KMOVW k1, k2/m16 OR AVX10.1
VEX.L0.66.0F.W0 90 /r RM V/V AVX512DQ Move 8 bits mask from k2/m8 and store the result in k1.
KMOVB k1, k2/m8 OR AVX10.1
VEX.L0.0F.W1 90 /r RM V/V AVX512BW Move 64 bits mask from k2/m64 and store the result in k1.
KMOVQ k1, k2/m64 OR AVX10.1
VEX.L0.66.0F.W1 90 /r RM V/V AVX512BW Move 32 bits mask from k2/m32 and store the result in k1.
KMOVD k1, k2/m32 OR AVX10.1
VEX.L0.0F.W0 91 /r MR V/V AVX512F Move 16 bits mask from k1 and store the result in m16.
KMOVW m16, k1 OR AVX10.1
VEX.L0.66.0F.W0 91 /r MR V/V AVX512DQ Move 8 bits mask from k1 and store the result in m8.
KMOVB m8, k1 OR AVX10.1
VEX.L0.0F.W1 91 /r MR V/V AVX512BW Move 64 bits mask from k1 and store the result in m64.
KMOVQ m64, k1 OR AVX10.1
VEX.L0.66.0F.W1 91 /r MR V/V AVX512BW Move 32 bits mask from k1 and store the result in m32.
KMOVD m32, k1 OR AVX10.1
VEX.L0.0F.W0 92 /r RR V/V AVX512F Move 16 bits mask from r32 to k1.
KMOVW k1, r32 OR AVX10.1
VEX.L0.66.0F.W0 92 /r RR V/V AVX512DQ Move 8 bits mask from r32 to k1.
KMOVB k1, r32 OR AVX10.1
VEX.L0.F2.0F.W1 92 /r RR V/I AVX512BW Move 64 bits mask from r64 to k1.
KMOVQ k1, r64 OR AVX10.1
VEX.L0.F2.0F.W0 92 /r RR V/V AVX512BW Move 32 bits mask from r32 to k1.
KMOVD k1, r32 OR AVX10.1
VEX.L0.0F.W0 93 /r RR V/V AVX512F Move 16 bits mask from k1 to r32.
KMOVW r32, k1 OR AVX10.1
VEX.L0.66.0F.W0 93 /r RR V/V AVX512DQ Move 8 bits mask from k1 to r32.
KMOVB r32, k1 OR AVX10.1
VEX.L0.F2.0F.W1 93 /r RR V/I AVX512BW Move 64 bits mask from k1 to r64.
KMOVQ r64, k1 OR AVX10.1
VEX.L0.F2.0F.W0 93 /r RR V/V AVX512BW Move 32 bits mask from k1 to r32.
KMOVD r32, k1 OR AVX10.1

Instruction Operand Encoding


Op/En Operand 1 Operand 2
RM ModRM:reg (w) ModRM:r/m (r)
MR ModRM:r/m (w, ModRM:[7:6] must not be 11b) ModRM:reg (r)
RR ModRM:reg (w) ModRM:r/m (r, ModRM:[7:6] must be 11b)

Description
Copies values from the source operand (second operand) to the destination operand (first operand). The source
and destination operands can be mask registers, memory location or general purpose. The instruction cannot be
used to transfer data between general purpose registers and or memory locations.

KMOVW/KMOVB/KMOVQ/KMOVD—Move From and to Mask Registers Vol. 2A 3-574


When moving to a mask register, the result is zero extended to MAX_KL size (i.e., 64 bits currently). When moving
to a general-purpose register (GPR), the result is zero-extended to the size of the destination. In 32-bit mode, the
default GPR destination’s size is 32 bits. In 64-bit mode, the default GPR destination’s size is 64 bits. Note that
VEX.W can only be used to modify the size of the GPR operand in 64b mode.

Operation
KMOVW
IF *destination is a memory location*
DEST[15:0] := SRC[15:0]
IF *destination is a mask register or a GPR *
DEST := ZeroExtension(SRC[15:0])

KMOVB
IF *destination is a memory location*
DEST[7:0] := SRC[7:0]
IF *destination is a mask register or a GPR *
DEST := ZeroExtension(SRC[7:0])

KMOVQ
IF *destination is a memory location or a GPR*
DEST[63:0] := SRC[63:0]
IF *destination is a mask register*
DEST := ZeroExtension(SRC[63:0])

KMOVD
IF *destination is a memory location*
DEST[31:0] := SRC[31:0]
IF *destination is a mask register or a GPR *
DEST := ZeroExtension(SRC[31:0])

Intel C/C++ Compiler Intrinsic Equivalent


KMOVW __mmask16 _mm512_kmov(__mmask16 a);

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
Instructions with RR operand encoding, see Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask
Instructions w/o Memory Arg).”
Instructions with RM or MR operand encoding, see Table 2-66, “TYPE K21 Exception Definition (VEX-Encoded
OpMask Instructions Addressing Memory).”

KMOVW/KMOVB/KMOVQ/KMOVD—Move From and to Mask Registers Vol. 2A 3-575


KNOTW/KNOTB/KNOTQ/KNOTD—NOT Mask Register
Opcode/ Op/En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
VEX.L0.0F.W0 44 /r RR V/V AVX512F Bitwise NOT of 16 bits mask k2.
KNOTW k1, k2 OR AVX10.1
VEX.L0.66.0F.W0 44 /r RR V/V AVX512DQ Bitwise NOT of 8 bits mask k2.
KNOTB k1, k2 OR AVX10.1
VEX.L0.0F.W1 44 /r RR V/V AVX512BW Bitwise NOT of 64 bits mask k2.
KNOTQ k1, k2 OR AVX10.1
VEX.L0.66.0F.W1 44 /r RR V/V AVX512BW Bitwise NOT of 32 bits mask k2.
KNOTD k1, k2 OR AVX10.1

Instruction Operand Encoding


Op/En Operand 1 Operand 2
RR ModRM:reg (w) ModRM:r/m (r, ModRM:[7:6] must be 11b)

Description
Performs a bitwise NOT of vector mask k2 and writes the result into vector mask k1.

Operation
KNOTW
DEST[15:0] := BITWISE NOT SRC[15:0]
DEST[MAX_KL-1:16] := 0

KNOTB
DEST[7:0] := BITWISE NOT SRC[7:0]
DEST[MAX_KL-1:8] := 0

KNOTQ
DEST[63:0] := BITWISE NOT SRC[63:0]
DEST[MAX_KL-1:64] := 0

KNOTD
DEST[31:0] := BITWISE NOT SRC[31:0]
DEST[MAX_KL-1:32] := 0

Intel C/C++ Compiler Intrinsic Equivalent


KNOTW __mmask16 _mm512_knot(__mmask16 a);

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”

KNOTW/KNOTB/KNOTQ/KNOTD—NOT Mask Register Vol. 2A 3-576


KORTESTW/KORTESTB/KORTESTQ/KORTESTD—OR Masks and Set Flags
Opcode/ Op/E 64/32 bit CPUID Description
Instruction n Mode Feature Flag
Support
VEX.L0.0F.W0 98 /r RR V/V AVX512F Bitwise OR 16 bits masks k1 and k2 and update ZF and CF accordingly.
KORTESTW k1, k2 OR AVX10.1
VEX.L0.66.0F.W0 98 /r RR V/V AVX512DQ Bitwise OR 8 bits masks k1 and k2 and update ZF and CF accordingly.
KORTESTB k1, k2 OR AVX10.1
VEX.L0.0F.W1 98 /r RR V/V AVX512BW Bitwise OR 64 bits masks k1 and k2 and update ZF and CF accordingly.
KORTESTQ k1, k2 OR AVX10.1
VEX.L0.66.0F.W1 98 /r RR V/V AVX512BW Bitwise OR 32 bits masks k1 and k2 and update ZF and CF accordingly.
KORTESTD k1, k2 OR AVX10.1

Instruction Operand Encoding


Op/En Operand 1 Operand 2
RR ModRM:reg (w) ModRM:r/m (r, ModRM:[7:6] must be 11b)

Description
Performs a bitwise OR between the vector mask register k2, and the vector mask register k1, and sets CF and ZF
based on the operation result.
ZF flag is set if both sources are 0x0. CF is set if, after the OR operation is done, the operation result is all 1’s.

Operation
KORTESTW
TMP[15:0] := DEST[15:0] BITWISE OR SRC[15:0]
IF(TMP[15:0]=0)
THEN ZF := 1
ELSE ZF := 0
FI;
IF(TMP[15:0]=FFFFh)
THEN CF := 1
ELSE CF := 0
FI;

KORTESTB
TMP[7:0] := DEST[7:0] BITWISE OR SRC[7:0]
IF(TMP[7:0]=0)
THEN ZF := 1
ELSE ZF := 0
FI;
IF(TMP[7:0]==FFh)
THEN CF := 1
ELSE CF := 0
FI;

KORTESTW/KORTESTB/KORTESTQ/KORTESTD—OR Masks and Set Flags Vol. 2A 3-577


KORTESTQ
TMP[63:0] := DEST[63:0] BITWISE OR SRC[63:0]
IF(TMP[63:0]=0)
THEN ZF := 1
ELSE ZF := 0
FI;
IF(TMP[63:0]==FFFFFFFF_FFFFFFFFh)
THEN CF := 1
ELSE CF := 0
FI;

KORTESTD
TMP[31:0] := DEST[31:0] BITWISE OR SRC[31:0]
IF(TMP[31:0]=0)
THEN ZF := 1
ELSE ZF := 0
FI;
IF(TMP[31:0]=FFFFFFFFh)
THEN CF := 1
ELSE CF := 0
FI;

Intel C/C++ Compiler Intrinsic Equivalent


KORTESTW __mmask16 _mm512_kortest[cz](__mmask16 a, __mmask16 b);

Flags Affected
The ZF flag is set if the result of OR-ing both sources is all 0s.
The CF flag is set if the result of OR-ing both sources is all 1s.
The OF, SF, AF, and PF flags are set to 0.

Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”

KORTESTW/KORTESTB/KORTESTQ/KORTESTD—OR Masks and Set Flags Vol. 2A 3-578


KORW/KORB/KORQ/KORD—Bitwise Logical OR Masks
Opcode/ Op/En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
VEX.L1.0F.W0 45 /r RVR V/V AVX512F Bitwise OR 16 bits masks k2 and k3 and place result in k1.
KORW k1, k2, k3 OR AVX10.1
VEX.L1.66.0F.W0 45 /r RVR V/V AVX512DQ Bitwise OR 8 bits masks k2 and k3 and place result in k1.
KORB k1, k2, k3 OR AVX10.1
VEX.L1.0F.W1 45 /r RVR V/V AVX512BW Bitwise OR 64 bits masks k2 and k3 and place result in k1.
KORQ k1, k2, k3 OR AVX10.1
VEX.L1.66.0F.W1 45 /r RVR V/V AVX512BW Bitwise OR 32 bits masks k2 and k3 and place result in k1.
KORD k1, k2, k3 OR AVX10.1

Instruction Operand Encoding


Op/En Operand 1 Operand 2 Operand 3
RVR ModRM:reg (w) VEX.1vvv (r) ModRM:r/m (r, ModRM:[7:6] must be 11b)

Description
Performs a bitwise OR between the vector mask k2 and the vector mask k3, and writes the result into vector mask
k1 (three-operand form).

Operation
KORW
DEST[15:0] := SRC1[15:0] BITWISE OR SRC2[15:0]
DEST[MAX_KL-1:16] := 0

KORB
DEST[7:0] := SRC1[7:0] BITWISE OR SRC2[7:0]
DEST[MAX_KL-1:8] := 0

KORQ
DEST[63:0] := SRC1[63:0] BITWISE OR SRC2[63:0]
DEST[MAX_KL-1:64] := 0

KORD
DEST[31:0] := SRC1[31:0] BITWISE OR SRC2[31:0]
DEST[MAX_KL-1:32] := 0

Intel C/C++ Compiler Intrinsic Equivalent


KORW __mmask16 _mm512_kor(__mmask16 a, __mmask16 b);

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”

KORW/KORB/KORQ/KORD—Bitwise Logical OR Masks Vol. 2A 3-579


KSHIFTLW/KSHIFTLB/KSHIFTLQ/KSHIFTLD—Shift Left Mask Registers
Opcode/ Op/En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
VEX.L0.66.0F3A.W1 32 /r RRI V/V AVX512F Shift left 16 bits in k2 by immediate and write result in k1.
KSHIFTLW k1, k2, imm8 OR AVX10.1
VEX.L0.66.0F3A.W0 32 /r RRI V/V AVX512DQ Shift left 8 bits in k2 by immediate and write result in k1.
KSHIFTLB k1, k2, imm8 OR AVX10.1
VEX.L0.66.0F3A.W1 33 /r RRI V/V AVX512BW Shift left 64 bits in k2 by immediate and write result in k1.
KSHIFTLQ k1, k2, imm8 OR AVX10.1
VEX.L0.66.0F3A.W0 33 /r RRI V/V AVX512BW Shift left 32 bits in k2 by immediate and write result in k1.
KSHIFTLD k1, k2, imm8 OR AVX10.1

Instruction Operand Encoding


Op/En Operand 1 Operand 2 Operand 3
RRI ModRM:reg (w) ModRM:r/m (r, ModRM:[7:6] must be 11b) imm8

Description
Shifts 8/16/32/64 bits in the second operand (source operand) left by the count specified in immediate byte and
place the least significant 8/16/32/64 bits of the result in the destination operand. The higher bits of the destina-
tion are zero-extended. The destination is set to zero if the count value is greater than 7 (for byte shift), 15 (for
word shift), 31 (for doubleword shift) or 63 (for quadword shift).

Operation
KSHIFTLW
COUNT := imm8[7:0]
DEST[MAX_KL-1:0] := 0
IF COUNT <=15
THEN DEST[15:0] := SRC1[15:0] << COUNT;
FI;

KSHIFTLB
COUNT := imm8[7:0]
DEST[MAX_KL-1:0] := 0
IF COUNT <=7
THEN DEST[7:0] := SRC1[7:0] << COUNT;
FI;

KSHIFTLQ
COUNT := imm8[7:0]
DEST[MAX_KL-1:0] := 0
IF COUNT <=63
THEN DEST[63:0] := SRC1[63:0] << COUNT;
FI;

KSHIFTLW/KSHIFTLB/KSHIFTLQ/KSHIFTLD—Shift Left Mask Registers Vol. 2A 3-580


KSHIFTLD
COUNT := imm8[7:0]
DEST[MAX_KL-1:0] := 0
IF COUNT <=31
THEN DEST[31:0] := SRC1[31:0] << COUNT;
FI;

Intel C/C++ Compiler Intrinsic Equivalent


Compiler auto generates KSHIFTLW when needed.

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”

KSHIFTLW/KSHIFTLB/KSHIFTLQ/KSHIFTLD—Shift Left Mask Registers Vol. 2A 3-581


KSHIFTRW/KSHIFTRB/KSHIFTRQ/KSHIFTRD—Shift Right Mask Registers
Opcode/ Op/En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
VEX.L0.66.0F3A.W1 30 /r RRI V/V AVX512F Shift right 16 bits in k2 by immediate and write result in k1.
KSHIFTRW k1, k2, imm8 OR AVX10.1
VEX.L0.66.0F3A.W0 30 /r RRI V/V AVX512DQ Shift right 8 bits in k2 by immediate and write result in k1.
KSHIFTRB k1, k2, imm8 OR AVX10.1
VEX.L0.66.0F3A.W1 31 /r RRI V/V AVX512BW Shift right 64 bits in k2 by immediate and write result in k1.
KSHIFTRQ k1, k2, imm8 OR AVX10.1
VEX.L0.66.0F3A.W0 31 /r RRI V/V AVX512BW Shift right 32 bits in k2 by immediate and write result in k1.
KSHIFTRD k1, k2, imm8 OR AVX10.1

Instruction Operand Encoding


Op/En Operand 1 Operand 2 Operand 3
RRI ModRM:reg (w) ModRM:r/m (r, ModRM:[7:6] must be 11b) imm8

Description
Shifts 8/16/32/64 bits in the second operand (source operand) right by the count specified in immediate and place
the least significant 8/16/32/64 bits of the result in the destination operand. The higher bits of the destination are
zero-extended. The destination is set to zero if the count value is greater than 7 (for byte shift), 15 (for word shift),
31 (for doubleword shift) or 63 (for quadword shift).

Operation
KSHIFTRW
COUNT := imm8[7:0]
DEST[MAX_KL-1:0] := 0
IF COUNT <=15
THEN DEST[15:0] := SRC1[15:0] >> COUNT;
FI;

KSHIFTRB
COUNT := imm8[7:0]
DEST[MAX_KL-1:0] := 0
IF COUNT <=7
THEN DEST[7:0] := SRC1[7:0] >> COUNT;
FI;

KSHIFTRQ
COUNT := imm8[7:0]
DEST[MAX_KL-1:0] := 0
IF COUNT <=63
THEN DEST[63:0] := SRC1[63:0] >> COUNT;
FI;

KSHIFTRW/KSHIFTRB/KSHIFTRQ/KSHIFTRD—Shift Right Mask Registers Vol. 2A 3-582


KSHIFTRD
COUNT := imm8[7:0]
DEST[MAX_KL-1:0] := 0
IF COUNT <=31
THEN DEST[31:0] := SRC1[31:0] >> COUNT;
FI;

Intel C/C++ Compiler Intrinsic Equivalent


Compiler auto generates KSHIFTRW when needed.

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”

KSHIFTRW/KSHIFTRB/KSHIFTRQ/KSHIFTRD—Shift Right Mask Registers Vol. 2A 3-583


KTESTW/KTESTB/KTESTQ/KTESTD—Packed Bit Test Masks and Set Flags
Opcode/ Op 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
VEX.L0.0F.W0 99 /r RR V/V AVX512DQ Set ZF and CF depending on sign bit AND and ANDN of 16 bits mask
KTESTW k1, k2 OR AVX10.1 register sources.
VEX.L0.66.0F.W0 99 /r RR V/V AVX512DQ Set ZF and CF depending on sign bit AND and ANDN of 8 bits mask reg-
KTESTB k1, k2 OR AVX10.1 ister sources.
VEX.L0.0F.W1 99 /r RR V/V AVX512BW Set ZF and CF depending on sign bit AND and ANDN of 64 bits mask
KTESTQ k1, k2 OR AVX10.1 register sources.
VEX.L0.66.0F.W1 99 /r RR V/V AVX512BW Set ZF and CF depending on sign bit AND and ANDN of 32 bits mask
KTESTD k1, k2 OR AVX10.1 register sources.

Instruction Operand Encoding


Op/En Operand 1 Operand 2
RR ModRM:reg (r) ModRM:r/m (r, ModRM:[7:6] must be 11b)

Description
Performs a bitwise comparison of the bits of the first source operand and corresponding bits in the second source
operand. If the AND operation produces all zeros, the ZF is set else the ZF is clear. If the bitwise AND operation of
the inverted first source operand with the second source operand produces all zeros the CF is set else the CF is
clear. Only the EFLAGS register is updated.
Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Operation
KTESTW
TEMP[15:0] := SRC2[15:0] AND SRC1[15:0]
IF (TEMP[15:0] = = 0)
THEN ZF :=1;
ELSE ZF := 0;
FI;
TEMP[15:0] := SRC2[15:0] AND NOT SRC1[15:0]
IF (TEMP[15:0] = = 0)
THEN CF :=1;
ELSE CF := 0;
FI;
AF := OF := PF := SF := 0;

KTESTB
TEMP[7:0] := SRC2[7:0] AND SRC1[7:0]
IF (TEMP[7:0] = = 0)
THEN ZF :=1;
ELSE ZF := 0;
FI;
TEMP[7:0] := SRC2[7:0] AND NOT SRC1[7:0]
IF (TEMP[7:0] = = 0)
THEN CF :=1;
ELSE CF := 0;
FI;
AF := OF := PF := SF := 0;

KTESTW/KTESTB/KTESTQ/KTESTD—Packed Bit Test Masks and Set Flags Vol. 2A 3-584


KTESTQ
TEMP[63:0] := SRC2[63:0] AND SRC1[63:0]
IF (TEMP[63:0] = = 0)
THEN ZF :=1;
ELSE ZF := 0;
FI;
TEMP[63:0] := SRC2[63:0] AND NOT SRC1[63:0]
IF (TEMP[63:0] = = 0)
THEN CF :=1;
ELSE CF := 0;
FI;
AF := OF := PF := SF := 0;

KTESTD
TEMP[31:0] := SRC2[31:0] AND SRC1[31:0]
IF (TEMP[31:0] = = 0)
THEN ZF :=1;
ELSE ZF := 0;
FI;
TEMP[31:0] := SRC2[31:0] AND NOT SRC1[31:0]
IF (TEMP[31:0] = = 0)
THEN CF :=1;
ELSE CF := 0;
FI;
AF := OF := PF := SF := 0;

Intel C/C++ Compiler Intrinsic Equivalent

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”

KTESTW/KTESTB/KTESTQ/KTESTD—Packed Bit Test Masks and Set Flags Vol. 2A 3-585


KUNPCKBW/KUNPCKWD/KUNPCKDQ—Unpack for Mask Registers
Opcode/ Op/En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
VEX.L1.66.0F.W0 4B /r RVR V/V AVX512F Unpack 8-bit masks in k2 and k3 and write word result in k1.
KUNPCKBW k1, k2, k3 OR AVX10.1
VEX.L1.0F.W0 4B /r RVR V/V AVX512BW Unpack 16-bit masks in k2 and k3 and write doubleword result
KUNPCKWD k1, k2, k3 OR AVX10.1 in k1.
VEX.L1.0F.W1 4B /r RVR V/V AVX512BW Unpack 32-bit masks in k2 and k3 and write quadword result in
KUNPCKDQ k1, k2, k3 OR AVX10.1 k1.

Instruction Operand Encoding


Op/En Operand 1 Operand 2 Operand 3
RVR ModRM:reg (w) VEX.1vvv (r) ModRM:r/m (r, ModRM:[7:6] must be 11b)

Description
Unpacks the lower 8/16/32 bits of the second and third operands (source operands) into the low part of the first
operand (destination operand), starting from the low bytes. The result is zero-extended in the destination.

Operation
KUNPCKBW
DEST[7:0] := SRC2[7:0]
DEST[15:8] := SRC1[7:0]
DEST[MAX_KL-1:16] := 0

KUNPCKWD
DEST[15:0] := SRC2[15:0]
DEST[31:16] := SRC1[15:0]
DEST[MAX_KL-1:32] := 0

KUNPCKDQ
DEST[31:0] := SRC2[31:0]
DEST[63:32] := SRC1[31:0]
DEST[MAX_KL-1:64] := 0

Intel C/C++ Compiler Intrinsic Equivalent


KUNPCKBW __mmask16 _mm512_kunpackb(__mmask16 a, __mmask16 b);
KUNPCKDQ __mmask64 _mm512_kunpackd(__mmask64 a, __mmask64 b);
KUNPCKWD __mmask32 _mm512_kunpackw(__mmask32 a, __mmask32 b);

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”

KUNPCKBW/KUNPCKWD/KUNPCKDQ—Unpack for Mask Registers Vol. 2A 3-586


KXNORW/KXNORB/KXNORQ/KXNORD—Bitwise Logical XNOR Masks
Opcode/ Op/En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
VEX.L1.0F.W0 46 /r RVR V/V AVX512F Bitwise XNOR 16-bit masks k2 and k3 and place result in k1.
KXNORW k1, k2, k3 OR AVX10.1
VEX.L1.66.0F.W0 46 /r RVR V/V AVX512DQ Bitwise XNOR 8-bit masks k2 and k3 and place result in k1.
KXNORB k1, k2, k3 OR AVX10.1
VEX.L1.0F.W1 46 /r RVR V/V AVX512BW Bitwise XNOR 64-bit masks k2 and k3 and place result in k1.
KXNORQ k1, k2, k3 OR AVX10.1
VEX.L1.66.0F.W1 46 /r RVR V/V AVX512BW Bitwise XNOR 32-bit masks k2 and k3 and place result in k1.
KXNORD k1, k2, k3 OR AVX10.1

Instruction Operand Encoding


Op/En Operand 1 Operand 2 Operand 3
RVR ModRM:reg (w) VEX.1vvv (r) ModRM:r/m (r, ModRM:[7:6] must be 11b)

Description
Performs a bitwise XNOR between the vector mask k2 and the vector mask k3, and writes the result into vector
mask k1 (three-operand form).

Operation
KXNORW
DEST[15:0] := NOT (SRC1[15:0] BITWISE XOR SRC2[15:0])
DEST[MAX_KL-1:16] := 0

KXNORB
DEST[7:0] := NOT (SRC1[7:0] BITWISE XOR SRC2[7:0])
DEST[MAX_KL-1:8] := 0

KXNORQ
DEST[63:0] := NOT (SRC1[63:0] BITWISE XOR SRC2[63:0])
DEST[MAX_KL-1:64] := 0

KXNORD
DEST[31:0] := NOT (SRC1[31:0] BITWISE XOR SRC2[31:0])
DEST[MAX_KL-1:32] := 0

Intel C/C++ Compiler Intrinsic Equivalent


KXNORW __mmask16 _mm512_kxnor(__mmask16 a, __mmask16 b);

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”

KXNORW/KXNORB/KXNORQ/KXNORD—Bitwise Logical XNOR Masks Vol. 2A 3-587


KXORW/KXORB/KXORQ/KXORD—Bitwise Logical XOR Masks
Opcode/ Op/En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
VEX.L1.0F.W0 47 /r RVR V/V AVX512F Bitwise XOR 16-bit masks k2 and k3 and place result in k1.
KXORW k1, k2, k3 OR AVX10.1
VEX.L1.66.0F.W0 47 /r RVR V/V AVX512DQ Bitwise XOR 8-bit masks k2 and k3 and place result in k1.
KXORB k1, k2, k3 OR AVX10.1
VEX.L1.0F.W1 47 /r RVR V/V AVX512BW Bitwise XOR 64-bit masks k2 and k3 and place result in k1.
KXORQ k1, k2, k3 OR AVX10.1
VEX.L1.66.0F.W1 47 /r RVR V/V AVX512BW Bitwise XOR 32-bit masks k2 and k3 and place result in k1.
KXORD k1, k2, k3 OR AVX10.1

Instruction Operand Encoding


Op/En Operand 1 Operand 2 Operand 3
RVR ModRM:reg (w) VEX.1vvv (r) ModRM:r/m (r, ModRM:[7:6] must be 11b)

Description
Performs a bitwise XOR between the vector mask k2 and the vector mask k3, and writes the result into vector mask
k1 (three-operand form).

Operation
KXORW
DEST[15:0] := SRC1[15:0] BITWISE XOR SRC2[15:0]
DEST[MAX_KL-1:16] := 0

KXORB
DEST[7:0] := SRC1[7:0] BITWISE XOR SRC2[7:0]
DEST[MAX_KL-1:8] := 0

KXORQ
DEST[63:0] := SRC1[63:0] BITWISE XOR SRC2[63:0]
DEST[MAX_KL-1:64] := 0

KXORD
DEST[31:0] := SRC1[31:0] BITWISE XOR SRC2[31:0]
DEST[MAX_KL-1:32] := 0

Intel C/C++ Compiler Intrinsic Equivalent


KXORW __mmask16 _mm512_kxor(__mmask16 a, __mmask16 b);

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”

KXORW/KXORB/KXORQ/KXORD—Bitwise Logical XOR Masks Vol. 2A 3-588


7. Updates to Chapter 4, Volume 2B
Change bars and violet text show changes to Chapter 4 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 2B: Instruction Set Reference, M-U.

------------------------------------------------------------------------------------------
Changes to this chapter:
• Updated the PREFETCHh instruction to add details for using the temporal code hints.
• Added the RDMSRLIST instruction.
• Updated the RDPMC instruction to add details on RDPMC Metrics Clear.
• Added the TDPFP16PS instruction.
• Updated the UIRET instruction to add the pseudocode describing software control of the value of the user-
interrupt flag (UIF) established by UIRET.
• Added Intel® AVX10.1 information to the following instructions:
— MAXPD
— MAXPS
— MAXSD
— MAXSS
— MINPD
— MINPS
— MINSD
— MINSS
— MOVAPD
— MOVAPS
— MOVD/MOVQ
— MOVDDUP
— MOVDQA,VMOVDQA32/64
— MOVDQU,VMOVDQU8/16/32/64
— MOVHLPS
— MOVHPD
— MOVHPS
— MOVLHPS
— MOVLPD
— MOVLPS
— MOVNTDQ
— MOVNTDQA
— MOVNTPD
— MOVNTPS
— MOVQ
— MOVSD
— MOVSHDUP
— MOVSLDUP
— MOVSS
— MOVUPD
— MOVUPS
— MULPD
— MULPS
— MULSD

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 13


— MULSS
— ORPD
— ORPS
— PABSB/PABSW/PABSD/PABSQ
— PACKSSWB/PACKSSDW
— PACKUSDW
— PACKUSWB
— PADDB/PADDW/PADDD/PADDQ
— PADDSB/PADDSW
— PADDUSB/PADDUSW
— PALIGNR
— PAND
— PANDN
— PAVGB/PAVGW
— PCLMULQDQ
— PCMPEQB/PCMPEQW/PCMPEQD
— PCMPEQQ
— PCMPGTB/PCMPGTW/PCMPGTD
— PCMPGTQ
— PEXTRB/PEXTRD/PEXTRQ
— PEXTRW
— PINSRB/PINSRD/PINSRQ
— PINSRW
— PMADDUBSW
— PMADDWD
— PMAXSB/PMAXSW/PMAXSD/PMAXSQ
— PMAXUB/PMAXUW
— PMAXUD/PMAXUQ
— PMINSB/PMINSW
— PMINSD/PMINSQ
— PMINUB/PMINUW
— PMINUD/PMINUQ
— PMOVSX
— PMOVZX
— PMULDQ
— PMULHRSW
— PMULHUW
— PMULHW
— PMULLD/PMULLQ
— PMULLW
— PMULUDQ
— POR
— PSADBW
— PSHUFB
— PSHUFD
— PSHUFHW

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 14


— PSHUFLW
— PSLLDQ
— PSLLW/PSLLD/PSLLQ
— PSRAW/PSRAD/PSRAQ
— PSRLDQ
— PSRLW/PSRLD/PSRLQ
— PSUBB/PSUBW/PSUBD
— PSUBQ
— PSUBSB/PSUBSW
— PSUBUSB/PSUBUSW
— PUNPCKHBW/PUNPCKHWD/PUNPCKHDQ/PUNPCKHQDQ
— PUNPCKLBW/PUNPCKLWD/PUNPCKLDQ/PUNPCKLQDQ
— PXOR
— SHUFPD
— SHUFPS
— SQRTPD
— SQRTPS
— SQRTSD
— SQRTSS
— SUBPD
— SUBPS
— SUBSD
— SUBSS
— UCOMISD
— UCOMISS
— UNPCKHPD
— UNPCKHPS
— UNPCKLPD
— UNPCKLPS

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 15


MAXPD—Maximum of Packed Double Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 5F /r A V/V SSE2 Return the maximum double precision floating-
MAXPD xmm1, xmm2/m128 point values between xmm1 and xmm2/m128.
VEX.128.66.0F.WIG 5F /r B V/V AVX Return the maximum double precision floating-
VMAXPD xmm1, xmm2, xmm3/m128 point values between xmm2 and xmm3/m128.
VEX.256.66.0F.WIG 5F /r B V/V AVX Return the maximum packed double precision
VMAXPD ymm1, ymm2, ymm3/m256 floating-point values between ymm2 and
ymm3/m256.
EVEX.128.66.0F.W1 5F /r C V/V (AVX512VL AND Return the maximum packed double precision
VMAXPD xmm1 {k1}{z}, xmm2, AVX512F) OR floating-point values between xmm2 and
xmm3/m128/m64bcst AVX10.11 xmm3/m128/m64bcst and store result in xmm1
subject to writemask k1.
EVEX.256.66.0F.W1 5F /r C V/V (AVX512VL AND Return the maximum packed double precision
VMAXPD ymm1 {k1}{z}, ymm2, AVX512F) OR floating-point values between ymm2 and
ymm3/m256/m64bcst AVX10.11 ymm3/m256/m64bcst and store result in ymm1
subject to writemask k1.
EVEX.512.66.0F.W1 5F /r C V/V AVX512F Return the maximum packed double precision
VMAXPD zmm1 {k1}{z}, zmm2, OR AVX10.11 floating-point values between zmm2 and
zmm3/m512/m64bcst{sae} zmm3/m512/m64bcst and store result in zmm1
subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD compare of the packed double precision floating-point values in the first source operand and the
second source operand and returns the maximum value for each pair of values to the destination operand.
If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is
returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that
is, a QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN
or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source
operand (from either the first or second operand) be returned, the action of MAXPD can be emulated using a
sequence of instructions, such as a comparison followed by AND, ANDN, and OR.
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.

MAXPD—Maximum of Packed Double Precision Floating-Point Values Vol. 2B 4-12


VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of
the corresponding ZMM register destination are zeroed.
VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified.

Operation
MAX(SRC1, SRC2)
{
IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST := SRC2;
ELSE IF (SRC1 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC2 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC1 > SRC2) THEN DEST := SRC1;
ELSE DEST := SRC2;
FI;
}

VMAXPD (EVEX Encoded Versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+63:i] := MAX(SRC1[i+63:i], SRC2[63:0])
ELSE
DEST[i+63:i] := MAX(SRC1[i+63:i], SRC2[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE DEST[i+63:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMAXPD (VEX.256 Encoded Version)


DEST[63:0] := MAX(SRC1[63:0], SRC2[63:0])
DEST[127:64] := MAX(SRC1[127:64], SRC2[127:64])
DEST[191:128] := MAX(SRC1[191:128], SRC2[191:128])
DEST[255:192] := MAX(SRC1[255:192], SRC2[255:192])
DEST[MAXVL-1:256] := 0

VMAXPD (VEX.128 Encoded Version)


DEST[63:0] := MAX(SRC1[63:0], SRC2[63:0])
DEST[127:64] := MAX(SRC1[127:64], SRC2[127:64])
DEST[MAXVL-1:128] := 0

MAXPD—Maximum of Packed Double Precision Floating-Point Values Vol. 2B 4-13


MAXPD (128-bit Legacy SSE Version)
DEST[63:0] := MAX(DEST[63:0], SRC[63:0])
DEST[127:64] := MAX(DEST[127:64], SRC[127:64])
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VMAXPD __m512d _mm512_max_pd( __m512d a, __m512d b);
VMAXPD __m512d _mm512_mask_max_pd(__m512d s, __mmask8 k, __m512d a, __m512d b,);
VMAXPD __m512d _mm512_maskz_max_pd( __mmask8 k, __m512d a, __m512d b);
VMAXPD __m512d _mm512_max_round_pd( __m512d a, __m512d b, int);
VMAXPD __m512d _mm512_mask_max_round_pd(__m512d s, __mmask8 k, __m512d a, __m512d b, int);
VMAXPD __m512d _mm512_maskz_max_round_pd( __mmask8 k, __m512d a, __m512d b, int);
VMAXPD __m256d _mm256_mask_max_pd(__m5256d s, __mmask8 k, __m256d a, __m256d b);
VMAXPD __m256d _mm256_maskz_max_pd( __mmask8 k, __m256d a, __m256d b);
VMAXPD __m128d _mm_mask_max_pd(__m128d s, __mmask8 k, __m128d a, __m128d b);
VMAXPD __m128d _mm_maskz_max_pd( __mmask8 k, __m128d a, __m128d b);
VMAXPD __m256d _mm256_max_pd (__m256d a, __m256d b);
(V)MAXPD __m128d _mm_max_pd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions


Invalid (including QNaN Source Operand), Denormal.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”

MAXPD—Maximum of Packed Double Precision Floating-Point Values Vol. 2B 4-14


MAXPS—Maximum of Packed Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 5F /r A V/V SSE Return the maximum single precision floating-point values
MAXPS xmm1, xmm2/m128 between xmm1 and xmm2/mem.
VEX.128.0F.WIG 5F /r B V/V AVX Return the maximum single precision floating-point values
VMAXPS xmm1, xmm2, between xmm2 and xmm3/mem.
xmm3/m128
VEX.256.0F.WIG 5F /r B V/V AVX Return the maximum single precision floating-point values
VMAXPS ymm1, ymm2, between ymm2 and ymm3/mem.
ymm3/m256
EVEX.128.0F.W0 5F /r C V/V (AVX512VL AND Return the maximum packed single precision floating-point
VMAXPS xmm1 {k1}{z}, xmm2, AVX512F) OR values between xmm2 and xmm3/m128/m32bcst and
xmm3/m128/m32bcst AVX10.11 store result in xmm1 subject to writemask k1.
EVEX.256.0F.W0 5F /r C V/V (AVX512VL AND Return the maximum packed single precision floating-point
VMAXPS ymm1 {k1}{z}, ymm2, AVX512F) OR values between ymm2 and ymm3/m256/m32bcst and
ymm3/m256/m32bcst AVX10.11 store result in ymm1 subject to writemask k1.
EVEX.512.0F.W0 5F /r C V/V AVX512F Return the maximum packed single precision floating-point
VMAXPS zmm1 {k1}{z}, zmm2, OR AVX10.11 values between zmm2 and zmm3/m512/m32bcst and
zmm3/m512/m32bcst{sae} store result in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD compare of the packed single precision floating-point values in the first source operand and the
second source operand and returns the maximum value for each pair of values to the destination operand.
If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is
returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that
is, a QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN
or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source
operand (from either the first or second operand) be returned, the action of MAXPS can be emulated using a
sequence of instructions, such as, a comparison followed by AND, ANDN, and OR.
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of
the corresponding ZMM register destination are zeroed.

MAXPS—Maximum of Packed Single Precision Floating-Point Values Vol. 2B 4-15


VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified.

Operation
MAX(SRC1, SRC2)
{
IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST := SRC2;
ELSE IF (SRC1 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC2 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC1 > SRC2) THEN DEST := SRC1;
ELSE DEST := SRC2;
FI;
}

VMAXPS (EVEX Encoded Versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+31:i] := MAX(SRC1[i+31:i], SRC2[31:0])
ELSE
DEST[i+31:i] := MAX(SRC1[i+31:i], SRC2[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE DEST[i+31:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMAXPS (VEX.256 Encoded Version)


DEST[31:0] := MAX(SRC1[31:0], SRC2[31:0])
DEST[63:32] := MAX(SRC1[63:32], SRC2[63:32])
DEST[95:64] := MAX(SRC1[95:64], SRC2[95:64])
DEST[127:96] := MAX(SRC1[127:96], SRC2[127:96])
DEST[159:128] := MAX(SRC1[159:128], SRC2[159:128])
DEST[191:160] := MAX(SRC1[191:160], SRC2[191:160])
DEST[223:192] := MAX(SRC1[223:192], SRC2[223:192])
DEST[255:224] := MAX(SRC1[255:224], SRC2[255:224])
DEST[MAXVL-1:256] := 0

MAXPS—Maximum of Packed Single Precision Floating-Point Values Vol. 2B 4-16


VMAXPS (VEX.128 Encoded Version)
DEST[31:0] := MAX(SRC1[31:0], SRC2[31:0])
DEST[63:32] := MAX(SRC1[63:32], SRC2[63:32])
DEST[95:64] := MAX(SRC1[95:64], SRC2[95:64])
DEST[127:96] := MAX(SRC1[127:96], SRC2[127:96])
DEST[MAXVL-1:128] := 0

MAXPS (128-bit Legacy SSE Version)


DEST[31:0] := MAX(DEST[31:0], SRC[31:0])
DEST[63:32] := MAX(DEST[63:32], SRC[63:32])
DEST[95:64] := MAX(DEST[95:64], SRC[95:64])
DEST[127:96] := MAX(DEST[127:96], SRC[127:96])
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VMAXPS __m512 _mm512_max_ps( __m512 a, __m512 b);
VMAXPS __m512 _mm512_mask_max_ps(__m512 s, __mmask16 k, __m512 a, __m512 b);
VMAXPS __m512 _mm512_maskz_max_ps( __mmask16 k, __m512 a, __m512 b);
VMAXPS __m512 _mm512_max_round_ps( __m512 a, __m512 b, int);
VMAXPS __m512 _mm512_mask_max_round_ps(__m512 s, __mmask16 k, __m512 a, __m512 b, int);
VMAXPS __m512 _mm512_maskz_max_round_ps( __mmask16 k, __m512 a, __m512 b, int);
VMAXPS __m256 _mm256_mask_max_ps(__m256 s, __mmask8 k, __m256 a, __m256 b);
VMAXPS __m256 _mm256_maskz_max_ps( __mmask8 k, __m256 a, __m256 b);
VMAXPS __m128 _mm_mask_max_ps(__m128 s, __mmask8 k, __m128 a, __m128 b);
VMAXPS __m128 _mm_maskz_max_ps( __mmask8 k, __m128 a, __m128 b);
VMAXPS __m256 _mm256_max_ps (__m256 a, __m256 b);
MAXPS __m128 _mm_max_ps (__m128 a, __m128 b);

SIMD Floating-Point Exceptions


Invalid (including QNaN Source Operand), Denormal.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”

MAXPS—Maximum of Packed Single Precision Floating-Point Values Vol. 2B 4-17


MAXSD—Return Maximum Scalar Double Precision Floating-Point Value
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F2 0F 5F /r A V/V SSE2 Return the maximum scalar double precision
MAXSD xmm1, xmm2/m64 floating-point value between xmm2/m64 and xmm1.
VEX.LIG.F2.0F.WIG 5F /r B V/V AVX Return the maximum scalar double precision
VMAXSD xmm1, xmm2, xmm3/m64 floating-point value between xmm3/m64 and xmm2.
EVEX.LLIG.F2.0F.W1 5F /r C V/V AVX512F Return the maximum scalar double precision
VMAXSD xmm1 {k1}{z}, xmm2, OR AVX10.11 floating-point value between xmm3/m64 and xmm2.
xmm3/m64{sae}

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Compares the low double precision floating-point values in the first source operand and the second source
operand, and returns the maximum value to the low quadword of the destination operand. The second source
operand can be an XMM register or a 64-bit memory location. The first source and destination operands are XMM
registers. When the second source operand is a memory operand, only 64 bits are accessed.
If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If
a value in the second source operand is an SNaN, that SNaN is returned unchanged to the destination (that is, a
QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid
floating-point value, is written to the result. If instead of this behavior, it is required that the NaN of either source
operand be returned, the action of MAXSD can be emulated using a sequence of instructions, such as, a compar-
ison followed by AND, ANDN, and OR.
128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAXVL-1:64) of the
corresponding destination register remain unchanged.
VEX.128 and EVEX encoded version: Bits (127:64) of the XMM register destination are copied from corresponding
bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.
EVEX encoded version: The low quadword element of the destination operand is updated according to the write-
mask.
Software should ensure VMAXSD is encoded with VEX.L=0. Encoding VMAXSD with VEX.L=1 may encounter
unpredictable behavior across different processor generations.

MAXSD—Return Maximum Scalar Double Precision Floating-Point Value Vol. 2B 4-18


Operation
MAX(SRC1, SRC2)
{
IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST := SRC2;
ELSE IF (SRC1 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC2 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC1 > SRC2) THEN DEST := SRC1;
ELSE DEST := SRC2;
FI;
}

VMAXSD (EVEX Encoded Version)


IF k1[0] or *no writemask*
THEN DEST[63:0] := MAX(SRC1[63:0], SRC2[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
DEST[63:0] := 0
FI;
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

VMAXSD (VEX.128 Encoded Version)


DEST[63:0] := MAX(SRC1[63:0], SRC2[63:0])
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

MAXSD (128-bit Legacy SSE Version)


DEST[63:0] := MAX(DEST[63:0], SRC[63:0])
DEST[MAXVL-1:64] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VMAXSD __m128d _mm_max_round_sd( __m128d a, __m128d b, int);
VMAXSD __m128d _mm_mask_max_round_sd(__m128d s, __mmask8 k, __m128d a, __m128d b, int);
VMAXSD __m128d _mm_maskz_max_round_sd( __mmask8 k, __m128d a, __m128d b, int);
MAXSD __m128d _mm_max_sd(__m128d a, __m128d b)

SIMD Floating-Point Exceptions


Invalid (Including QNaN Source Operand), Denormal.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-49, “Type E3 Class Exception Conditions.”

MAXSD—Return Maximum Scalar Double Precision Floating-Point Value Vol. 2B 4-19


MAXSS—Return Maximum Scalar Single Precision Floating-Point Value
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F3 0F 5F /r A V/V SSE Return the maximum scalar single precision floating-point
MAXSS xmm1, xmm2/m32 value between xmm2/m32 and xmm1.
VEX.LIG.F3.0F.WIG 5F /r B V/V AVX Return the maximum scalar single precision floating-point
VMAXSS xmm1, xmm2, xmm3/m32 value between xmm3/m32 and xmm2.
EVEX.LLIG.F3.0F.W0 5F /r C V/V AVX512F Return the maximum scalar single precision floating-point
VMAXSS xmm1 {k1}{z}, xmm2, OR AVX10.11 value between xmm3/m32 and xmm2.
xmm3/m32{sae}

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Compares the low single precision floating-point values in the first source operand and the second source operand,
and returns the maximum value to the low doubleword of the destination operand.
If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If
a value in the second source operand is an SNaN, that SNaN is returned unchanged to the destination (that is, a
QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid
floating-point value, is written to the result. If instead of this behavior, it is required that the NaN from either source
operand be returned, the action of MAXSS can be emulated using a sequence of instructions, such as, a comparison
followed by AND, ANDN, and OR.
The second source operand can be an XMM register or a 32-bit memory location. The first source and destination
operands are XMM registers.
128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAXVL:32) of the corre-
sponding destination register remain unchanged.
VEX.128 and EVEX encoded version: The first source operand is an xmm register encoded by VEX.vvvv. Bits
(127:32) of the XMM register destination are copied from corresponding bits in the first source operand. Bits
(MAXVL:128) of the destination register are zeroed.
EVEX encoded version: The low doubleword element of the destination operand is updated according to the write-
mask.
Software should ensure VMAXSS is encoded with VEX.L=0. Encoding VMAXSS with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.

MAXSS—Return Maximum Scalar Single Precision Floating-Point Value Vol. 2B 4-20


Operation
MAX(SRC1, SRC2)
{
IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST := SRC2;
ELSE IF (SRC1 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC2 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC1 > SRC2) THEN DEST := SRC1;
ELSE DEST := SRC2;
FI;
}

VMAXSS (EVEX Encoded Version)


IF k1[0] or *no writemask*
THEN DEST[31:0] := MAX(SRC1[31:0], SRC2[31:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

VMAXSS (VEX.128 Encoded Version)


DEST[31:0] := MAX(SRC1[31:0], SRC2[31:0])
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

MAXSS (128-bit Legacy SSE Version)


DEST[31:0] := MAX(DEST[31:0], SRC[31:0])
DEST[MAXVL-1:32] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VMAXSS __m128 _mm_max_round_ss( __m128 a, __m128 b, int);
VMAXSS __m128 _mm_mask_max_round_ss(__m128 s, __mmask8 k, __m128 a, __m128 b, int);
VMAXSS __m128 _mm_maskz_max_round_ss( __mmask8 k, __m128 a, __m128 b, int);
MAXSS __m128 _mm_max_ss(__m128 a, __m128 b)

SIMD Floating-Point Exceptions


Invalid (Including QNaN Source Operand), Denormal.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-49, “Type E3 Class Exception Conditions.”

MAXSS—Return Maximum Scalar Single Precision Floating-Point Value Vol. 2B 4-21


MINPD—Minimum of Packed Double Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 5D /r A V/V SSE2 Return the minimum double precision floating-point
MINPD xmm1, xmm2/m128 values between xmm1 and xmm2/mem
VEX.128.66.0F.WIG 5D /r B V/V AVX Return the minimum double precision floating-point
VMINPD xmm1, xmm2, values between xmm2 and xmm3/mem.
xmm3/m128
VEX.256.66.0F.WIG 5D /r B V/V AVX Return the minimum packed double precision floating-
VMINPD ymm1, ymm2, point values between ymm2 and ymm3/mem.
ymm3/m256
EVEX.128.66.0F.W1 5D /r C V/V (AVX512VL AND Return the minimum packed double precision floating-
VMINPD xmm1 {k1}{z}, xmm2, AVX512F) OR point values between xmm2 and xmm3/m128/m64bcst
xmm3/m128/m64bcst AVX10.11 and store result in xmm1 subject to writemask k1.
EVEX.256.66.0F.W1 5D /r C V/V (AVX512VL AND Return the minimum packed double precision floating-
VMINPD ymm1 {k1}{z}, ymm2, AVX512F) OR point values between ymm2 and ymm3/m256/m64bcst
ymm3/m256/m64bcst AVX10.11 and store result in ymm1 subject to writemask k1.
EVEX.512.66.0F.W1 5D /r C V/V AVX512F Return the minimum packed double precision floating-
VMINPD zmm1 {k1}{z}, zmm2, OR AVX10.11 point values between zmm2 and zmm3/m512/m64bcst
zmm3/m512/m64bcst{sae} and store result in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD compare of the packed double precision floating-point values in the first source operand and the
second source operand and returns the minimum value for each pair of values to the destination operand.
If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is
returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that
is, a QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN
or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source
operand (from either the first or second operand) be returned, the action of MINPD can be emulated using a
sequence of instructions, such as, a comparison followed by AND, ANDN, and OR.
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of
the corresponding ZMM register destination are zeroed.

MINPD—Minimum of Packed Double Precision Floating-Point Values Vol. 2B 4-23


VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified.

Operation
MIN(SRC1, SRC2)
{
IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST := SRC2;
ELSE IF (SRC1 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC2 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC1 < SRC2) THEN DEST := SRC1;
ELSE DEST := SRC2;
FI;
}

VMINPD (EVEX Encoded Version)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+63:i] := MIN(SRC1[i+63:i], SRC2[63:0])
ELSE
DEST[i+63:i] := MIN(SRC1[i+63:i], SRC2[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE DEST[i+63:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMINPD (VEX.256 Encoded Version)


DEST[63:0] := MIN(SRC1[63:0], SRC2[63:0])
DEST[127:64] := MIN(SRC1[127:64], SRC2[127:64])
DEST[191:128] := MIN(SRC1[191:128], SRC2[191:128])
DEST[255:192] := MIN(SRC1[255:192], SRC2[255:192])

VMINPD (VEX.128 Encoded Version)


DEST[63:0] := MIN(SRC1[63:0], SRC2[63:0])
DEST[127:64] := MIN(SRC1[127:64], SRC2[127:64])
DEST[MAXVL-1:128] := 0

MINPD (128-bit Legacy SSE Version)


DEST[63:0] := MIN(SRC1[63:0], SRC2[63:0])
DEST[127:64] := MIN(SRC1[127:64], SRC2[127:64])
DEST[MAXVL-1:128] (Unmodified)

MINPD—Minimum of Packed Double Precision Floating-Point Values Vol. 2B 4-24


Intel C/C++ Compiler Intrinsic Equivalent
VMINPD __m512d _mm512_min_pd( __m512d a, __m512d b);
VMINPD __m512d _mm512_mask_min_pd(__m512d s, __mmask8 k, __m512d a, __m512d b);
VMINPD __m512d _mm512_maskz_min_pd( __mmask8 k, __m512d a, __m512d b);
VMINPD __m512d _mm512_min_round_pd( __m512d a, __m512d b, int);
VMINPD __m512d _mm512_mask_min_round_pd(__m512d s, __mmask8 k, __m512d a, __m512d b, int);
VMINPD __m512d _mm512_maskz_min_round_pd( __mmask8 k, __m512d a, __m512d b, int);
VMINPD __m256d _mm256_mask_min_pd(__m256d s, __mmask8 k, __m256d a, __m256d b);
VMINPD __m256d _mm256_maskz_min_pd( __mmask8 k, __m256d a, __m256d b);
VMINPD __m128d _mm_mask_min_pd(__m128d s, __mmask8 k, __m128d a, __m128d b);
VMINPD __m128d _mm_maskz_min_pd( __mmask8 k, __m128d a, __m128d b);
VMINPD __m256d _mm256_min_pd (__m256d a, __m256d b);
MINPD __m128d _mm_min_pd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions


Invalid (including QNaN Source Operand), Denormal.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”

MINPD—Minimum of Packed Double Precision Floating-Point Values Vol. 2B 4-25


MINPS—Minimum of Packed Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 5D /r A V/V SSE Return the minimum single precision floating-point values
MINPS xmm1, xmm2/m128 between xmm1 and xmm2/mem.
VEX.128.0F.WIG 5D /r B V/V AVX Return the minimum single precision floating-point values
VMINPS xmm1, xmm2, between xmm2 and xmm3/mem.
xmm3/m128
VEX.256.0F.WIG 5D /r B V/V AVX Return the minimum single double precision floating-point
VMINPS ymm1, ymm2, values between ymm2 and ymm3/mem.
ymm3/m256
EVEX.128.0F.W0 5D /r C V/V (AVX512VL AND Return the minimum packed single precision floating-point
VMINPS xmm1 {k1}{z}, xmm2, AVX512F) OR values between xmm2 and xmm3/m128/m32bcst and
xmm3/m128/m32bcst AVX10.11 store result in xmm1 subject to writemask k1.
EVEX.256.0F.W0 5D /r C V/V (AVX512VL AND Return the minimum packed single precision floating-point
VMINPS ymm1 {k1}{z}, ymm2, AVX512F) OR values between ymm2 and ymm3/m256/m32bcst and
ymm3/m256/m32bcst AVX10.11 store result in ymm1 subject to writemask k1.
EVEX.512.0F.W0 5D /r C V/V AVX512F Return the minimum packed single precision floating-point
VMINPS zmm1 {k1}{z}, zmm2, OR AVX10.11 values between zmm2 and zmm3/m512/m32bcst and
zmm3/m512/m32bcst{sae} store result in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD compare of the packed single precision floating-point values in the first source operand and the
second source operand and returns the minimum value for each pair of values to the destination operand.
If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is
returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that
is, a QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN
or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source
operand (from either the first or second operand) be returned, the action of MINPS can be emulated using a
sequence of instructions, such as, a comparison followed by AND, ANDN, and OR.
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of
the corresponding ZMM register destination are zeroed.

MINPS—Minimum of Packed Single Precision Floating-Point Values Vol. 2B 4-26


VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified.

Operation
MIN(SRC1, SRC2)
{
IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST := SRC2;
ELSE IF (SRC1 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC2 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC1 < SRC2) THEN DEST := SRC1;
ELSE DEST := SRC2;
FI;
}

VMINPS (EVEX Encoded Version)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+31:i] := MIN(SRC1[i+31:i], SRC2[31:0])
ELSE
DEST[i+31:i] := MIN(SRC1[i+31:i], SRC2[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE DEST[i+31:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMINPS (VEX.256 Encoded Version)


DEST[31:0] := MIN(SRC1[31:0], SRC2[31:0])
DEST[63:32] := MIN(SRC1[63:32], SRC2[63:32])
DEST[95:64] := MIN(SRC1[95:64], SRC2[95:64])
DEST[127:96] := MIN(SRC1[127:96], SRC2[127:96])
DEST[159:128] := MIN(SRC1[159:128], SRC2[159:128])
DEST[191:160] := MIN(SRC1[191:160], SRC2[191:160])
DEST[223:192] := MIN(SRC1[223:192], SRC2[223:192])
DEST[255:224] := MIN(SRC1[255:224], SRC2[255:224])

MINPS—Minimum of Packed Single Precision Floating-Point Values Vol. 2B 4-27


VMINPS (VEX.128 Encoded Version)
DEST[31:0] := MIN(SRC1[31:0], SRC2[31:0])
DEST[63:32] := MIN(SRC1[63:32], SRC2[63:32])
DEST[95:64] := MIN(SRC1[95:64], SRC2[95:64])
DEST[127:96] := MIN(SRC1[127:96], SRC2[127:96])
DEST[MAXVL-1:128] := 0

MINPS (128-bit Legacy SSE Version)


DEST[31:0] := MIN(SRC1[31:0], SRC2[31:0])
DEST[63:32] := MIN(SRC1[63:32], SRC2[63:32])
DEST[95:64] := MIN(SRC1[95:64], SRC2[95:64])
DEST[127:96] := MIN(SRC1[127:96], SRC2[127:96])
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VMINPS __m512 _mm512_min_ps( __m512 a, __m512 b);
VMINPS __m512 _mm512_mask_min_ps(__m512 s, __mmask16 k, __m512 a, __m512 b);
VMINPS __m512 _mm512_maskz_min_ps( __mmask16 k, __m512 a, __m512 b);
VMINPS __m512 _mm512_min_round_ps( __m512 a, __m512 b, int);
VMINPS __m512 _mm512_mask_min_round_ps(__m512 s, __mmask16 k, __m512 a, __m512 b, int);
VMINPS __m512 _mm512_maskz_min_round_ps( __mmask16 k, __m512 a, __m512 b, int);
VMINPS __m256 _mm256_mask_min_ps(__m256 s, __mmask8 k, __m256 a, __m256 b);
VMINPS __m256 _mm256_maskz_min_ps( __mmask8 k, __m256 a, __m25 b);
VMINPS __m128 _mm_mask_min_ps(__m128 s, __mmask8 k, __m128 a, __m128 b);
VMINPS __m128 _mm_maskz_min_ps( __mmask8 k, __m128 a, __m128 b);
VMINPS __m256 _mm256_min_ps (__m256 a, __m256 b);
MINPS __m128 _mm_min_ps (__m128 a, __m128 b);

SIMD Floating-Point Exceptions


Invalid (including QNaN Source Operand), Denormal.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”

MINPS—Minimum of Packed Single Precision Floating-Point Values Vol. 2B 4-28


MINSD—Return Minimum Scalar Double Precision Floating-Point Value
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F2 0F 5D /r A V/V SSE2 Return the minimum scalar double precision floating-
MINSD xmm1, xmm2/m64 point value between xmm2/m64 and xmm1.
VEX.LIG.F2.0F.WIG 5D /r B V/V AVX Return the minimum scalar double precision floating-
VMINSD xmm1, xmm2, xmm3/m64 point value between xmm3/m64 and xmm2.
EVEX.LLIG.F2.0F.W1 5D /r C V/V AVX512F Return the minimum scalar double precision floating-
VMINSD xmm1 {k1}{z}, xmm2, OR AVX10.11 point value between xmm3/m64 and xmm2.
xmm3/m64{sae}

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Compares the low double precision floating-point values in the first source operand and the second source
operand, and returns the minimum value to the low quadword of the destination operand. When the source
operand is a memory operand, only the 64 bits are accessed.
If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If
a value in the second source operand is an SNaN, then SNaN is returned unchanged to the destination (that is, a
QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid
floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source operand
(from either the first or second source) be returned, the action of MINSD can be emulated using a sequence of
instructions, such as, a comparison followed by AND, ANDN, and OR.
The second source operand can be an XMM register or a 64-bit memory location. The first source and destination
operands are XMM registers.
128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAXVL-1:64) of the
corresponding destination register remain unchanged.
VEX.128 and EVEX encoded version: Bits (127:64) of the XMM register destination are copied from corresponding
bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.
EVEX encoded version: The low quadword element of the destination operand is updated according to the write-
mask.
Software should ensure VMINSD is encoded with VEX.L=0. Encoding VMINSD with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.

MINSD—Return Minimum Scalar Double Precision Floating-Point Value Vol. 2B 4-29


Operation
MIN(SRC1, SRC2)
{
IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST := SRC2;
ELSE IF (SRC1 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC2 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC1 < SRC2) THEN DEST := SRC1;
ELSE DEST := SRC2;
FI;
}

MINSD (EVEX Encoded Version)


IF k1[0] or *no writemask*
THEN DEST[63:0] := MIN(SRC1[63:0], SRC2[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

MINSD (VEX.128 Encoded Version)


DEST[63:0] := MIN(SRC1[63:0], SRC2[63:0])
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

MINSD (128-bit Legacy SSE Version)


DEST[63:0] := MIN(SRC1[63:0], SRC2[63:0])
DEST[MAXVL-1:64] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VMINSD __m128d _mm_min_round_sd(__m128d a, __m128d b, int);
VMINSD __m128d _mm_mask_min_round_sd(__m128d s, __mmask8 k, __m128d a, __m128d b, int);
VMINSD __m128d _mm_maskz_min_round_sd( __mmask8 k, __m128d a, __m128d b, int);
MINSD __m128d _mm_min_sd(__m128d a, __m128d b)

SIMD Floating-Point Exceptions


Invalid (including QNaN Source Operand), Denormal.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-49, “Type E3 Class Exception Conditions.”

MINSD—Return Minimum Scalar Double Precision Floating-Point Value Vol. 2B 4-30


MINSS—Return Minimum Scalar Single Precision Floating-Point Value
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F3 0F 5D /r A V/V SSE Return the minimum scalar single precision floating-point
MINSS xmm1,xmm2/m32 value between xmm2/m32 and xmm1.
VEX.LIG.F3.0F.WIG 5D /r B V/V AVX Return the minimum scalar single precision floating-point
VMINSS xmm1,xmm2, xmm3/m32 value between xmm3/m32 and xmm2.
EVEX.LLIG.F3.0F.W0 5D /r C V/V AVX512F Return the minimum scalar single precision floating-point
VMINSS xmm1 {k1}{z}, xmm2, OR AVX10.11 value between xmm3/m32 and xmm2.
xmm3/m32{sae}

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Compares the low single precision floating-point values in the first source operand and the second source operand
and returns the minimum value to the low doubleword of the destination operand.
If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If
a value in the second operand is an SNaN, that SNaN is returned unchanged to the destination (that is, a QNaN
version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid
floating-point value, is written to the result. If instead of this behavior, it is required that the NaN in either source
operand be returned, the action of MINSD can be emulated using a sequence of instructions, such as, a comparison
followed by AND, ANDN, and OR.
The second source operand can be an XMM register or a 32-bit memory location. The first source and destination
operands are XMM registers.
128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAXVL:32) of the corre-
sponding destination register remain unchanged.
VEX.128 and EVEX encoded version: The first source operand is an xmm register encoded by (E)VEX.vvvv. Bits
(127:32) of the XMM register destination are copied from corresponding bits in the first source operand. Bits
(MAXVL-1:128) of the destination register are zeroed.
EVEX encoded version: The low doubleword element of the destination operand is updated according to the write-
mask.
Software should ensure VMINSS is encoded with VEX.L=0. Encoding VMINSS with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.

MINSS—Return Minimum Scalar Single Precision Floating-Point Value Vol. 2B 4-31


Operation
MIN(SRC1, SRC2)
{
IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST := SRC2;
ELSE IF (SRC1 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC2 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC1 < SRC2) THEN DEST := SRC1;
ELSE DEST := SRC2;
FI;
}

MINSS (EVEX Encoded Version)


IF k1[0] or *no writemask*
THEN DEST[31:0] := MIN(SRC1[31:0], SRC2[31:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

VMINSS (VEX.128 Encoded Version)


DEST[31:0] := MIN(SRC1[31:0], SRC2[31:0])
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

MINSS (128-bit Legacy SSE Version)


DEST[31:0] := MIN(SRC1[31:0], SRC2[31:0])
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VMINSS __m128 _mm_min_round_ss( __m128 a, __m128 b, int);
VMINSS __m128 _mm_mask_min_round_ss(__m128 s, __mmask8 k, __m128 a, __m128 b, int);
VMINSS __m128 _mm_maskz_min_round_ss( __mmask8 k, __m128 a, __m128 b, int);
MINSS __m128 _mm_min_ss(__m128 a, __m128 b)

SIMD Floating-Point Exceptions


Invalid (Including QNaN Source Operand), Denormal.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”

MINSS—Return Minimum Scalar Single Precision Floating-Point Value Vol. 2B 4-32


MOVAPD—Move Aligned Packed Double Precision Floating-Point Values
Opcode/ Op/En 64/32 bit CPUID Feature Description
Instruction Mode Flag
Support
66 0F 28 /r A V/V SSE2 Move aligned packed double precision floating-
MOVAPD xmm1, xmm2/m128 point values from xmm2/mem to xmm1.
66 0F 29 /r B V/V SSE2 Move aligned packed double precision floating-
MOVAPD xmm2/m128, xmm1 point values from xmm1 to xmm2/mem.
VEX.128.66.0F.WIG 28 /r A V/V AVX Move aligned packed double precision floating-
VMOVAPD xmm1, xmm2/m128 point values from xmm2/mem to xmm1.
VEX.128.66.0F.WIG 29 /r B V/V AVX Move aligned packed double precision floating-
VMOVAPD xmm2/m128, xmm1 point values from xmm1 to xmm2/mem.
VEX.256.66.0F.WIG 28 /r A V/V AVX Move aligned packed double precision floating-
VMOVAPD ymm1, ymm2/m256 point values from ymm2/mem to ymm1.
VEX.256.66.0F.WIG 29 /r B V/V AVX Move aligned packed double precision floating-
VMOVAPD ymm2/m256, ymm1 point values from ymm1 to ymm2/mem.
EVEX.128.66.0F.W1 28 /r C V/V (AVX512VL AND Move aligned packed double precision floating-
VMOVAPD xmm1 {k1}{z}, xmm2/m128 AVX512F) OR point values from xmm2/m128 to xmm1 using
AVX10.11 writemask k1.
EVEX.256.66.0F.W1 28 /r C V/V (AVX512VL AND Move aligned packed double precision floating-
VMOVAPD ymm1 {k1}{z}, ymm2/m256 AVX512F) OR point values from ymm2/m256 to ymm1 using
AVX10.11 writemask k1.
EVEX.512.66.0F.W1 28 /r C V/V AVX512F Move aligned packed double precision floating-
VMOVAPD zmm1 {k1}{z}, zmm2/m512 OR AVX10.11 point values from zmm2/m512 to zmm1 using
writemask k1.
EVEX.128.66.0F.W1 29 /r D V/V (AVX512VL AND Move aligned packed double precision floating-
VMOVAPD xmm2/m128 {k1}{z}, xmm1 AVX512F) OR point values from xmm1 to xmm2/m128 using
AVX10.11 writemask k1.
EVEX.256.66.0F.W1 29 /r D V/V (AVX512VL AND Move aligned packed double precision floating-
VMOVAPD ymm2/m256 {k1}{z}, ymm1 AVX512F) OR point values from ymm1 to ymm2/m256 using
AVX10.11 writemask k1.
EVEX.512.66.0F.W1 29 /r D V/V AVX512F Move aligned packed double precision floating-
VMOVAPD zmm2/m512 {k1}{z}, zmm1 OR AVX10.11 point values from zmm1 to zmm2/m512 using
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B N/A ModRM:r/m (w) ModRM:reg (r) N/A N/A
C Full Mem ModRM:reg (w) ModRM:r/m (r) N/A N/A
D Full Mem ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
Moves 2, 4 or 8 double precision floating-point values from the source operand (second operand) to the destination
operand (first operand). This instruction can be used to load an XMM, YMM or ZMM register from an 128-bit, 256-

MOVAPD—Move Aligned Packed Double Precision Floating-Point Values Vol. 2B 4-44


bit or 512-bit memory location, to store the contents of an XMM, YMM or ZMM register into a 128-bit, 256-bit or
512-bit memory location, or to move data between two XMM, two YMM or two ZMM registers.
When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte (128-bit
versions), 32-byte (256-bit version) or 64-byte (EVEX.512 encoded version) boundary or a general-protection
exception (#GP) will be generated. For EVEX encoded versions, the operand must be aligned to the size of the
memory operand. To move double precision floating-point values to and from unaligned memory locations, use the
VMOVUPD instruction.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
EVEX.512 encoded version:
Moves 512 bits of packed double precision floating-point values from the source operand (second operand) to the
destination operand (first operand). This instruction can be used to load a ZMM register from a 512-bit float64
memory location, to store the contents of a ZMM register into a 512-bit float64 memory location, or to move data
between two ZMM registers. When the source or destination operand is a memory operand, the operand must be
aligned on a 64-byte boundary or a general-protection exception (#GP) will be generated. To move single precision
floating-point values to and from unaligned memory locations, use the VMOVUPD instruction.
VEX.256 and EVEX.256 encoded versions:
Moves 256 bits of packed double precision floating-point values from the source operand (second operand) to the
destination operand (first operand). This instruction can be used to load a YMM register from a 256-bit memory
location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM
registers. When the source or destination operand is a memory operand, the operand must be aligned on a 32-byte
boundary or a general-protection exception (#GP) will be generated. To move double precision floating-point
values to and from unaligned memory locations, use the VMOVUPD instruction.
128-bit versions:
Moves 128 bits of packed double precision floating-point values from the source operand (second operand) to the
destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory
location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two
XMM registers. When the source or destination operand is a memory operand, the operand must be aligned on a
16-byte boundary or a general-protection exception (#GP) will be generated. To move single precision floating-
point values to and from unaligned memory locations, use the VMOVUPD instruction.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding ZMM destination register remain
unchanged.
(E)VEX.128 encoded version: Bits (MAXVL-1:128) of the destination ZMM register destination are zeroed.

Operation
VMOVAPD (EVEX Encoded Versions, Register-Copy Form)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE DEST[i+63:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

MOVAPD—Move Aligned Packed Double Precision Floating-Point Values Vol. 2B 4-45


VMOVAPD (EVEX Encoded Versions, Store-Form)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[i+63:i]
ELSE
ELSE *DEST[i+63:i] remains unchanged* ; merging-masking

FI;
ENDFOR;

VMOVAPD (EVEX Encoded Versions, Load-Form)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE DEST[i+63:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMOVAPD (VEX.256 Encoded Version, Load - and Register Copy)


DEST[255:0] := SRC[255:0]
DEST[MAXVL-1:256] := 0

VMOVAPD (VEX.256 Encoded Version, Store-Form)


DEST[255:0] := SRC[255:0]

VMOVAPD (VEX.128 Encoded Version, Load - and Register Copy)


DEST[127:0] := SRC[127:0]
DEST[MAXVL-1:128] := 0

MOVAPD (128-bit Load- and Register-Copy- Form Legacy SSE Version)


DEST[127:0] := SRC[127:0]
DEST[MAXVL-1:128] (Unmodified)

(V)MOVAPD (128-bit Store-Form Version)


DEST[127:0] := SRC[127:0]

MOVAPD—Move Aligned Packed Double Precision Floating-Point Values Vol. 2B 4-46


Intel C/C++ Compiler Intrinsic Equivalent
VMOVAPD __m512d _mm512_load_pd( void * m);
VMOVAPD __m512d _mm512_mask_load_pd(__m512d s, __mmask8 k, void * m);
VMOVAPD __m512d _mm512_maskz_load_pd( __mmask8 k, void * m);
VMOVAPD void _mm512_store_pd( void * d, __m512d a);
VMOVAPD void _mm512_mask_store_pd( void * d, __mmask8 k, __m512d a);
VMOVAPD __m256d _mm256_mask_load_pd(__m256d s, __mmask8 k, void * m);
VMOVAPD __m256d _mm256_maskz_load_pd( __mmask8 k, void * m);
VMOVAPD void _mm256_mask_store_pd( void * d, __mmask8 k, __m256d a);
VMOVAPD __m128d _mm_mask_load_pd(__m128d s, __mmask8 k, void * m);
VMOVAPD __m128d _mm_maskz_load_pd( __mmask8 k, void * m);
VMOVAPD void _mm_mask_store_pd( void * d, __mmask8 k, __m128d a);
MOVAPD __m256d _mm256_load_pd (double * p);
MOVAPD void _mm256_store_pd(double * p, __m256d a);
MOVAPD __m128d _mm_load_pd (double * p);
MOVAPD void _mm_store_pd(double * p, __m128d a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Exceptions Type1.SSE2 in Table 2-18, “Type 1 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-46, “Type E1 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.

MOVAPD—Move Aligned Packed Double Precision Floating-Point Values Vol. 2B 4-47


MOVAPS—Move Aligned Packed Single Precision Floating-Point Values
Opcode/ Op/En 64/32 bit CPUID Feature Description
Instruction Mode Flag
Support
NP 0F 28 /r A V/V SSE Move aligned packed single precision floating-point
MOVAPS xmm1, xmm2/m128 values from xmm2/mem to xmm1.
NP 0F 29 /r B V/V SSE Move aligned packed single precision floating-point
MOVAPS xmm2/m128, xmm1 values from xmm1 to xmm2/mem.
VEX.128.0F.WIG 28 /r A V/V AVX Move aligned packed single precision floating-point
VMOVAPS xmm1, xmm2/m128 values from xmm2/mem to xmm1.
VEX.128.0F.WIG 29 /r B V/V AVX Move aligned packed single precision floating-point
VMOVAPS xmm2/m128, xmm1 values from xmm1 to xmm2/mem.
VEX.256.0F.WIG 28 /r A V/V AVX Move aligned packed single precision floating-point
VMOVAPS ymm1, ymm2/m256 values from ymm2/mem to ymm1.
VEX.256.0F.WIG 29 /r B V/V AVX Move aligned packed single precision floating-point
VMOVAPS ymm2/m256, ymm1 values from ymm1 to ymm2/mem.
EVEX.128.0F.W0 28 /r C V/V (AVX512VL AND Move aligned packed single precision floating-point
VMOVAPS xmm1 {k1}{z}, xmm2/m128 AVX512F) OR values from xmm2/m128 to xmm1 using writemask
AVX10.11 k1.
EVEX.256.0F.W0 28 /r C V/V (AVX512VL AND Move aligned packed single precision floating-point
VMOVAPS ymm1 {k1}{z}, ymm2/m256 AVX512F) OR values from ymm2/m256 to ymm1 using writemask
AVX10.11 k1.
EVEX.512.0F.W0 28 /r C V/V AVX512F Move aligned packed single precision floating-point
VMOVAPS zmm1 {k1}{z}, zmm2/m512 OR AVX10.11 values from zmm2/m512 to zmm1 using writemask
k1.
EVEX.128.0F.W0 29 /r D V/V (AVX512VL AND Move aligned packed single precision floating-point
VMOVAPS xmm2/m128 {k1}{z}, xmm1 AVX512F) OR values from xmm1 to xmm2/m128 using writemask
AVX10.11 k1.
EVEX.256.0F.W0 29 /r D V/V (AVX512VL AND Move aligned packed single precision floating-point
VMOVAPS ymm2/m256 {k1}{z}, ymm1 AVX512F) OR values from ymm1 to ymm2/m256 using writemask
AVX10.11 k1.
EVEX.512.0F.W0 29 /r D V/V AVX512F Move aligned packed single precision floating-point
VMOVAPS zmm2/m512 {k1}{z}, zmm1 OR AVX10.11 values from zmm1 to zmm2/m512 using writemask
k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B N/A ModRM:r/m (w) ModRM:reg (r) N/A N/A
C Full Mem ModRM:reg (w) ModRM:r/m (r) N/A N/A
D Full Mem ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
Moves 4, 8 or 16 single precision floating-point values from the source operand (second operand) to the destination
operand (first operand). This instruction can be used to load an XMM, YMM or ZMM register from an 128-bit, 256-

MOVAPS—Move Aligned Packed Single Precision Floating-Point Values Vol. 2B 4-48


bit or 512-bit memory location, to store the contents of an XMM, YMM or ZMM register into a 128-bit, 256-bit or
512-bit memory location, or to move data between two XMM, two YMM or two ZMM registers.
When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte (128-bit
version), 32-byte (VEX.256 encoded version) or 64-byte (EVEX.512 encoded version) boundary or a general-
protection exception (#GP) will be generated. For EVEX.512 encoded versions, the operand must be aligned to the
size of the memory operand. To move single precision floating-point values to and from unaligned memory loca-
tions, use the VMOVUPS instruction.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
EVEX.512 encoded version:
Moves 512 bits of packed single precision floating-point values from the source operand (second operand) to the
destination operand (first operand). This instruction can be used to load a ZMM register from a 512-bit float32
memory location, to store the contents of a ZMM register into a float32 memory location, or to move data between
two ZMM registers. When the source or destination operand is a memory operand, the operand must be aligned on
a 64-byte boundary or a general-protection exception (#GP) will be generated. To move single precision floating-
point values to and from unaligned memory locations, use the VMOVUPS instruction.
VEX.256 and EVEX.256 encoded version:
Moves 256 bits of packed single precision floating-point values from the source operand (second operand) to the
destination operand (first operand). This instruction can be used to load a YMM register from a 256-bit memory
location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM
registers. When the source or destination operand is a memory operand, the operand must be aligned on a 32-byte
boundary or a general-protection exception (#GP) will be generated.
128-bit versions:
Moves 128 bits of packed single precision floating-point values from the source operand (second operand) to the
destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory
location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two
XMM registers. When the source or destination operand is a memory operand, the operand must be aligned on a
16-byte boundary or a general-protection exception (#GP) will be generated. To move single precision floating-
point values to and from unaligned memory locations, use the VMOVUPS instruction.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding ZMM destination register remain
unchanged.
(E)VEX.128 encoded version: Bits (MAXVL-1:128) of the destination ZMM register are zeroed.

Operation
VMOVAPS (EVEX Encoded Versions, Register-Copy Form)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE DEST[i+31:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

MOVAPS—Move Aligned Packed Single Precision Floating-Point Values Vol. 2B 4-49


VMOVAPS (EVEX Encoded Versions, Store Form)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
SRC[i+31:i]
ELSE *DEST[i+31:i] remains unchanged* ; merging-masking
FI;
ENDFOR;

VMOVAPS (EVEX Encoded Versions, Load Form)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE DEST[i+31:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMOVAPS (VEX.256 Encoded Version, Load - and Register Copy)


DEST[255:0] := SRC[255:0]
DEST[MAXVL-1:256] := 0

VMOVAPS (VEX.256 Encoded Version, Store-Form)


DEST[255:0] := SRC[255:0]

VMOVAPS (VEX.128 Encoded Version, Load - and Register Copy)


DEST[127:0] := SRC[127:0]
DEST[MAXVL-1:128] := 0

MOVAPS (128-bit Load- and Register-Copy- Form Legacy SSE Version)


DEST[127:0] := SRC[127:0]
DEST[MAXVL-1:128] (Unmodified)

(V)MOVAPS (128-bit Store-Form Version)


DEST[127:0] := SRC[127:0]

MOVAPS—Move Aligned Packed Single Precision Floating-Point Values Vol. 2B 4-50


Intel C/C++ Compiler Intrinsic Equivalent
VMOVAPS __m512 _mm512_load_ps( void * m);
VMOVAPS __m512 _mm512_mask_load_ps(__m512 s, __mmask16 k, void * m);
VMOVAPS __m512 _mm512_maskz_load_ps( __mmask16 k, void * m);
VMOVAPS void _mm512_store_ps( void * d, __m512 a);
VMOVAPS void _mm512_mask_store_ps( void * d, __mmask16 k, __m512 a);
VMOVAPS __m256 _mm256_mask_load_ps(__m256 a, __mmask8 k, void * s);
VMOVAPS __m256 _mm256_maskz_load_ps( __mmask8 k, void * s);
VMOVAPS void _mm256_mask_store_ps( void * d, __mmask8 k, __m256 a);
VMOVAPS __m128 _mm_mask_load_ps(__m128 a, __mmask8 k, void * s);
VMOVAPS __m128 _mm_maskz_load_ps( __mmask8 k, void * s);
VMOVAPS void _mm_mask_store_ps( void * d, __mmask8 k, __m128 a);
MOVAPS __m256 _mm256_load_ps (float * p);
MOVAPS void _mm256_store_ps(float * p, __m256 a);
MOVAPS __m128 _mm_load_ps (float * p);
MOVAPS void _mm_store_ps(float * p, __m128 a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Exceptions Type1.SSE in Table 2-18, “Type 1 Class Exception Conditions,”
additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instruction, see Table 2-46, “Type E1 Class Exception Conditions.”

MOVAPS—Move Aligned Packed Single Precision Floating-Point Values Vol. 2B 4-51


MOVDDUP—Replicate Double Precision Floating-Point Values
Opcode/ Op / En 64/32 bit CPUID Feature Description
Instruction Mode Flag
Support
F2 0F 12 /r A V/V SSE3 Move double precision floating-point value from
MOVDDUP xmm1, xmm2/m64 xmm2/m64 and duplicate into xmm1.
VEX.128.F2.0F.WIG 12 /r A V/V AVX Move double precision floating-point value from
VMOVDDUP xmm1, xmm2/m64 xmm2/m64 and duplicate into xmm1.
VEX.256.F2.0F.WIG 12 /r A V/V AVX Move even index double precision floating-point
VMOVDDUP ymm1, ymm2/m256 values from ymm2/mem and duplicate each element
into ymm1.
EVEX.128.F2.0F.W1 12 /r B V/V (AVX512VL Move double precision floating-point value from
VMOVDDUP xmm1 {k1}{z}, AND AVX512F) xmm2/m64 and duplicate each element into xmm1
xmm2/m64 OR AVX10.11 subject to writemask k1.
EVEX.256.F2.0F.W1 12 /r B V/V (AVX512VL Move even index double precision floating-point
VMOVDDUP ymm1 {k1}{z}, AND AVX512F) values from ymm2/m256 and duplicate each
ymm2/m256 OR AVX10.11 element into ymm1 subject to writemask k1.
EVEX.512.F2.0F.W1 12 /r B V/V AVX512F Move even index double precision floating-point
VMOVDDUP zmm1 {k1}{z}, OR AVX10.11 values from zmm2/m512 and duplicate each
zmm2/m512 element into zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B MOVDDUP ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
For 256-bit or higher versions: Duplicates even-indexed double precision floating-point values from the source
operand (the second operand) and into adjacent pair and store to the destination operand (the first operand).
For 128-bit versions: Duplicates the low double precision floating-point value from the source operand (the second
operand) and store to the destination operand (the first operand).
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register are unchanged. The
source operand is XMM register or a 64-bit memory location.
VEX.128 and EVEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed. The source
operand is XMM register or a 64-bit memory location. The destination is updated conditionally under the writemask
for EVEX version.
VEX.256 and EVEX.256 encoded version: Bits (MAXVL-1:256) of the destination register are zeroed. The source
operand is YMM register or a 256-bit memory location. The destination is updated conditionally under the write-
mask for EVEX version.
EVEX.512 encoded version: The destination is updated according to the writemask. The source operand is ZMM
register or a 512-bit memory location.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

MOVDDUP—Replicate Double Precision Floating-Point Values Vol. 2B 4-55


SRC X3 X2 X1 X0

DEST X2 X2 X0 X0

Figure 4-2. VMOVDDUP Operation

Operation
VMOVDDUP (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
TMP_SRC[63:0] := SRC[63:0]
TMP_SRC[127:64] := SRC[63:0]
IF VL >= 256
TMP_SRC[191:128] := SRC[191:128]
TMP_SRC[255:192] := SRC[191:128]
FI;
IF VL >= 512
TMP_SRC[319:256] := SRC[319:256]
TMP_SRC[383:320] := SRC[319:256]
TMP_SRC[477:384] := SRC[477:384]
TMP_SRC[511:484] := SRC[477:384]
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_SRC[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMOVDDUP (VEX.256 Encoded Version)


DEST[63:0] := SRC[63:0]
DEST[127:64] := SRC[63:0]
DEST[191:128] := SRC[191:128]
DEST[255:192] := SRC[191:128]
DEST[MAXVL-1:256] := 0

VMOVDDUP (VEX.128 Encoded Version)


DEST[63:0] := SRC[63:0]
DEST[127:64] := SRC[63:0]
DEST[MAXVL-1:128] := 0

MOVDDUP—Replicate Double Precision Floating-Point Values Vol. 2B 4-56


MOVDDUP (128-bit Legacy SSE Version)
DEST[63:0] := SRC[63:0]
DEST[127:64] := SRC[63:0]
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VMOVDDUP __m512d _mm512_movedup_pd( __m512d a);
VMOVDDUP __m512d _mm512_mask_movedup_pd(__m512d s, __mmask8 k, __m512d a);
VMOVDDUP __m512d _mm512_maskz_movedup_pd( __mmask8 k, __m512d a);
VMOVDDUP __m256d _mm256_mask_movedup_pd(__m256d s, __mmask8 k, __m256d a);
VMOVDDUP __m256d _mm256_maskz_movedup_pd( __mmask8 k, __m256d a);
VMOVDDUP __m128d _mm_mask_movedup_pd(__m128d s, __mmask8 k, __m128d a);
VMOVDDUP __m128d _mm_maskz_movedup_pd( __mmask8 k, __m128d a);
MOVDDUP __m256d _mm256_movedup_pd (__m256d a);
MOVDDUP __m128d _mm_movedup_pd (__m128d a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-54, “Type E5NF Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.

MOVDDUP—Replicate Double Precision Floating-Point Values Vol. 2B 4-57


MOVD/MOVQ—Move Doubleword/Move Quadword
Opcode/ Op/ En 64/32-bit CPUID Description
Instruction Mode Feature Flag
NP 0F 6E /r A V/V MMX Move doubleword from r/m32 to mm.
MOVD mm, r/m32
NP REX.W + 0F 6E /r A V/N.E. MMX Move quadword from r/m64 to mm.
MOVQ mm, r/m64
NP 0F 7E /r B V/V MMX Move doubleword from mm to r/m32.
MOVD r/m32, mm
NP REX.W + 0F 7E /r B V/N.E. MMX Move quadword from mm to r/m64.
MOVQ r/m64, mm
66 0F 6E /r A V/V SSE2 Move doubleword from r/m32 to xmm.
MOVD xmm, r/m32
66 REX.W 0F 6E /r A V/N.E. SSE2 Move quadword from r/m64 to xmm.
MOVQ xmm, r/m64
66 0F 7E /r B V/V SSE2 Move doubleword from xmm register to r/m32.
MOVD r/m32, xmm
66 REX.W 0F 7E /r B V/N.E. SSE2 Move quadword from xmm register to r/m64.
MOVQ r/m64, xmm
VEX.128.66.0F.W0 6E / A V/V AVX Move doubleword from r/m32 to xmm1.
VMOVD xmm1, r32/m32
VEX.128.66.0F.W1 6E /r A V/N.E.1 AVX Move quadword from r/m64 to xmm1.
VMOVQ xmm1, r64/m64
VEX.128.66.0F.W0 7E /r B V/V AVX Move doubleword from xmm1 register to r/m32.
VMOVD r32/m32, xmm1
VEX.128.66.0F.W1 7E /r B V/N.E.1 AVX Move quadword from xmm1 register to r/m64.
VMOVQ r64/m64, xmm1
EVEX.128.66.0F.W0 6E /r C V/V AVX512F Move doubleword from r/m32 to xmm1.
VMOVD xmm1, r32/m32 OR AVX10.12
EVEX.128.66.0F.W1 6E /r C V/N.E. AVX512F Move quadword from r/m64 to xmm1.
VMOVQ xmm1, r64/m64 OR AVX10.12
EVEX.128.66.0F.W0 7E /r D V/V AVX512F Move doubleword from xmm1 register to r/m32.
VMOVD r32/m32, xmm1 OR AVX10.12
EVEX.128.66.0F.W1 7E /r D V/N.E.1 AVX512F Move quadword from xmm1 register to r/m64.
VMOVQ r64/m64, xmm1 OR AVX10.12

NOTES:
1. For this specific instruction, VEX.W/EVEX.W in non-64 bit is ignored; the instruction behaves as if the W0 version is used.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

MOVD/MOVQ—Move Doubleword/Move Quadword Vol. 2B 4-62


Instruction Operand Encoding
Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B N/A ModRM:r/m (w) ModRM:reg (r) N/A N/A
C Tuple1 Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A
D Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
Copies a doubleword from the source operand (second operand) to the destination operand (first operand). The
source and destination operands can be general-purpose registers, MMX technology registers, XMM registers, or
32-bit memory locations. This instruction can be used to move a doubleword to and from the low doubleword of an
MMX technology register and a general-purpose register or a 32-bit memory location, or to and from the low
doubleword of an XMM register and a general-purpose register or a 32-bit memory location. The instruction cannot
be used to transfer data between MMX technology registers, between XMM registers, between general-purpose
registers, or between memory locations.
When the destination operand is an MMX technology register, the source operand is written to the low doubleword
of the register, and the register is zero-extended to 64 bits. When the destination operand is an XMM register, the
source operand is written to the low doubleword of the register, and the register is zero-extended to 128 bits.
In 64-bit mode, the instruction’s default operation size is 32 bits. Use of the REX.B prefix permits access to addi-
tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See the summary chart at the
beginning of this section for encoding data and limits.
MOVD/Q with XMM destination:
Moves a dword/qword integer from the source operand and stores it in the low 32/64-bits of the destination XMM
register. The upper bits of the destination are zeroed. The source operand can be a 32/64-bit register or 32/64-bit
memory location.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding YMM destination register remain
unchanged. Qword operation requires the use of REX.W=1.
VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed. Qword operation requires the
use of VEX.W=1.
EVEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed. Qword operation requires
the use of EVEX.W=1.
MOVD/Q with 32/64 reg/mem destination:
Stores the low dword/qword of the source XMM register to 32/64-bit memory location or general-purpose register.
Qword operation requires the use of REX.W=1, VEX.W=1, or EVEX.W=1.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
If VMOVD or VMOVQ is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will
cause an #UD exception.

Operation
MOVD (When Destination Operand is an MMX Technology Register)
DEST[31:0] := SRC;
DEST[63:32] := 00000000H;

MOVD (When Destination Operand is an XMM Register)


DEST[31:0] := SRC;
DEST[127:32] := 000000000000000000000000H;
DEST[MAXVL-1:128] (Unmodified)

MOVD (When Source Operand is an MMX Technology or XMM Register)


DEST := SRC[31:0];

MOVD/MOVQ—Move Doubleword/Move Quadword Vol. 2B 4-63


VMOVD (VEX-Encoded Version when Destination is an XMM Register)
DEST[31:0] := SRC[31:0]
DEST[MAXVL-1:32] := 0

MOVQ (When Destination Operand is an XMM Register)


DEST[63:0] := SRC[63:0];
DEST[127:64] := 0000000000000000H;
DEST[MAXVL-1:128] (Unmodified)

MOVQ (When Destination Operand is r/m64)


DEST[63:0] := SRC[63:0];

MOVQ (When Source Operand is an XMM Register or r/m64)


DEST := SRC[63:0];

VMOVQ (VEX-Encoded Version When Destination is an XMM Register)


DEST[63:0] := SRC[63:0]
DEST[MAXVL-1:64] := 0

VMOVD (EVEX-Encoded Version When Destination is an XMM Register)


DEST[31:0] := SRC[31:0]
DEST[MAXVL-1:32] := 0

VMOVQ (EVEX-Encoded Version When Destination is an XMM Register)


DEST[63:0] := SRC[63:0]
DEST[MAXVL-1:64] := 0

Intel C/C++ Compiler Intrinsic Equivalent


MOVD __m64 _mm_cvtsi32_si64 (int i )
MOVD int _mm_cvtsi64_si32 ( __m64m )
MOVD __m128i _mm_cvtsi32_si128 (int a)
MOVD int _mm_cvtsi128_si32 ( __m128i a)
MOVQ __int64 _mm_cvtsi128_si64(__m128i);
MOVQ __m128i _mm_cvtsi64_si128(__int64);
VMOVD __m128i _mm_cvtsi32_si128( int);
VMOVD int _mm_cvtsi128_si32( __m128i );
VMOVQ __m128i _mm_cvtsi64_si128 (__int64);
VMOVQ __int64 _mm_cvtsi128_si64(__m128i );
VMOVQ __m128i _mm_loadl_epi64( __m128i * s);
VMOVQ void _mm_storel_epi64( __m128i * d, __m128i s);

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

MOVD/MOVQ—Move Doubleword/Move Quadword Vol. 2B 4-64


Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-59, “Type E9NF Class Exception Conditions.”
Additionally:
#UD If VEX.L = 1.
If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

MOVD/MOVQ—Move Doubleword/Move Quadword Vol. 2B 4-65


MOVDQA,VMOVDQA32/64—Move Aligned Packed Integer Values
Opcode/ Op/En 64/32 bit CPUID Feature Description
Instruction Mode Flag
Support
66 0F 6F /r A V/V SSE2 Move aligned packed integer values from
MOVDQA xmm1, xmm2/m128 xmm2/mem to xmm1.
66 0F 7F /r B V/V SSE2 Move aligned packed integer values from
MOVDQA xmm2/m128, xmm1 xmm1 to xmm2/mem.
VEX.128.66.0F.WIG 6F /r A V/V AVX Move aligned packed integer values from
VMOVDQA xmm1, xmm2/m128 xmm2/mem to xmm1.
VEX.128.66.0F.WIG 7F /r B V/V AVX Move aligned packed integer values from
VMOVDQA xmm2/m128, xmm1 xmm1 to xmm2/mem.
VEX.256.66.0F.WIG 6F /r A V/V AVX Move aligned packed integer values from
VMOVDQA ymm1, ymm2/m256 ymm2/mem to ymm1.
VEX.256.66.0F.WIG 7F /r B V/V AVX Move aligned packed integer values from
VMOVDQA ymm2/m256, ymm1 ymm1 to ymm2/mem.
EVEX.128.66.0F.W0 6F /r C V/V (AVX512VL Move aligned packed doubleword integer
VMOVDQA32 xmm1 {k1}{z}, AND AVX512F) values from xmm2/m128 to xmm1 using
xmm2/m128 OR AVX10.11 writemask k1.
EVEX.256.66.0F.W0 6F /r C V/V (AVX512VL Move aligned packed doubleword integer
VMOVDQA32 ymm1 {k1}{z}, AND AVX512F) values from ymm2/m256 to ymm1 using
ymm2/m256 OR AVX10.11 writemask k1.
EVEX.512.66.0F.W0 6F /r C V/V AVX512F Move aligned packed doubleword integer
VMOVDQA32 zmm1 {k1}{z}, zmm2/m512 OR AVX10.11 values from zmm2/m512 to zmm1 using
writemask k1.
EVEX.128.66.0F.W0 7F /r D V/V (AVX512VL Move aligned packed doubleword integer
VMOVDQA32 xmm2/m128 {k1}{z}, AND AVX512F) values from xmm1 to xmm2/m128 using
xmm1 OR AVX10.11 writemask k1.
EVEX.256.66.0F.W0 7F /r D V/V (AVX512VL Move aligned packed doubleword integer
VMOVDQA32 ymm2/m256 {k1}{z}, AND AVX512F) values from ymm1 to ymm2/m256 using
ymm1 OR AVX10.11 writemask k1.
EVEX.512.66.0F.W0 7F /r D V/V AVX512F Move aligned packed doubleword integer
VMOVDQA32 zmm2/m512 {k1}{z}, zmm1 OR AVX10.11 values from zmm1 to zmm2/m512 using
writemask k1.
EVEX.128.66.0F.W1 6F /r C V/V (AVX512VL Move aligned packed quadword integer values
VMOVDQA64 xmm1 {k1}{z}, AND AVX512F) from xmm2/m128 to xmm1 using writemask
xmm2/m128 OR AVX10.11 k1.
EVEX.256.66.0F.W1 6F /r C V/V (AVX512VL Move aligned packed quadword integer values
VMOVDQA64 ymm1 {k1}{z}, AND AVX512F) from ymm2/m256 to ymm1 using writemask
ymm2/m256 OR AVX10.11 k1.
EVEX.512.66.0F.W1 6F /r C V/V AVX512F Move aligned packed quadword integer values
VMOVDQA64 zmm1 {k1}{z}, zmm2/m512 OR AVX10.11 from zmm2/m512 to zmm1 using writemask
k1.
EVEX.128.66.0F.W1 7F /r D V/V (AVX512VL Move aligned packed quadword integer values
VMOVDQA64 xmm2/m128 {k1}{z}, AND AVX512F) from xmm1 to xmm2/m128 using writemask
xmm1 OR AVX10.11 k1.
EVEX.256.66.0F.W1 7F /r D V/V (AVX512VL Move aligned packed quadword integer values
VMOVDQA64 ymm2/m256 {k1}{z}, AND AVX512F) from ymm1 to ymm2/m256 using writemask
ymm1 OR AVX10.11 k1.
EVEX.512.66.0F.W1 7F /r D V/V AVX512F Move aligned packed quadword integer values
VMOVDQA64 zmm2/m512 {k1}{z}, zmm1 OR AVX10.11 from zmm1 to zmm2/m512 using writemask
k1.

MOVDQA,VMOVDQA32/64—Move Aligned Packed Integer Values Vol. 2B 4-66


NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B N/A ModRM:r/m (w) ModRM:reg (r) N/A N/A
C Full Mem ModRM:reg (w) ModRM:r/m (r) N/A N/A
D Full Mem ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
EVEX encoded versions:
Moves 128, 256 or 512 bits of packed doubleword/quadword integer values from the source operand (the second
operand) to the destination operand (the first operand). This instruction can be used to load a vector register from
an int32/int64 memory location, to store the contents of a vector register into an int32/int64 memory location, or
to move data between two ZMM registers. When the source or destination operand is a memory operand, the
operand must be aligned on a 16 (EVEX.128)/32(EVEX.256)/64(EVEX.512)-byte boundary or a general-protection
exception (#GP) will be generated. To move integer data to and from unaligned memory locations, use the
VMOVDQU instruction.
The destination operand is updated at 32-bit (VMOVDQA32) or 64-bit (VMOVDQA64) granularity according to the
writemask.
VEX.256 encoded version:
Moves 256 bits of packed integer values from the source operand (second operand) to the destination operand
(first operand). This instruction can be used to load a YMM register from a 256-bit memory location, to store the
contents of a YMM register into a 256-bit memory location, or to move data between two YMM registers.
When the source or destination operand is a memory operand, the operand must be aligned on a 32-byte boundary
or a general-protection exception (#GP) will be generated. To move integer data to and from unaligned memory
locations, use the VMOVDQU instruction. Bits (MAXVL-1:256) of the destination register are zeroed.
128-bit versions:
Moves 128 bits of packed integer values from the source operand (second operand) to the destination operand
(first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the
contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers.
When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte boundary
or a general-protection exception (#GP) will be generated. To move integer data to and from unaligned memory
locations, use the VMOVDQU instruction.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding ZMM destination register remain
unchanged.
VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed.

MOVDQA,VMOVDQA32/64—Move Aligned Packed Integer Values Vol. 2B 4-67


Operation
VMOVDQA32 (EVEX Encoded Versions, Register-Copy Form)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE DEST[i+31:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMOVDQA32 (EVEX Encoded Versions, Store-Form)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[i+31:i]
ELSE *DEST[i+31:i] remains unchanged* ; merging-masking
FI;
ENDFOR;

VMOVDQA32 (EVEX Encoded Versions, Load-Form)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE DEST[i+31:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMOVDQA64 (EVEX Encoded Versions, Register-Copy Form)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE DEST[i+63:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR

MOVDQA,VMOVDQA32/64—Move Aligned Packed Integer Values Vol. 2B 4-68


DEST[MAXVL-1:VL] := 0

VMOVDQA64 (EVEX Encoded Versions, Store-Form)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[i+63:i]
ELSE *DEST[i+63:i] remains unchanged* ; merging-masking
FI;
ENDFOR;

VMOVDQA64 (EVEX Encoded Versions, Load-Form)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE DEST[i+63:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMOVDQA (VEX.256 Encoded Version, Load - and Register Copy)


DEST[255:0] := SRC[255:0]
DEST[MAXVL-1:256] := 0

VMOVDQA (VEX.256 Encoded Version, Store-Form)


DEST[255:0] := SRC[255:0]

VMOVDQA (VEX.128 Encoded Version)


DEST[127:0] := SRC[127:0]
DEST[MAXVL-1:128] := 0

VMOVDQA (128-bit Load- and Register-Copy- Form Legacy SSE Version)


DEST[127:0] := SRC[127:0]
DEST[MAXVL-1:128] (Unmodified)

(V)MOVDQA (128-bit Store-Form Version)


DEST[127:0] := SRC[127:0]

MOVDQA,VMOVDQA32/64—Move Aligned Packed Integer Values Vol. 2B 4-69


Intel C/C++ Compiler Intrinsic Equivalent
VMOVDQA32 __m512i _mm512_load_epi32( void * sa);
VMOVDQA32 __m512i _mm512_mask_load_epi32(__m512i s, __mmask16 k, void * sa);
VMOVDQA32 __m512i _mm512_maskz_load_epi32( __mmask16 k, void * sa);
VMOVDQA32 void _mm512_store_epi32(void * d, __m512i a);
VMOVDQA32 void _mm512_mask_store_epi32(void * d, __mmask16 k, __m512i a);
VMOVDQA32 __m256i _mm256_mask_load_epi32(__m256i s, __mmask8 k, void * sa);
VMOVDQA32 __m256i _mm256_maskz_load_epi32( __mmask8 k, void * sa);
VMOVDQA32 void _mm256_store_epi32(void * d, __m256i a);
VMOVDQA32 void _mm256_mask_store_epi32(void * d, __mmask8 k, __m256i a);
VMOVDQA32 __m128i _mm_mask_load_epi32(__m128i s, __mmask8 k, void * sa);
VMOVDQA32 __m128i _mm_maskz_load_epi32( __mmask8 k, void * sa);
VMOVDQA32 void _mm_store_epi32(void * d, __m128i a);
VMOVDQA32 void _mm_mask_store_epi32(void * d, __mmask8 k, __m128i a);
VMOVDQA64 __m512i _mm512_load_epi64( void * sa);
VMOVDQA64 __m512i _mm512_mask_load_epi64(__m512i s, __mmask8 k, void * sa);
VMOVDQA64 __m512i _mm512_maskz_load_epi64( __mmask8 k, void * sa);
VMOVDQA64 void _mm512_store_epi64(void * d, __m512i a);
VMOVDQA64 void _mm512_mask_store_epi64(void * d, __mmask8 k, __m512i a);
VMOVDQA64 __m256i _mm256_mask_load_epi64(__m256i s, __mmask8 k, void * sa);
VMOVDQA64 __m256i _mm256_maskz_load_epi64( __mmask8 k, void * sa);
VMOVDQA64 void _mm256_store_epi64(void * d, __m256i a);
VMOVDQA64 void _mm256_mask_store_epi64(void * d, __mmask8 k, __m256i a);
VMOVDQA64 __m128i _mm_mask_load_epi64(__m128i s, __mmask8 k, void * sa);
VMOVDQA64 __m128i _mm_maskz_load_epi64( __mmask8 k, void * sa);
VMOVDQA64 void _mm_store_epi64(void * d, __m128i a);
VMOVDQA64 void _mm_mask_store_epi64(void * d, __mmask8 k, __m128i a);
MOVDQA void __m256i _mm256_load_si256 (__m256i * p);
MOVDQA _mm256_store_si256(_m256i *p, __m256i a);
MOVDQA __m128i _mm_load_si128 (__m128i * p);
MOVDQA void _mm_store_si128(__m128i *p, __m128i a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Exceptions Type1.SSE2 in Table 2-18, “Type 1 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-46, “Type E1 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.

MOVDQA,VMOVDQA32/64—Move Aligned Packed Integer Values Vol. 2B 4-70


MOVDQU,VMOVDQU8/16/32/64—Move Unaligned Packed Integer Values
Opcode/ Op/En 64/32 bit CPUID Feature Description
Instruction Mode Flag
Support
F3 0F 6F /r A V/V SSE2 Move unaligned packed integer values
MOVDQU xmm1, xmm2/m128 from xmm2/m128 to xmm1.
F3 0F 7F /r B V/V SSE2 Move unaligned packed integer values
MOVDQU xmm2/m128, xmm1 from xmm1 to xmm2/m128.
VEX.128.F3.0F.WIG 6F /r A V/V AVX Move unaligned packed integer values
VMOVDQU xmm1, xmm2/m128 from xmm2/m128 to xmm1.
VEX.128.F3.0F.WIG 7F /r B V/V AVX Move unaligned packed integer values
VMOVDQU xmm2/m128, xmm1 from xmm1 to xmm2/m128.
VEX.256.F3.0F.WIG 6F /r A V/V AVX Move unaligned packed integer values
VMOVDQU ymm1, ymm2/m256 from ymm2/m256 to ymm1.
VEX.256.F3.0F.WIG 7F /r B V/V AVX Move unaligned packed integer values
VMOVDQU ymm2/m256, ymm1 from ymm1 to ymm2/m256.
EVEX.128.F2.0F.W0 6F /r C V/V (AVX512VL AND Move unaligned packed byte integer values
VMOVDQU8 xmm1 {k1}{z}, xmm2/m128 AVX512BW) OR from xmm2/m128 to xmm1 using
AVX10.11 writemask k1.
EVEX.256.F2.0F.W0 6F /r C V/V (AVX512VL AND Move unaligned packed byte integer values
VMOVDQU8 ymm1 {k1}{z}, ymm2/m256 AVX512BW) OR from ymm2/m256 to ymm1 using
AVX10.11 writemask k1.
EVEX.512.F2.0F.W0 6F /r C V/V AVX512BW Move unaligned packed byte integer values
VMOVDQU8 zmm1 {k1}{z}, zmm2/m512 OR AVX10.11 from zmm2/m512 to zmm1 using
writemask k1.
EVEX.128.F2.0F.W0 7F /r D V/V (AVX512VL AND Move unaligned packed byte integer values
VMOVDQU8 xmm2/m128 {k1}{z}, xmm1 AVX512BW) OR from xmm1 to xmm2/m128 using
AVX10.11 writemask k1.
EVEX.256.F2.0F.W0 7F /r D V/V (AVX512VL AND Move unaligned packed byte integer values
VMOVDQU8 ymm2/m256 {k1}{z}, ymm1 AVX512BW) OR from ymm1 to ymm2/m256 using
AVX10.11 writemask k1.
EVEX.512.F2.0F.W0 7F /r D V/V AVX512BW Move unaligned packed byte integer values
VMOVDQU8 zmm2/m512 {k1}{z}, zmm1 OR AVX10.11 from zmm1 to zmm2/m512 using
writemask k1.
EVEX.128.F2.0F.W1 6F /r C V/V (AVX512VL AND Move unaligned packed word integer
VMOVDQU16 xmm1 {k1}{z}, xmm2/m128 AVX512BW) OR values from xmm2/m128 to xmm1 using
AVX10.11 writemask k1.
EVEX.256.F2.0F.W1 6F /r C V/V (AVX512VL AND Move unaligned packed word integer
VMOVDQU16 ymm1 {k1}{z}, ymm2/m256 AVX512BW) OR values from ymm2/m256 to ymm1 using
AVX10.11 writemask k1.
EVEX.512.F2.0F.W1 6F /r C V/V AVX512BW Move unaligned packed word integer
VMOVDQU16 zmm1 {k1}{z}, zmm2/m512 OR AVX10.11 values from zmm2/m512 to zmm1 using
writemask k1.
EVEX.128.F2.0F.W1 7F /r D V/V (AVX512VL AND Move unaligned packed word integer
VMOVDQU16 xmm2/m128 {k1}{z}, xmm1 AVX512BW) OR values from xmm1 to xmm2/m128 using
AVX10.11 writemask k1.
EVEX.256.F2.0F.W1 7F /r D V/V (AVX512VL AND Move unaligned packed word integer
VMOVDQU16 ymm2/m256 {k1}{z}, ymm1 AVX512BW) OR values from ymm1 to ymm2/m256 using
AVX10.11 writemask k1.
EVEX.512.F2.0F.W1 7F /r D V/V AVX512BW Move unaligned packed word integer
VMOVDQU16 zmm2/m512 {k1}{z}, zmm1 OR AVX10.11 values from zmm1 to zmm2/m512 using
writemask k1.

MOVDQU,VMOVDQU8/16/32/64—Move Unaligned Packed Integer Values Vol. 2B 4-71


Opcode/ Op/En 64/32 bit CPUID Feature Description
Instruction Mode Flag
Support
EVEX.128.F3.0F.W0 6F /r C V/V (AVX512VL AND Move unaligned packed doubleword
VMOVDQU32 xmm1 {k1}{z}, xmm2/mm128 AVX512F) OR integer values from xmm2/m128 to xmm1
AVX10.11 using writemask k1.
EVEX.256.F3.0F.W0 6F /r C V/V (AVX512VL AND Move unaligned packed doubleword
VMOVDQU32 ymm1 {k1}{z}, ymm2/m256 AVX512F) OR integer values from ymm2/m256 to ymm1
AVX10.11 using writemask k1.
EVEX.512.F3.0F.W0 6F /r C V/V AVX512F Move unaligned packed doubleword
VMOVDQU32 zmm1 {k1}{z}, zmm2/m512 OR AVX10.11 integer values from zmm2/m512 to zmm1
using writemask k1.
EVEX.128.F3.0F.W0 7F /r D V/V (AVX512VL AND Move unaligned packed doubleword
VMOVDQU32 xmm2/m128 {k1}{z}, xmm1 AVX512F) OR integer values from xmm1 to xmm2/m128
AVX10.11 using writemask k1.
EVEX.256.F3.0F.W0 7F /r D V/V (AVX512VL AND Move unaligned packed doubleword
VMOVDQU32 ymm2/m256 {k1}{z}, ymm1 AVX512F) OR integer values from ymm1 to ymm2/m256
AVX10.11 using writemask k1.
EVEX.512.F3.0F.W0 7F /r D V/V AVX512F Move unaligned packed doubleword
VMOVDQU32 zmm2/m512 {k1}{z}, zmm1 OR AVX10.11 integer values from zmm1 to zmm2/m512
using writemask k1.
EVEX.128.F3.0F.W1 6F /r C V/V (AVX512VL AND Move unaligned packed quadword integer
VMOVDQU64 xmm1 {k1}{z}, xmm2/m128 AVX512F) OR values from xmm2/m128 to xmm1 using
AVX10.11 writemask k1.
EVEX.256.F3.0F.W1 6F /r C V/V (AVX512VL AND Move unaligned packed quadword integer
VMOVDQU64 ymm1 {k1}{z}, ymm2/m256 AVX512F) OR values from ymm2/m256 to ymm1 using
AVX10.11 writemask k1.
EVEX.512.F3.0F.W1 6F /r C V/V AVX512F Move unaligned packed quadword integer
VMOVDQU64 zmm1 {k1}{z}, zmm2/m512 OR AVX10.11 values from zmm2/m512 to zmm1 using
writemask k1.
EVEX.128.F3.0F.W1 7F /r D V/V (AVX512VL AND Move unaligned packed quadword integer
VMOVDQU64 xmm2/m128 {k1}{z}, xmm1 AVX512F) OR values from xmm1 to xmm2/m128 using
AVX10.11 writemask k1.
EVEX.256.F3.0F.W1 7F /r D V/V (AVX512VL AND Move unaligned packed quadword integer
VMOVDQU64 ymm2/m256 {k1}{z}, ymm1 AVX512F) OR values from ymm1 to ymm2/m256 using
AVX10.11 writemask k1.
EVEX.512.F3.0F.W1 7F /r D V/V AVX512F Move unaligned packed quadword integer
VMOVDQU64 zmm2/m512 {k1}{z}, zmm1 OR AVX10.11 values from zmm1 to zmm2/m512 using
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

MOVDQU,VMOVDQU8/16/32/64—Move Unaligned Packed Integer Values Vol. 2B 4-72


Instruction Operand Encoding
Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B N/A ModRM:r/m (w) ModRM:reg (r) N/A N/A
C Full Mem ModRM:reg (w) ModRM:r/m (r) N/A N/A
D Full Mem ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
EVEX encoded versions:
Moves 128, 256 or 512 bits of packed byte/word/doubleword/quadword integer values from the source operand
(the second operand) to the destination operand (first operand). This instruction can be used to load a vector
register from a memory location, to store the contents of a vector register into a memory location, or to move data
between two vector registers.
The destination operand is updated at 8-bit (VMOVDQU8), 16-bit (VMOVDQU16), 32-bit (VMOVDQU32), or 64-bit
(VMOVDQU64) granularity according to the writemask.
VEX.256 encoded version:
Moves 256 bits of packed integer values from the source operand (second operand) to the destination operand
(first operand). This instruction can be used to load a YMM register from a 256-bit memory location, to store the
contents of a YMM register into a 256-bit memory location, or to move data between two YMM registers.
Bits (MAXVL-1:256) of the destination register are zeroed.
128-bit versions:
Moves 128 bits of packed integer values from the source operand (second operand) to the destination operand
(first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the
contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.
When the source or destination operand is a memory operand, the operand may be unaligned to any alignment
without causing a general-protection exception (#GP) to be generated
VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed.

Operation
VMOVDQU8 (EVEX Encoded Versions, Register-Copy Form)
(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SRC[i+7:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE DEST[i+7:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

MOVDQU,VMOVDQU8/16/32/64—Move Unaligned Packed Integer Values Vol. 2B 4-73


VMOVDQU8 (EVEX Encoded Versions, Store-Form)
(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] :=
SRC[i+7:i]
ELSE *DEST[i+7:i] remains unchanged* ; merging-masking
FI;
ENDFOR;

VMOVDQU8 (EVEX Encoded Versions, Load-Form)


(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SRC[i+7:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE DEST[i+7:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMOVDQU16 (EVEX Encoded Versions, Register-Copy Form)


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SRC[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE DEST[i+15:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMOVDQU16 (EVEX Encoded Versions, Store-Form)


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] :=
SRC[i+15:i]
ELSE *DEST[i+15:i] remains unchanged* ; merging-masking
FI;
ENDFOR;

MOVDQU,VMOVDQU8/16/32/64—Move Unaligned Packed Integer Values Vol. 2B 4-74


VMOVDQU16 (EVEX Encoded Versions, Load-Form)
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SRC[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE DEST[i+15:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMOVDQU32 (EVEX Encoded Versions, Register-Copy Form)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE DEST[i+31:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMOVDQU32 (EVEX Encoded Versions, Store-Form)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
SRC[i+31:i]
ELSE *DEST[i+31:i] remains unchanged* ; merging-masking
FI;
ENDFOR;

VMOVDQU32 (EVEX Encoded Versions, Load-Form)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE DEST[i+31:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR

MOVDQU,VMOVDQU8/16/32/64—Move Unaligned Packed Integer Values Vol. 2B 4-75


DEST[MAXVL-1:VL] := 0

VMOVDQU64 (EVEX Encoded Versions, Register-Copy Form)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE DEST[i+63:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMOVDQU64 (EVEX Encoded Versions, Store-Form)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[i+63:i]
ELSE *DEST[i+63:i] remains unchanged* ; merging-masking

FI;
ENDFOR;

VMOVDQU64 (EVEX Encoded Versions, Load-Form)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE DEST[i+63:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMOVDQU (VEX.256 Encoded Version, Load - and Register Copy)


DEST[255:0] := SRC[255:0]
DEST[MAXVL-1:256] := 0

VMOVDQU (VEX.256 Encoded Version, Store-Form)


DEST[255:0] := SRC[255:0]

VMOVDQU (VEX.128 encoded version)


DEST[127:0] := SRC[127:0]
DEST[MAXVL-1:128] := 0

MOVDQU,VMOVDQU8/16/32/64—Move Unaligned Packed Integer Values Vol. 2B 4-76


VMOVDQU (128-bit Load- and Register-Copy- Form Legacy SSE Version)
DEST[127:0] := SRC[127:0]
DEST[MAXVL-1:128] (Unmodified)

(V)MOVDQU (128-bit Store-Form Version)


DEST[127:0] := SRC[127:0]

Intel C/C++ Compiler Intrinsic Equivalent


VMOVDQU16 __m512i _mm512_mask_loadu_epi16(__m512i s, __mmask32 k, void * sa);
VMOVDQU16 __m512i _mm512_maskz_loadu_epi16( __mmask32 k, void * sa);
VMOVDQU16 void _mm512_mask_storeu_epi16(void * d, __mmask32 k, __m512i a);
VMOVDQU16 __m256i _mm256_mask_loadu_epi16(__m256i s, __mmask16 k, void * sa);
VMOVDQU16 __m256i _mm256_maskz_loadu_epi16( __mmask16 k, void * sa);
VMOVDQU16 void _mm256_mask_storeu_epi16(void * d, __mmask16 k, __m256i a);
VMOVDQU16 __m128i _mm_mask_loadu_epi16(__m128i s, __mmask8 k, void * sa);
VMOVDQU16 __m128i _mm_maskz_loadu_epi16( __mmask8 k, void * sa);
VMOVDQU16 void _mm_mask_storeu_epi16(void * d, __mmask8 k, __m128i a);
VMOVDQU32 __m512i _mm512_loadu_epi32( void * sa);
VMOVDQU32 __m512i _mm512_mask_loadu_epi32(__m512i s, __mmask16 k, void * sa);
VMOVDQU32 __m512i _mm512_maskz_loadu_epi32( __mmask16 k, void * sa);
VMOVDQU32 void _mm512_storeu_epi32(void * d, __m512i a);
VMOVDQU32 void _mm512_mask_storeu_epi32(void * d, __mmask16 k, __m512i a);
VMOVDQU32 __m256i _mm256_mask_loadu_epi32(__m256i s, __mmask8 k, void * sa);
VMOVDQU32 __m256i _mm256_maskz_loadu_epi32( __mmask8 k, void * sa);
VMOVDQU32 void _mm256_storeu_epi32(void * d, __m256i a);
VMOVDQU32 void _mm256_mask_storeu_epi32(void * d, __mmask8 k, __m256i a);
VMOVDQU32 __m128i _mm_mask_loadu_epi32(__m128i s, __mmask8 k, void * sa);
VMOVDQU32 __m128i _mm_maskz_loadu_epi32( __mmask8 k, void * sa);
VMOVDQU32 void _mm_storeu_epi32(void * d, __m128i a);
VMOVDQU32 void _mm_mask_storeu_epi32(void * d, __mmask8 k, __m128i a);
VMOVDQU64 __m512i _mm512_loadu_epi64( void * sa);
VMOVDQU64 __m512i _mm512_mask_loadu_epi64(__m512i s, __mmask8 k, void * sa);
VMOVDQU64 __m512i _mm512_maskz_loadu_epi64( __mmask8 k, void * sa);
VMOVDQU64 void _mm512_storeu_epi64(void * d, __m512i a);
VMOVDQU64 void _mm512_mask_storeu_epi64(void * d, __mmask8 k, __m512i a);
VMOVDQU64 __m256i _mm256_mask_loadu_epi64(__m256i s, __mmask8 k, void * sa);
VMOVDQU64 __m256i _mm256_maskz_loadu_epi64( __mmask8 k, void * sa);
VMOVDQU64 void _mm256_storeu_epi64(void * d, __m256i a);
VMOVDQU64 void _mm256_mask_storeu_epi64(void * d, __mmask8 k, __m256i a);
VMOVDQU64 __m128i _mm_mask_loadu_epi64(__m128i s, __mmask8 k, void * sa);
VMOVDQU64 __m128i _mm_maskz_loadu_epi64( __mmask8 k, void * sa);
VMOVDQU64 void _mm_storeu_epi64(void * d, __m128i a);
VMOVDQU64 void _mm_mask_storeu_epi64(void * d, __mmask8 k, __m128i a);
VMOVDQU8 __m512i _mm512_mask_loadu_epi8(__m512i s, __mmask64 k, void * sa);
VMOVDQU8 __m512i _mm512_maskz_loadu_epi8( __mmask64 k, void * sa);
VMOVDQU8 void _mm512_mask_storeu_epi8(void * d, __mmask64 k, __m512i a);
VMOVDQU8 __m256i _mm256_mask_loadu_epi8(__m256i s, __mmask32 k, void * sa);
VMOVDQU8 __m256i _mm256_maskz_loadu_epi8( __mmask32 k, void * sa);
VMOVDQU8 void _mm256_mask_storeu_epi8(void * d, __mmask32 k, __m256i a);
VMOVDQU8 __m128i _mm_mask_loadu_epi8(__m128i s, __mmask16 k, void * sa);
VMOVDQU8 __m128i _mm_maskz_loadu_epi8( __mmask16 k, void * sa);
VMOVDQU8 void _mm_mask_storeu_epi8(void * d, __mmask16 k, __m128i a);
MOVDQU __m256i _mm256_loadu_si256 (__m256i * p);

MOVDQU,VMOVDQU8/16/32/64—Move Unaligned Packed Integer Values Vol. 2B 4-77


MOVDQU _mm256_storeu_si256(_m256i *p, __m256i a);
MOVDQU __m128i _mm_loadu_si128 (__m128i * p);
MOVDQU _mm_storeu_si128(__m128i *p, __m128i a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.

MOVDQU,VMOVDQU8/16/32/64—Move Unaligned Packed Integer Values Vol. 2B 4-78


MOVHLPS—Move Packed Single Precision Floating-Point Values High to Low
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
NP 0F 12 /r RM V/V SSE Move two packed single precision floating-point values
MOVHLPS xmm1, xmm2 from high quadword of xmm2 to low quadword of xmm1.
VEX.128.0F.WIG 12 /r RVM V/V AVX Merge two packed single precision floating-point values
VMOVHLPS xmm1, xmm2, xmm3 from high quadword of xmm3 and low quadword of xmm2.
EVEX.128.0F.W0 12 /r RVM V/V AVX512F Merge two packed single precision floating-point values
VMOVHLPS xmm1, xmm2, xmm3 OR AVX10.11 from high quadword of xmm3 and low quadword of xmm2.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding1


Op/En Operand 1 Operand 2 Operand 3 Operand 4
RM ModRM:reg (w) ModRM:r/m (r) N/A N/A
VEX.vvvv (r) /
RVM ModRM:reg (w) ModRM:r/m (r) N/A
EVEX.vvvv (r)

Description
This instruction cannot be used for memory to register moves.
128-bit two-argument form:
Moves two packed single precision floating-point values from the high quadword of the second XMM argument
(second operand) to the low quadword of the first XMM register (first argument). The quadword at bits 127:64 of
the destination operand is left unchanged. Bits (MAXVL-1:128) of the corresponding destination register remain
unchanged.
128-bit and EVEX three-argument form:
Moves two packed single precision floating-point values from the high quadword of the third XMM argument (third
operand) to the low quadword of the destination (first operand). Copies the high quadword from the second XMM
argument (second operand) to the high quadword of the destination (first operand). Bits (MAXVL-1:128) of the
corresponding destination register are zeroed.
If VMOVHLPS is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or
EVEX.L’L= 1 will cause an #UD exception.

Operation
MOVHLPS (128-bit Two-Argument Form)
DEST[63:0] := SRC[127:64]
DEST[MAXVL-1:64] (Unmodified)

VMOVHLPS (128-bit Three-Argument Form - VEX & EVEX)


DEST[63:0] := SRC2[127:64]
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

1. ModRM.MOD = 011B required.

MOVHLPS—Move Packed Single Precision Floating-Point Values High to Low Vol. 2B 4-78
Intel C/C++ Compiler Intrinsic Equivalent
MOVHLPS __m128 _mm_movehl_ps(__m128 a, __m128 b)

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-24, “Type 7 Class Exception Conditions,” additionally:
#UD If VEX.L = 1.
EVEX-encoded instruction, see Exceptions Type E7NM.128 in Table 2-57, “Type E7NM Class Exception Conditions.”

MOVHLPS—Move Packed Single Precision Floating-Point Values High to Low Vol. 2B 4-79
MOVHPD—Move High Packed Double Precision Floating-Point Value
Opcode/ Op / En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
66 0F 16 /r A V/V SSE2 Move double precision floating-point value from m64
MOVHPD xmm1, m64 to high quadword of xmm1.
VEX.128.66.0F.WIG 16 /r B V/V AVX Merge double precision floating-point value from m64
VMOVHPD xmm2, xmm1, m64 and the low quadword of xmm1.
EVEX.128.66.0F.W1 16 /r D V/V AVX512F Merge double precision floating-point value from m64
VMOVHPD xmm2, xmm1, m64 OR AVX10.11 and the low quadword of xmm1.
66 0F 17 /r C V/V SSE2 Move double precision floating-point value from high
MOVHPD m64, xmm1 quadword of xmm1 to m64.
VEX.128.66.0F.WIG 17 /r C V/V AVX Move double precision floating-point value from high
VMOVHPD m64, xmm1 quadword of xmm1 to m64.
EVEX.128.66.0F.W1 17 /r E V/V AVX512F Move double precision floating-point value from high
VMOVHPD m64, xmm1 OR AVX10.11 quadword of xmm1 to m64.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C N/A ModRM:r/m (w) ModRM:reg (r) N/A N/A
D Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A
E Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
This instruction cannot be used for register to register or memory to memory moves.
128-bit Legacy SSE load:
Moves a double precision floating-point value from the source 64-bit memory operand and stores it in the high 64-
bits of the destination XMM register. The lower 64bits of the XMM register are preserved. Bits (MAXVL-1:128) of the
corresponding destination register are preserved.
VEX.128 & EVEX encoded load:
Loads a double precision floating-point value from the source 64-bit memory operand (the third operand) and
stores it in the upper 64-bits of the destination XMM register (first operand). The low 64-bits from the first source
operand (second operand) are copied to the low 64-bits of the destination. Bits (MAXVL-1:128) of the corre-
sponding destination register are zeroed.
128-bit store:
Stores a double precision floating-point value from the high 64-bits of the XMM register source (second operand)
to the 64-bit memory location (first operand).
Note: VMOVHPD (store) (VEX.128.66.0F 17 /r) is legal and has the same behavior as the existing 66 0F 17 store.
For VMOVHPD (store) VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD.
If VMOVHPD is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or
EVEX.L’L= 1 will cause an #UD exception.

MOVHPD—Move High Packed Double Precision Floating-Point Value Vol. 2B 4-80


Operation
MOVHPD (128-bit Legacy SSE Load)
DEST[63:0] (Unmodified)
DEST[127:64] := SRC[63:0]
DEST[MAXVL-1:128] (Unmodified)

VMOVHPD (VEX.128 & EVEX Encoded Load)


DEST[63:0] := SRC1[63:0]
DEST[127:64] := SRC2[63:0]
DEST[MAXVL-1:128] := 0

VMOVHPD (Store)
DEST[63:0] := SRC[127:64]

Intel C/C++ Compiler Intrinsic Equivalent


MOVHPD __m128d _mm_loadh_pd ( __m128d a, double *p)
MOVHPD void _mm_storeh_pd (double *p, __m128d a)

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions,” additionally:
#UD If VEX.L = 1.
EVEX-encoded instruction, see Table 2-59, “Type E9NF Class Exception Conditions.”

MOVHPD—Move High Packed Double Precision Floating-Point Value Vol. 2B 4-81


MOVHPS—Move High Packed Single Precision Floating-Point Values
Opcode/ Op / En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
NP 0F 16 /r A V/V SSE Move two packed single precision floating-point values
MOVHPS xmm1, m64 from m64 to high quadword of xmm1.
VEX.128.0F.WIG 16 /r B V/V AVX Merge two packed single precision floating-point values
VMOVHPS xmm2, xmm1, m64 from m64 and the low quadword of xmm1.
EVEX.128.0F.W0 16 /r D V/V AVX512F Merge two packed single precision floating-point values
VMOVHPS xmm2, xmm1, m64 OR AVX10.11 from m64 and the low quadword of xmm1.
NP 0F 17 /r C V/V SSE Move two packed single precision floating-point values
MOVHPS m64, xmm1 from high quadword of xmm1 to m64.
VEX.128.0F.WIG 17 /r C V/V AVX Move two packed single precision floating-point values
VMOVHPS m64, xmm1 from high quadword of xmm1 to m64.
EVEX.128.0F.W0 17 /r E V/V AVX512F Move two packed single precision floating-point values
VMOVHPS m64, xmm1 OR AVX10.11 from high quadword of xmm1 to m64.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C N/A ModRM:r/m (w) ModRM:reg (r) N/A N/A
D Tuple2 ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A
E Tuple2 ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
This instruction cannot be used for register to register or memory to memory moves.
128-bit Legacy SSE load:
Moves two packed single precision floating-point values from the source 64-bit memory operand and stores them
in the high 64-bits of the destination XMM register. The lower 64bits of the XMM register are preserved. Bits
(MAXVL-1:128) of the corresponding destination register are preserved.
VEX.128 & EVEX encoded load:
Loads two single precision floating-point values from the source 64-bit memory operand (the third operand) and
stores it in the upper 64-bits of the destination XMM register (first operand). The low 64-bits from the first source
operand (the second operand) are copied to the lower 64-bits of the destination. Bits (MAXVL-1:128) of the corre-
sponding destination register are zeroed.
128-bit store:
Stores two packed single precision floating-point values from the high 64-bits of the XMM register source (second
operand) to the 64-bit memory location (first operand).
Note: VMOVHPS (store) (VEX.128.0F 17 /r) is legal and has the same behavior as the existing 0F 17 store. For
VMOVHPS (store) VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD.
If VMOVHPS is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or
EVEX.L’L= 1 will cause an #UD exception.

MOVHPS—Move High Packed Single Precision Floating-Point Values Vol. 2B 4-82


Operation
MOVHPS (128-bit Legacy SSE Load)
DEST[63:0] (Unmodified)
DEST[127:64] := SRC[63:0]
DEST[MAXVL-1:128] (Unmodified)

VMOVHPS (VEX.128 and EVEX Encoded Load)


DEST[63:0] := SRC1[63:0]
DEST[127:64] := SRC2[63:0]
DEST[MAXVL-1:128] := 0

VMOVHPS (Store)
DEST[63:0] := SRC[127:64]

Intel C/C++ Compiler Intrinsic Equivalent


MOVHPS __m128 _mm_loadh_pi ( __m128 a, __m64 *p)
MOVHPS void _mm_storeh_pi (__m64 *p, __m128 a)

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions,” additionally:
#UD If VEX.L = 1.
EVEX-encoded instruction, see Table 2-59, “Type E9NF Class Exception Conditions.”

MOVHPS—Move High Packed Single Precision Floating-Point Values Vol. 2B 4-83


MOVLHPS—Move Packed Single Precision Floating-Point Values Low to High
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
NP 0F 16 /r RM V/V SSE Move two packed single precision floating-point values from
MOVLHPS xmm1, xmm2 low quadword of xmm2 to high quadword of xmm1.
VEX.128.0F.WIG 16 /r RVM V/V AVX Merge two packed single precision floating-point values
VMOVLHPS xmm1, xmm2, xmm3 from low quadword of xmm3 and low quadword of xmm2.
EVEX.128.0F.W0 16 /r RVM V/V AVX512F Merge two packed single precision floating-point values
VMOVLHPS xmm1, xmm2, xmm3 OR AVX10.11 from low quadword of xmm3 and low quadword of xmm2.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding1


Op/En Operand 1 Operand 2 Operand 3 Operand 4
RM ModRM:reg (w) ModRM:r/m (r) N/A N/A
VEX.vvvv (r) /
RVM ModRM:reg (w) ModRM:r/m (r) N/A
EVEX.vvvv (r)

Description
This instruction cannot be used for memory to register moves.
128-bit two-argument form:
Moves two packed single precision floating-point values from the low quadword of the second XMM argument
(second operand) to the high quadword of the first XMM register (first argument). The low quadword of the desti-
nation operand is left unchanged. Bits (MAXVL-1:128) of the corresponding destination register are unmodified.
128-bit three-argument forms:
Moves two packed single precision floating-point values from the low quadword of the third XMM argument (third
operand) to the high quadword of the destination (first operand). Copies the low quadword from the second XMM
argument (second operand) to the low quadword of the destination (first operand). Bits (MAXVL-1:128) of the
corresponding destination register are zeroed.
If VMOVLHPS is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or
EVEX.L’L= 1 will cause an #UD exception.

Operation
MOVLHPS (128-bit Two-Argument Form)
DEST[63:0] (Unmodified)
DEST[127:64] := SRC[63:0]
DEST[MAXVL-1:128] (Unmodified)

VMOVLHPS (128-bit Three-Argument Form - VEX & EVEX)


DEST[63:0] := SRC1[63:0]
DEST[127:64] := SRC2[63:0]
DEST[MAXVL-1:128] := 0

1. ModRM.MOD = 011B required

MOVLHPS—Move Packed Single Precision Floating-Point Values Low to High Vol. 2B 4-84
Intel C/C++ Compiler Intrinsic Equivalent
MOVLHPS __m128 _mm_movelh_ps(__m128 a, __m128 b)

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-24, “Type 7 Class Exception Conditions,” additionally:
#UD If VEX.L = 1.
EVEX-encoded instruction, see Exceptions Type E7NM.128 in Table 2-57, “Type E7NM Class Exception Conditions.”

MOVLHPS—Move Packed Single Precision Floating-Point Values Low to High Vol. 2B 4-85
MOVLPD—Move Low Packed Double Precision Floating-Point Value
Opcode/ Op / En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
66 0F 12 /r A V/V SSE2 Move double precision floating-point value from m64 to
MOVLPD xmm1, m64 low quadword of xmm1.
VEX.128.66.0F.WIG 12 /r B V/V AVX Merge double precision floating-point value from m64 and
VMOVLPD xmm2, xmm1, m64 the high quadword of xmm1.
EVEX.128.66.0F.W1 12 /r D V/V AVX512F Merge double precision floating-point value from m64 and
VMOVLPD xmm2, xmm1, m64 OR AVX10.11 the high quadword of xmm1.
66 0F 13/r C V/V SSE2 Move double precision floating-point value from low
MOVLPD m64, xmm1 quadword of xmm1 to m64.
VEX.128.66.0F.WIG 13/r C V/V AVX Move double precision floating-point value from low
VMOVLPD m64, xmm1 quadword of xmm1 to m64.
EVEX.128.66.0F.W1 13/r E V/V AVX512F Move double precision floating-point value from low
VMOVLPD m64, xmm1 OR AVX10.11 quadword of xmm1 to m64.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:r/m (r) VEX.vvvv (r) ModRM:r/m (r) N/A
C N/A ModRM:r/m (w) ModRM:reg (r) N/A N/A
D Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A
E Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
This instruction cannot be used for register to register or memory to memory moves.
128-bit Legacy SSE load:
Moves a double precision floating-point value from the source 64-bit memory operand and stores it in the low 64-
bits of the destination XMM register. The upper 64bits of the XMM register are preserved. Bits (MAXVL-1:128) of the
corresponding destination register are preserved.
VEX.128 & EVEX encoded load:
Loads a double precision floating-point value from the source 64-bit memory operand (third operand), merges it
with the upper 64-bits of the first source XMM register (second operand), and stores it in the low 128-bits of the
destination XMM register (first operand). Bits (MAXVL-1:128) of the corresponding destination register are zeroed.
128-bit store:
Stores a double precision floating-point value from the low 64-bits of the XMM register source (second operand) to
the 64-bit memory location (first operand).
Note: VMOVLPD (store) (VEX.128.66.0F 13 /r) is legal and has the same behavior as the existing 66 0F 13 store.
For VMOVLPD (store) VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD.
If VMOVLPD is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or
EVEX.L’L= 1 will cause an #UD exception.

MOVLPD—Move Low Packed Double Precision Floating-Point Value Vol. 2B 4-86


Operation
MOVLPD (128-bit Legacy SSE Load)
DEST[63:0] := SRC[63:0]
DEST[MAXVL-1:64] (Unmodified)

VMOVLPD (VEX.128 & EVEX Encoded Load)


DEST[63:0] := SRC2[63:0]
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

VMOVLPD (Store)
DEST[63:0] := SRC[63:0]

Intel C/C++ Compiler Intrinsic Equivalent


MOVLPD __m128d _mm_loadl_pd ( __m128d a, double *p)
MOVLPD void _mm_storel_pd (double *p, __m128d a)

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions,” additionally:
#UD If VEX.L = 1.
EVEX-encoded instruction, see Table 2-59, “Type E9NF Class Exception Conditions.”

MOVLPD—Move Low Packed Double Precision Floating-Point Value Vol. 2B 4-87


MOVLPS—Move Low Packed Single Precision Floating-Point Values
Opcode/ Op / En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
NP 0F 12 /r A V/V SSE Move two packed single precision floating-point values
MOVLPS xmm1, m64 from m64 to low quadword of xmm1.
VEX.128.0F.WIG 12 /r B V/V AVX Merge two packed single precision floating-point values
VMOVLPS xmm2, xmm1, m64 from m64 and the high quadword of xmm1.
EVEX.128.0F.W0 12 /r D V/V AVX512F Merge two packed single precision floating-point values
VMOVLPS xmm2, xmm1, m64 OR AVX10.11 from m64 and the high quadword of xmm1.
0F 13/r C V/V SSE Move two packed single precision floating-point values
MOVLPS m64, xmm1 from low quadword of xmm1 to m64.
VEX.128.0F.WIG 13/r C V/V AVX Move two packed single precision floating-point values
VMOVLPS m64, xmm1 from low quadword of xmm1 to m64.
EVEX.128.0F.W0 13/r E V/V AVX512F Move two packed single precision floating-point values
VMOVLPS m64, xmm1 OR AVX10.11 from low quadword of xmm1 to m64.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C N/A ModRM:r/m (w) ModRM:reg (r) N/A N/A
D Tuple2 ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A
E Tuple2 ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
This instruction cannot be used for register to register or memory to memory moves.
128-bit Legacy SSE load:
Moves two packed single precision floating-point values from the source 64-bit memory operand and stores them
in the low 64-bits of the destination XMM register. The upper 64bits of the XMM register are preserved. Bits
(MAXVL-1:128) of the corresponding destination register are preserved.
VEX.128 & EVEX encoded load:
Loads two packed single precision floating-point values from the source 64-bit memory operand (the third
operand), merges them with the upper 64-bits of the first source operand (the second operand), and stores them
in the low 128-bits of the destination register (the first operand). Bits (MAXVL-1:128) of the corresponding desti-
nation register are zeroed.
128-bit store:
Loads two packed single precision floating-point values from the low 64-bits of the XMM register source (second
operand) to the 64-bit memory location (first operand).
Note: VMOVLPS (store) (VEX.128.0F 13 /r) is legal and has the same behavior as the existing 0F 13 store. For
VMOVLPS (store) VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD.
If VMOVLPS is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or
EVEX.L’L= 1 will cause an #UD exception.

MOVLPS—Move Low Packed Single Precision Floating-Point Values Vol. 2B 4-88


Operation
MOVLPS (128-bit Legacy SSE Load)
DEST[63:0] := SRC[63:0]
DEST[MAXVL-1:64] (Unmodified)

VMOVLPS (VEX.128 & EVEX Encoded Load)


DEST[63:0] := SRC2[63:0]
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

VMOVLPS (Store)
DEST[63:0] := SRC[63:0]

Intel C/C++ Compiler Intrinsic Equivalent


MOVLPS __m128 _mm_loadl_pi ( __m128 a, __m64 *p)
MOVLPS void _mm_storel_pi (__m64 *p, __m128 a)

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions,” additionally:
#UD If VEX.L = 1.
EVEX-encoded instruction, see Table 2-59, “Type E9NF Class Exception Conditions.”

MOVLPS—Move Low Packed Single Precision Floating-Point Values Vol. 2B 4-89


MOVNTDQ—Store Packed Integers Using Non-Temporal Hint
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F E7 /r A V/V SSE2 Move packed integer values in xmm1 to m128 using non-
MOVNTDQ m128, xmm1 temporal hint.
VEX.128.66.0F.WIG E7 /r A V/V AVX Move packed integer values in xmm1 to m128 using non-
VMOVNTDQ m128, xmm1 temporal hint.
VEX.256.66.0F.WIG E7 /r A V/V AVX Move packed integer values in ymm1 to m256 using non-
VMOVNTDQ m256, ymm1 temporal hint.
EVEX.128.66.0F.W0 E7 /r B V/V (AVX512VL AND Move packed integer values in xmm1 to m128 using non-
VMOVNTDQ m128, xmm1 AVX512F) OR temporal hint.
AVX10.11
EVEX.256.66.0F.W0 E7 /r B V/V (AVX512VL AND Move packed integer values in zmm1 to m256 using non-
VMOVNTDQ m256, ymm1 AVX512F) OR temporal hint.
AVX10.11
EVEX.512.66.0F.W0 E7 /r B V/V AVX512F Move packed integer values in zmm1 to m512 using non-
VMOVNTDQ m512, zmm1 OR AVX10.11 temporal hint.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding1


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:r/m (w) ModRM:reg (r) N/A N/A
B Full Mem ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
Moves the packed integers in the source operand (second operand) to the destination operand (first operand) using
a non-temporal hint to prevent caching of the data during the write to memory. The source operand is an XMM
register, YMM register or ZMM register, which is assumed to contain integer data (packed bytes, words, double-
words, or quadwords). The destination operand is a 128-bit, 256-bit or 512-bit memory location. The memory
operand must be aligned on a 16-byte (128-bit version), 32-byte (VEX.256 encoded version) or 64-byte (512-bit
version) boundary otherwise a general-protection exception (#GP) will be generated.
The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the
data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it
fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being
written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an
uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see
“Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the IA-32 Intel Architecture Software Developer’s
Manual, Volume 1.
Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with
the SFENCE or MFENCE instruction should be used in conjunction with VMOVNTDQ instructions if multiple proces-
sors might use different memory types to read/write the destination memory locations.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, VEX.L must be 0; otherwise instructions will
#UD.

1. ModRM.MOD != 011B

MOVNTDQ—Store Packed Integers Using Non-Temporal Hint Vol. 2B 4-96


Operation
VMOVNTDQ(EVEX Encoded Versions)
VL = 128, 256, 512
DEST[VL-1:0] := SRC[VL-1:0]
DEST[MAXVL-1:VL] := 0

MOVNTDQ (Legacy and VEX Versions)


DEST := SRC

Intel C/C++ Compiler Intrinsic Equivalent


VMOVNTDQ void _mm512_stream_si512(void * p, __m512i a);
VMOVNTDQ void _mm256_stream_si256 (__m256i * p, __m256i a);
MOVNTDQ void _mm_stream_si128 (__m128i * p, __m128i a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Exceptions Type1.SSE2 in Table 2-18, “Type 1 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-47, “Type E1NF Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

MOVNTDQ—Store Packed Integers Using Non-Temporal Hint Vol. 2B 4-97


MOVNTDQA—Load Double Quadword Non-Temporal Aligned Hint
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 38 2A /r A V/V SSE4_1 Move double quadword from m128 to xmm1 using non-
MOVNTDQA xmm1, m128 temporal hint if WC memory type.
VEX.128.66.0F38.WIG 2A /r A V/V AVX Move double quadword from m128 to xmm1 using non-
VMOVNTDQA xmm1, m128 temporal hint if WC memory type.
VEX.256.66.0F38.WIG 2A /r A V/V AVX2 Move 256-bit data from m256 to ymm1 using non-
VMOVNTDQA ymm1, m256 temporal hint if WC memory type.
EVEX.128.66.0F38.W0 2A /r B V/V (AVX512VL AND Move 128-bit data from m128 to xmm1 using non-
VMOVNTDQA xmm1, m128 AVX512F) OR temporal hint if WC memory type.
AVX10.11
EVEX.256.66.0F38.W0 2A /r B V/V (AVX512VL AND Move 256-bit data from m256 to ymm1 using non-
VMOVNTDQA ymm1, m256 AVX512F) OR temporal hint if WC memory type.
AVX10.11
EVEX.512.66.0F38.W0 2A /r B V/V AVX512F Move 512-bit data from m512 to zmm1 using non-
VMOVNTDQA zmm1, m512 OR AVX10.11 temporal hint if WC memory type.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding1


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Full Mem ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
MOVNTDQA loads a double quadword from the source operand (second operand) to the destination operand (first
operand) using a non-temporal hint if the memory source is WC (write combining) memory type. For WC memory
type, the non-temporal hint may be implemented by loading a temporary internal buffer with the equivalent of an
aligned cache line without filling this data to the cache. Any memory-type aliased lines in the cache will be snooped
and flushed. Subsequent MOVNTDQA reads to unread portions of the WC cache line will receive data from the
temporary internal buffer if data is available. The temporary internal buffer may be flushed by the processor at any
time for any reason, for example:
• A load operation other than a MOVNTDQA which references memory already resident in a temporary internal
buffer.
• A non-WC reference to memory already resident in a temporary internal buffer.
• Interleaving of reads and writes to a single temporary internal buffer.
• Repeated (V)MOVNTDQA loads of a particular 16-byte item in a streaming line.
• Certain micro-architectural conditions including resource shortages, detection of a mis-speculation condition,
and various fault conditions.
The non-temporal hint is implemented by using a write combining (WC) memory type protocol when reading the
data from memory. Using this protocol, the processor does not read the data into the cache hierarchy, nor does it
fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being
read can override the non-temporal hint, if the memory address specified for the non-temporal read is not a WC

1. ModRM.MOD != 011B

MOVNTDQA—Load Double Quadword Non-Temporal Aligned Hint Vol. 2B 4-94


memory region. Information on non-temporal reads and writes can be found in “Caching of Temporal vs. Non-
Temporal Data” in Chapter 10 in the Intel® 64 and IA-32 Architecture Software Developer’s Manual, Volume 3A.
Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with
a MFENCE instruction should be used in conjunction with MOVNTDQA instructions if multiple processors might use
different memory types for the referenced memory locations or to synchronize reads of a processor with writes by
other agents in the system. A processor’s implementation of the streaming load hint does not override the effective
memory type, but the implementation of the hint is processor dependent. For example, a processor implementa-
tion may choose to ignore the hint and process the instruction as a normal MOVDQA for any memory type. Alter-
natively, another implementation may optimize cache reads generated by MOVNTDQA on WB memory type to
reduce cache evictions.
The 128-bit (V)MOVNTDQA addresses must be 16-byte aligned or the instruction will cause a #GP.
The 256-bit VMOVNTDQA addresses must be 32-byte aligned or the instruction will cause a #GP.
The 512-bit VMOVNTDQA addresses must be 64-byte aligned or the instruction will cause a #GP.

Operation
MOVNTDQA (128bit- Legacy SSE Form)
DEST := SRC
DEST[MAXVL-1:128] (Unmodified)

VMOVNTDQA (VEX.128 and EVEX.128 Encoded Form)


DEST := SRC
DEST[MAXVL-1:128] := 0

VMOVNTDQA (VEX.256 and EVEX.256 Encoded Forms)


DEST[255:0] := SRC[255:0]
DEST[MAXVL-1:256] := 0

VMOVNTDQA (EVEX.512 Encoded Form)


DEST[511:0] := SRC[511:0]
DEST[MAXVL-1:512] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VMOVNTDQA __m512i _mm512_stream_load_si512(__m512i const* p);
MOVNTDQA __m128i _mm_stream_load_si128 (const __m128i *p);
VMOVNTDQA __m256i _mm256_stream_load_si256 (__m256i const* p);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-18, “Type 1 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-47, “Type E1NF Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

MOVNTDQA—Load Double Quadword Non-Temporal Aligned Hint Vol. 2B 4-95


MOVNTPD—Store Packed Double Precision Floating-Point Values Using Non-Temporal Hint
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 2B /r A V/V SSE2 Move packed double precision values in xmm1 to m128
MOVNTPD m128, xmm1 using non-temporal hint.
VEX.128.66.0F.WIG 2B /r A V/V AVX Move packed double precision values in xmm1 to m128
VMOVNTPD m128, xmm1 using non-temporal hint.
VEX.256.66.0F.WIG 2B /r A V/V AVX Move packed double precision values in ymm1 to m256
VMOVNTPD m256, ymm1 using non-temporal hint.
EVEX.128.66.0F.W1 2B /r B V/V (AVX512VL AND Move packed double precision values in xmm1 to m128
VMOVNTPD m128, xmm1 AVX512F) OR using non-temporal hint.
AVX10.11
EVEX.256.66.0F.W1 2B /r B V/V (AVX512VL AND Move packed double precision values in ymm1 to m256
VMOVNTPD m256, ymm1 AVX512F) OR using non-temporal hint.
AVX10.11
EVEX.512.66.0F.W1 2B /r B V/V AVX512F Move packed double precision values in zmm1 to m512
VMOVNTPD m512, zmm1 OR AVX10.11 using non-temporal hint.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding1


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:r/m (w) ModRM:reg (r) N/A N/A
B Full Mem ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
Moves the packed double precision floating-point values in the source operand (second operand) to the destination
operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory. The
source operand is an XMM register, YMM register or ZMM register, which is assumed to contain packed double preci-
sion, floating-pointing data. The destination operand is a 128-bit, 256-bit or 512-bit memory location. The memory
operand must be aligned on a 16-byte (128-bit version), 32-byte (VEX.256 encoded version) or 64-byte
(EVEX.512 encoded version) boundary otherwise a general-protection exception (#GP) will be generated.
The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the
data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it
fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being
written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an
uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see
“Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the IA-32 Intel Architecture Software Developer’s
Manual, Volume 1.
Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with
the SFENCE or MFENCE instruction should be used in conjunction with MOVNTPD instructions if multiple processors
might use different memory types to read/write the destination memory locations.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, VEX.L must be 0; otherwise instructions will
#UD.

1. ModRM.MOD != 011B

MOVNTPD—Store Packed Double Precision Floating-Point Values Using Non-Temporal Hint Vol. 2B 4-100
Operation
VMOVNTPD (EVEX Encoded Versions)
VL = 128, 256, 512
DEST[VL-1:0] := SRC[VL-1:0]
DEST[MAXVL-1:VL] := 0

MOVNTPD (Legacy and VEX Versions)


DEST := SRC

Intel C/C++ Compiler Intrinsic Equivalent


VMOVNTPD void _mm512_stream_pd(double * p, __m512d a);
VMOVNTPD void _mm256_stream_pd (double * p, __m256d a);
MOVNTPD void _mm_stream_pd (double * p, __m128d a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Exceptions Type1.SSE2 in Table 2-18, “Type 1 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-47, “Type E1NF Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

MOVNTPD—Store Packed Double Precision Floating-Point Values Using Non-Temporal Hint Vol. 2B 4-101
MOVNTPS—Store Packed Single Precision Floating-Point Values Using Non-Temporal Hint
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 2B /r A V/V SSE Move packed single precision values xmm1 to mem using
MOVNTPS m128, xmm1 non-temporal hint.
VEX.128.0F.WIG 2B /r A V/V AVX Move packed single precision values xmm1 to mem using
VMOVNTPS m128, xmm1 non-temporal hint.
VEX.256.0F.WIG 2B /r A V/V AVX Move packed single precision values ymm1 to mem using
VMOVNTPS m256, ymm1 non-temporal hint.
EVEX.128.0F.W0 2B /r B V/V (AVX512VL AND Move packed single precision values in xmm1 to m128
VMOVNTPS m128, xmm1 AVX512F) OR using non-temporal hint.
AVX10.11
EVEX.256.0F.W0 2B /r B V/V (AVX512VL AND Move packed single precision values in ymm1 to m256
VMOVNTPS m256, ymm1 AVX512F) OR using non-temporal hint.
AVX10.11
EVEX.512.0F.W0 2B /r B V/V AVX512F Move packed single precision values in zmm1 to m512
VMOVNTPS m512, zmm1 OR AVX10.11 using non-temporal hint.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding1


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:r/m (w) ModRM:reg (r) N/A N/A
B Full Mem ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
Moves the packed single precision floating-point values in the source operand (second operand) to the destination
operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory. The
source operand is an XMM register, YMM register or ZMM register, which is assumed to contain packed single preci-
sion, floating-pointing. The destination operand is a 128-bit, 256-bit or 512-bit memory location. The memory
operand must be aligned on a 16-byte (128-bit version), 32-byte (VEX.256 encoded version) or 64-byte
(EVEX.512 encoded version) boundary otherwise a general-protection exception (#GP) will be generated.
The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the
data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it
fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being
written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an
uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see
“Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the IA-32 Intel Architecture Software Developer’s
Manual, Volume 1.
Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with
the SFENCE or MFENCE instruction should be used in conjunction with MOVNTPS instructions if multiple processors
might use different memory types to read/write the destination memory locations.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

1. ModRM.MOD != 011B

MOVNTPS—Store Packed Single Precision Floating-Point Values Using Non-Temporal Hint Vol. 2B 4-102
Operation
VMOVNTPS (EVEX Encoded Versions)
VL = 128, 256, 512
DEST[VL-1:0] := SRC[VL-1:0]
DEST[MAXVL-1:VL] := 0

MOVNTPS
DEST := SRC

Intel C/C++ Compiler Intrinsic Equivalent


VMOVNTPS void _mm512_stream_ps(float * p, __m512d a);
MOVNTPS void _mm_stream_ps (float * p, __m128d a);
VMOVNTPS void _mm256_stream_ps (float * p, __m256 a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Exceptions Type1.SSE in Table 2-18, “Type 1 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-47, “Type E1NF Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

MOVNTPS—Store Packed Single Precision Floating-Point Values Using Non-Temporal Hint Vol. 2B 4-103
MOVQ—Move Quadword
Opcode/ Op/ En 64/32-bit CPUID Description
Instruction Mode Feature Flag
NP 0F 6F /r A V/V MMX Move quadword from mm/m64 to mm.
MOVQ mm, mm/m64
NP 0F 7F /r B V/V MMX Move quadword from mm to mm/m64.
MOVQ mm/m64, mm
F3 0F 7E /r A V/V SSE2 Move quadword from xmm2/mem64 to xmm1.
MOVQ xmm1, xmm2/m64
VEX.128.F3.0F.WIG 7E /r A V/V AVX Move quadword from xmm2 to xmm1.
VMOVQ xmm1, xmm2/m64
EVEX.128.F3.0F.W1 7E /r C V/V AVX512F Move quadword from xmm2/m64 to xmm1.
VMOVQ xmm1, xmm2/m64 OR AVX10.11
66 0F D6 /r B V/V SSE2 Move quadword from xmm1 to xmm2/mem64.
MOVQ xmm2/m64, xmm1
VEX.128.66.0F.WIG D6 /r B V/V AVX Move quadword from xmm2 register to
VMOVQ xmm1/m64, xmm2 xmm1/m64.

EVEX.128.66.0F.W1 D6 /r D V/V AVX512F Move quadword from xmm2 register to


VMOVQ xmm1/m64, xmm2 OR AVX10.11 xmm1/m64.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B N/A ModRM:r/m (w) ModRM:reg (r) N/A N/A
C Tuple1 Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A
D Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
Copies a quadword from the source operand (second operand) to the destination operand (first operand). The
source and destination operands can be MMX technology registers, XMM registers, or 64-bit memory locations.
This instruction can be used to move a quadword between two MMX technology registers or between an MMX tech-
nology register and a 64-bit memory location, or to move data between two XMM registers or between an XMM
register and a 64-bit memory location. The instruction cannot be used to transfer data between memory locations.
When the source operand is an XMM register, the low quadword is moved; when the destination operand is an XMM
register, the quadword is stored to the low quadword of the register, and the high quadword is cleared to all 0s.
In 64-bit mode and if not encoded using VEX/EVEX, use of the REX prefix in the form of REX.R permits this instruc-
tion to access additional registers (XMM8-XMM15).
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.
If VMOVQ is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an
#UD exception.

MOVQ—Move Quadword Vol. 2B 4-105


Operation
MOVQ Instruction When Operating on MMX Technology Registers and Memory Locations
DEST := SRC;

MOVQ Instruction When Source and Destination Operands are XMM Registers
DEST[63:0] := SRC[63:0];
DEST[127:64] := 0000000000000000H;

MOVQ Instruction When Source Operand is XMM Register and Destination


operand is memory location:
DEST := SRC[63:0];

MOVQ Instruction When Source Operand is Memory Location and Destination


operand is XMM register:
DEST[63:0] := SRC;
DEST[127:64] := 0000000000000000H;

VMOVQ (VEX.128.F3.0F 7E) With XMM Register Source and Destination


DEST[63:0] := SRC[63:0]
DEST[MAXVL-1:64] := 0

VMOVQ (VEX.128.66.0F D6) With XMM Register Source and Destination


DEST[63:0] := SRC[63:0]
DEST[MAXVL-1:64] := 0

VMOVQ (7E - EVEX Encoded Version) With XMM Register Source and Destination
DEST[63:0] := SRC[63:0]
DEST[MAXVL-1:64] := 0

VMOVQ (D6 - EVEX Encoded Version) With XMM Register Source and Destination
DEST[63:0] := SRC[63:0]
DEST[MAXVL-1:64] := 0

VMOVQ (7E) With Memory Source


DEST[63:0] := SRC[63:0]
DEST[MAXVL-1:64] := 0

VMOVQ (7E - EVEX Encoded Version) With Memory Source


DEST[63:0] := SRC[63:0]
DEST[:MAXVL-1:64] := 0

VMOVQ (D6) With Memory DEST


DEST[63:0] := SRC2[63:0]

Flags Affected
None.

Intel C/C++ Compiler Intrinsic Equivalent


VMOVQ __m128i _mm_loadu_si64( void * s);
VMOVQ void _mm_storeu_si64( void * d, __m128i s);
MOVQ m128i _mm_move_epi64(__m128i a)

MOVQ—Move Quadword Vol. 2B 4-106


SIMD Floating-Point Exceptions
None.

Other Exceptions
See Table 24-8, “Exception Conditions for Legacy SIMD/MMX Instructions without FP Exception,” in the Intel® 64
and IA-32 Architectures Software Developer’s Manual, Volume 3B.

MOVQ—Move Quadword Vol. 2B 4-107


MOVSD—Move or Merge Scalar Double Precision Floating-Point Value
Opcode/ Op / En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
F2 0F 10 /r A V/V SSE2 Move scalar double precision floating-point value
MOVSD xmm1, xmm2 from xmm2 to xmm1 register.
F2 0F 10 /r A V/V SSE2 Load scalar double precision floating-point value
MOVSD xmm1, m64 from m64 to xmm1 register.
F2 0F 11 /r C V/V SSE2 Move scalar double precision floating-point value
MOVSD xmm1/m64, xmm2 from xmm2 register to xmm1/m64.
VEX.LIG.F2.0F.WIG 10 /r B V/V AVX Merge scalar double precision floating-point value
VMOVSD xmm1, xmm2, xmm3 from xmm2 and xmm3 to xmm1 register.
VEX.LIG.F2.0F.WIG 10 /r D V/V AVX Load scalar double precision floating-point value
VMOVSD xmm1, m64 from m64 to xmm1 register.
VEX.LIG.F2.0F.WIG 11 /r E V/V AVX Merge scalar double precision floating-point value
VMOVSD xmm1, xmm2, xmm3 from xmm2 and xmm3 registers to xmm1.
VEX.LIG.F2.0F.WIG 11 /r C V/V AVX Store scalar double precision floating-point value
VMOVSD m64, xmm1 from xmm1 register to m64.
EVEX.LLIG.F2.0F.W1 10 /r B V/V AVX512F Merge scalar double precision floating-point value
VMOVSD xmm1 {k1}{z}, xmm2, xmm3 OR AVX10.11 from xmm2 and xmm3 registers to xmm1 under
writemask k1.
EVEX.LLIG.F2.0F.W1 10 /r F V/V AVX512F Load scalar double precision floating-point value
VMOVSD xmm1 {k1}{z}, m64 OR AVX10.11 from m64 to xmm1 register under writemask k1.
EVEX.LLIG.F2.0F.W1 11 /r E V/V AVX512F Merge scalar double precision floating-point value
VMOVSD xmm1 {k1}{z}, xmm2, xmm3 OR AVX10.11 from xmm2 and xmm3 registers to xmm1 under
writemask k1.
EVEX.LLIG.F2.0F.W1 11 /r G V/V AVX512F Store scalar double precision floating-point value
VMOVSD m64 {k1}, xmm1 OR AVX10.11 from xmm1 register to m64 under writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C N/A ModRM:r/m (w) ModRM:reg (r) N/A N/A
D N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
E N/A ModRM:r/m (w) EVEX.vvvv (r) ModRM:reg (r) N/A
F Tuple1 Scalar ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
G Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
Moves a scalar double precision floating-point value from the source operand (second operand) to the destination
operand (first operand). The source and destination operands can be XMM registers or 64-bit memory locations.
This instruction can be used to move a double precision floating-point value to and from the low quadword of an

MOVSD—Move or Merge Scalar Double Precision Floating-Point Value Vol. 2B 4-114


XMM register and a 64-bit memory location, or to move a double precision floating-point value between the low
quadwords of two XMM registers. The instruction cannot be used to transfer data between memory locations.
Legacy version: When the source and destination operands are XMM registers, bits MAXVL:64 of the destination
operand remains unchanged. When the source operand is a memory location and destination operand is an XMM
registers, the quadword at bits 127:64 of the destination operand is cleared to all 0s, bits MAXVL:128 of the desti-
nation operand remains unchanged.
VEX and EVEX encoded register-register syntax: Moves a scalar double precision floating-point value from the
second source operand (the third operand) to the low quadword element of the destination operand (the first
operand). Bits 127:64 of the destination operand are copied from the first source operand (the second operand).
Bits (MAXVL-1:128) of the corresponding destination register are zeroed.
VEX and EVEX encoded memory store syntax: When the source operand is a memory location and destination
operand is an XMM registers, bits MAXVL:64 of the destination operand is cleared to all 0s.
EVEX encoded versions: The low quadword of the destination is updated according to the writemask.
Note: For VMOVSD (memory store and load forms), VEX.vvvv and EVEX.vvvv are reserved and must be 1111b,
otherwise instruction will #UD.

Operation
VMOVSD (EVEX.LLIG.F2.0F 10 /r: VMOVSD xmm1, m64 With Support for 32 Registers)
IF k1[0] or *no writemask*
THEN DEST[63:0] := SRC[63:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[MAXVL-1:64] := 0

VMOVSD (EVEX.LLIG.F2.0F 11 /r: VMOVSD m64, xmm1 With Support for 32 Registers)
IF k1[0] or *no writemask*
THEN DEST[63:0] := SRC[63:0]
ELSE *DEST[63:0] remains unchanged* ; merging-masking
FI;

VMOVSD (EVEX.LLIG.F2.0F 11 /r: VMOVSD xmm1, xmm2, xmm3)


IF k1[0] or *no writemask*
THEN DEST[63:0] := SRC2[63:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

MOVSD (128-bit Legacy SSE Version: MOVSD xmm1, xmm2)


DEST[63:0] := SRC[63:0]
DEST[MAXVL-1:64] (Unmodified)

MOVSD—Move or Merge Scalar Double Precision Floating-Point Value Vol. 2B 4-115


VMOVSD (VEX.128.F2.0F 11 /r: VMOVSD xmm1, xmm2, xmm3)
DEST[63:0] := SRC2[63:0]
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

VMOVSD (VEX.128.F2.0F 10 /r: VMOVSD xmm1, xmm2, xmm3)


DEST[63:0] := SRC2[63:0]
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

VMOVSD (VEX.128.F2.0F 10 /r: VMOVSD xmm1, m64)


DEST[63:0] := SRC[63:0]
DEST[MAXVL-1:64] := 0

MOVSD/VMOVSD (128-bit Versions: MOVSD m64, xmm1 or VMOVSD m64, xmm1)


DEST[63:0] := SRC[63:0]

MOVSD (128-bit Legacy SSE Version: MOVSD xmm1, m64)


DEST[63:0] := SRC[63:0]
DEST[127:64] := 0
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VMOVSD __m128d _mm_mask_load_sd(__m128d s, __mmask8 k, double * p);
VMOVSD __m128d _mm_maskz_load_sd( __mmask8 k, double * p);
VMOVSD __m128d _mm_mask_move_sd(__m128d sh, __mmask8 k, __m128d sl, __m128d a);
VMOVSD __m128d _mm_maskz_move_sd( __mmask8 k, __m128d s, __m128d a);
VMOVSD void _mm_mask_store_sd(double * p, __mmask8 k, __m128d s);
MOVSD __m128d _mm_load_sd (double *p)
MOVSD void _mm_store_sd (double *p, __m128d a)
MOVSD __m128d _mm_move_sd ( __m128d a, __m128d b)

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions,” additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instruction, see Table 2-60, “Type E10 Class Exception Conditions.”

MOVSD—Move or Merge Scalar Double Precision Floating-Point Value Vol. 2B 4-116


MOVSHDUP—Replicate Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
F3 0F 16 /r A V/V SSE3 Move odd index single precision floating-point values
MOVSHDUP xmm1, xmm2/m128 from xmm2/mem and duplicate each element into xmm1.
VEX.128.F3.0F.WIG 16 /r A V/V AVX Move odd index single precision floating-point values
VMOVSHDUP xmm1, xmm2/m128 from xmm2/mem and duplicate each element into xmm1.
VEX.256.F3.0F.WIG 16 /r A V/V AVX Move odd index single precision floating-point values
VMOVSHDUP ymm1, ymm2/m256 from ymm2/mem and duplicate each element into ymm1.
EVEX.128.F3.0F.W0 16 /r B V/V (AVX512VL AND Move odd index single precision floating-point values
VMOVSHDUP xmm1 {k1}{z}, AVX512F) OR from xmm2/m128 and duplicate each element into
xmm2/m128 AVX10.11 xmm1 under writemask.
EVEX.256.F3.0F.W0 16 /r B V/V (AVX512VL AND Move odd index single precision floating-point values
VMOVSHDUP ymm1 {k1}{z}, AVX512F) OR from ymm2/m256 and duplicate each element into
ymm2/m256 AVX10.11 ymm1 under writemask.
EVEX.512.F3.0F.W0 16 /r B V/V AVX512F Move odd index single precision floating-point values
VMOVSHDUP zmm1 {k1}{z}, OR AVX10.11 from zmm2/m512 and duplicate each element into
zmm2/m512 zmm1 under writemask.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Full Mem ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Duplicates odd-indexed single precision floating-point values from the source operand (the second operand) to
adjacent element pair in the destination operand (the first operand). See Figure 4-3. The source operand is an
XMM, YMM or ZMM register or 128, 256 or 512-bit memory location and the destination operand is an XMM, YMM
or ZMM register.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.
VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed.
VEX.256 encoded version: Bits (MAXVL-1:256) of the destination register are zeroed.
EVEX encoded version: The destination operand is updated at 32-bit granularity according to the writemask.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

MOVSHDUP—Replicate Single Precision Floating-Point Values Vol. 2B 4-117


SRC X7 X6 X5 X4 X3 X2 X1 X0

DEST X7 X7 X5 X5 X3 X3 X1 X1

Figure 4-3. MOVSHDUP Operation

Operation
VMOVSHDUP (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
TMP_SRC[31:0] := SRC[63:32]
TMP_SRC[63:32] := SRC[63:32]
TMP_SRC[95:64] := SRC[127:96]
TMP_SRC[127:96] := SRC[127:96]
IF VL >= 256
TMP_SRC[159:128] := SRC[191:160]
TMP_SRC[191:160] := SRC[191:160]
TMP_SRC[223:192] := SRC[255:224]
TMP_SRC[255:224] := SRC[255:224]
FI;
IF VL >= 512
TMP_SRC[287:256] := SRC[319:288]
TMP_SRC[319:288] := SRC[319:288]
TMP_SRC[351:320] := SRC[383:352]
TMP_SRC[383:352] := SRC[383:352]
TMP_SRC[415:384] := SRC[447:416]
TMP_SRC[447:416] := SRC[447:416]
TMP_SRC[479:448] := SRC[511:480]
TMP_SRC[511:480] := SRC[511:480]
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_SRC[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

MOVSHDUP—Replicate Single Precision Floating-Point Values Vol. 2B 4-118


VMOVSHDUP (VEX.256 Encoded Version)
DEST[31:0] := SRC[63:32]
DEST[63:32] := SRC[63:32]
DEST[95:64] := SRC[127:96]
DEST[127:96] := SRC[127:96]
DEST[159:128] := SRC[191:160]
DEST[191:160] := SRC[191:160]
DEST[223:192] := SRC[255:224]
DEST[255:224] := SRC[255:224]
DEST[MAXVL-1:256] := 0

VMOVSHDUP (VEX.128 Encoded Version)


DEST[31:0] := SRC[63:32]
DEST[63:32] := SRC[63:32]
DEST[95:64] := SRC[127:96]
DEST[127:96] := SRC[127:96]
DEST[MAXVL-1:128] := 0
MOVSHDUP (128-bit Legacy SSE Version)
DEST[31:0] := SRC[63:32]
DEST[63:32] := SRC[63:32]
DEST[95:64] := SRC[127:96]
DEST[127:96] := SRC[127:96]
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VMOVSHDUP __m512 _mm512_movehdup_ps( __m512 a);
VMOVSHDUP __m512 _mm512_mask_movehdup_ps(__m512 s, __mmask16 k, __m512 a);
VMOVSHDUP __m512 _mm512_maskz_movehdup_ps( __mmask16 k, __m512 a);
VMOVSHDUP __m256 _mm256_mask_movehdup_ps(__m256 s, __mmask8 k, __m256 a);
VMOVSHDUP __m256 _mm256_maskz_movehdup_ps( __mmask8 k, __m256 a);
VMOVSHDUP __m128 _mm_mask_movehdup_ps(__m128 s, __mmask8 k, __m128 a);
VMOVSHDUP __m128 _mm_maskz_movehdup_ps( __mmask8 k, __m128 a);
VMOVSHDUP __m256 _mm256_movehdup_ps (__m256 a);
VMOVSHDUP __m128 _mm_movehdup_ps (__m128 a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.

MOVSHDUP—Replicate Single Precision Floating-Point Values Vol. 2B 4-119


MOVSLDUP—Replicate Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
F3 0F 12 /r A V/V SSE3 Move even index single precision floating-point values
MOVSLDUP xmm1, xmm2/m128 from xmm2/mem and duplicate each element into
xmm1.
VEX.128.F3.0F.WIG 12 /r A V/V AVX Move even index single precision floating-point values
VMOVSLDUP xmm1, xmm2/m128 from xmm2/mem and duplicate each element into
xmm1.
VEX.256.F3.0F.WIG 12 /r A V/V AVX Move even index single precision floating-point values
VMOVSLDUP ymm1, ymm2/m256 from ymm2/mem and duplicate each element into
ymm1.
EVEX.128.F3.0F.W0 12 /r B V/V (AVX512VL AND Move even index single precision floating-point values
VMOVSLDUP xmm1 {k1}{z}, AVX512F) OR from xmm2/m128 and duplicate each element into
xmm2/m128 AVX10.11 xmm1 under writemask.
EVEX.256.F3.0F.W0 12 /r B V/V (AVX512VL AND Move even index single precision floating-point values
VMOVSLDUP ymm1 {k1}{z}, AVX512F) OR from ymm2/m256 and duplicate each element into
ymm2/m256 AVX10.11 ymm1 under writemask.
EVEX.512.F3.0F.W0 12 /r B V/V AVX512F Move even index single precision floating-point values
VMOVSLDUP zmm1 {k1}{z}, OR AVX10.11 from zmm2/m512 and duplicate each element into
zmm2/m512 zmm1 under writemask.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Full Mem ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Duplicates even-indexed single precision floating-point values from the source operand (the second operand). See
Figure 4-4. The source operand is an XMM, YMM or ZMM register or 128, 256 or 512-bit memory location and the
destination operand is an XMM, YMM or ZMM register.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.
VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed.
VEX.256 encoded version: Bits (MAXVL-1:256) of the destination register are zeroed.
EVEX encoded version: The destination operand is updated at 32-bit granularity according to the writemask.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

MOVSLDUP—Replicate Single Precision Floating-Point Values Vol. 2B 4-120


SRC X7 X6 X5 X4 X3 X2 X1 X0

DEST X6 X6 X4 X4 X2 X2 X0 X0

Figure 4-4. MOVSLDUP Operation

Operation
VMOVSLDUP (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
TMP_SRC[31:0] := SRC[31:0]
TMP_SRC[63:32] := SRC[31:0]
TMP_SRC[95:64] := SRC[95:64]
TMP_SRC[127:96] := SRC[95:64]
IF VL >= 256
TMP_SRC[159:128] := SRC[159:128]
TMP_SRC[191:160] := SRC[159:128]
TMP_SRC[223:192] := SRC[223:192]
TMP_SRC[255:224] := SRC[223:192]
FI;
IF VL >= 512
TMP_SRC[287:256] := SRC[287:256]
TMP_SRC[319:288] := SRC[287:256]
TMP_SRC[351:320] := SRC[351:320]
TMP_SRC[383:352] := SRC[351:320]
TMP_SRC[415:384] := SRC[415:384]
TMP_SRC[447:416] := SRC[415:384]
TMP_SRC[479:448] := SRC[479:448]
TMP_SRC[511:480] := SRC[479:448]
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_SRC[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

MOVSLDUP—Replicate Single Precision Floating-Point Values Vol. 2B 4-121


VMOVSLDUP (VEX.256 Encoded Version)
DEST[31:0] := SRC[31:0]
DEST[63:32] := SRC[31:0]
DEST[95:64] := SRC[95:64]
DEST[127:96] := SRC[95:64]
DEST[159:128] := SRC[159:128]
DEST[191:160] := SRC[159:128]
DEST[223:192] := SRC[223:192]
DEST[255:224] := SRC[223:192]
DEST[MAXVL-1:256] := 0

VMOVSLDUP (VEX.128 Encoded Version)


DEST[31:0] := SRC[31:0]
DEST[63:32] := SRC[31:0]
DEST[95:64] := SRC[95:64]
DEST[127:96] := SRC[95:64]
DEST[MAXVL-1:128] := 0
MOVSLDUP (128-bit Legacy SSE Version)
DEST[31:0] := SRC[31:0]
DEST[63:32] := SRC[31:0]
DEST[95:64] := SRC[95:64]
DEST[127:96] := SRC[95:64]
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VMOVSLDUP __m512 _mm512_moveldup_ps( __m512 a);
VMOVSLDUP __m512 _mm512_mask_moveldup_ps(__m512 s, __mmask16 k, __m512 a);
VMOVSLDUP __m512 _mm512_maskz_moveldup_ps( __mmask16 k, __m512 a);
VMOVSLDUP __m256 _mm256_mask_moveldup_ps(__m256 s, __mmask8 k, __m256 a);
VMOVSLDUP __m256 _mm256_maskz_moveldup_ps( __mmask8 k, __m256 a);
VMOVSLDUP __m128 _mm_mask_moveldup_ps(__m128 s, __mmask8 k, __m128 a);
VMOVSLDUP __m128 _mm_maskz_moveldup_ps( __mmask8 k, __m128 a);
VMOVSLDUP __m256 _mm256_moveldup_ps (__m256 a);
VMOVSLDUP __m128 _mm_moveldup_ps (__m128 a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.

MOVSLDUP—Replicate Single Precision Floating-Point Values Vol. 2B 4-122


MOVSS—Move or Merge Scalar Single Precision Floating-Point Value
Opcode/ Op / En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
F3 0F 10 /r A V/V SSE Merge scalar single precision floating-point value
MOVSS xmm1, xmm2 from xmm2 to xmm1 register.
F3 0F 10 /r A V/V SSE Load scalar single precision floating-point value from
MOVSS xmm1, m32 m32 to xmm1 register.
VEX.LIG.F3.0F.WIG 10 /r B V/V AVX Merge scalar single precision floating-point value
VMOVSS xmm1, xmm2, xmm3 from xmm2 and xmm3 to xmm1 register
VEX.LIG.F3.0F.WIG 10 /r D V/V AVX Load scalar single precision floating-point value from
VMOVSS xmm1, m32 m32 to xmm1 register.
F3 0F 11 /r C V/V SSE Move scalar single precision floating-point value
MOVSS xmm2/m32, xmm1 from xmm1 register to xmm2/m32.
VEX.LIG.F3.0F.WIG 11 /r E V/V AVX Move scalar single precision floating-point value
VMOVSS xmm1, xmm2, xmm3 from xmm2 and xmm3 to xmm1 register.
VEX.LIG.F3.0F.WIG 11 /r C V/V AVX Move scalar single precision floating-point value
VMOVSS m32, xmm1 from xmm1 register to m32.
EVEX.LLIG.F3.0F.W0 10 /r B V/V AVX512F Move scalar single precision floating-point value
VMOVSS xmm1 {k1}{z}, xmm2, xmm3 OR AVX10.11 from xmm2 and xmm3 to xmm1 register under
writemask k1.
EVEX.LLIG.F3.0F.W0 10 /r F V/V AVX512F Move scalar single precision floating-point values
VMOVSS xmm1 {k1}{z}, m32 OR AVX10.11 from m32 to xmm1 under writemask k1.
EVEX.LLIG.F3.0F.W0 11 /r E V/V AVX512F Move scalar single precision floating-point value
VMOVSS xmm1 {k1}{z}, xmm2, xmm3 OR AVX10.11 from xmm2 and xmm3 to xmm1 register under
writemask k1.
EVEX.LLIG.F3.0F.W0 11 /r G V/V AVX512F Move scalar single precision floating-point values
VMOVSS m32 {k1}, xmm1 OR AVX10.11 from xmm1 to m32 under writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C N/A ModRM:r/m (w) ModRM:reg (r) N/A N/A
D N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
E N/A ModRM:r/m (w) EVEX.vvvv (r) ModRM:reg (r) N/A
F Tuple1 Scalar ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
G Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
Moves a scalar single precision floating-point value from the source operand (second operand) to the destination
operand (first operand). The source and destination operands can be XMM registers or 32-bit memory locations.
This instruction can be used to move a single precision floating-point value to and from the low doubleword of an

MOVSS—Move or Merge Scalar Single Precision Floating-Point Value Vol. 2B 4-123


XMM register and a 32-bit memory location, or to move a single precision floating-point value between the low
doublewords of two XMM registers. The instruction cannot be used to transfer data between memory locations.
Legacy version: When the source and destination operands are XMM registers, bits (MAXVL-1:32) of the corre-
sponding destination register are unmodified. When the source operand is a memory location and destination
operand is an XMM registers, Bits (127:32) of the destination operand is cleared to all 0s, bits MAXVL:128 of the
destination operand remains unchanged.
VEX and EVEX encoded register-register syntax: Moves a scalar single precision floating-point value from the
second source operand (the third operand) to the low doubleword element of the destination operand (the first
operand). Bits 127:32 of the destination operand are copied from the first source operand (the second operand).
Bits (MAXVL-1:128) of the corresponding destination register are zeroed.
VEX and EVEX encoded memory load syntax: When the source operand is a memory location and destination
operand is an XMM registers, bits MAXVL:32 of the destination operand is cleared to all 0s.
EVEX encoded versions: The low doubleword of the destination is updated according to the writemask.
Note: For memory store form instruction “VMOVSS m32, xmm1”, VEX.vvvv is reserved and must be 1111b other-
wise instruction will #UD. For memory store form instruction “VMOVSS mv {k1}, xmm1”, EVEX.vvvv is reserved
and must be 1111b otherwise instruction will #UD.
Software should ensure VMOVSS is encoded with VEX.L=0. Encoding VMOVSS with VEX.L=1 may encounter
unpredictable behavior across different processor generations.

Operation
VMOVSS (EVEX.LLIG.F3.0F.W0 11 /r When the Source Operand is Memory and the Destination is an XMM Register)
IF k1[0] or *no writemask*
THEN DEST[31:0] := SRC[31:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[MAXVL-1:32] := 0

VMOVSS (EVEX.LLIG.F3.0F.W0 10 /r When the Source Operand is an XMM Register and the Destination is Memory)
IF k1[0] or *no writemask*
THEN DEST[31:0] := SRC[31:0]
ELSE *DEST[31:0] remains unchanged* ; merging-masking
FI;

VMOVSS (EVEX.LLIG.F3.0F.W0 10/11 /r Where the Source and Destination are XMM Registers)
IF k1[0] or *no writemask*
THEN DEST[31:0] := SRC2[31:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

MOVSS—Move or Merge Scalar Single Precision Floating-Point Value Vol. 2B 4-124


MOVSS (Legacy SSE Version When the Source and Destination Operands are Both XMM Registers)
DEST[31:0] := SRC[31:0]
DEST[MAXVL-1:32] (Unmodified)

VMOVSS (VEX.128.F3.0F 11 /r Where the Destination is an XMM Register)


DEST[31:0] := SRC2[31:0]
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

VMOVSS (VEX.128.F3.0F 10 /r Where the Source and Destination are XMM Registers)
DEST[31:0] := SRC2[31:0]
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

VMOVSS (VEX.128.F3.0F 10 /r When the Source Operand is Memory and the Destination is an XMM Register)
DEST[31:0] := SRC[31:0]
DEST[MAXVL-1:32] := 0

MOVSS/VMOVSS (When the Source Operand is an XMM Register and the Destination is Memory)
DEST[31:0] := SRC[31:0]

MOVSS (Legacy SSE Version when the Source Operand is Memory and the Destination is an XMM Register)
DEST[31:0] := SRC[31:0]
DEST[127:32] := 0
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VMOVSS __m128 _mm_mask_load_ss(__m128 s, __mmask8 k, float * p);
VMOVSS __m128 _mm_maskz_load_ss( __mmask8 k, float * p);
VMOVSS __m128 _mm_mask_move_ss(__m128 sh, __mmask8 k, __m128 sl, __m128 a);
VMOVSS __m128 _mm_maskz_move_ss( __mmask8 k, __m128 s, __m128 a);
VMOVSS void _mm_mask_store_ss(float * p, __mmask8 k, __m128 a);
MOVSS __m128 _mm_load_ss(float * p)
MOVSS void_mm_store_ss(float * p, __m128 a)
MOVSS __m128 _mm_move_ss(__m128 a, __m128 b)

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions,” additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instruction, see Table 2-60, “Type E10 Class Exception Conditions.”

MOVSS—Move or Merge Scalar Single Precision Floating-Point Value Vol. 2B 4-125


MOVUPD—Move Unaligned Packed Double Precision Floating-Point Values
Opcode/ Op / En 64/32 bit CPUID Feature Description
Instruction Mode Flag
Support
66 0F 10 /r A V/V SSE2 Move unaligned packed double precision
MOVUPD xmm1, xmm2/m128 floating-point from xmm2/mem to xmm1.
66 0F 11 /r B V/V SSE2 Move unaligned packed double precision
MOVUPD xmm2/m128, xmm1 floating-point from xmm1 to xmm2/mem.
VEX.128.66.0F.WIG 10 /r A V/V AVX Move unaligned packed double precision
VMOVUPD xmm1, xmm2/m128 floating-point from xmm2/mem to xmm1.
VEX.128.66.0F.WIG 11 /r B V/V AVX Move unaligned packed double precision
VMOVUPD xmm2/m128, xmm1 floating-point from xmm1 to xmm2/mem.
VEX.256.66.0F.WIG 10 /r A V/V AVX Move unaligned packed double precision
VMOVUPD ymm1, ymm2/m256 floating-point from ymm2/mem to ymm1.
VEX.256.66.0F.WIG 11 /r B V/V AVX Move unaligned packed double precision
VMOVUPD ymm2/m256, ymm1 floating-point from ymm1 to ymm2/mem.
EVEX.128.66.0F.W1 10 /r C V/V (AVX512VL AND Move unaligned packed double precision
VMOVUPD xmm1 {k1}{z}, xmm2/m128 AVX512F) OR floating-point from xmm2/m128 to xmm1
AVX10.11 using writemask k1.
EVEX.128.66.0F.W1 11 /r D V/V (AVX512VL AND Move unaligned packed double precision
VMOVUPD xmm2/m128 {k1}{z}, xmm1 AVX512F) OR floating-point from xmm1 to xmm2/m128
AVX10.11 using writemask k1.
EVEX.256.66.0F.W1 10 /r C V/V (AVX512VL AND Move unaligned packed double precision
VMOVUPD ymm1 {k1}{z}, ymm2/m256 AVX512F) OR floating-point from ymm2/m256 to ymm1
AVX10.11 using writemask k1.
EVEX.256.66.0F.W1 11 /r D V/V (AVX512VL AND Move unaligned packed double precision
VMOVUPD ymm2/m256 {k1}{z}, ymm1 AVX512F) OR floating-point from ymm1 to ymm2/m256
AVX10.11 using writemask k1.
EVEX.512.66.0F.W1 10 /r C V/V AVX512F Move unaligned packed double precision
VMOVUPD zmm1 {k1}{z}, zmm2/m512 OR AVX10.11 floating-point values from zmm2/m512 to
zmm1 using writemask k1.
EVEX.512.66.0F.W1 11 /r D V/V AVX512F Move unaligned packed double precision
VMOVUPD zmm2/m512 {k1}{z}, zmm1 OR AVX10.11 floating-point values from zmm1 to
zmm2/m512 using writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B N/A ModRM:r/m (w) ModRM:reg (r) N/A N/A
C Full Mem ModRM:reg (w) ModRM:r/m (r) N/A N/A
D Full Mem ModRM:r/m (w) ModRM:reg (r) N/A N/A

MOVUPD—Move Unaligned Packed Double Precision Floating-Point Values Vol. 2B 4-128


Description
Note: VEX.vvvv and EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
EVEX.512 encoded version:
Moves 512 bits of packed double precision floating-point values from the source operand (second operand) to the
destination operand (first operand). This instruction can be used to load a ZMM register from a float64 memory
location, to store the contents of a ZMM register into a memory. The destination operand is updated according to
the writemask.
VEX.256 encoded version:
Moves 256 bits of packed double precision floating-point values from the source operand (second operand) to the
destination operand (first operand). This instruction can be used to load a YMM register from a 256-bit memory
location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM
registers. Bits (MAXVL-1:256) of the destination register are zeroed.
128-bit versions:
Moves 128 bits of packed double precision floating-point values from the source operand (second operand) to the
destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory
location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two
XMM registers.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.
When the source or destination operand is a memory operand, the operand may be unaligned on a 16-byte
boundary without causing a general-protection exception (#GP) to be generated
VEX.128 and EVEX.128 encoded versions: Bits (MAXVL-1:128) of the destination register are zeroed.

Operation
VMOVUPD (EVEX Encoded Versions, Register-Copy Form)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE DEST[i+63:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMOVUPD (EVEX Encoded Versions, Store-Form)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[i+63:i]
ELSE *DEST[i+63:i] remains unchanged* ; merging-masking

FI;
ENDFOR;

MOVUPD—Move Unaligned Packed Double Precision Floating-Point Values Vol. 2B 4-129


VMOVUPD (EVEX Encoded Versions, Load-Form)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE DEST[i+63:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMOVUPD (VEX.256 Encoded Version, Load - and Register Copy)


DEST[255:0] := SRC[255:0]
DEST[MAXVL-1:256] := 0

VMOVUPD (VEX.256 Encoded Version, Store-Form)


DEST[255:0] := SRC[255:0]

VMOVUPD (VEX.128 Encoded Version)


DEST[127:0] := SRC[127:0]
DEST[MAXVL-1:128] := 0

MOVUPD (128-bit Load- and Register-Copy- Form Legacy SSE Version)


DEST[127:0] := SRC[127:0]
DEST[MAXVL-1:128] (Unmodified)

(V)MOVUPD (128-bit Store-Form Version)


DEST[127:0] := SRC[127:0]

Intel C/C++ Compiler Intrinsic Equivalent


VMOVUPD __m512d _mm512_loadu_pd( void * s);
VMOVUPD __m512d _mm512_mask_loadu_pd(__m512d a, __mmask8 k, void * s);
VMOVUPD __m512d _mm512_maskz_loadu_pd( __mmask8 k, void * s);
VMOVUPD void _mm512_storeu_pd( void * d, __m512d a);
VMOVUPD void _mm512_mask_storeu_pd( void * d, __mmask8 k, __m512d a);
VMOVUPD __m256d _mm256_mask_loadu_pd(__m256d s, __mmask8 k, void * m);
VMOVUPD __m256d _mm256_maskz_loadu_pd( __mmask8 k, void * m);
VMOVUPD void _mm256_mask_storeu_pd( void * d, __mmask8 k, __m256d a);
VMOVUPD __m128d _mm_mask_loadu_pd(__m128d s, __mmask8 k, void * m);
VMOVUPD __m128d _mm_maskz_loadu_pd( __mmask8 k, void * m);
VMOVUPD void _mm_mask_storeu_pd( void * d, __mmask8 k, __m128d a);
MOVUPD __m256d _mm256_loadu_pd (double * p);
MOVUPD void _mm256_storeu_pd( double *p, __m256d a);
MOVUPD __m128d _mm_loadu_pd (double * p);
MOVUPD void _mm_storeu_pd( double *p, __m128d a);

SIMD Floating-Point Exceptions


None.

MOVUPD—Move Unaligned Packed Double Precision Floating-Point Values Vol. 2B 4-130


Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
Note treatment of #AC varies; additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

MOVUPD—Move Unaligned Packed Double Precision Floating-Point Values Vol. 2B 4-131


MOVUPS—Move Unaligned Packed Single Precision Floating-Point Values
Opcode/ Op / En 64/32 bit CPUID Feature Description
Instruction Mode Flag
Support
NP 0F 10 /r A V/V SSE Move unaligned packed single precision
MOVUPS xmm1, xmm2/m128 floating-point from xmm2/mem to xmm1.
NP 0F 11 /r B V/V SSE Move unaligned packed single precision
MOVUPS xmm2/m128, xmm1 floating-point from xmm1 to xmm2/mem.
VEX.128.0F.WIG 10 /r A V/V AVX Move unaligned packed single precision
VMOVUPS xmm1, xmm2/m128 floating-point from xmm2/mem to xmm1.
VEX.128.0F.WIG 11 /r B V/V AVX Move unaligned packed single precision
VMOVUPS xmm2/m128, xmm1 floating-point from xmm1 to xmm2/mem.
VEX.256.0F.WIG 10 /r A V/V AVX Move unaligned packed single precision
VMOVUPS ymm1, ymm2/m256 floating-point from ymm2/mem to ymm1.
VEX.256.0F.WIG 11 /r B V/V AVX Move unaligned packed single precision
VMOVUPS ymm2/m256, ymm1 floating-point from ymm1 to ymm2/mem.
EVEX.128.0F.W0 10 /r C V/V (AVX512VL AND Move unaligned packed single precision
VMOVUPS xmm1 {k1}{z}, xmm2/m128 AVX512F) OR floating-point values from xmm2/m128 to
AVX10.11 xmm1 using writemask k1.
EVEX.256.0F.W0 10 /r C V/V (AVX512VL AND Move unaligned packed single precision
VMOVUPS ymm1 {k1}{z}, ymm2/m256 AVX512F) OR floating-point values from ymm2/m256 to
AVX10.11 ymm1 using writemask k1.
EVEX.512.0F.W0 10 /r C V/V AVX512F Move unaligned packed single precision
VMOVUPS zmm1 {k1}{z}, zmm2/m512 OR AVX10.11 floating-point values from zmm2/m512 to
zmm1 using writemask k1.
EVEX.128.0F.W0 11 /r D V/V (AVX512VL AND Move unaligned packed single precision
VMOVUPS xmm2/m128 {k1}{z}, xmm1 AVX512F) OR floating-point values from xmm1 to
AVX10.11 xmm2/m128 using writemask k1.
EVEX.256.0F.W0 11 /r D V/V (AVX512VL AND Move unaligned packed single precision
VMOVUPS ymm2/m256 {k1}{z}, ymm1 AVX512F) OR floating-point values from ymm1 to
AVX10.11 ymm2/m256 using writemask k1.
EVEX.512.0F.W0 11 /r D V/V AVX512F Move unaligned packed single precision
VMOVUPS zmm2/m512 {k1}{z}, zmm1 OR AVX10.11 floating-point values from zmm1 to
zmm2/m512 using writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B N/A ModRM:r/m (w) ModRM:reg (r) N/A N/A
C Full Mem ModRM:reg (w) ModRM:r/m (r) N/A N/A
D Full Mem ModRM:r/m (w) ModRM:reg (r) N/A N/A

MOVUPS—Move Unaligned Packed Single Precision Floating-Point Values Vol. 2B 4-132


Description
Note: VEX.vvvv and EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
EVEX.512 encoded version:
Moves 512 bits of packed single precision floating-point values from the source operand (second operand) to the
destination operand (first operand). This instruction can be used to load a ZMM register from a 512-bit float32
memory location, to store the contents of a ZMM register into memory. The destination operand is updated
according to the writemask.
VEX.256 and EVEX.256 encoded versions:
Moves 256 bits of packed single precision floating-point values from the source operand (second operand) to the
destination operand (first operand). This instruction can be used to load a YMM register from a 256-bit memory
location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM
registers. Bits (MAXVL-1:256) of the destination register are zeroed.
128-bit versions:
Moves 128 bits of packed single precision floating-point values from the source operand (second operand) to the
destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory
location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two
XMM registers.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.
When the source or destination operand is a memory operand, the operand may be unaligned without causing a
general-protection exception (#GP) to be generated.
VEX.128 and EVEX.128 encoded versions: Bits (MAXVL-1:128) of the destination register are zeroed.

Operation
VMOVUPS (EVEX Encoded Versions, Register-Copy Form)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE DEST[i+31:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMOVUPS (EVEX Encoded Versions, Store-Form)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[i+31:i]
ELSE *DEST[i+31:i] remains unchanged* ; merging-masking
FI;
ENDFOR;

MOVUPS—Move Unaligned Packed Single Precision Floating-Point Values Vol. 2B 4-133


VMOVUPS (EVEX Encoded Versions, Load-Form)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE DEST[i+31:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMOVUPS (VEX.256 Encoded Version, Load - and Register Copy)


DEST[255:0] := SRC[255:0]
DEST[MAXVL-1:256] := 0

VMOVUPS (VEX.256 Encoded Version, Store-Form)


DEST[255:0] := SRC[255:0]

VMOVUPS (VEX.128 Encoded Version)


DEST[127:0] := SRC[127:0]
DEST[MAXVL-1:128] := 0

MOVUPS (128-bit Load- and Register-Copy- Form Legacy SSE Version)


DEST[127:0] := SRC[127:0]
DEST[MAXVL-1:128] (Unmodified)

(V)MOVUPS (128-bit Store-Form Version)


DEST[127:0] := SRC[127:0]

Intel C/C++ Compiler Intrinsic Equivalent


VMOVUPS __m512 _mm512_loadu_ps( void * s);
VMOVUPS __m512 _mm512_mask_loadu_ps(__m512 a, __mmask16 k, void * s);
VMOVUPS __m512 _mm512_maskz_loadu_ps( __mmask16 k, void * s);
VMOVUPS void _mm512_storeu_ps( void * d, __m512 a);
VMOVUPS void _mm512_mask_storeu_ps( void * d, __mmask8 k, __m512 a);
VMOVUPS __m256 _mm256_mask_loadu_ps(__m256 a, __mmask8 k, void * s);
VMOVUPS __m256 _mm256_maskz_loadu_ps( __mmask8 k, void * s);
VMOVUPS void _mm256_mask_storeu_ps( void * d, __mmask8 k, __m256 a);
VMOVUPS __m128 _mm_mask_loadu_ps(__m128 a, __mmask8 k, void * s);
VMOVUPS __m128 _mm_maskz_loadu_ps( __mmask8 k, void * s);
VMOVUPS void _mm_mask_storeu_ps( void * d, __mmask8 k, __m128 a);
MOVUPS __m256 _mm256_loadu_ps ( float * p);
MOVUPS void _mm256 _storeu_ps( float *p, __m256 a);
MOVUPS __m128 _mm_loadu_ps ( float * p);
MOVUPS void _mm_storeu_ps( float *p, __m128 a);

SIMD Floating-Point Exceptions


None.

MOVUPS—Move Unaligned Packed Single Precision Floating-Point Values Vol. 2B 4-134


Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
Note treatment of #AC varies.
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.

MOVUPS—Move Unaligned Packed Single Precision Floating-Point Values Vol. 2B 4-135


MULPD—Multiply Packed Double Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 59 /r A V/V SSE2 Multiply packed double precision floating-point
MULPD xmm1, xmm2/m128 values in xmm2/m128 with xmm1 and store result
in xmm1.
VEX.128.66.0F.WIG 59 /r B V/V AVX Multiply packed double precision floating-point
VMULPD xmm1,xmm2, xmm3/m128 values in xmm3/m128 with xmm2 and store result
in xmm1.
VEX.256.66.0F.WIG 59 /r B V/V AVX Multiply packed double precision floating-point
VMULPD ymm1, ymm2, ymm3/m256 values in ymm3/m256 with ymm2 and store result
in ymm1.
EVEX.128.66.0F.W1 59 /r C V/V (AVX512VL AND Multiply packed double precision floating-point
VMULPD xmm1 {k1}{z}, xmm2, AVX512F) OR values from xmm3/m128/m64bcst to xmm2 and
xmm3/m128/m64bcst AVX10.11 store result in xmm1.
EVEX.256.66.0F.W1 59 /r C V/V (AVX512VL AND Multiply packed double precision floating-point
VMULPD ymm1 {k1}{z}, ymm2, AVX512F) OR values from ymm3/m256/m64bcst to ymm2 and
ymm3/m256/m64bcst AVX10.11 store result in ymm1.
EVEX.512.66.0F.W1 59 /r C V/V AVX512F Multiply packed double precision floating-point
VMULPD zmm1 {k1}{z}, zmm2, OR AVX10.11 values in zmm3/m512/m64bcst with zmm2 and
zmm3/m512/m64bcst{er} store result in zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Multiply packed double precision floating-point values from the first source operand with corresponding values in
the second source operand, and stores the packed double precision floating-point results in the destination
operand.
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register. Bits (MAXVL-1:256) of the corre-
sponding destination ZMM register are zeroed.
VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the destination YMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified.

MULPD—Multiply Packed Double Precision Floating-Point Values Vol. 2B 4-148


Operation
VMULPD (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1) AND SRC2 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+63:i] := SRC1[i+63:i] * SRC2[63:0]
ELSE
DEST[i+63:i] := SRC1[i+63:i] * SRC2[i+63:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMULPD (VEX.256 Encoded Version)


DEST[63:0] := SRC1[63:0] * SRC2[63:0]
DEST[127:64] := SRC1[127:64] * SRC2[127:64]
DEST[191:128] := SRC1[191:128] * SRC2[191:128]
DEST[255:192] := SRC1[255:192] * SRC2[255:192]
DEST[MAXVL-1:256] := 0;
.
VMULPD (VEX.128 Encoded Version)
DEST[63:0] := SRC1[63:0] * SRC2[63:0]
DEST[127:64] := SRC1[127:64] * SRC2[127:64]
DEST[MAXVL-1:128] := 0

MULPD (128-bit Legacy SSE Version)


DEST[63:0] := DEST[63:0] * SRC[63:0]
DEST[127:64] := DEST[127:64] * SRC[127:64]
DEST[MAXVL-1:128] (Unmodified)

MULPD—Multiply Packed Double Precision Floating-Point Values Vol. 2B 4-149


Intel C/C++ Compiler Intrinsic Equivalent
VMULPD __m512d _mm512_mul_pd( __m512d a, __m512d b);
VMULPD __m512d _mm512_mask_mul_pd(__m512d s, __mmask8 k, __m512d a, __m512d b);
VMULPD __m512d _mm512_maskz_mul_pd( __mmask8 k, __m512d a, __m512d b);
VMULPD __m512d _mm512_mul_round_pd( __m512d a, __m512d b, int);
VMULPD __m512d _mm512_mask_mul_round_pd(__m512d s, __mmask8 k, __m512d a, __m512d b, int);
VMULPD __m512d _mm512_maskz_mul_round_pd( __mmask8 k, __m512d a, __m512d b, int);
VMULPD __m256d _mm256_mul_pd (__m256d a, __m256d b);
MULPD __m128d _mm_mul_pd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”

MULPD—Multiply Packed Double Precision Floating-Point Values Vol. 2B 4-150


MULPS—Multiply Packed Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 59 /r A V/V SSE Multiply packed single precision floating-point values
MULPS xmm1, xmm2/m128 in xmm2/m128 with xmm1 and store result in xmm1.
VEX.128.0F.WIG 59 /r B V/V AVX Multiply packed single precision floating-point values
VMULPS xmm1,xmm2, xmm3/m128 in xmm3/m128 with xmm2 and store result in xmm1.
VEX.256.0F.WIG 59 /r B V/V AVX Multiply packed single precision floating-point values
VMULPS ymm1, ymm2, ymm3/m256 in ymm3/m256 with ymm2 and store result in ymm1.
EVEX.128.0F.W0 59 /r C V/V (AVX512VL AND Multiply packed single precision floating-point values
VMULPS xmm1 {k1}{z}, xmm2, AVX512F) OR from xmm3/m128/m32bcst to xmm2 and store
xmm3/m128/m32bcst AVX10.11 result in xmm1.
EVEX.256.0F.W0 59 /r C V/V (AVX512VL AND Multiply packed single precision floating-point values
VMULPS ymm1 {k1}{z}, ymm2, AVX512F) OR from ymm3/m256/m32bcst to ymm2 and store
ymm3/m256/m32bcst AVX10.11 result in ymm1.
EVEX.512.0F.W0 59 /r C V/V AVX512F Multiply packed single precision floating-point values
VMULPS zmm1 {k1}{z}, zmm2, OR AVX10.11 in zmm3/m512/m32bcst with zmm2 and store result
zmm3/m512/m32bcst {er} in zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Multiply the packed single precision floating-point values from the first source operand with the corresponding
values in the second source operand, and stores the packed double precision floating-point results in the destina-
tion operand.
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register. Bits (MAXVL-1:256) of the corre-
sponding destination ZMM register are zeroed.
VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the destination YMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified.

MULPS—Multiply Packed Single Precision Floating-Point Values Vol. 2B 4-151


Operation
VMULPS (EVEX Encoded Version)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1) AND SRC2 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+31:i] := SRC1[i+31:i] * SRC2[31:0]
ELSE
DEST[i+31:i] := SRC1[i+31:i] * SRC2[i+31:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VMULPS (VEX.256 Encoded Version)


DEST[31:0] := SRC1[31:0] * SRC2[31:0]
DEST[63:32] := SRC1[63:32] * SRC2[63:32]
DEST[95:64] := SRC1[95:64] * SRC2[95:64]
DEST[127:96] := SRC1[127:96] * SRC2[127:96]
DEST[159:128] := SRC1[159:128] * SRC2[159:128]
DEST[191:160] := SRC1[191:160] * SRC2[191:160]
DEST[223:192] := SRC1[223:192] * SRC2[223:192]
DEST[255:224] := SRC1[255:224] * SRC2[255:224].
DEST[MAXVL-1:256] := 0;

VMULPS (VEX.128 Encoded Version)


DEST[31:0] := SRC1[31:0] * SRC2[31:0]
DEST[63:32] := SRC1[63:32] * SRC2[63:32]
DEST[95:64] := SRC1[95:64] * SRC2[95:64]
DEST[127:96] := SRC1[127:96] * SRC2[127:96]
DEST[MAXVL-1:128] := 0

MULPS (128-bit Legacy SSE Version)


DEST[31:0] := SRC1[31:0] * SRC2[31:0]
DEST[63:32] := SRC1[63:32] * SRC2[63:32]
DEST[95:64] := SRC1[95:64] * SRC2[95:64]
DEST[127:96] := SRC1[127:96] * SRC2[127:96]
DEST[MAXVL-1:128] (Unmodified)

MULPS—Multiply Packed Single Precision Floating-Point Values Vol. 2B 4-152


Intel C/C++ Compiler Intrinsic Equivalent
VMULPS __m512 _mm512_mul_ps( __m512 a, __m512 b);
VMULPS __m512 _mm512_mask_mul_ps(__m512 s, __mmask16 k, __m512 a, __m512 b);
VMULPS __m512 _mm512_maskz_mul_ps(__mmask16 k, __m512 a, __m512 b);
VMULPS __m512 _mm512_mul_round_ps( __m512 a, __m512 b, int);
VMULPS __m512 _mm512_mask_mul_round_ps(__m512 s, __mmask16 k, __m512 a, __m512 b, int);
VMULPS __m512 _mm512_maskz_mul_round_ps(__mmask16 k, __m512 a, __m512 b, int);
VMULPS __m256 _mm256_mask_mul_ps(__m256 s, __mmask8 k, __m256 a, __m256 b);
VMULPS __m256 _mm256_maskz_mul_ps(__mmask8 k, __m256 a, __m256 b);
VMULPS __m128 _mm_mask_mul_ps(__m128 s, __mmask8 k, __m128 a, __m128 b);
VMULPS __m128 _mm_maskz_mul_ps(__mmask8 k, __m128 a, __m128 b);
VMULPS __m256 _mm256_mul_ps (__m256 a, __m256 b);
MULPS __m128 _mm_mul_ps (__m128 a, __m128 b);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”

MULPS—Multiply Packed Single Precision Floating-Point Values Vol. 2B 4-153


MULSD—Multiply Scalar Double Precision Floating-Point Value
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F2 0F 59 /r A V/V SSE2 Multiply the low double precision floating-point value in
MULSD xmm1,xmm2/m64 xmm2/m64 by low double precision floating-point
value in xmm1.
VEX.LIG.F2.0F.WIG 59 /r B V/V AVX Multiply the low double precision floating-point value in
VMULSD xmm1,xmm2, xmm3/m64 xmm3/m64 by low double precision floating-point
value in xmm2.
EVEX.LLIG.F2.0F.W1 59 /r C V/V AVX512F Multiply the low double precision floating-point value in
VMULSD xmm1 {k1}{z}, xmm2, OR AVX10.11 xmm3/m64 by low double precision floating-point
xmm3/m64 {er} value in xmm2.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Multiplies the low double precision floating-point value in the second source operand by the low double precision
floating-point value in the first source operand, and stores the double precision floating-point result in the destina-
tion operand. The second source operand can be an XMM register or a 64-bit memory location. The first source
operand and the destination operands are XMM registers.
128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (MAXVL-
1:64) of the corresponding destination register remain unchanged.
VEX.128 and EVEX encoded version: The quadword at bits 127:64 of the destination operand is copied from the
same bits of the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.
EVEX encoded version: The low quadword element of the destination operand is updated according to the write-
mask.
Software should ensure VMULSD is encoded with VEX.L=0. Encoding VMULSD with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.

MULSD—Multiply Scalar Double Precision Floating-Point Value Vol. 2B 4-154


Operation
VMULSD (EVEX Encoded Version)
IF (EVEX.b = 1) AND SRC2 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[63:0] := SRC1[63:0] * SRC2[63:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI
FI;
ENDFOR
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

VMULSD (VEX.128 Encoded Version)


DEST[63:0] := SRC1[63:0] * SRC2[63:0]
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

MULSD (128-bit Legacy SSE Version)


DEST[63:0] := DEST[63:0] * SRC[63:0]
DEST[MAXVL-1:64] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VMULSD __m128d _mm_mask_mul_sd(__m128d s, __mmask8 k, __m128d a, __m128d b);
VMULSD __m128d _mm_maskz_mul_sd( __mmask8 k, __m128d a, __m128d b);
VMULSD __m128d _mm_mul_round_sd( __m128d a, __m128d b, int);
VMULSD __m128d _mm_mask_mul_round_sd(__m128d s, __mmask8 k, __m128d a, __m128d b, int);
VMULSD __m128d _mm_maskz_mul_round_sd( __mmask8 k, __m128d a, __m128d b, int);
MULSD __m128d _mm_mul_sd (__m128d a, __m128d b)

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-49, “Type E3 Class Exception Conditions.”

MULSD—Multiply Scalar Double Precision Floating-Point Value Vol. 2B 4-155


MULSS—Multiply Scalar Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F3 0F 59 /r A V/V SSE Multiply the low single precision floating-point value in
MULSS xmm1,xmm2/m32 xmm2/m32 by the low single precision floating-point
value in xmm1.
VEX.LIG.F3.0F.WIG 59 /r B V/V AVX Multiply the low single precision floating-point value in
VMULSS xmm1,xmm2, xmm3/m32 xmm3/m32 by the low single precision floating-point
value in xmm2.
EVEX.LLIG.F3.0F.W0 59 /r C V/V AVX512F Multiply the low single precision floating-point value in
VMULSS xmm1 {k1}{z}, xmm2, OR AVX10.11 xmm3/m32 by the low single precision floating-point
xmm3/m32 {er} value in xmm2.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Multiplies the low single precision floating-point value from the second source operand by the low single precision
floating-point value in the first source operand, and stores the single precision floating-point result in the destina-
tion operand. The second source operand can be an XMM register or a 32-bit memory location. The first source
operand and the destination operands are XMM registers.
128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (MAXVL-
1:32) of the corresponding YMM destination register remain unchanged.
VEX.128 and EVEX encoded version: The first source operand is an xmm register encoded by VEX.vvvv. The three
high-order doublewords of the destination operand are copied from the first source operand. Bits (MAXVL-1:128)
of the destination register are zeroed.
EVEX encoded version: The low doubleword element of the destination operand is updated according to the write-
mask.
Software should ensure VMULSS is encoded with VEX.L=0. Encoding VMULSS with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.

MULSS—Multiply Scalar Single Precision Floating-Point Values Vol. 2B 4-156


Operation
VMULSS (EVEX Encoded Version)
IF (EVEX.b = 1) AND SRC2 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := SRC1[31:0] * SRC2[31:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI
FI;
ENDFOR
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

VMULSS (VEX.128 Encoded Version)


DEST[31:0] := SRC1[31:0] * SRC2[31:0]
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

MULSS (128-bit Legacy SSE Version)


DEST[31:0] := DEST[31:0] * SRC[31:0]
DEST[MAXVL-1:32] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VMULSS __m128 _mm_mask_mul_ss(__m128 s, __mmask8 k, __m128 a, __m128 b);
VMULSS __m128 _mm_maskz_mul_ss( __mmask8 k, __m128 a, __m128 b);
VMULSS __m128 _mm_mul_round_ss( __m128 a, __m128 b, int);
VMULSS __m128 _mm_mask_mul_round_ss(__m128 s, __mmask8 k, __m128 a, __m128 b, int);
VMULSS __m128 _mm_maskz_mul_round_ss( __mmask8 k, __m128 a, __m128 b, int);
MULSS __m128 _mm_mul_ss(__m128 a, __m128 b)

SIMD Floating-Point Exceptions


Underflow, Overflow, Invalid, Precision, Denormal.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-49, “Type E3 Class Exception Conditions.”

MULSS—Multiply Scalar Single Precision Floating-Point Values Vol. 2B 4-157


ORPD—Bitwise Logical OR of Packed Double Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 56/r A V/V SSE2 Return the bitwise logical OR of packed double
ORPD xmm1, xmm2/m128 precision floating-point values in xmm1 and
xmm2/mem.
VEX.128.66.0F 56 /r B V/V AVX Return the bitwise logical OR of packed double
VORPD xmm1,xmm2, xmm3/m128 precision floating-point values in xmm2 and
xmm3/mem.
VEX.256.66.0F 56 /r B V/V AVX Return the bitwise logical OR of packed double
VORPD ymm1, ymm2, ymm3/m256 precision floating-point values in ymm2 and
ymm3/mem.
EVEX.128.66.0F.W1 56 /r C V/V (AVX512VL AND Return the bitwise logical OR of packed double
VORPD xmm1 {k1}{z}, xmm2, AVX512DQ) OR precision floating-point values in xmm2 and
xmm3/m128/m64bcst AVX10.11 xmm3/m128/m64bcst subject to writemask k1.
EVEX.256.66.0F.W1 56 /r C V/V (AVX512VL AND Return the bitwise logical OR of packed double
VORPD ymm1 {k1}{z}, ymm2, AVX512DQ) OR precision floating-point values in ymm2 and
ymm3/m256/m64bcst AVX10.11 ymm3/m256/m64bcst subject to writemask k1.
EVEX.512.66.0F.W1 56 /r C V/V AVX512DQ Return the bitwise logical OR of packed double
VORPD zmm1 {k1}{z}, zmm2, OR AVX10.11 precision floating-point values in zmm2 and
zmm3/m512/m64bcst zmm3/m512/m64bcst subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a bitwise logical OR of the two, four or eight packed double precision floating-point values from the first
source operand and the second source operand, and stores the result in the destination operand.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a
32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with write-
mask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the
corresponding ZMM register destination are zeroed.
VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
register destination are unmodified.

ORPD—Bitwise Logical OR of Packed Double Precision Floating-Point Values Vol. 2B 4-170


Operation
VORPD (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1) AND (SRC2 *is memory*)
THEN
DEST[i+63:i] := SRC1[i+63:i] BITWISE OR SRC2[63:0]
ELSE
DEST[i+63:i] := SRC1[i+63:i] BITWISE OR SRC2[i+63:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VORPD (VEX.256 Encoded Version)


DEST[63:0] := SRC1[63:0] BITWISE OR SRC2[63:0]
DEST[127:64] := SRC1[127:64] BITWISE OR SRC2[127:64]
DEST[191:128] := SRC1[191:128] BITWISE OR SRC2[191:128]
DEST[255:192] := SRC1[255:192] BITWISE OR SRC2[255:192]
DEST[MAXVL-1:256] := 0

VORPD (VEX.128 Encoded Version)


DEST[63:0] := SRC1[63:0] BITWISE OR SRC2[63:0]
DEST[127:64] := SRC1[127:64] BITWISE OR SRC2[127:64]
DEST[MAXVL-1:128] := 0

ORPD (128-bit Legacy SSE Version)


DEST[63:0] := DEST[63:0] BITWISE OR SRC[63:0]
DEST[127:64] := DEST[127:64] BITWISE OR SRC[127:64]
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VORPD __m512d _mm512_or_pd ( __m512d a, __m512d b);
VORPD __m512d _mm512_mask_or_pd ( __m512d s, __mmask8 k, __m512d a, __m512d b);
VORPD __m512d _mm512_maskz_or_pd (__mmask8 k, __m512d a, __m512d b);
VORPD __m256d _mm256_mask_or_pd (__m256d s, ___mmask8 k, __m256d a, __m256d b);
VORPD __m256d _mm256_maskz_or_pd (__mmask8 k, __m256d a, __m256d b);
VORPD __m128d _mm_mask_or_pd ( __m128d s, __mmask8 k, __m128d a, __m128d b);
VORPD __m128d _mm_maskz_or_pd (__mmask8 k, __m128d a, __m128d b);
VORPD __m256d _mm256_or_pd (__m256d a, __m256d b);
ORPD __m128d _mm_or_pd (__m128d a, __m128d b);

ORPD—Bitwise Logical OR of Packed Double Precision Floating-Point Values Vol. 2B 4-171


SIMD Floating-Point Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

ORPD—Bitwise Logical OR of Packed Double Precision Floating-Point Values Vol. 2B 4-172


ORPS—Bitwise Logical OR of Packed Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 56 /r A V/V SSE Return the bitwise logical OR of packed single
ORPS xmm1, xmm2/m128 precision floating-point values in xmm1 and
xmm2/mem.
VEX.128.0F 56 /r B V/V AVX Return the bitwise logical OR of packed single
VORPS xmm1,xmm2, xmm3/m128 precision floating-point values in xmm2 and
xmm3/mem.
VEX.256.0F 56 /r B V/V AVX Return the bitwise logical OR of packed single
VORPS ymm1, ymm2, ymm3/m256 precision floating-point values in ymm2 and
ymm3/mem.
EVEX.128.0F.W0 56 /r C V/V (AVX512VL AND Return the bitwise logical OR of packed single
VORPS xmm1 {k1}{z}, xmm2, AVX512DQ) OR precision floating-point values in xmm2 and
xmm3/m128/m32bcst AVX10.11 xmm3/m128/m32bcst subject to writemask k1.
EVEX.256.0F.W0 56 /r C V/V (AVX512VL AND Return the bitwise logical OR of packed single
VORPS ymm1 {k1}{z}, ymm2, AVX512DQ) OR precision floating-point values in ymm2 and
ymm3/m256/m32bcst AVX10.11 ymm3/m256/m32bcst subject to writemask k1.
EVEX.512.0F.W0 56 /r C V/V AVX512DQ Return the bitwise logical OR of packed single
VORPS zmm1 {k1}{z}, zmm2, OR AVX10.11 precision floating-point values in zmm2 and
zmm3/m512/m32bcst zmm3/m512/m32bcst subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a bitwise logical OR of the four, eight or sixteen packed single precision floating-point values from the first
source operand and the second source operand, and stores the result in the destination operand
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a
32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with write-
mask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the
corresponding ZMM register destination are zeroed.
VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
register destination are unmodified.

ORPS—Bitwise Logical OR of Packed Single Precision Floating-Point Values Vol. 2B 4-173


Operation
VORPS (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1) AND (SRC2 *is memory*)
THEN
DEST[i+31:i] := SRC1[i+31:i] BITWISE OR SRC2[31:0]
ELSE
DEST[i+31:i] := SRC1[i+31:i] BITWISE OR SRC2[i+31:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VORPS (VEX.256 Encoded Version)


DEST[31:0] := SRC1[31:0] BITWISE OR SRC2[31:0]
DEST[63:32] := SRC1[63:32] BITWISE OR SRC2[63:32]
DEST[95:64] := SRC1[95:64] BITWISE OR SRC2[95:64]
DEST[127:96] := SRC1[127:96] BITWISE OR SRC2[127:96]
DEST[159:128] := SRC1[159:128] BITWISE OR SRC2[159:128]
DEST[191:160] := SRC1[191:160] BITWISE OR SRC2[191:160]
DEST[223:192] := SRC1[223:192] BITWISE OR SRC2[223:192]
DEST[255:224] := SRC1[255:224] BITWISE OR SRC2[255:224].
DEST[MAXVL-1:256] := 0

VORPS (VEX.128 Encoded Version)


DEST[31:0] := SRC1[31:0] BITWISE OR SRC2[31:0]
DEST[63:32] := SRC1[63:32] BITWISE OR SRC2[63:32]
DEST[95:64] := SRC1[95:64] BITWISE OR SRC2[95:64]
DEST[127:96] := SRC1[127:96] BITWISE OR SRC2[127:96]
DEST[MAXVL-1:128] := 0

ORPS (128-bit Legacy SSE Version)


DEST[31:0] := SRC1[31:0] BITWISE OR SRC2[31:0]
DEST[63:32] := SRC1[63:32] BITWISE OR SRC2[63:32]
DEST[95:64] := SRC1[95:64] BITWISE OR SRC2[95:64]
DEST[127:96] := SRC1[127:96] BITWISE OR SRC2[127:96]
DEST[MAXVL-1:128] (Unmodified)

ORPS—Bitwise Logical OR of Packed Single Precision Floating-Point Values Vol. 2B 4-174


Intel C/C++ Compiler Intrinsic Equivalent
VORPS __m512 _mm512_or_ps ( __m512 a, __m512 b);
VORPS __m512 _mm512_mask_or_ps ( __m512 s, __mmask16 k, __m512 a, __m512 b);
VORPS __m512 _mm512_maskz_or_ps (__mmask16 k, __m512 a, __m512 b);
VORPS __m256 _mm256_mask_or_ps (__m256 s, ___mmask8 k, __m256 a, __m256 b);
VORPS __m256 _mm256_maskz_or_ps (__mmask8 k, __m256 a, __m256 b);
VORPS __m128 _mm_mask_or_ps ( __m128 s, __mmask8 k, __m128 a, __m128 b);
VORPS __m128 _mm_maskz_or_ps (__mmask8 k, __m128 a, __m128 b);
VORPS __m256 _mm256_or_ps (__m256 a, __m256 b);
ORPS __m128 _mm_or_ps (__m128 a, __m128 b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

ORPS—Bitwise Logical OR of Packed Single Precision Floating-Point Values Vol. 2B 4-175


PABSB/PABSW/PABSD/PABSQ—Packed Absolute Value
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 38 1C /r1 A V/V SSSE3 Compute the absolute value of bytes in mm2/m64
PABSB mm1, mm2/m64 and store UNSIGNED result in mm1.

66 0F 38 1C /r A V/V SSSE3 Compute the absolute value of bytes in


PABSB xmm1, xmm2/m128 xmm2/m128 and store UNSIGNED result in xmm1.

NP 0F 38 1D /r1 A V/V SSSE3 Compute the absolute value of 16-bit integers in


PABSW mm1, mm2/m64 mm2/m64 and store UNSIGNED result in mm1.

66 0F 38 1D /r A V/V SSSE3 Compute the absolute value of 16-bit integers in


PABSW xmm1, xmm2/m128 xmm2/m128 and store UNSIGNED result in xmm1.

NP 0F 38 1E /r1 A V/V SSSE3 Compute the absolute value of 32-bit integers in


PABSD mm1, mm2/m64 mm2/m64 and store UNSIGNED result in mm1.

66 0F 38 1E /r A V/V SSSE3 Compute the absolute value of 32-bit integers in


PABSD xmm1, xmm2/m128 xmm2/m128 and store UNSIGNED result in xmm1.

VEX.128.66.0F38.WIG 1C /r A V/V AVX Compute the absolute value of bytes in


VPABSB xmm1, xmm2/m128 xmm2/m128 and store UNSIGNED result in xmm1.

VEX.128.66.0F38.WIG 1D /r A V/V AVX Compute the absolute value of 16- bit integers in
VPABSW xmm1, xmm2/m128 xmm2/m128 and store UNSIGNED result in xmm1.

VEX.128.66.0F38.WIG 1E /r A V/V AVX Compute the absolute value of 32- bit integers in
VPABSD xmm1, xmm2/m128 xmm2/m128 and store UNSIGNED result in xmm1.

VEX.256.66.0F38.WIG 1C /r A V/V AVX2 Compute the absolute value of bytes in


VPABSB ymm1, ymm2/m256 ymm2/m256 and store UNSIGNED result in ymm1.
VEX.256.66.0F38.WIG 1D /r A V/V AVX2 Compute the absolute value of 16-bit integers in
VPABSW ymm1, ymm2/m256 ymm2/m256 and store UNSIGNED result in ymm1.

VEX.256.66.0F38.WIG 1E /r A V/V AVX2 Compute the absolute value of 32-bit integers in


VPABSD ymm1, ymm2/m256 ymm2/m256 and store UNSIGNED result in ymm1.

EVEX.128.66.0F38.WIG 1C /r B V/V (AVX512VL AND Compute the absolute value of bytes in


VPABSB xmm1 {k1}{z}, xmm2/m128 AVX512BW) OR xmm2/m128 and store UNSIGNED result in xmm1
AVX10.12 using writemask k1.
EVEX.256.66.0F38.WIG 1C /r B V/V (AVX512VL AND Compute the absolute value of bytes in
VPABSB ymm1 {k1}{z}, ymm2/m256 AVX512BW) OR ymm2/m256 and store UNSIGNED result in ymm1
AVX10.12 using writemask k1.
EVEX.512.66.0F38.WIG 1C /r B V/V AVX512BW OR Compute the absolute value of bytes in
VPABSB zmm1 {k1}{z}, zmm2/m512 AVX10.12 zmm2/m512 and store UNSIGNED result in zmm1
using writemask k1.
EVEX.128.66.0F38.WIG 1D /r B V/V (AVX512VL AND Compute the absolute value of 16-bit integers in
VPABSW xmm1 {k1}{z}, xmm2/m128 AVX512BW) OR xmm2/m128 and store UNSIGNED result in xmm1
AVX10.12 using writemask k1.
EVEX.256.66.0F38.WIG 1D /r B V/V (AVX512VL AND Compute the absolute value of 16-bit integers in
VPABSW ymm1 {k1}{z}, ymm2/m256 AVX512BW) OR ymm2/m256 and store UNSIGNED result in ymm1
AVX10.12 using writemask k1.
EVEX.512.66.0F38.WIG 1D /r B V/V AVX512BW OR Compute the absolute value of 16-bit integers in
VPABSW zmm1 {k1}{z}, zmm2/m512 AVX10.12 zmm2/m512 and store UNSIGNED result in zmm1
using writemask k1.

PABSB/PABSW/PABSD/PABSQ—Packed Absolute Value Vol. 2B 4-182


Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
EVEX.128.66.0F38.W0 1E /r C V/V (AVX512VL AND Compute the absolute value of 32-bit integers in
VPABSD xmm1 {k1}{z}, AVX512F) OR xmm2/m128/m32bcst and store UNSIGNED result
xmm2/m128/m32bcst AVX10.12 in xmm1 using writemask k1.
EVEX.256.66.0F38.W0 1E /r C V/V (AVX512VL AND Compute the absolute value of 32-bit integers in
VPABSD ymm1 {k1}{z}, AVX512F) OR ymm2/m256/m32bcst and store UNSIGNED result
ymm2/m256/m32bcst AVX10.12 in ymm1 using writemask k1.
EVEX.512.66.0F38.W0 1E /r C V/V AVX512F Compute the absolute value of 32-bit integers in
VPABSD zmm1 {k1}{z}, OR AVX10.12 zmm2/m512/m32bcst and store UNSIGNED result
zmm2/m512/m32bcst in zmm1 using writemask k1.
EVEX.128.66.0F38.W1 1F /r C V/V (AVX512VL AND Compute the absolute value of 64-bit integers in
VPABSQ xmm1 {k1}{z}, AVX512F) OR xmm2/m128/m64bcst and store UNSIGNED result
xmm2/m128/m64bcst AVX10.12 in xmm1 using writemask k1.
EVEX.256.66.0F38.W1 1F /r C V/V (AVX512VL AND Compute the absolute value of 64-bit integers in
VPABSQ ymm1 {k1}{z}, AVX512F) OR ymm2/m256/m64bcst and store UNSIGNED result
ymm2/m256/m64bcst AVX10.12 in ymm1 using writemask k1.
EVEX.512.66.0F38.W1 1F /r C V/V AVX512F Compute the absolute value of 64-bit integers in
VPABSQ zmm1 {k1}{z}, OR AVX10.12 zmm2/m512/m64bcst and store UNSIGNED result
zmm2/m512/m64bcst in zmm1 using writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Full Mem ModRM:reg (w) ModRM:r/m (r) N/A N/A
C Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
PABSB/W/D computes the absolute value of each data element of the source operand (the second operand) and
stores the UNSIGNED results in the destination operand (the first operand). PABSB operates on signed bytes,
PABSW operates on signed 16-bit words, and PABSD operates on signed 32-bit integers.
EVEX encoded VPABSD/Q: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location,
or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. The destination operand is a
ZMM/YMM/XMM register updated according to the writemask.
EVEX encoded VPABSB/W: The source operand is a ZMM/YMM/XMM register, or a 512/256/128-bit memory loca-
tion. The destination operand is a ZMM/YMM/XMM register updated according to the writemask.
VEX.256 encoded versions: The source operand is a YMM register or a 256-bit memory location. The destination
operand is a YMM register. The upper bits (MAXVL-1:256) of the corresponding register destination are zeroed.
VEX.128 encoded versions: The source operand is an XMM register or 128-bit memory location. The destination
operand is an XMM register. The upper bits (MAXVL-1:128) of the corresponding register destination are zeroed.

PABSB/PABSW/PABSD/PABSQ—Packed Absolute Value Vol. 2B 4-183


128-bit Legacy SSE version: The source operand can be an XMM register or an 128-bit memory location. The desti-
nation is an XMM register. The upper bits (VL_MAX-1:128) of the corresponding register destination are unmodi-
fied.
VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

Operation
PABSB With 64-bit Operands:
Unsigned DEST[7:0] := ABS(SRC[7: 0])
Repeat operation for 2nd through 7th bytes
Unsigned DEST[63:56] := ABS(SRC[63:56])

PABSB With 128-bit Operands:


Unsigned DEST[7:0] := ABS(SRC[7: 0])
Repeat operation for 2nd through 15th bytes
Unsigned DEST[127:120] := ABS(SRC[127:120])

VPABSB With 128-bit Operands:


Unsigned DEST[7:0] := ABS(SRC[7: 0])
Repeat operation for 2nd through 15th bytes
Unsigned DEST[127:120] := ABS(SRC[127:120])

VPABSB With 256-bit Operands:


Unsigned DEST[7:0] := ABS(SRC[7: 0])
Repeat operation for 2nd through 31st bytes
Unsigned DEST[255:248] := ABS(SRC[255:248])

VPABSB (EVEX Encoded Versions)


(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN
Unsigned DEST[i+7:i] := ABS(SRC[i+7:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PABSW With 128-bit Operands:


Unsigned DEST[15:0] := ABS(SRC[15:0])
Repeat operation for 2nd through 7th 16-bit words
Unsigned DEST[127:112] := ABS(SRC[127:112])

VPABSW With 128-bit Operands:


Unsigned DEST[15:0] := ABS(SRC[15:0])
Repeat operation for 2nd through 7th 16-bit words
Unsigned DEST[127:112] := ABS(SRC[127:112])

PABSB/PABSW/PABSD/PABSQ—Packed Absolute Value Vol. 2B 4-184


VPABSW With 256-bit Operands:
Unsigned DEST[15:0] := ABS(SRC[15:0])
Repeat operation for 2nd through 15th 16-bit words
Unsigned DEST[255:240] := ABS(SRC[255:240])

VPABSW (EVEX Encoded Versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN
Unsigned DEST[i+15:i] := ABS(SRC[i+15:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PABSD With 128-bit Operands:


Unsigned DEST[31:0] := ABS(SRC[31:0])
Repeat operation for 2nd through 3rd 32-bit double words
Unsigned DEST[127:96] := ABS(SRC[127:96])

VPABSD With 128-bit Operands:


Unsigned DEST[31:0] := ABS(SRC[31:0])
Repeat operation for 2nd through 3rd 32-bit double words
Unsigned DEST[127:96] := ABS(SRC[127:96])

VPABSD With 256-bit Operands:


Unsigned DEST[31:0] := ABS(SRC[31:0])
Repeat operation for 2nd through 7th 32-bit double words
Unsigned DEST[255:224] := ABS(SRC[255:224])

VPABSD (EVEX Encoded Versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC *is memory*)
THEN
Unsigned DEST[i+31:i] := ABS(SRC[31:0])
ELSE
Unsigned DEST[i+31:i] := ABS(SRC[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*

PABSB/PABSW/PABSD/PABSQ—Packed Absolute Value Vol. 2B 4-185


ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

VPABSQ (EVEX Encoded Versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC *is memory*)
THEN
Unsigned DEST[i+63:i] := ABS(SRC[63:0])
ELSE
Unsigned DEST[i+63:i] := ABS(SRC[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalents


VPABSB__m512i _mm512_abs_epi8 ( __m512i a)
VPABSW__m512i _mm512_abs_epi16 ( __m512i a)
VPABSB__m512i _mm512_mask_abs_epi8 ( __m512i s, __mmask64 m, __m512i a)
VPABSW__m512i _mm512_mask_abs_epi16 ( __m512i s, __mmask32 m, __m512i a)
VPABSB__m512i _mm512_maskz_abs_epi8 (__mmask64 m, __m512i a)
VPABSW__m512i _mm512_maskz_abs_epi16 (__mmask32 m, __m512i a)
VPABSB__m256i _mm256_mask_abs_epi8 (__m256i s, __mmask32 m, __m256i a)
VPABSW__m256i _mm256_mask_abs_epi16 (__m256i s, __mmask16 m, __m256i a)
VPABSB__m256i _mm256_maskz_abs_epi8 (__mmask32 m, __m256i a)
VPABSW__m256i _mm256_maskz_abs_epi16 (__mmask16 m, __m256i a)
VPABSB__m128i _mm_mask_abs_epi8 (__m128i s, __mmask16 m, __m128i a)
VPABSW__m128i _mm_mask_abs_epi16 (__m128i s, __mmask8 m, __m128i a)
VPABSB__m128i _mm_maskz_abs_epi8 (__mmask16 m, __m128i a)
VPABSW__m128i _mm_maskz_abs_epi16 (__mmask8 m, __m128i a)
VPABSD __m256i _mm256_mask_abs_epi32(__m256i s, __mmask8 k, __m256i a);
VPABSD __m256i _mm256_maskz_abs_epi32( __mmask8 k, __m256i a);
VPABSD __m128i _mm_mask_abs_epi32(__m128i s, __mmask8 k, __m128i a);
VPABSD __m128i _mm_maskz_abs_epi32( __mmask8 k, __m128i a);
VPABSD __m512i _mm512_abs_epi32( __m512i a);
VPABSD __m512i _mm512_mask_abs_epi32(__m512i s, __mmask16 k, __m512i a);
VPABSD __m512i _mm512_maskz_abs_epi32( __mmask16 k, __m512i a);
VPABSQ __m512i _mm512_abs_epi64( __m512i a);
VPABSQ __m512i _mm512_mask_abs_epi64(__m512i s, __mmask8 k, __m512i a);

PABSB/PABSW/PABSD/PABSQ—Packed Absolute Value Vol. 2B 4-186


VPABSQ __m512i _mm512_maskz_abs_epi64( __mmask8 k, __m512i a);
VPABSQ __m256i _mm256_mask_abs_epi64(__m256i s, __mmask8 k, __m256i a);
VPABSQ __m256i _mm256_maskz_abs_epi64( __mmask8 k, __m256i a);
VPABSQ __m128i _mm_mask_abs_epi64(__m128i s, __mmask8 k, __m128i a);
VPABSQ __m128i _mm_maskz_abs_epi64( __mmask8 k, __m128i a);
PABSB __m128i _mm_abs_epi8 (__m128i a)
VPABSB __m128i _mm_abs_epi8 (__m128i a)
VPABSB __m256i _mm256_abs_epi8 (__m256i a)
PABSW __m128i _mm_abs_epi16 (__m128i a)
VPABSW __m128i _mm_abs_epi16 (__m128i a)
VPABSW __m256i _mm256_abs_epi16 (__m256i a)
PABSD __m128i _mm_abs_epi32 (__m128i a)
VPABSD __m128i _mm_abs_epi32 (__m128i a)
VPABSD __m256i _mm256_abs_epi32 (__m256i a)

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPABSD/Q, see Table 2-51, “Type E4 Class Exception Conditions.”
EVEX-encoded VPABSB/W, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

PABSB/PABSW/PABSD/PABSQ—Packed Absolute Value Vol. 2B 4-187


PACKSSWB/PACKSSDW—Pack With Signed Saturation
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 63 /r1 A V/V MMX Converts 4 packed signed word integers from
PACKSSWB mm1, mm2/m64 mm1 and from mm2/m64 into 8 packed signed
byte integers in mm1 using signed saturation.
66 0F 63 /r A V/V SSE2 Converts 8 packed signed word integers from
PACKSSWB xmm1, xmm2/m128 xmm1 and from xmm2/m128 into 16 packed
signed byte integers in xmm1 using signed
saturation.
NP 0F 6B /r1 A V/V MMX Converts 2 packed signed doubleword integers
PACKSSDW mm1, mm2/m64 from mm1 and from mm2/m64 into 4 packed
signed word integers in mm1 using signed
saturation.
66 0F 6B /r A V/V SSE2 Converts 4 packed signed doubleword integers
PACKSSDW xmm1, xmm2/m128 from xmm1 and from xmm2/m128 into 8 packed
signed word integers in xmm1 using signed
saturation.
VEX.128.66.0F.WIG 63 /r B V/V AVX Converts 8 packed signed word integers from
VPACKSSWB xmm1,xmm2, xmm3/m128 xmm2 and from xmm3/m128 into 16 packed
signed byte integers in xmm1 using signed
saturation.
VEX.128.66.0F.WIG 6B /r B V/V AVX Converts 4 packed signed doubleword integers
VPACKSSDW xmm1,xmm2, xmm3/m128 from xmm2 and from xmm3/m128 into 8 packed
signed word integers in xmm1 using signed
saturation.
VEX.256.66.0F.WIG 63 /r B V/V AVX2 Converts 16 packed signed word integers from
VPACKSSWB ymm1, ymm2, ymm3/m256 ymm2 and from ymm3/m256 into 32 packed
signed byte integers in ymm1 using signed
saturation.
VEX.256.66.0F.WIG 6B /r B V/V AVX2 Converts 8 packed signed doubleword integers
VPACKSSDW ymm1, ymm2, ymm3/m256 from ymm2 and from ymm3/m256 into 16
packed signed word integers in ymm1using
signed saturation.
EVEX.128.66.0F.WIG 63 /r C V/V (AVX512VL AND Converts packed signed word integers from
VPACKSSWB xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm2 and from xmm3/m128 into packed signed
xmm3/m128 AVX10.12 byte integers in xmm1 using signed saturation
under writemask k1.
EVEX.256.66.0F.WIG 63 /r C V/V (AVX512VL AND Converts packed signed word integers from
VPACKSSWB ymm1 {k1}{z}, ymm2, AVX512BW) OR ymm2 and from ymm3/m256 into packed signed
ymm3/m256 AVX10.12 byte integers in ymm1 using signed saturation
under writemask k1.
EVEX.512.66.0F.WIG 63 /r C V/V AVX512BW Converts packed signed word integers from
VPACKSSWB zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm2 and from zmm3/m512 into packed signed
zmm3/m512 byte integers in zmm1 using signed saturation
under writemask k1.
EVEX.128.66.0F.W0 6B /r D V/V (AVX512VL AND Converts packed signed doubleword integers
VPACKSSDW xmm1 {k1}{z}, xmm2, AVX512BW) OR from xmm2 and from xmm3/m128/m32bcst into
xmm3/m128/m32bcst AVX10.12 packed signed word integers in xmm1 using
signed saturation under writemask k1.

PACKSSWB/PACKSSDW—Pack With Signed Saturation Vol. 2B 4-188


Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
EVEX.256.66.0F.W0 6B /r D V/V (AVX512VL AND Converts packed signed doubleword integers
VPACKSSDW ymm1 {k1}{z}, ymm2, AVX512BW) OR from ymm2 and from ymm3/m256/m32bcst into
ymm3/m256/m32bcst AVX10.12 packed signed word integers in ymm1 using
signed saturation under writemask k1.
EVEX.512.66.0F.W0 6B /r D V/V AVX512BW Converts packed signed doubleword integers
VPACKSSDW zmm1 {k1}{z}, zmm2, OR AVX10.12 from zmm2 and from zmm3/m512/m32bcst into
zmm3/m512/m32bcst packed signed word integers in zmm1 using
signed saturation under writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A
D Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Converts packed signed word integers into packed signed byte integers (PACKSSWB) or converts packed signed
doubleword integers into packed signed word integers (PACKSSDW), using saturation to handle overflow condi-
tions. See Figure 4-6 for an example of the packing operation.

64-Bit SRC 64-Bit DEST


D C B A

D’ C’ B’ A’
64-Bit DEST

Figure 4-6. Operation of the PACKSSDW Instruction Using 64-Bit Operands

PACKSSWB converts packed signed word integers in the first and second source operands into packed signed byte
integers using signed saturation to handle overflow conditions beyond the range of signed byte integers. If the
signed word value is beyond the range of a signed byte value (i.e., greater than 7FH or less than 80H), the satu-
rated signed byte integer value of 7FH or 80H, respectively, is stored in the destination. PACKSSDW converts
packed signed doubleword integers in the first and second source operands into packed signed word integers using
signed saturation to handle overflow conditions beyond 7FFFH and 8000H.
EVEX encoded PACKSSWB: The first source operand is a ZMM/YMM/XMM register. The second source operand is a
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand is a ZMM/YMM/XMM
register, updated conditional under the writemask k1.

PACKSSWB/PACKSSDW—Pack With Signed Saturation Vol. 2B 4-189


EVEX encoded PACKSSDW: The first source operand is a ZMM/YMM/XMM register. The second source operand is a
ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 32-
bit memory location. The destination operand is a ZMM/YMM/XMM register, updated conditional under the write-
mask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the
corresponding ZMM register destination are zeroed.
VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the
upper bits (MAXVL-1:128) of the corresponding ZMM destination register destination are unmodified.

Operation
PACKSSWB Instruction (128-bit Legacy SSE Version)
DEST[7:0] := SaturateSignedWordToSignedByte (DEST[15:0]);
DEST[15:8] := SaturateSignedWordToSignedByte (DEST[31:16]);
DEST[23:16] := SaturateSignedWordToSignedByte (DEST[47:32]);
DEST[31:24] := SaturateSignedWordToSignedByte (DEST[63:48]);
DEST[39:32] := SaturateSignedWordToSignedByte (DEST[79:64]);
DEST[47:40] := SaturateSignedWordToSignedByte (DEST[95:80]);
DEST[55:48] := SaturateSignedWordToSignedByte (DEST[111:96]);
DEST[63:56] := SaturateSignedWordToSignedByte (DEST[127:112]);
DEST[71:64] := SaturateSignedWordToSignedByte (SRC[15:0]);
DEST[79:72] := SaturateSignedWordToSignedByte (SRC[31:16]);
DEST[87:80] := SaturateSignedWordToSignedByte (SRC[47:32]);
DEST[95:88] := SaturateSignedWordToSignedByte (SRC[63:48]);
DEST[103:96] := SaturateSignedWordToSignedByte (SRC[79:64]);
DEST[111:104] := SaturateSignedWordToSignedByte (SRC[95:80]);
DEST[119:112] := SaturateSignedWordToSignedByte (SRC[111:96]);
DEST[127:120] := SaturateSignedWordToSignedByte (SRC[127:112]);
DEST[MAXVL-1:128] (Unmodified)

PACKSSDW Instruction (128-bit Legacy SSE Version)


DEST[15:0] := SaturateSignedDwordToSignedWord (DEST[31:0]);
DEST[31:16] := SaturateSignedDwordToSignedWord (DEST[63:32]);
DEST[47:32] := SaturateSignedDwordToSignedWord (DEST[95:64]);
DEST[63:48] := SaturateSignedDwordToSignedWord (DEST[127:96]);
DEST[79:64] := SaturateSignedDwordToSignedWord (SRC[31:0]);
DEST[95:80] := SaturateSignedDwordToSignedWord (SRC[63:32]);
DEST[111:96] := SaturateSignedDwordToSignedWord (SRC[95:64]);
DEST[127:112] := SaturateSignedDwordToSignedWord (SRC[127:96]);
DEST[MAXVL-1:128] (Unmodified)

PACKSSWB/PACKSSDW—Pack With Signed Saturation Vol. 2B 4-190


VPACKSSWB Instruction (VEX.128 Encoded Version)
DEST[7:0] := SaturateSignedWordToSignedByte (SRC1[15:0]);
DEST[15:8] := SaturateSignedWordToSignedByte (SRC1[31:16]);
DEST[23:16] := SaturateSignedWordToSignedByte (SRC1[47:32]);
DEST[31:24] := SaturateSignedWordToSignedByte (SRC1[63:48]);
DEST[39:32] := SaturateSignedWordToSignedByte (SRC1[79:64]);
DEST[47:40] := SaturateSignedWordToSignedByte (SRC1[95:80]);
DEST[55:48] := SaturateSignedWordToSignedByte (SRC1[111:96]);
DEST[63:56] := SaturateSignedWordToSignedByte (SRC1[127:112]);
DEST[71:64] := SaturateSignedWordToSignedByte (SRC2[15:0]);
DEST[79:72] := SaturateSignedWordToSignedByte (SRC2[31:16]);
DEST[87:80] := SaturateSignedWordToSignedByte (SRC2[47:32]);
DEST[95:88] := SaturateSignedWordToSignedByte (SRC2[63:48]);
DEST[103:96] := SaturateSignedWordToSignedByte (SRC2[79:64]);
DEST[111:104] := SaturateSignedWordToSignedByte (SRC2[95:80]);
DEST[119:112] := SaturateSignedWordToSignedByte (SRC2[111:96]);
DEST[127:120] := SaturateSignedWordToSignedByte (SRC2[127:112]);
DEST[MAXVL-1:128] := 0;

VPACKSSDW Instruction (VEX.128 Encoded Version)


DEST[15:0] := SaturateSignedDwordToSignedWord (SRC1[31:0]);
DEST[31:16] := SaturateSignedDwordToSignedWord (SRC1[63:32]);
DEST[47:32] := SaturateSignedDwordToSignedWord (SRC1[95:64]);
DEST[63:48] := SaturateSignedDwordToSignedWord (SRC1[127:96]);
DEST[79:64] := SaturateSignedDwordToSignedWord (SRC2[31:0]);
DEST[95:80] := SaturateSignedDwordToSignedWord (SRC2[63:32]);
DEST[111:96] := SaturateSignedDwordToSignedWord (SRC2[95:64]);
DEST[127:112] := SaturateSignedDwordToSignedWord (SRC2[127:96]);
DEST[MAXVL-1:128] := 0;

VPACKSSWB Instruction (VEX.256 Encoded Version)


DEST[7:0] := SaturateSignedWordToSignedByte (SRC1[15:0]);
DEST[15:8] := SaturateSignedWordToSignedByte (SRC1[31:16]);
DEST[23:16] := SaturateSignedWordToSignedByte (SRC1[47:32]);
DEST[31:24] := SaturateSignedWordToSignedByte (SRC1[63:48]);
DEST[39:32] := SaturateSignedWordToSignedByte (SRC1[79:64]);
DEST[47:40] := SaturateSignedWordToSignedByte (SRC1[95:80]);
DEST[55:48] := SaturateSignedWordToSignedByte (SRC1[111:96]);
DEST[63:56] := SaturateSignedWordToSignedByte (SRC1[127:112]);
DEST[71:64] := SaturateSignedWordToSignedByte (SRC2[15:0]);
DEST[79:72] := SaturateSignedWordToSignedByte (SRC2[31:16]);
DEST[87:80] := SaturateSignedWordToSignedByte (SRC2[47:32]);
DEST[95:88] := SaturateSignedWordToSignedByte (SRC2[63:48]);
DEST[103:96] := SaturateSignedWordToSignedByte (SRC2[79:64]);
DEST[111:104] := SaturateSignedWordToSignedByte (SRC2[95:80]);
DEST[119:112] := SaturateSignedWordToSignedByte (SRC2[111:96]);
DEST[127:120] := SaturateSignedWordToSignedByte (SRC2[127:112]);
DEST[135:128] := SaturateSignedWordToSignedByte (SRC1[143:128]);
DEST[143:136] := SaturateSignedWordToSignedByte (SRC1[159:144]);
DEST[151:144] := SaturateSignedWordToSignedByte (SRC1[175:160]);
DEST[159:152] := SaturateSignedWordToSignedByte (SRC1[191:176]);
DEST[167:160] := SaturateSignedWordToSignedByte (SRC1[207:192]);
DEST[175:168] := SaturateSignedWordToSignedByte (SRC1[223:208]);
DEST[183:176] := SaturateSignedWordToSignedByte (SRC1[239:224]);

PACKSSWB/PACKSSDW—Pack With Signed Saturation Vol. 2B 4-191


DEST[191:184] := SaturateSignedWordToSignedByte (SRC1[255:240]);
DEST[199:192] := SaturateSignedWordToSignedByte (SRC2[143:128]);
DEST[207:200] := SaturateSignedWordToSignedByte (SRC2[159:144]);
DEST[215:208] := SaturateSignedWordToSignedByte (SRC2[175:160]);
DEST[223:216] := SaturateSignedWordToSignedByte (SRC2[191:176]);
DEST[231:224] := SaturateSignedWordToSignedByte (SRC2[207:192]);
DEST[239:232] := SaturateSignedWordToSignedByte (SRC2[223:208]);
DEST[247:240] := SaturateSignedWordToSignedByte (SRC2[239:224]);
DEST[255:248] := SaturateSignedWordToSignedByte (SRC2[255:240]);
DEST[MAXVL-1:256] := 0;

VPACKSSDW Instruction (VEX.256 Encoded Version)


DEST[15:0] := SaturateSignedDwordToSignedWord (SRC1[31:0]);
DEST[31:16] := SaturateSignedDwordToSignedWord (SRC1[63:32]);
DEST[47:32] := SaturateSignedDwordToSignedWord (SRC1[95:64]);
DEST[63:48] := SaturateSignedDwordToSignedWord (SRC1[127:96]);
DEST[79:64] := SaturateSignedDwordToSignedWord (SRC2[31:0]);
DEST[95:80] := SaturateSignedDwordToSignedWord (SRC2[63:32]);
DEST[111:96] := SaturateSignedDwordToSignedWord (SRC2[95:64]);
DEST[127:112] := SaturateSignedDwordToSignedWord (SRC2[127:96]);
DEST[143:128] := SaturateSignedDwordToSignedWord (SRC1[159:128]);
DEST[159:144] := SaturateSignedDwordToSignedWord (SRC1[191:160]);
DEST[175:160] := SaturateSignedDwordToSignedWord (SRC1[223:192]);
DEST[191:176] := SaturateSignedDwordToSignedWord (SRC1[255:224]);
DEST[207:192] := SaturateSignedDwordToSignedWord (SRC2[159:128]);
DEST[223:208] := SaturateSignedDwordToSignedWord (SRC2[191:160]);
DEST[239:224] := SaturateSignedDwordToSignedWord (SRC2[223:192]);
DEST[255:240] := SaturateSignedDwordToSignedWord (SRC2[255:224]);
DEST[MAXVL-1:256] := 0;

VPACKSSWB (EVEX Encoded Versions)


(KL, VL) = (16, 128), (32, 256), (64, 512)
TMP_DEST[7:0] := SaturateSignedWordToSignedByte (SRC1[15:0]);
TMP_DEST[15:8] := SaturateSignedWordToSignedByte (SRC1[31:16]);
TMP_DEST[23:16] := SaturateSignedWordToSignedByte (SRC1[47:32]);
TMP_DEST[31:24] := SaturateSignedWordToSignedByte (SRC1[63:48]);
TMP_DEST[39:32] := SaturateSignedWordToSignedByte (SRC1[79:64]);
TMP_DEST[47:40] := SaturateSignedWordToSignedByte (SRC1[95:80]);
TMP_DEST[55:48] := SaturateSignedWordToSignedByte (SRC1[111:96]);
TMP_DEST[63:56] := SaturateSignedWordToSignedByte (SRC1[127:112]);
TMP_DEST[71:64] := SaturateSignedWordToSignedByte (SRC2[15:0]);
TMP_DEST[79:72] := SaturateSignedWordToSignedByte (SRC2[31:16]);
TMP_DEST[87:80] := SaturateSignedWordToSignedByte (SRC2[47:32]);
TMP_DEST[95:88] := SaturateSignedWordToSignedByte (SRC2[63:48]);
TMP_DEST[103:96] := SaturateSignedWordToSignedByte (SRC2[79:64]);
TMP_DEST[111:104] := SaturateSignedWordToSignedByte (SRC2[95:80]);
TMP_DEST[119:112] := SaturateSignedWordToSignedByte (SRC2[111:96]);
TMP_DEST[127:120] := SaturateSignedWordToSignedByte (SRC2[127:112]);
IF VL >= 256
TMP_DEST[135:128] := SaturateSignedWordToSignedByte (SRC1[143:128]);
TMP_DEST[143:136] := SaturateSignedWordToSignedByte (SRC1[159:144]);
TMP_DEST[151:144] := SaturateSignedWordToSignedByte (SRC1[175:160]);
TMP_DEST[159:152] := SaturateSignedWordToSignedByte (SRC1[191:176]);
TMP_DEST[167:160] := SaturateSignedWordToSignedByte (SRC1[207:192]);

PACKSSWB/PACKSSDW—Pack With Signed Saturation Vol. 2B 4-192


TMP_DEST[175:168] := SaturateSignedWordToSignedByte (SRC1[223:208]);
TMP_DEST[183:176] := SaturateSignedWordToSignedByte (SRC1[239:224]);
TMP_DEST[191:184] := SaturateSignedWordToSignedByte (SRC1[255:240]);
TMP_DEST[199:192] := SaturateSignedWordToSignedByte (SRC2[143:128]);
TMP_DEST[207:200] := SaturateSignedWordToSignedByte (SRC2[159:144]);
TMP_DEST[215:208] := SaturateSignedWordToSignedByte (SRC2[175:160]);
TMP_DEST[223:216] := SaturateSignedWordToSignedByte (SRC2[191:176]);
TMP_DEST[231:224] := SaturateSignedWordToSignedByte (SRC2[207:192]);
TMP_DEST[239:232] := SaturateSignedWordToSignedByte (SRC2[223:208]);
TMP_DEST[247:240] := SaturateSignedWordToSignedByte (SRC2[239:224]);
TMP_DEST[255:248] := SaturateSignedWordToSignedByte (SRC2[255:240]);
FI;
IF VL >= 512
TMP_DEST[263:256] := SaturateSignedWordToSignedByte (SRC1[271:256]);
TMP_DEST[271:264] := SaturateSignedWordToSignedByte (SRC1[287:272]);
TMP_DEST[279:272] := SaturateSignedWordToSignedByte (SRC1[303:288]);
TMP_DEST[287:280] := SaturateSignedWordToSignedByte (SRC1[319:304]);
TMP_DEST[295:288] := SaturateSignedWordToSignedByte (SRC1[335:320]);
TMP_DEST[303:296] := SaturateSignedWordToSignedByte (SRC1[351:336]);
TMP_DEST[311:304] := SaturateSignedWordToSignedByte (SRC1[367:352]);
TMP_DEST[319:312] := SaturateSignedWordToSignedByte (SRC1[383:368]);

TMP_DEST[327:320] := SaturateSignedWordToSignedByte (SRC2[271:256]);


TMP_DEST[335:328] := SaturateSignedWordToSignedByte (SRC2[287:272]);
TMP_DEST[343:336] := SaturateSignedWordToSignedByte (SRC2[303:288]);
TMP_DEST[351:344] := SaturateSignedWordToSignedByte (SRC2[319:304]);
TMP_DEST[359:352] := SaturateSignedWordToSignedByte (SRC2[335:320]);
TMP_DEST[367:360] := SaturateSignedWordToSignedByte (SRC2[351:336]);
TMP_DEST[375:368] := SaturateSignedWordToSignedByte (SRC2[367:352]);
TMP_DEST[383:376] := SaturateSignedWordToSignedByte (SRC2[383:368]);

TMP_DEST[391:384] := SaturateSignedWordToSignedByte (SRC1[399:384]);


TMP_DEST[399:392] := SaturateSignedWordToSignedByte (SRC1[415:400]);
TMP_DEST[407:400] := SaturateSignedWordToSignedByte (SRC1[431:416]);
TMP_DEST[415:408] := SaturateSignedWordToSignedByte (SRC1[447:432]);
TMP_DEST[423:416] := SaturateSignedWordToSignedByte (SRC1[463:448]);
TMP_DEST[431:424] := SaturateSignedWordToSignedByte (SRC1[479:464]);
TMP_DEST[439:432] := SaturateSignedWordToSignedByte (SRC1[495:480]);
TMP_DEST[447:440] := SaturateSignedWordToSignedByte (SRC1[511:496]);

TMP_DEST[455:448] := SaturateSignedWordToSignedByte (SRC2[399:384]);


TMP_DEST[463:456] := SaturateSignedWordToSignedByte (SRC2[415:400]);
TMP_DEST[471:464] := SaturateSignedWordToSignedByte (SRC2[431:416]);
TMP_DEST[479:472] := SaturateSignedWordToSignedByte (SRC2[447:432]);
TMP_DEST[487:480] := SaturateSignedWordToSignedByte (SRC2[463:448]);
TMP_DEST[495:488] := SaturateSignedWordToSignedByte (SRC2[479:464]);
TMP_DEST[503:496] := SaturateSignedWordToSignedByte (SRC2[495:480]);
TMP_DEST[511:504] := SaturateSignedWordToSignedByte (SRC2[511:496]);
FI;
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN
DEST[i+7:i] := TMP_DEST[i+7:i]

PACKSSWB/PACKSSDW—Pack With Signed Saturation Vol. 2B 4-193


ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

VPACKSSDW (EVEX Encoded Versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO ((KL/2) - 1)
i := j * 32

IF (EVEX.b == 1) AND (SRC2 *is memory*)


THEN
TMP_SRC2[i+31:i] := SRC2[31:0]
ELSE
TMP_SRC2[i+31:i] := SRC2[i+31:i]
FI;
ENDFOR;

TMP_DEST[15:0] := SaturateSignedDwordToSignedWord (SRC1[31:0]);


TMP_DEST[31:16] := SaturateSignedDwordToSignedWord (SRC1[63:32]);
TMP_DEST[47:32] := SaturateSignedDwordToSignedWord (SRC1[95:64]);
TMP_DEST[63:48] := SaturateSignedDwordToSignedWord (SRC1[127:96]);
TMP_DEST[79:64] := SaturateSignedDwordToSignedWord (TMP_SRC2[31:0]);
TMP_DEST[95:80] := SaturateSignedDwordToSignedWord (TMP_SRC2[63:32]);
TMP_DEST[111:96] := SaturateSignedDwordToSignedWord (TMP_SRC2[95:64]);
TMP_DEST[127:112] := SaturateSignedDwordToSignedWord (TMP_SRC2[127:96]);
IF VL >= 256
TMP_DEST[143:128] := SaturateSignedDwordToSignedWord (SRC1[159:128]);
TMP_DEST[159:144] := SaturateSignedDwordToSignedWord (SRC1[191:160]);
TMP_DEST[175:160] := SaturateSignedDwordToSignedWord (SRC1[223:192]);
TMP_DEST[191:176] := SaturateSignedDwordToSignedWord (SRC1[255:224]);
TMP_DEST[207:192] := SaturateSignedDwordToSignedWord (TMP_SRC2[159:128]);
TMP_DEST[223:208] := SaturateSignedDwordToSignedWord (TMP_SRC2[191:160]);
TMP_DEST[239:224] := SaturateSignedDwordToSignedWord (TMP_SRC2[223:192]);
TMP_DEST[255:240] := SaturateSignedDwordToSignedWord (TMP_SRC2[255:224]);
FI;
IF VL >= 512
TMP_DEST[271:256] := SaturateSignedDwordToSignedWord (SRC1[287:256]);
TMP_DEST[287:272] := SaturateSignedDwordToSignedWord (SRC1[319:288]);
TMP_DEST[303:288] := SaturateSignedDwordToSignedWord (SRC1[351:320]);
TMP_DEST[319:304] := SaturateSignedDwordToSignedWord (SRC1[383:352]);
TMP_DEST[335:320] := SaturateSignedDwordToSignedWord (TMP_SRC2[287:256]);
TMP_DEST[351:336] := SaturateSignedDwordToSignedWord (TMP_SRC2[319:288]);
TMP_DEST[367:352] := SaturateSignedDwordToSignedWord (TMP_SRC2[351:320]);
TMP_DEST[383:368] := SaturateSignedDwordToSignedWord (TMP_SRC2[383:352]);

TMP_DEST[399:384] := SaturateSignedDwordToSignedWord (SRC1[415:384]);


TMP_DEST[415:400] := SaturateSignedDwordToSignedWord (SRC1[447:416]);
TMP_DEST[431:416] := SaturateSignedDwordToSignedWord (SRC1[479:448]);

PACKSSWB/PACKSSDW—Pack With Signed Saturation Vol. 2B 4-194


TMP_DEST[447:432] := SaturateSignedDwordToSignedWord (SRC1[511:480]);
TMP_DEST[463:448] := SaturateSignedDwordToSignedWord (TMP_SRC2[415:384]);
TMP_DEST[479:464] := SaturateSignedDwordToSignedWord (TMP_SRC2[447:416]);
TMP_DEST[495:480] := SaturateSignedDwordToSignedWord (TMP_SRC2[479:448]);
TMP_DEST[511:496] := SaturateSignedDwordToSignedWord (TMP_SRC2[511:480]);
FI;
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TMP_DEST[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalents


VPACKSSDW__m512i _mm512_packs_epi32(__m512i m1, __m512i m2);
VPACKSSDW__m512i _mm512_mask_packs_epi32(__m512i s, __mmask32 k, __m512i m1, __m512i m2);
VPACKSSDW__m512i _mm512_maskz_packs_epi32( __mmask32 k, __m512i m1, __m512i m2);
VPACKSSDW__m256i _mm256_mask_packs_epi32( __m256i s, __mmask16 k, __m256i m1, __m256i m2);
VPACKSSDW__m256i _mm256_maskz_packs_epi32( __mmask16 k, __m256i m1, __m256i m2);
VPACKSSDW__m128i _mm_mask_packs_epi32( __m128i s, __mmask8 k, __m128i m1, __m128i m2);
VPACKSSDW__m128i _mm_maskz_packs_epi32( __mmask8 k, __m128i m1, __m128i m2);
VPACKSSWB__m512i _mm512_packs_epi16(__m512i m1, __m512i m2);
VPACKSSWB__m512i _mm512_mask_packs_epi16(__m512i s, __mmask32 k, __m512i m1, __m512i m2);
VPACKSSWB__m512i _mm512_maskz_packs_epi16( __mmask32 k, __m512i m1, __m512i m2);
VPACKSSWB__m256i _mm256_mask_packs_epi16( __m256i s, __mmask16 k, __m256i m1, __m256i m2);
VPACKSSWB__m256i _mm256_maskz_packs_epi16( __mmask16 k, __m256i m1, __m256i m2);
VPACKSSWB__m128i _mm_mask_packs_epi16( __m128i s, __mmask8 k, __m128i m1, __m128i m2);
VPACKSSWB__m128i _mm_maskz_packs_epi16( __mmask8 k, __m128i m1, __m128i m2);
PACKSSWB __m128i _mm_packs_epi16(__m128i m1, __m128i m2)
PACKSSDW __m128i _mm_packs_epi32(__m128i m1, __m128i m2)
VPACKSSWB __m256i _mm256_packs_epi16(__m256i m1, __m256i m2)
VPACKSSDW __m256i _mm256_packs_epi32(__m256i m1, __m256i m2)

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPACKSSDW, see Table 2-52, “Type E4NF Class Exception Conditions.”
EVEX-encoded VPACKSSWB, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”

PACKSSWB/PACKSSDW—Pack With Signed Saturation Vol. 2B 4-195


PACKUSDW—Pack With Unsigned Saturation
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 38 2B /r A V/V SSE4_1 Convert 4 packed signed doubleword integers from
PACKUSDW xmm1, xmm2/m128 xmm1 and 4 packed signed doubleword integers from
xmm2/m128 into 8 packed unsigned word integers in
xmm1 using unsigned saturation.
VEX.128.66.0F38 2B /r B V/V AVX Convert 4 packed signed doubleword integers from
VPACKUSDW xmm1,xmm2, xmm2 and 4 packed signed doubleword integers from
xmm3/m128 xmm3/m128 into 8 packed unsigned word integers in
xmm1 using unsigned saturation.
VEX.256.66.0F38 2B /r B V/V AVX2 Convert 8 packed signed doubleword integers from
VPACKUSDW ymm1, ymm2, ymm2 and 8 packed signed doubleword integers from
ymm3/m256 ymm3/m256 into 16 packed unsigned word integers in
ymm1 using unsigned saturation.
EVEX.128.66.0F38.W0 2B /r C V/V (AVX512VL AND Convert packed signed doubleword integers from xmm2
VPACKUSDW xmm1{k1}{z}, xmm2, AVX512BW) OR and packed signed doubleword integers from
xmm3/m128/m32bcst AVX10.11 xmm3/m128/m32bcst into packed unsigned word
integers in xmm1 using unsigned saturation under
writemask k1.
EVEX.256.66.0F38.W0 2B /r C V/V (AVX512VL AND Convert packed signed doubleword integers from ymm2
VPACKUSDW ymm1{k1}{z}, ymm2, AVX512BW) OR and packed signed doubleword integers from
ymm3/m256/m32bcst AVX10.11 ymm3/m256/m32bcst into packed unsigned word
integers in ymm1 using unsigned saturation under
writemask k1.
EVEX.512.66.0F38.W0 2B /r C V/V AVX512BW Convert packed signed doubleword integers from zmm2
VPACKUSDW zmm1{k1}{z}, zmm2, OR AVX10.11 and packed signed doubleword integers from
zmm3/m512/m32bcst zmm3/m512/m32bcst into packed unsigned word
integers in zmm1 using unsigned saturation under
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Converts packed signed doubleword integers in the first and second source operands into packed unsigned word
integers using unsigned saturation to handle overflow conditions. If the signed doubleword value is beyond the
range of an unsigned word (that is, greater than FFFFH or less than 0000H), the saturated unsigned word integer
value of FFFFH or 0000H, respectively, is stored in the destination.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand is a
ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 32-
bit memory location. The destination operand is a ZMM register, updated conditionally under the writemask k1.

PACKUSDW—Pack With Unsigned Saturation Vol. 2B 4-196


VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the
corresponding ZMM register destination are zeroed.
VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the
upper bits (MAXVL-1:128) of the corresponding destination register destination are unmodified.

Operation
PACKUSDW (Legacy SSE Instruction)
TMP[15:0] := (DEST[31:0] < 0) ? 0 : DEST[15:0];
DEST[15:0] := (DEST[31:0] > FFFFH) ? FFFFH : TMP[15:0] ;
TMP[31:16] := (DEST[63:32] < 0) ? 0 : DEST[47:32];
DEST[31:16] := (DEST[63:32] > FFFFH) ? FFFFH : TMP[31:16] ;
TMP[47:32] := (DEST[95:64] < 0) ? 0 : DEST[79:64];
DEST[47:32] := (DEST[95:64] > FFFFH) ? FFFFH : TMP[47:32] ;
TMP[63:48] := (DEST[127:96] < 0) ? 0 : DEST[111:96];
DEST[63:48] := (DEST[127:96] > FFFFH) ? FFFFH : TMP[63:48] ;
TMP[79:64] := (SRC[31:0] < 0) ? 0 : SRC[15:0];
DEST[79:64] := (SRC[31:0] > FFFFH) ? FFFFH : TMP[79:64] ;
TMP[95:80] := (SRC[63:32] < 0) ? 0 : SRC[47:32];
DEST[95:80] := (SRC[63:32] > FFFFH) ? FFFFH : TMP[95:80] ;
TMP[111:96] := (SRC[95:64] < 0) ? 0 : SRC[79:64];
DEST[111:96] := (SRC[95:64] > FFFFH) ? FFFFH : TMP[111:96] ;
TMP[127:112] := (SRC[127:96] < 0) ? 0 : SRC[111:96];
DEST[127:112] := (SRC[127:96] > FFFFH) ? FFFFH : TMP[127:112] ;
DEST[MAXVL-1:128] (Unmodified)

PACKUSDW (VEX.128 Encoded Version)


TMP[15:0] := (SRC1[31:0] < 0) ? 0 : SRC1[15:0];
DEST[15:0] := (SRC1[31:0] > FFFFH) ? FFFFH : TMP[15:0] ;
TMP[31:16] := (SRC1[63:32] < 0) ? 0 : SRC1[47:32];
DEST[31:16] := (SRC1[63:32] > FFFFH) ? FFFFH : TMP[31:16] ;
TMP[47:32] := (SRC1[95:64] < 0) ? 0 : SRC1[79:64];
DEST[47:32] := (SRC1[95:64] > FFFFH) ? FFFFH : TMP[47:32] ;
TMP[63:48] := (SRC1[127:96] < 0) ? 0 : SRC1[111:96];
DEST[63:48] := (SRC1[127:96] > FFFFH) ? FFFFH : TMP[63:48] ;
TMP[79:64] := (SRC2[31:0] < 0) ? 0 : SRC2[15:0];
DEST[79:64] := (SRC2[31:0] > FFFFH) ? FFFFH : TMP[79:64] ;
TMP[95:80] := (SRC2[63:32] < 0) ? 0 : SRC2[47:32];
DEST[95:80] := (SRC2[63:32] > FFFFH) ? FFFFH : TMP[95:80] ;
TMP[111:96] := (SRC2[95:64] < 0) ? 0 : SRC2[79:64];
DEST[111:96] := (SRC2[95:64] > FFFFH) ? FFFFH : TMP[111:96] ;
TMP[127:112] := (SRC2[127:96] < 0) ? 0 : SRC2[111:96];
DEST[127:112] := (SRC2[127:96] > FFFFH) ? FFFFH : TMP[127:112];
DEST[MAXVL-1:128] := 0;

PACKUSDW—Pack With Unsigned Saturation Vol. 2B 4-197


VPACKUSDW (VEX.256 Encoded Version)
TMP[15:0] := (SRC1[31:0] < 0) ? 0 : SRC1[15:0];
DEST[15:0] := (SRC1[31:0] > FFFFH) ? FFFFH : TMP[15:0] ;
TMP[31:16] := (SRC1[63:32] < 0) ? 0 : SRC1[47:32];
DEST[31:16] := (SRC1[63:32] > FFFFH) ? FFFFH : TMP[31:16] ;
TMP[47:32] := (SRC1[95:64] < 0) ? 0 : SRC1[79:64];
DEST[47:32] := (SRC1[95:64] > FFFFH) ? FFFFH : TMP[47:32] ;
TMP[63:48] := (SRC1[127:96] < 0) ? 0 : SRC1[111:96];
DEST[63:48] := (SRC1[127:96] > FFFFH) ? FFFFH : TMP[63:48] ;
TMP[79:64] := (SRC2[31:0] < 0) ? 0 : SRC2[15:0];
DEST[79:64] := (SRC2[31:0] > FFFFH) ? FFFFH : TMP[79:64] ;
TMP[95:80] := (SRC2[63:32] < 0) ? 0 : SRC2[47:32];
DEST[95:80] := (SRC2[63:32] > FFFFH) ? FFFFH : TMP[95:80] ;
TMP[111:96] := (SRC2[95:64] < 0) ? 0 : SRC2[79:64];
DEST[111:96] := (SRC2[95:64] > FFFFH) ? FFFFH : TMP[111:96] ;
TMP[127:112] := (SRC2[127:96] < 0) ? 0 : SRC2[111:96];
DEST[127:112] := (SRC2[127:96] > FFFFH) ? FFFFH : TMP[127:112] ;
TMP[143:128] := (SRC1[159:128] < 0) ? 0 : SRC1[143:128];
DEST[143:128] := (SRC1[159:128] > FFFFH) ? FFFFH : TMP[143:128] ;
TMP[159:144] := (SRC1[191:160] < 0) ? 0 : SRC1[175:160];
DEST[159:144] := (SRC1[191:160] > FFFFH) ? FFFFH : TMP[159:144] ;
TMP[175:160] := (SRC1[223:192] < 0) ? 0 : SRC1[207:192];
DEST[175:160] := (SRC1[223:192] > FFFFH) ? FFFFH : TMP[175:160] ;
TMP[191:176] := (SRC1[255:224] < 0) ? 0 : SRC1[239:224];
DEST[191:176] := (SRC1[255:224] > FFFFH) ? FFFFH : TMP[191:176] ;
TMP[207:192] := (SRC2[159:128] < 0) ? 0 : SRC2[143:128];
DEST[207:192] := (SRC2[159:128] > FFFFH) ? FFFFH : TMP[207:192] ;
TMP[223:208] := (SRC2[191:160] < 0) ? 0 : SRC2[175:160];
DEST[223:208] := (SRC2[191:160] > FFFFH) ? FFFFH : TMP[223:208] ;
TMP[239:224] := (SRC2[223:192] < 0) ? 0 : SRC2[207:192];
DEST[239:224] := (SRC2[223:192] > FFFFH) ? FFFFH : TMP[239:224] ;
TMP[255:240] := (SRC2[255:224] < 0) ? 0 : SRC2[239:224];
DEST[255:240] := (SRC2[255:224] > FFFFH) ? FFFFH : TMP[255:240] ;
DEST[MAXVL-1:256] := 0;

VPACKUSDW (EVEX Encoded Versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO ((KL/2) - 1)
i := j * 32

IF (EVEX.b == 1) AND (SRC2 *is memory*)


THEN
TMP_SRC2[i+31:i] := SRC2[31:0]
ELSE
TMP_SRC2[i+31:i] := SRC2[i+31:i]
FI;
ENDFOR;

TMP[15:0] := (SRC1[31:0] < 0) ? 0 : SRC1[15:0];


DEST[15:0] := (SRC1[31:0] > FFFFH) ? FFFFH : TMP[15:0] ;
TMP[31:16] := (SRC1[63:32] < 0) ? 0 : SRC1[47:32];
DEST[31:16] := (SRC1[63:32] > FFFFH) ? FFFFH : TMP[31:16] ;
TMP[47:32] := (SRC1[95:64] < 0) ? 0 : SRC1[79:64];
DEST[47:32] := (SRC1[95:64] > FFFFH) ? FFFFH : TMP[47:32] ;

PACKUSDW—Pack With Unsigned Saturation Vol. 2B 4-198


TMP[63:48] := (SRC1[127:96] < 0) ? 0 : SRC1[111:96];
DEST[63:48] := (SRC1[127:96] > FFFFH) ? FFFFH : TMP[63:48] ;
TMP[79:64] := (TMP_SRC2[31:0] < 0) ? 0 : TMP_SRC2[15:0];
DEST[79:64] := (TMP_SRC2[31:0] > FFFFH) ? FFFFH : TMP[79:64] ;
TMP[95:80] := (TMP_SRC2[63:32] < 0) ? 0 : TMP_SRC2[47:32];
DEST[95:80] := (TMP_SRC2[63:32] > FFFFH) ? FFFFH : TMP[95:80] ;
TMP[111:96] := (TMP_SRC2[95:64] < 0) ? 0 : TMP_SRC2[79:64];
DEST[111:96] := (TMP_SRC2[95:64] > FFFFH) ? FFFFH : TMP[111:96] ;
TMP[127:112] := (TMP_SRC2[127:96] < 0) ? 0 : TMP_SRC2[111:96];
DEST[127:112] := (TMP_SRC2[127:96] > FFFFH) ? FFFFH : TMP[127:112] ;
IF VL >= 256
TMP[143:128] := (SRC1[159:128] < 0) ? 0 : SRC1[143:128];
DEST[143:128] := (SRC1[159:128] > FFFFH) ? FFFFH : TMP[143:128] ;
TMP[159:144] := (SRC1[191:160] < 0) ? 0 : SRC1[175:160];
DEST[159:144] := (SRC1[191:160] > FFFFH) ? FFFFH : TMP[159:144] ;
TMP[175:160] := (SRC1[223:192] < 0) ? 0 : SRC1[207:192];
DEST[175:160] := (SRC1[223:192] > FFFFH) ? FFFFH : TMP[175:160] ;
TMP[191:176] := (SRC1[255:224] < 0) ? 0 : SRC1[239:224];
DEST[191:176] := (SRC1[255:224] > FFFFH) ? FFFFH : TMP[191:176] ;
TMP[207:192] := (TMP_SRC2[159:128] < 0) ? 0 : TMP_SRC2[143:128];
DEST[207:192] := (TMP_SRC2[159:128] > FFFFH) ? FFFFH : TMP[207:192] ;
TMP[223:208] := (TMP_SRC2[191:160] < 0) ? 0 : TMP_SRC2[175:160];
DEST[223:208] := (TMP_SRC2[191:160] > FFFFH) ? FFFFH : TMP[223:208] ;
TMP[239:224] := (TMP_SRC2[223:192] < 0) ? 0 : TMP_SRC2[207:192];
DEST[239:224] := (TMP_SRC2[223:192] > FFFFH) ? FFFFH : TMP[239:224] ;
TMP[255:240] := (TMP_SRC2[255:224] < 0) ? 0 : TMP_SRC2[239:224];
DEST[255:240] := (TMP_SRC2[255:224] > FFFFH) ? FFFFH : TMP[255:240] ;
FI;
IF VL >= 512
TMP[271:256] := (SRC1[287:256] < 0) ? 0 : SRC1[271:256];
DEST[271:256] := (SRC1[287:256] > FFFFH) ? FFFFH : TMP[271:256] ;
TMP[287:272] := (SRC1[319:288] < 0) ? 0 : SRC1[303:288];
DEST[287:272] := (SRC1[319:288] > FFFFH) ? FFFFH : TMP[287:272] ;
TMP[303:288] := (SRC1[351:320] < 0) ? 0 : SRC1[335:320];
DEST[303:288] := (SRC1[351:320] > FFFFH) ? FFFFH : TMP[303:288] ;
TMP[319:304] := (SRC1[383:352] < 0) ? 0 : SRC1[367:352];
DEST[319:304] := (SRC1[383:352] > FFFFH) ? FFFFH : TMP[319:304] ;
TMP[335:320] := (TMP_SRC2[287:256] < 0) ? 0 : TMP_SRC2[271:256];
DEST[335:304] := (TMP_SRC2[287:256] > FFFFH) ? FFFFH : TMP[79:64] ;
TMP[351:336] := (TMP_SRC2[319:288] < 0) ? 0 : TMP_SRC2[303:288];
DEST[351:336] := (TMP_SRC2[319:288] > FFFFH) ? FFFFH : TMP[351:336] ;
TMP[367:352] := (TMP_SRC2[351:320] < 0) ? 0 : TMP_SRC2[315:320];
DEST[367:352] := (TMP_SRC2[351:320] > FFFFH) ? FFFFH : TMP[367:352] ;
TMP[383:368] := (TMP_SRC2[383:352] < 0) ? 0 : TMP_SRC2[367:352];
DEST[383:368] := (TMP_SRC2[383:352] > FFFFH) ? FFFFH : TMP[383:368] ;
TMP[399:384] := (SRC1[415:384] < 0) ? 0 : SRC1[399:384];
DEST[399:384] := (SRC1[415:384] > FFFFH) ? FFFFH : TMP[399:384] ;
TMP[415:400] := (SRC1[447:416] < 0) ? 0 : SRC1[431:416];
DEST[415:400] := (SRC1[447:416] > FFFFH) ? FFFFH : TMP[415:400] ;
TMP[431:416] := (SRC1[479:448] < 0) ? 0 : SRC1[463:448];
DEST[431:416] := (SRC1[479:448] > FFFFH) ? FFFFH : TMP[431:416] ;
TMP[447:432] := (SRC1[511:480] < 0) ? 0 : SRC1[495:480];
DEST[447:432] := (SRC1[511:480] > FFFFH) ? FFFFH : TMP[447:432] ;
TMP[463:448] := (TMP_SRC2[415:384] < 0) ? 0 : TMP_SRC2[399:384];

PACKUSDW—Pack With Unsigned Saturation Vol. 2B 4-199


DEST[463:448] := (TMP_SRC2[415:384] > FFFFH) ? FFFFH : TMP[463:448] ;
TMP[475:464] := (TMP_SRC2[447:416] < 0) ? 0 : TMP_SRC2[431:416];
DEST[475:464] := (TMP_SRC2[447:416] > FFFFH) ? FFFFH : TMP[475:464] ;
TMP[491:476] := (TMP_SRC2[479:448] < 0) ? 0 : TMP_SRC2[463:448];
DEST[491:476] := (TMP_SRC2[479:448] > FFFFH) ? FFFFH : TMP[491:476] ;
TMP[511:492] := (TMP_SRC2[511:480] < 0) ? 0 : TMP_SRC2[495:480];
DEST[511:492] := (TMP_SRC2[511:480] > FFFFH) ? FFFFH : TMP[511:492] ;
FI;
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN
DEST[i+15:i] := TMP_DEST[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalents


VPACKUSDW__m512i _mm512_packus_epi32(__m512i m1, __m512i m2);
VPACKUSDW__m512i _mm512_mask_packus_epi32(__m512i s, __mmask32 k, __m512i m1, __m512i m2);
VPACKUSDW__m512i _mm512_maskz_packus_epi32( __mmask32 k, __m512i m1, __m512i m2);
VPACKUSDW__m256i _mm256_mask_packus_epi32( __m256i s, __mmask16 k, __m256i m1, __m256i m2);
VPACKUSDW__m256i _mm256_maskz_packus_epi32( __mmask16 k, __m256i m1, __m256i m2);
VPACKUSDW__m128i _mm_mask_packus_epi32( __m128i s, __mmask8 k, __m128i m1, __m128i m2);
VPACKUSDW__m128i _mm_maskz_packus_epi32( __mmask8 k, __m128i m1, __m128i m2);
PACKUSDW__m128i _mm_packus_epi32(__m128i m1, __m128i m2);
VPACKUSDW__m256i _mm256_packus_epi32(__m256i m1, __m256i m2);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-52, “Type E4NF Class Exception Conditions.”

PACKUSDW—Pack With Unsigned Saturation Vol. 2B 4-200


PACKUSWB—Pack With Unsigned Saturation
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 67 /r1 A V/V MMX Converts 4 signed word integers from mm and
PACKUSWB mm, mm/m64 4 signed word integers from mm/m64 into 8
unsigned byte integers in mm using unsigned
saturation.
66 0F 67 /r A V/V SSE2 Converts 8 signed word integers from xmm1
PACKUSWB xmm1, xmm2/m128 and 8 signed word integers from xmm2/m128
into 16 unsigned byte integers in xmm1 using
unsigned saturation.
VEX.128.66.0F.WIG 67 /r B V/V AVX Converts 8 signed word integers from xmm2
VPACKUSWB xmm1, xmm2, xmm3/m128 and 8 signed word integers from xmm3/m128
into 16 unsigned byte integers in xmm1 using
unsigned saturation.
VEX.256.66.0F.WIG 67 /r B V/V AVX2 Converts 16 signed word integers from ymm2
VPACKUSWB ymm1, ymm2, ymm3/m256 and 16signed word integers from ymm3/m256
into 32 unsigned byte integers in ymm1 using
unsigned saturation.
EVEX.128.66.0F.WIG 67 /r C V/V (AVX512VL AND Converts signed word integers from xmm2 and
VPACKUSWB xmm1{k1}{z}, xmm2, AVX512BW) OR signed word integers from xmm3/m128 into
xmm3/m128 AVX10.12 unsigned byte integers in xmm1 using unsigned
saturation under writemask k1.
EVEX.256.66.0F.WIG 67 /r C V/V (AVX512VL AND Converts signed word integers from ymm2 and
VPACKUSWB ymm1{k1}{z}, ymm2, AVX512BW) OR signed word integers from ymm3/m256 into
ymm3/m256 AVX10.12 unsigned byte integers in ymm1 using unsigned
saturation under writemask k1.
EVEX.512.66.0F.WIG 67 /r C V/V AVX512BW Converts signed word integers from zmm2 and
VPACKUSWB zmm1{k1}{z}, zmm2, OR AVX10.12 signed word integers from zmm3/m512 into
zmm3/m512 unsigned byte integers in zmm1 using unsigned
saturation under writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Converts 4, 8, 16, or 32 signed word integers from the destination operand (first operand) and 4, 8, 16, or 32
signed word integers from the source operand (second operand) into 8, 16, 32 or 64 unsigned byte integers and
stores the result in the destination operand. (See Figure 4-6 for an example of the packing operation.) If a signed

PACKUSWB—Pack With Unsigned Saturation Vol. 2B 4-201


word integer value is beyond the range of an unsigned byte integer (that is, greater than FFH or less than 00H), the
saturated unsigned byte integer value of FFH or 00H, respectively, is stored in the destination.
EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM
register or a 512-bit memory location. The destination operand is a ZMM register.
VEX.256 and EVEX.256 encoded versions: The first source operand is a YMM register. The second source operand
is a YMM register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-
1:256) of the corresponding ZMM register destination are zeroed.
VEX.128 and EVEX.128 encoded versions: The first source operand is an XMM register. The second source operand
is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-
1:128) of the corresponding register destination are zeroed.
128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the
upper bits (MAXVL-1:128) of the corresponding register destination are unmodified.

Operation
PACKUSWB (With 64-bit Operands)
DEST[7:0] := SaturateSignedWordToUnsignedByte DEST[15:0];
DEST[15:8] := SaturateSignedWordToUnsignedByte DEST[31:16];
DEST[23:16] := SaturateSignedWordToUnsignedByte DEST[47:32];
DEST[31:24] := SaturateSignedWordToUnsignedByte DEST[63:48];
DEST[39:32] := SaturateSignedWordToUnsignedByte SRC[15:0];
DEST[47:40] := SaturateSignedWordToUnsignedByte SRC[31:16];
DEST[55:48] := SaturateSignedWordToUnsignedByte SRC[47:32];
DEST[63:56] := SaturateSignedWordToUnsignedByte SRC[63:48];

PACKUSWB (Legacy SSE Instruction)


DEST[7:0] := SaturateSignedWordToUnsignedByte (DEST[15:0]);
DEST[15:8] := SaturateSignedWordToUnsignedByte (DEST[31:16]);
DEST[23:16] := SaturateSignedWordToUnsignedByte (DEST[47:32]);
DEST[31:24] := SaturateSignedWordToUnsignedByte (DEST[63:48]);
DEST[39:32] := SaturateSignedWordToUnsignedByte (DEST[79:64]);
DEST[47:40] := SaturateSignedWordToUnsignedByte (DEST[95:80]);
DEST[55:48] := SaturateSignedWordToUnsignedByte (DEST[111:96]);
DEST[63:56] := SaturateSignedWordToUnsignedByte (DEST[127:112]);
DEST[71:64] := SaturateSignedWordToUnsignedByte (SRC[15:0]);
DEST[79:72] := SaturateSignedWordToUnsignedByte (SRC[31:16]);
DEST[87:80] := SaturateSignedWordToUnsignedByte (SRC[47:32]);
DEST[95:88] := SaturateSignedWordToUnsignedByte (SRC[63:48]);
DEST[103:96] := SaturateSignedWordToUnsignedByte (SRC[79:64]);
DEST[111:104] := SaturateSignedWordToUnsignedByte (SRC[95:80]);
DEST[119:112] := SaturateSignedWordToUnsignedByte (SRC[111:96]);
DEST[127:120] := SaturateSignedWordToUnsignedByte (SRC[127:112]);

PACKUSWB (VEX.128 Encoded Version)


DEST[7:0] := SaturateSignedWordToUnsignedByte (SRC1[15:0]);
DEST[15:8] := SaturateSignedWordToUnsignedByte (SRC1[31:16]);
DEST[23:16] := SaturateSignedWordToUnsignedByte (SRC1[47:32]);
DEST[31:24] := SaturateSignedWordToUnsignedByte (SRC1[63:48]);
DEST[39:32] := SaturateSignedWordToUnsignedByte (SRC1[79:64]);
DEST[47:40] := SaturateSignedWordToUnsignedByte (SRC1[95:80]);
DEST[55:48] := SaturateSignedWordToUnsignedByte (SRC1[111:96]);
DEST[63:56] := SaturateSignedWordToUnsignedByte (SRC1[127:112]);
DEST[71:64] := SaturateSignedWordToUnsignedByte (SRC2[15:0]);

PACKUSWB—Pack With Unsigned Saturation Vol. 2B 4-202


DEST[79:72] := SaturateSignedWordToUnsignedByte (SRC2[31:16]);
DEST[87:80] := SaturateSignedWordToUnsignedByte (SRC2[47:32]);
DEST[95:88] := SaturateSignedWordToUnsignedByte (SRC2[63:48]);
DEST[103:96] := SaturateSignedWordToUnsignedByte (SRC2[79:64]);
DEST[111:104] := SaturateSignedWordToUnsignedByte (SRC2[95:80]);
DEST[119:112] := SaturateSignedWordToUnsignedByte (SRC2[111:96]);
DEST[127:120] := SaturateSignedWordToUnsignedByte (SRC2[127:112]);
DEST[MAXVL-1:128] := 0;

VPACKUSWB (VEX.256 Encoded Version)


DEST[7:0] := SaturateSignedWordToUnsignedByte (SRC1[15:0]);
DEST[15:8] := SaturateSignedWordToUnsignedByte (SRC1[31:16]);
DEST[23:16] := SaturateSignedWordToUnsignedByte (SRC1[47:32]);
DEST[31:24] := SaturateSignedWordToUnsignedByte (SRC1[63:48]);
DEST[39:32] := SaturateSignedWordToUnsignedByte (SRC1[79:64]);
DEST[47:40] := SaturateSignedWordToUnsignedByte (SRC1[95:80]);
DEST[55:48] := SaturateSignedWordToUnsignedByte (SRC1[111:96]);
DEST[63:56] := SaturateSignedWordToUnsignedByte (SRC1[127:112]);
DEST[71:64] := SaturateSignedWordToUnsignedByte (SRC2[15:0]);
DEST[79:72] := SaturateSignedWordToUnsignedByte (SRC2[31:16]);
DEST[87:80] := SaturateSignedWordToUnsignedByte (SRC2[47:32]);
DEST[95:88] := SaturateSignedWordToUnsignedByte (SRC2[63:48]);
DEST[103:96] := SaturateSignedWordToUnsignedByte (SRC2[79:64]);
DEST[111:104] := SaturateSignedWordToUnsignedByte (SRC2[95:80]);
DEST[119:112] := SaturateSignedWordToUnsignedByte (SRC2[111:96]);
DEST[127:120] := SaturateSignedWordToUnsignedByte (SRC2[127:112]);
DEST[135:128] := SaturateSignedWordToUnsignedByte (SRC1[143:128]);
DEST[143:136] := SaturateSignedWordToUnsignedByte (SRC1[159:144]);
DEST[151:144] := SaturateSignedWordToUnsignedByte (SRC1[175:160]);
DEST[159:152] := SaturateSignedWordToUnsignedByte (SRC1[191:176]);
DEST[167:160] := SaturateSignedWordToUnsignedByte (SRC1[207:192]);
DEST[175:168] := SaturateSignedWordToUnsignedByte (SRC1[223:208]);
DEST[183:176] := SaturateSignedWordToUnsignedByte (SRC1[239:224]);
DEST[191:184] := SaturateSignedWordToUnsignedByte (SRC1[255:240]);
DEST[199:192] := SaturateSignedWordToUnsignedByte (SRC2[143:128]);
DEST[207:200] := SaturateSignedWordToUnsignedByte (SRC2[159:144]);
DEST[215:208] := SaturateSignedWordToUnsignedByte (SRC2[175:160]);
DEST[223:216] := SaturateSignedWordToUnsignedByte (SRC2[191:176]);
DEST[231:224] := SaturateSignedWordToUnsignedByte (SRC2[207:192]);
DEST[239:232] := SaturateSignedWordToUnsignedByte (SRC2[223:208]);
DEST[247:240] := SaturateSignedWordToUnsignedByte (SRC2[239:224]);
DEST[255:248] := SaturateSignedWordToUnsignedByte (SRC2[255:240]);

VPACKUSWB (EVEX Encoded Versions)


(KL, VL) = (16, 128), (32, 256), (64, 512)
TMP_DEST[7:0] := SaturateSignedWordToUnsignedByte (SRC1[15:0]);
TMP_DEST[15:8] := SaturateSignedWordToUnsignedByte (SRC1[31:16]);
TMP_DEST[23:16] := SaturateSignedWordToUnsignedByte (SRC1[47:32]);
TMP_DEST[31:24] := SaturateSignedWordToUnsignedByte (SRC1[63:48]);
TMP_DEST[39:32] := SaturateSignedWordToUnsignedByte (SRC1[79:64]);
TMP_DEST[47:40] := SaturateSignedWordToUnsignedByte (SRC1[95:80]);
TMP_DEST[55:48] := SaturateSignedWordToUnsignedByte (SRC1[111:96]);
TMP_DEST[63:56] := SaturateSignedWordToUnsignedByte (SRC1[127:112]);
TMP_DEST[71:64] := SaturateSignedWordToUnsignedByte (SRC2[15:0]);

PACKUSWB—Pack With Unsigned Saturation Vol. 2B 4-203


TMP_DEST[79:72] := SaturateSignedWordToUnsignedByte (SRC2[31:16]);
TMP_DEST[87:80] := SaturateSignedWordToUnsignedByte (SRC2[47:32]);
TMP_DEST[95:88] := SaturateSignedWordToUnsignedByte (SRC2[63:48]);
TMP_DEST[103:96] := SaturateSignedWordToUnsignedByte (SRC2[79:64]);
TMP_DEST[111:104] := SaturateSignedWordToUnsignedByte (SRC2[95:80]);
TMP_DEST[119:112] := SaturateSignedWordToUnsignedByte (SRC2[111:96]);
TMP_DEST[127:120] := SaturateSignedWordToUnsignedByte (SRC2[127:112]);
IF VL >= 256
TMP_DEST[135:128] := SaturateSignedWordToUnsignedByte (SRC1[143:128]);
TMP_DEST[143:136] := SaturateSignedWordToUnsignedByte (SRC1[159:144]);
TMP_DEST[151:144] := SaturateSignedWordToUnsignedByte (SRC1[175:160]);
TMP_DEST[159:152] := SaturateSignedWordToUnsignedByte (SRC1[191:176]);
TMP_DEST[167:160] := SaturateSignedWordToUnsignedByte (SRC1[207:192]);
TMP_DEST[175:168] := SaturateSignedWordToUnsignedByte (SRC1[223:208]);
TMP_DEST[183:176] := SaturateSignedWordToUnsignedByte (SRC1[239:224]);
TMP_DEST[191:184] := SaturateSignedWordToUnsignedByte (SRC1[255:240]);
TMP_DEST[199:192] := SaturateSignedWordToUnsignedByte (SRC2[143:128]);
TMP_DEST[207:200] := SaturateSignedWordToUnsignedByte (SRC2[159:144]);
TMP_DEST[215:208] := SaturateSignedWordToUnsignedByte (SRC2[175:160]);
TMP_DEST[223:216] := SaturateSignedWordToUnsignedByte (SRC2[191:176]);
TMP_DEST[231:224] := SaturateSignedWordToUnsignedByte (SRC2[207:192]);
TMP_DEST[239:232] := SaturateSignedWordToUnsignedByte (SRC2[223:208]);
TMP_DEST[247:240] := SaturateSignedWordToUnsignedByte (SRC2[239:224]);
TMP_DEST[255:248] := SaturateSignedWordToUnsignedByte (SRC2[255:240]);
FI;
IF VL >= 512
TMP_DEST[263:256] := SaturateSignedWordToUnsignedByte (SRC1[271:256]);
TMP_DEST[271:264] := SaturateSignedWordToUnsignedByte (SRC1[287:272]);
TMP_DEST[279:272] := SaturateSignedWordToUnsignedByte (SRC1[303:288]);
TMP_DEST[287:280] := SaturateSignedWordToUnsignedByte (SRC1[319:304]);
TMP_DEST[295:288] := SaturateSignedWordToUnsignedByte (SRC1[335:320]);
TMP_DEST[303:296] := SaturateSignedWordToUnsignedByte (SRC1[351:336]);
TMP_DEST[311:304] := SaturateSignedWordToUnsignedByte (SRC1[367:352]);
TMP_DEST[319:312] := SaturateSignedWordToUnsignedByte (SRC1[383:368]);

TMP_DEST[327:320] := SaturateSignedWordToUnsignedByte (SRC2[271:256]);


TMP_DEST[335:328] := SaturateSignedWordToUnsignedByte (SRC2[287:272]);
TMP_DEST[343:336] := SaturateSignedWordToUnsignedByte (SRC2[303:288]);
TMP_DEST[351:344] := SaturateSignedWordToUnsignedByte (SRC2[319:304]);
TMP_DEST[359:352] := SaturateSignedWordToUnsignedByte (SRC2[335:320]);
TMP_DEST[367:360] := SaturateSignedWordToUnsignedByte (SRC2[351:336]);
TMP_DEST[375:368] := SaturateSignedWordToUnsignedByte (SRC2[367:352]);
TMP_DEST[383:376] := SaturateSignedWordToUnsignedByte (SRC2[383:368]);

TMP_DEST[391:384] := SaturateSignedWordToUnsignedByte (SRC1[399:384]);


TMP_DEST[399:392] := SaturateSignedWordToUnsignedByte (SRC1[415:400]);
TMP_DEST[407:400] := SaturateSignedWordToUnsignedByte (SRC1[431:416]);
TMP_DEST[415:408] := SaturateSignedWordToUnsignedByte (SRC1[447:432]);
TMP_DEST[423:416] := SaturateSignedWordToUnsignedByte (SRC1[463:448]);
TMP_DEST[431:424] := SaturateSignedWordToUnsignedByte (SRC1[479:464]);
TMP_DEST[439:432] := SaturateSignedWordToUnsignedByte (SRC1[495:480]);
TMP_DEST[447:440] := SaturateSignedWordToUnsignedByte (SRC1[511:496]);

TMP_DEST[455:448] := SaturateSignedWordToUnsignedByte (SRC2[399:384]);

PACKUSWB—Pack With Unsigned Saturation Vol. 2B 4-204


TMP_DEST[463:456] := SaturateSignedWordToUnsignedByte (SRC2[415:400]);
TMP_DEST[471:464] := SaturateSignedWordToUnsignedByte (SRC2[431:416]);
TMP_DEST[479:472] := SaturateSignedWordToUnsignedByte (SRC2[447:432]);
TMP_DEST[487:480] := SaturateSignedWordToUnsignedByte (SRC2[463:448]);
TMP_DEST[495:488] := SaturateSignedWordToUnsignedByte (SRC2[479:464]);
TMP_DEST[503:496] := SaturateSignedWordToUnsignedByte (SRC2[495:480]);
TMP_DEST[511:504] := SaturateSignedWordToUnsignedByte (SRC2[511:496]);
FI;
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN
DEST[i+7:i] := TMP_DEST[i+7:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalents


VPACKUSWB __m512i _mm512_packus_epi16(__m512i m1, __m512i m2);
VPACKUSWB __m512i _mm512_mask_packus_epi16(__m512i s, __mmask64 k, __m512i m1, __m512i m2);
VPACKUSWB __m512i _mm512_maskz_packus_epi16(__mmask64 k, __m512i m1, __m512i m2);
VPACKUSWB __m256i _mm256_mask_packus_epi16(__m256i s, __mmask32 k, __m256i m1, __m256i m2);
VPACKUSWB __m256i _mm256_maskz_packus_epi16(__mmask32 k, __m256i m1, __m256i m2);
VPACKUSWB __m128i _mm_mask_packus_epi16(__m128i s, __mmask16 k, __m128i m1, __m128i m2);
VPACKUSWB __m128i _mm_maskz_packus_epi16(__mmask16 k, __m128i m1, __m128i m2);
PACKUSWB __m64 _mm_packs_pu16(__m64 m1, __m64 m2)
(V)PACKUSWB __m128i _mm_packus_epi16(__m128i m1, __m128i m2)
VPACKUSWB __m256i _mm256_packus_epi16(__m256i m1, __m256i m2);

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”

PACKUSWB—Pack With Unsigned Saturation Vol. 2B 4-205


PADDB/PADDW/PADDD/PADDQ—Add Packed Integers
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F FC /r1 A V/V MMX Add packed byte integers from mm/m64 and mm.
PADDB mm, mm/m64
NP 0F FD /r1 A V/V MMX Add packed word integers from mm/m64 and mm.
PADDW mm, mm/m64
NP 0F FE /r1 A V/V MMX Add packed doubleword integers from mm/m64
PADDD mm, mm/m64 and mm.
NP 0F D4 /r1 A V/V MMX Add packed quadword integers from mm/m64 and
PADDQ mm, mm/m64 mm.
66 0F FC /r A V/V SSE2 Add packed byte integers from xmm2/m128 and
PADDB xmm1, xmm2/m128 xmm1.
66 0F FD /r A V/V SSE2 Add packed word integers from xmm2/m128 and
PADDW xmm1, xmm2/m128 xmm1.
66 0F FE /r A V/V SSE2 Add packed doubleword integers from
PADDD xmm1, xmm2/m128 xmm2/m128 and xmm1.
66 0F D4 /r A V/V SSE2 Add packed quadword integers from xmm2/m128
PADDQ xmm1, xmm2/m128 and xmm1.
VEX.128.66.0F.WIG FC /r B V/V AVX Add packed byte integers from xmm2, and
VPADDB xmm1, xmm2, xmm3/m128 xmm3/m128 and store in xmm1.
VEX.128.66.0F.WIG FD /r B V/V AVX Add packed word integers from xmm2,
VPADDW xmm1, xmm2, xmm3/m128 xmm3/m128 and store in xmm1.
VEX.128.66.0F.WIG FE /r B V/V AVX Add packed doubleword integers from xmm2,
VPADDD xmm1, xmm2, xmm3/m128 xmm3/m128 and store in xmm1.
VEX.128.66.0F.WIG D4 /r B V/V AVX Add packed quadword integers from xmm2,
VPADDQ xmm1, xmm2, xmm3/m128 xmm3/m128 and store in xmm1.
VEX.256.66.0F.WIG FC /r B V/V AVX2 Add packed byte integers from ymm2, and
VPADDB ymm1, ymm2, ymm3/m256 ymm3/m256 and store in ymm1.
VEX.256.66.0F.WIG FD /r B V/V AVX2 Add packed word integers from ymm2,
VPADDW ymm1, ymm2, ymm3/m256 ymm3/m256 and store in ymm1.
VEX.256.66.0F.WIG FE /r B V/V AVX2 Add packed doubleword integers from ymm2,
VPADDD ymm1, ymm2, ymm3/m256 ymm3/m256 and store in ymm1.
VEX.256.66.0F.WIG D4 /r B V/V AVX2 Add packed quadword integers from ymm2,
VPADDQ ymm1, ymm2, ymm3/m256 ymm3/m256 and store in ymm1.
EVEX.128.66.0F.WIG FC /r C V/V (AVX512VL AND Add packed byte integers from xmm2, and
VPADDB xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm3/m128 and store in xmm1 using writemask
xmm3/m128 AVX10.12 k1.
EVEX.128.66.0F.WIG FD /r C V/V (AVX512VL AND Add packed word integers from xmm2, and
VPADDW xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm3/m128 and store in xmm1 using writemask
xmm3/m128 AVX10.12 k1.
EVEX.128.66.0F.W0 FE /r D V/V (AVX512VL AND Add packed doubleword integers from xmm2, and
VPADDD xmm1 {k1}{z}, xmm2, AVX512F) OR xmm3/m128/m32bcst and store in xmm1 using
xmm3/m128/m32bcst AVX10.12 writemask k1.
EVEX.128.66.0F.W1 D4 /r D V/V (AVX512VL AND Add packed quadword integers from xmm2, and
VPADDQ xmm1 {k1}{z}, xmm2, AVX512F) OR xmm3/m128/m64bcst and store in xmm1 using
xmm3/m128/m64bcst AVX10.12 writemask k1.

PADDB/PADDW/PADDD/PADDQ—Add Packed Integers Vol. 2B 4-206


Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
EVEX.256.66.0F.WIG FC /r C V/V (AVX512VL AND Add packed byte integers from ymm2, and
VPADDB ymm1 {k1}{z}, ymm2, AVX512BW) OR ymm3/m256 and store in ymm1 using writemask
ymm3/m256 AVX10.12 k1.
EVEX.256.66.0F.WIG FD /r C V/V (AVX512VL AND Add packed word integers from ymm2, and
VPADDW ymm1 {k1}{z}, ymm2, AVX512BW) OR ymm3/m256 and store in ymm1 using writemask
ymm3/m256 AVX10.12 k1.
EVEX.256.66.0F.W0 FE /r D V/V (AVX512VL AND Add packed doubleword integers from ymm2,
VPADDD ymm1 {k1}{z}, ymm2, AVX512F) OR ymm3/m256/m32bcst and store in ymm1 using
ymm3/m256/m32bcst AVX10.12 writemask k1.
EVEX.256.66.0F.W1 D4 /r D V/V (AVX512VL AND Add packed quadword integers from ymm2,
VPADDQ ymm1 {k1}{z}, ymm2, AVX512F) OR ymm3/m256/m64bcst and store in ymm1 using
ymm3/m256/m64bcst AVX10.12 writemask k1.
EVEX.512.66.0F.WIG FC /r C V/V AVX512BW Add packed byte integers from zmm2, and
VPADDB zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm3/m512 and store in zmm1 using writemask
zmm3/m512 k1.
EVEX.512.66.0F.WIG FD /r C V/V AVX512BW Add packed word integers from zmm2, and
VPADDW zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm3/m512 and store in zmm1 using writemask
zmm3/m512 k1.
EVEX.512.66.0F.W0 FE /r D V/V AVX512F Add packed doubleword integers from zmm2,
VPADDD zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm3/m512/m32bcst and store in zmm1 using
zmm3/m512/m32bcst writemask k1.
EVEX.512.66.0F.W1 D4 /r D V/V AVX512F Add packed quadword integers from zmm2,
VPADDQ zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm3/m512/m64bcst and store in zmm1 using
zmm3/m512/m64bcst writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A
D Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD add of the packed integers from the source operand (second operand) and the destination
operand (first operand), and stores the packed integer results in the destination operand. See Figure 9-4 in the
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a SIMD operation.
Overflow is handled with wraparound, as described in the following paragraphs.
The PADDB and VPADDB instructions add packed byte integers from the first source operand and second source
operand and store the packed integer results in the destination operand. When an individual result is too large to

PADDB/PADDW/PADDD/PADDQ—Add Packed Integers Vol. 2B 4-207


be represented in 8 bits (overflow), the result is wrapped around and the low 8 bits are written to the destination
operand (that is, the carry is ignored).
The PADDW and VPADDW instructions add packed word integers from the first source operand and second source
operand and store the packed integer results in the destination operand. When an individual result is too large to
be represented in 16 bits (overflow), the result is wrapped around and the low 16 bits are written to the destination
operand (that is, the carry is ignored).
The PADDD and VPADDD instructions add packed doubleword integers from the first source operand and second
source operand and store the packed integer results in the destination operand. When an individual result is too
large to be represented in 32 bits (overflow), the result is wrapped around and the low 32 bits are written to the
destination operand (that is, the carry is ignored).
The PADDQ and VPADDQ instructions add packed quadword integers from the first source operand and second
source operand and store the packed integer results in the destination operand. When a quadword result is too
large to be represented in 64 bits (overflow), the result is wrapped around and the low 64 bits are written to the
destination operand (that is, the carry is ignored).
Note that the (V)PADDB, (V)PADDW, (V)PADDD and (V)PADDQ instructions can operate on either unsigned or
signed (two's complement notation) packed integers; however, it does not set bits in the EFLAGS register to indi-
cate overflow and/or a carry. To prevent undetected overflow conditions, software must control the ranges of
values operated on.
EVEX encoded VPADDD/Q: The first source operand is a ZMM/YMM/XMM register. The second source operand is a
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a
32/64-bit memory location. The destination operand is a ZMM/YMM/XMM register updated according to the write-
mask.
EVEX encoded VPADDB/W: The first source operand is a ZMM/YMM/XMM register. The second source operand is a
ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand is a ZMM/YMM/XMM
register updated according to the writemask.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register. the upper bits (MAXVL-1:256) of the
destination are cleared.
VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the
upper bits (MAXVL-1:128) of the corresponding ZMM register destination are unmodified.

Operation
PADDB (With 64-bit Operands)
DEST[7:0] := DEST[7:0] + SRC[7:0];
(* Repeat add operation for 2nd through 7th byte *)
DEST[63:56] := DEST[63:56] + SRC[63:56];

PADDW (With 64-bit Operands)


DEST[15:0] := DEST[15:0] + SRC[15:0];
(* Repeat add operation for 2nd and 3th word *)
DEST[63:48] := DEST[63:48] + SRC[63:48];

PADDD (With 64-bit Operands)


DEST[31:0] := DEST[31:0] + SRC[31:0];
DEST[63:32] := DEST[63:32] + SRC[63:32];

PADDQ (With 64-Bit Operands)


DEST[63:0] := DEST[63:0] + SRC[63:0];

PADDB/PADDW/PADDD/PADDQ—Add Packed Integers Vol. 2B 4-208


PADDB (Legacy SSE Instruction)
DEST[7:0] := DEST[7:0] + SRC[7:0];
(* Repeat add operation for 2nd through 15th byte *)
DEST[127:120] := DEST[127:120] + SRC[127:120];
DEST[MAXVL-1:128] (Unmodified)

PADDW (Legacy SSE Instruction)


DEST[15:0] := DEST[15:0] + SRC[15:0];
(* Repeat add operation for 2nd through 7th word *)
DEST[127:112] := DEST[127:112] + SRC[127:112];
DEST[MAXVL-1:128] (Unmodified)

PADDD (Legacy SSE Instruction)


DEST[31:0] := DEST[31:0] + SRC[31:0];
(* Repeat add operation for 2nd and 3th doubleword *)
DEST[127:96] := DEST[127:96] + SRC[127:96];
DEST[MAXVL-1:128] (Unmodified)

PADDQ (Legacy SSE Instruction)


DEST[63:0] := DEST[63:0] + SRC[63:0];
DEST[127:64] := DEST[127:64] + SRC[127:64];
DEST[MAXVL-1:128] (Unmodified)

VPADDB (VEX.128 Encoded Instruction)


DEST[7:0] := SRC1[7:0] + SRC2[7:0];
(* Repeat add operation for 2nd through 15th byte *)
DEST[127:120] := SRC1[127:120] + SRC2[127:120];
DEST[MAXVL-1:128] := 0;

VPADDW (VEX.128 Encoded Instruction)


DEST[15:0] := SRC1[15:0] + SRC2[15:0];
(* Repeat add operation for 2nd through 7th word *)
DEST[127:112] := SRC1[127:112] + SRC2[127:112];
DEST[MAXVL-1:128] := 0;

VPADDD (VEX.128 Encoded Instruction)


DEST[31:0] := SRC1[31:0] + SRC2[31:0];
(* Repeat add operation for 2nd and 3th doubleword *)
DEST[127:96] := SRC1[127:96] + SRC2[127:96];
DEST[MAXVL-1:128] := 0;

VPADDQ (VEX.128 Encoded Instruction)


DEST[63:0] := SRC1[63:0] + SRC2[63:0];
DEST[127:64] := SRC1[127:64] + SRC2[127:64];
DEST[MAXVL-1:128] := 0;

VPADDB (VEX.256 Encoded Instruction)


DEST[7:0] := SRC1[7:0] + SRC2[7:0];
(* Repeat add operation for 2nd through 31th byte *)
DEST[255:248] := SRC1[255:248] + SRC2[255:248];

PADDB/PADDW/PADDD/PADDQ—Add Packed Integers Vol. 2B 4-209


VPADDW (VEX.256 Encoded Instruction)
DEST[15:0] := SRC1[15:0] + SRC2[15:0];
(* Repeat add operation for 2nd through 15th word *)
DEST[255:240] := SRC1[255:240] + SRC2[255:240];

VPADDD (VEX.256 Encoded Instruction)


DEST[31:0] := SRC1[31:0] + SRC2[31:0];
(* Repeat add operation for 2nd and 7th doubleword *)
DEST[255:224] := SRC1[255:224] + SRC2[255:224];

VPADDQ (VEX.256 Encoded Instruction)


DEST[63:0] := SRC1[63:0] + SRC2[63:0];
DEST[127:64] := SRC1[127:64] + SRC2[127:64];
DEST[191:128] := SRC1[191:128] + SRC2[191:128];
DEST[255:192] := SRC1[255:192] + SRC2[255:192];

VPADDB (EVEX Encoded Versions)


(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SRC1[i+7:i] + SRC2[i+7:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

VPADDW (EVEX Encoded Versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SRC1[i+15:i] + SRC2[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PADDB/PADDW/PADDD/PADDQ—Add Packed Integers Vol. 2B 4-210


VPADDD (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+31:i] := SRC1[i+31:i] + SRC2[31:0]
ELSE DEST[i+31:i] := SRC1[i+31:i] + SRC2[i+31:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

VPADDQ (EVEX Encoded Versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+63:i] := SRC1[i+63:i] + SRC2[63:0]
ELSE DEST[i+63:i] := SRC1[i+63:i] + SRC2[i+63:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalents


VPADDB__m512i _mm512_add_epi8 ( __m512i a, __m512i b)
VPADDW__m512i _mm512_add_epi16 ( __m512i a, __m512i b)
VPADDB__m512i _mm512_mask_add_epi8 ( __m512i s, __mmask64 m, __m512i a, __m512i b)
VPADDW__m512i _mm512_mask_add_epi16 ( __m512i s, __mmask32 m, __m512i a, __m512i b)
VPADDB__m512i _mm512_maskz_add_epi8 (__mmask64 m, __m512i a, __m512i b)
VPADDW__m512i _mm512_maskz_add_epi16 (__mmask32 m, __m512i a, __m512i b)
VPADDB__m256i _mm256_mask_add_epi8 (__m256i s, __mmask32 m, __m256i a, __m256i b)
VPADDW__m256i _mm256_mask_add_epi16 (__m256i s, __mmask16 m, __m256i a, __m256i b)
VPADDB__m256i _mm256_maskz_add_epi8 (__mmask32 m, __m256i a, __m256i b)
VPADDW__m256i _mm256_maskz_add_epi16 (__mmask16 m, __m256i a, __m256i b)
VPADDB__m128i _mm_mask_add_epi8 (__m128i s, __mmask16 m, __m128i a, __m128i b)
VPADDW__m128i _mm_mask_add_epi16 (__m128i s, __mmask8 m, __m128i a, __m128i b)

PADDB/PADDW/PADDD/PADDQ—Add Packed Integers Vol. 2B 4-211


VPADDB__m128i _mm_maskz_add_epi8 (__mmask16 m, __m128i a, __m128i b)
VPADDW__m128i _mm_maskz_add_epi16 (__mmask8 m, __m128i a, __m128i b)
VPADDD __m512i _mm512_add_epi32( __m512i a, __m512i b);
VPADDD __m512i _mm512_mask_add_epi32(__m512i s, __mmask16 k, __m512i a, __m512i b);
VPADDD __m512i _mm512_maskz_add_epi32( __mmask16 k, __m512i a, __m512i b);
VPADDD __m256i _mm256_mask_add_epi32(__m256i s, __mmask8 k, __m256i a, __m256i b);
VPADDD __m256i _mm256_maskz_add_epi32( __mmask8 k, __m256i a, __m256i b);
VPADDD __m128i _mm_mask_add_epi32(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPADDD __m128i _mm_maskz_add_epi32( __mmask8 k, __m128i a, __m128i b);
VPADDQ __m512i _mm512_add_epi64( __m512i a, __m512i b);
VPADDQ __m512i _mm512_mask_add_epi64(__m512i s, __mmask8 k, __m512i a, __m512i b);
VPADDQ __m512i _mm512_maskz_add_epi64( __mmask8 k, __m512i a, __m512i b);
VPADDQ __m256i _mm256_mask_add_epi64(__m256i s, __mmask8 k, __m256i a, __m256i b);
VPADDQ __m256i _mm256_maskz_add_epi64( __mmask8 k, __m256i a, __m256i b);
VPADDQ __m128i _mm_mask_add_epi64(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPADDQ __m128i _mm_maskz_add_epi64( __mmask8 k, __m128i a, __m128i b);
PADDB __m128i _mm_add_epi8 (__m128i a,__m128i b );
PADDW __m128i _mm_add_epi16 ( __m128i a, __m128i b);
PADDD __m128i _mm_add_epi32 ( __m128i a, __m128i b);
PADDQ __m128i _mm_add_epi64 ( __m128i a, __m128i b);
VPADDB __m256i _mm256_add_epi8 (__m256ia,__m256i b );
VPADDW __m256i _mm256_add_epi16 ( __m256i a, __m256i b);
VPADDD __m256i _mm256_add_epi32 ( __m256i a, __m256i b);
VPADDQ __m256i _mm256_add_epi64 ( __m256i a, __m256i b);
PADDB __m64 _mm_add_pi8(__m64 m1, __m64 m2)
PADDW __m64 _mm_add_pi16(__m64 m1, __m64 m2)
PADDD __m64 _mm_add_pi32(__m64 m1, __m64 m2)
PADDQ __m64 _mm_add_si64(__m64 m1, __m64 m2)

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPADDD/Q, see Table 2-51, “Type E4 Class Exception Conditions.”
EVEX-encoded VPADDB/W, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

PADDB/PADDW/PADDD/PADDQ—Add Packed Integers Vol. 2B 4-212


PADDSB/PADDSW—Add Packed Signed Integers with Signed Saturation
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F EC /r1 A V/V MMX Add packed signed byte integers from
PADDSB mm, mm/m64 mm/m64 and mm and saturate the results.

66 0F EC /r A V/V SSE2 Add packed signed byte integers from


PADDSB xmm1, xmm2/m128 xmm2/m128 and xmm1 saturate the results.

NP 0F ED /r1 A V/V MMX Add packed signed word integers from


PADDSW mm, mm/m64 mm/m64 and mm and saturate the results.

66 0F ED /r A V/V SSE2 Add packed signed word integers from


PADDSW xmm1, xmm2/m128 xmm2/m128 and xmm1 and saturate the
results.
VEX.128.66.0F.WIG EC /r B V/V AVX Add packed signed byte integers from
VPADDSB xmm1, xmm2, xmm3/m128 xmm3/m128 and xmm2 saturate the results.
VEX.128.66.0F.WIG ED /r B V/V AVX Add packed signed word integers from
VPADDSW xmm1, xmm2, xmm3/m128 xmm3/m128 and xmm2 and saturate the
results.
VEX.256.66.0F.WIG EC /r B V/V AVX2 Add packed signed byte integers from ymm2,
VPADDSB ymm1, ymm2, ymm3/m256 and ymm3/m256 and store the saturated
results in ymm1.
VEX.256.66.0F.WIG ED /r B V/V AVX2 Add packed signed word integers from ymm2,
VPADDSW ymm1, ymm2, ymm3/m256 and ymm3/m256 and store the saturated
results in ymm1.
EVEX.128.66.0F.WIG EC /r C V/V (AVX512VL AND Add packed signed byte integers from xmm2,
VPADDSB xmm1 {k1}{z}, xmm2, xmm3/m128 AVX512BW) OR and xmm3/m128 and store the saturated
AVX10.12 results in xmm1 under writemask k1.
EVEX.256.66.0F.WIG EC /r C V/V (AVX512VL AND Add packed signed byte integers from ymm2,
VPADDSB ymm1 {k1}{z}, ymm2, ymm3/m256 AVX512BW) OR and ymm3/m256 and store the saturated
AVX10.12 results in ymm1 under writemask k1.
EVEX.512.66.0F.WIG EC /r C V/V AVX512BW Add packed signed byte integers from zmm2,
VPADDSB zmm1 {k1}{z}, zmm2, zmm3/m512 OR AVX10.12 and zmm3/m512 and store the saturated
results in zmm1 under writemask k1.
EVEX.128.66.0F.WIG ED /r C V/V (AVX512VL AND Add packed signed word integers from xmm2,
VPADDSW xmm1 {k1}{z}, xmm2, xmm3/m128 AVX512BW) OR and xmm3/m128 and store the saturated
AVX10.12 results in xmm1 under writemask k1.
EVEX.256.66.0F.WIG ED /r C V/V (AVX512VL AND Add packed signed word integers from ymm2,
VPADDSW ymm1 {k1}{z}, ymm2, ymm3/m256 AVX512BW) OR and ymm3/m256 and store the saturated
AVX10.12 results in ymm1 under writemask k1.
EVEX.512.66.0F.WIG ED /r C V/V AVX512BW Add packed signed word integers from zmm2,
VPADDSW zmm1 {k1}{z}, zmm2, zmm3/m512 OR AVX10.12 and zmm3/m512 and store the saturated
results in zmm1 under writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

PADDSB/PADDSW—Add Packed Signed Integers with Signed Saturation Vol. 2B 4-213


Instruction Operand Encoding
Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD add of the packed signed integers from the source operand (second operand) and the destination
operand (first operand), and stores the packed integer results in the destination operand. See Figure 9-4 in the
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a SIMD operation.
Overflow is handled with signed saturation, as described in the following paragraphs.
(V)PADDSB performs a SIMD add of the packed signed integers with saturation from the first source operand and
second source operand and stores the packed integer results in the destination operand. When an individual byte
result is beyond the range of a signed byte integer (that is, greater than 7FH or less than 80H), the saturated value
of 7FH or 80H, respectively, is written to the destination operand.
(V)PADDSW performs a SIMD add of the packed signed word integers with saturation from the first source operand
and second source operand and stores the packed integer results in the destination operand. When an individual
word result is beyond the range of a signed word integer (that is, greater than 7FFFH or less than 8000H), the satu-
rated value of 7FFFH or 8000H, respectively, is written to the destination operand.
EVEX encoded versions: The first source operand is an ZMM/YMM/XMM register. The second source operand is an
ZMM/YMM/XMM register or a memory location. The destination operand is an ZMM/YMM/XMM register.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register.
VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding register destination are zeroed.
128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the
upper bits (MAXVL-1:128) of the corresponding register destination are unmodified.

Operation
PADDSB (With 64-bit Operands)
DEST[7:0] := SaturateToSignedByte(DEST[7:0] + SRC (7:0]);
(* Repeat add operation for 2nd through 7th bytes *)
DEST[63:56] := SaturateToSignedByte(DEST[63:56] + SRC[63:56] );

PADDSB (With 128-bit Operands)


DEST[7:0] := SaturateToSignedByte (DEST[7:0] + SRC[7:0]);
(* Repeat add operation for 2nd through 14th bytes *)
DEST[127:120] := SaturateToSignedByte (DEST[111:120] + SRC[127:120]);

VPADDSB (VEX.128 Encoded Version)


DEST[7:0] := SaturateToSignedByte (SRC1[7:0] + SRC2[7:0]);
(* Repeat subtract operation for 2nd through 14th bytes *)
DEST[127:120] := SaturateToSignedByte (SRC1[111:120] + SRC2[127:120]);
DEST[MAXVL-1:128] := 0

VPADDSB (VEX.256 Encoded Version)


DEST[7:0] := SaturateToSignedByte (SRC1[7:0] + SRC2[7:0]);
(* Repeat add operation for 2nd through 31st bytes *)
DEST[255:248] := SaturateToSignedByte (SRC1[255:248] + SRC2[255:248]);

PADDSB/PADDSW—Add Packed Signed Integers with Signed Saturation Vol. 2B 4-214


VPADDSB (EVEX Encoded Versions)
(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SaturateToSignedByte (SRC1[i+7:i] + SRC2[i+7:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
PADDSW (with 64-bit operands)
DEST[15:0] := SaturateToSignedWord(DEST[15:0] + SRC[15:0] );
(* Repeat add operation for 2nd and 7th words *)
DEST[63:48] := SaturateToSignedWord(DEST[63:48] + SRC[63:48] );
PADDSW (with 128-bit operands)
DEST[15:0] := SaturateToSignedWord (DEST[15:0] + SRC[15:0]);
(* Repeat add operation for 2nd through 7th words *)
DEST[127:112] := SaturateToSignedWord (DEST[127:112] + SRC[127:112]);

VPADDSW (VEX.128 Encoded Version)


DEST[15:0] := SaturateToSignedWord (SRC1[15:0] + SRC2[15:0]);
(* Repeat subtract operation for 2nd through 7th words *)
DEST[127:112] := SaturateToSignedWord (SRC1[127:112] + SRC2[127:112]);
DEST[MAXVL-1:128] := 0

VPADDSW (VEX.256 Encoded Version)


DEST[15:0] := SaturateToSignedWord (SRC1[15:0] + SRC2[15:0]);
(* Repeat add operation for 2nd through 15th words *)
DEST[255:240] := SaturateToSignedWord (SRC1[255:240] + SRC2[255:240])

VPADDSW (EVEX Encoded Versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SaturateToSignedWord (SRC1[i+15:i] + SRC2[i+15:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PADDSB/PADDSW—Add Packed Signed Integers with Signed Saturation Vol. 2B 4-215


Intel C/C++ Compiler Intrinsic Equivalents
PADDSB __m64 _mm_adds_pi8(__m64 m1, __m64 m2)
(V)PADDSB __m128i _mm_adds_epi8 ( __m128i a, __m128i b)
VPADDSB __m256i _mm256_adds_epi8 ( __m256i a, __m256i b)
PADDSW __m64 _mm_adds_pi16(__m64 m1, __m64 m2)
(V)PADDSW __m128i _mm_adds_epi16 ( __m128i a, __m128i b)
VPADDSW __m256i _mm256_adds_epi16 ( __m256i a, __m256i b)
VPADDSB __m512i _mm512_adds_epi8 ( __m512i a, __m512i b)
VPADDSW __m512i _mm512_adds_epi16 ( __m512i a, __m512i b)
VPADDSB __m512i _mm512_mask_adds_epi8 ( __m512i s, __mmask64 m, __m512i a, __m512i b)
VPADDSW __m512i _mm512_mask_adds_epi16 ( __m512i s, __mmask32 m, __m512i a, __m512i b)
VPADDSB __m512i _mm512_maskz_adds_epi8 (__mmask64 m, __m512i a, __m512i b)
VPADDSW __m512i _mm512_maskz_adds_epi16 (__mmask32 m, __m512i a, __m512i b)
VPADDSB __m256i _mm256_mask_adds_epi8 (__m256i s, __mmask32 m, __m256i a, __m256i b)
VPADDSW __m256i _mm256_mask_adds_epi16 (__m256i s, __mmask16 m, __m256i a, __m256i b)
VPADDSB __m256i _mm256_maskz_adds_epi8 (__mmask32 m, __m256i a, __m256i b)
VPADDSW __m256i _mm256_maskz_adds_epi16 (__mmask16 m, __m256i a, __m256i b)
VPADDSB __m128i _mm_mask_adds_epi8 (__m128i s, __mmask16 m, __m128i a, __m128i b)
VPADDSW __m128i _mm_mask_adds_epi16 (__m128i s, __mmask8 m, __m128i a, __m128i b)
VPADDSB __m128i _mm_maskz_adds_epi8 (__mmask16 m, __m128i a, __m128i b)
VPADDSW __m128i _mm_maskz_adds_epi16 (__mmask8 m, __m128i a, __m128i b)

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

PADDSB/PADDSW—Add Packed Signed Integers with Signed Saturation Vol. 2B 4-216


PADDUSB/PADDUSW—Add Packed Unsigned Integers With Unsigned Saturation
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F DC /r1 A V/V MMX Add packed unsigned byte integers from
PADDUSB mm, mm/m64 mm/m64 and mm and saturate the results.

66 0F DC /r A V/V SSE2 Add packed unsigned byte integers from


PADDUSB xmm1, xmm2/m128 xmm2/m128 and xmm1 saturate the results.

NP 0F DD /r1 A V/V MMX Add packed unsigned word integers from


PADDUSW mm, mm/m64 mm/m64 and mm and saturate the results.

66 0F DD /r A V/V SSE2 Add packed unsigned word integers from


PADDUSW xmm1, xmm2/m128 xmm2/m128 to xmm1 and saturate the results.

VEX.128.660F.WIG DC /r B V/V AVX Add packed unsigned byte integers from


VPADDUSB xmm1, xmm2, xmm3/m128 xmm3/m128 to xmm2 and saturate the results.

VEX.128.66.0F.WIG DD /r B V/V AVX Add packed unsigned word integers from


VPADDUSW xmm1, xmm2, xmm3/m128 xmm3/m128 to xmm2 and saturate the results.

VEX.256.66.0F.WIG DC /r B V/V AVX2 Add packed unsigned byte integers from ymm2,
VPADDUSB ymm1, ymm2, ymm3/m256 and ymm3/m256 and store the saturated
results in ymm1.
VEX.256.66.0F.WIG DD /r B V/V AVX2 Add packed unsigned word integers from ymm2,
VPADDUSW ymm1, ymm2, ymm3/m256 and ymm3/m256 and store the saturated
results in ymm1.
EVEX.128.66.0F.WIG DC /r C V/V (AVX512VL AND Add packed unsigned byte integers from xmm2,
VPADDUSB xmm1 {k1}{z}, xmm2, AVX512BW) OR and xmm3/m128 and store the saturated
xmm3/m128 AVX10.12 results in xmm1 under writemask k1.
EVEX.256.66.0F.WIG DC /r C V/V (AVX512VL AND Add packed unsigned byte integers from ymm2,
VPADDUSB ymm1 {k1}{z}, ymm2, AVX512BW) OR and ymm3/m256 and store the saturated
ymm3/m256 AVX10.12 results in ymm1 under writemask k1.
EVEX.512.66.0F.WIG DC /r C V/V AVX512BW Add packed unsigned byte integers from zmm2,
VPADDUSB zmm1 {k1}{z}, zmm2, OR AVX10.12 and zmm3/m512 and store the saturated
zmm3/m512 results in zmm1 under writemask k1.
EVEX.128.66.0F.WIG DD /r C V/V (AVX512VL AND Add packed unsigned word integers from xmm2,
VPADDUSW xmm1 {k1}{z}, xmm2, AVX512BW) OR and xmm3/m128 and store the saturated
xmm3/m128 AVX10.12 results in xmm1 under writemask k1.
EVEX.256.66.0F.WIG DD /r C V/V (AVX512VL AND Add packed unsigned word integers from ymm2,
VPADDUSW ymm1 {k1}{z}, ymm2, AVX512BW) OR and ymm3/m256 and store the saturated
ymm3/m256 AVX10.12 results in ymm1 under writemask k1.
EVEX.512.66.0F.WIG DD /r C V/V AVX512BW Add packed unsigned word integers from zmm2,
VPADDUSW zmm1 {k1}{z}, zmm2, OR AVX10.12 and zmm3/m512 and store the saturated
zmm3/m512 results in zmm1 under writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

PADDUSB/PADDUSW—Add Packed Unsigned Integers With Unsigned Saturation Vol. 2B 4-217


Instruction Operand Encoding
Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD add of the packed unsigned integers from the source operand (second operand) and the destina-
tion operand (first operand), and stores the packed integer results in the destination operand. See Figure 9-4 in the
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a SIMD operation.
Overflow is handled with unsigned saturation, as described in the following paragraphs.
(V)PADDUSB performs a SIMD add of the packed unsigned integers with saturation from the first source operand
and second source operand and stores the packed integer results in the destination operand. When an individual
byte result is beyond the range of an unsigned byte integer (that is, greater than FFH), the saturated value of FFH
is written to the destination operand.
(V)PADDUSW performs a SIMD add of the packed unsigned word integers with saturation from the first source
operand and second source operand and stores the packed integer results in the destination operand. When an
individual word result is beyond the range of an unsigned word integer (that is, greater than FFFFH), the saturated
value of FFFFH is written to the destination operand.
EVEX encoded versions: The first source operand is an ZMM/YMM/XMM register. The second source operand is an
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination is an ZMM/YMM/XMM register.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register.
VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding destination register destination are zeroed.
128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the
upper bits (MAXVL-1:128) of the corresponding register destination are unmodified.

Operation
PADDUSB (With 64-bit Operands)
DEST[7:0] := SaturateToUnsignedByte(DEST[7:0] + SRC (7:0] );
(* Repeat add operation for 2nd through 7th bytes *)
DEST[63:56] := SaturateToUnsignedByte(DEST[63:56] + SRC[63:56]

PADDUSB (With 128-bit Operands)


DEST[7:0] := SaturateToUnsignedByte (DEST[7:0] + SRC[7:0]);
(* Repeat add operation for 2nd through 14th bytes *)
DEST[127:120] := SaturateToUnSignedByte (DEST[127:120] + SRC[127:120]);

VPADDUSB (VEX.128 Encoded Version)


DEST[7:0] := SaturateToUnsignedByte (SRC1[7:0] + SRC2[7:0]);
(* Repeat subtract operation for 2nd through 14th bytes *)
DEST[127:120] := SaturateToUnsignedByte (SRC1[111:120] + SRC2[127:120]);
DEST[MAXVL-1:128] := 0

PADDUSB/PADDUSW—Add Packed Unsigned Integers With Unsigned Saturation Vol. 2B 4-218


VPADDUSB (VEX.256 Encoded Version)
DEST[7:0] := SaturateToUnsignedByte (SRC1[7:0] + SRC2[7:0]);
(* Repeat add operation for 2nd through 31st bytes *)
DEST[255:248] := SaturateToUnsignedByte (SRC1[255:248] + SRC2[255:248]);

PADDUSW (With 64-bit Operands)


DEST[15:0] := SaturateToUnsignedWord(DEST[15:0] + SRC[15:0] );
(* Repeat add operation for 2nd and 3rd words *)
DEST[63:48] := SaturateToUnsignedWord(DEST[63:48] + SRC[63:48] );

PADDUSW (With 128-bit Operands)


DEST[15:0] := SaturateToUnsignedWord (DEST[15:0] + SRC[15:0]);
(* Repeat add operation for 2nd through 7th words *)
DEST[127:112] := SaturateToUnSignedWord (DEST[127:112] + SRC[127:112]);

VPADDUSW (VEX.128 Encoded Version)


DEST[15:0] := SaturateToUnsignedWord (SRC1[15:0] + SRC2[15:0]);
(* Repeat subtract operation for 2nd through 7th words *)
DEST[127:112] := SaturateToUnsignedWord (SRC1[127:112] + SRC2[127:112]);
DEST[MAXVL-1:128] := 0

VPADDUSW (VEX.256 Encoded Version)


DEST[15:0] := SaturateToUnsignedWord (SRC1[15:0] + SRC2[15:0]);
(* Repeat add operation for 2nd through 15th words *)
DEST[255:240] := SaturateToUnsignedWord (SRC1[255:240] + SRC2[255:240])

VPADDUSB (EVEX Encoded Versions)


(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SaturateToUnsignedByte (SRC1[i+7:i] + SRC2[i+7:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PADDUSB/PADDUSW—Add Packed Unsigned Integers With Unsigned Saturation Vol. 2B 4-219


VPADDUSW (EVEX Encoded Versions)
(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SaturateToUnsignedWord (SRC1[i+15:i] + SRC2[i+15:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalents


PADDUSB __m64 _mm_adds_pu8(__m64 m1, __m64 m2)
PADDUSW __m64 _mm_adds_pu16(__m64 m1, __m64 m2)
(V)PADDUSB __m128i _mm_adds_epu8 ( __m128i a, __m128i b)
(V)PADDUSW __m128i _mm_adds_epu16 ( __m128i a, __m128i b)
VPADDUSB __m256i _mm256_adds_epu8 ( __m256i a, __m256i b)
VPADDUSW __m256i _mm256_adds_epu16 ( __m256i a, __m256i b)
VPADDUSB __m512i _mm512_adds_epu8 ( __m512i a, __m512i b)
VPADDUSW __m512i _mm512_adds_epu16 ( __m512i a, __m512i b)
VPADDUSB __m512i _mm512_mask_adds_epu8 ( __m512i s, __mmask64 m, __m512i a, __m512i b)
VPADDUSW __m512i _mm512_mask_adds_epu16 ( __m512i s, __mmask32 m, __m512i a, __m512i b)
VPADDUSB __m512i _mm512_maskz_adds_epu8 (__mmask64 m, __m512i a, __m512i b)
VPADDUSW __m512i _mm512_maskz_adds_epu16 (__mmask32 m, __m512i a, __m512i b)
VPADDUSB __m256i _mm256_mask_adds_epu8 (__m256i s, __mmask32 m, __m256i a, __m256i b)
VPADDUSW __m256i _mm256_mask_adds_epu16 (__m256i s, __mmask16 m, __m256i a, __m256i b)
VPADDUSB __m256i _mm256_maskz_adds_epu8 (__mmask32 m, __m256i a, __m256i b)
VPADDUSW __m256i _mm256_maskz_adds_epu16 (__mmask16 m, __m256i a, __m256i b)
VPADDUSB __m128i _mm_mask_adds_epu8 (__m128i s, __mmask16 m, __m128i a, __m128i b)
VPADDUSW __m128i _mm_mask_adds_epu16 (__m128i s, __mmask8 m, __m128i a, __m128i b)
VPADDUSB __m128i _mm_maskz_adds_epu8 (__mmask16 m, __m128i a, __m128i b)
VPADDUSW __m128i _mm_maskz_adds_epu16 (__mmask8 m, __m128i a, __m128i b)

Flags Affected
None.

Numeric Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

PADDUSB/PADDUSW—Add Packed Unsigned Integers With Unsigned Saturation Vol. 2B 4-220


PALIGNR—Packed Align Right
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 3A 0F /r ib1 A V/V SSSE3 Concatenate destination and source operands,
PALIGNR mm1, mm2/m64, imm8 extract byte-aligned result shifted to the right by
constant value in imm8 into mm1.
66 0F 3A 0F /r ib A V/V SSSE3 Concatenate destination and source operands,
PALIGNR xmm1, xmm2/m128, imm8 extract byte-aligned result shifted to the right by
constant value in imm8 into xmm1.
VEX.128.66.0F3A.WIG 0F /r ib B V/V AVX Concatenate xmm2 and xmm3/m128, extract byte
VPALIGNR xmm1, xmm2, xmm3/m128, aligned result shifted to the right by constant value in
imm8 imm8 and result is stored in xmm1.

VEX.256.66.0F3A.WIG 0F /r ib B V/V AVX2 Concatenate pairs of 16 bytes in ymm2 and


VPALIGNR ymm1, ymm2, ymm3/m256, ymm3/m256 into 32-byte intermediate result,
imm8 extract byte-aligned, 16-byte result shifted to the
right by constant values in imm8 from each
intermediate result, and two 16-byte results are
stored in ymm1.
EVEX.128.66.0F3A.WIG 0F /r ib C V/V (AVX512VL AND Concatenate xmm2 and xmm3/m128 into a 32-byte
VPALIGNR xmm1 {k1}{z}, xmm2, AVX512BW) OR intermediate result, extract byte aligned result
xmm3/m128, imm8 AVX10.12 shifted to the right by constant value in imm8 and
result is stored in xmm1.
EVEX.256.66.0F3A.WIG 0F /r ib C V/V (AVX512VL AND Concatenate pairs of 16 bytes in ymm2 and
VPALIGNR ymm1 {k1}{z}, ymm2, AVX512BW) OR ymm3/m256 into 32-byte intermediate result,
ymm3/m256, imm8 AVX10.11 extract byte-aligned, 16-byte result shifted to the
right by constant values in imm8 from each
intermediate result, and two 16-byte results are
stored in ymm1.
EVEX.512.66.0F3A.WIG 0F /r ib C V/V AVX512BW Concatenate pairs of 16 bytes in zmm2 and
VPALIGNR zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512 into 32-byte intermediate result,
zmm3/m512, imm8 extract byte-aligned, 16-byte result shifted to the
right by constant values in imm8 from each
intermediate result, and four 16-byte results are
stored in zmm1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) imm8 N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8

PALIGNR—Packed Align Right Vol. 2B 4-221


Description
(V)PALIGNR concatenates the destination operand (the first operand) and the source operand (the second
operand) into an intermediate composite, shifts the composite at byte granularity to the right by a constant imme-
diate, and extracts the right-aligned result into the destination. The first and the second operands can be an MMX,
XMM or a YMM register. The immediate value is considered unsigned. Immediate shift counts larger than the 2L
(i.e., 32 for 128-bit operands, or 16 for 64-bit operands) produce a zero result. Both operands can be MMX regis-
ters, XMM registers or YMM registers. When the source operand is a 128-bit memory operand, the operand must
be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated.
In 64-bit mode and not encoded by VEX/EVEX prefix, use the REX prefix to access additional registers.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding YMM destination register remain
unchanged.
EVEX.512 encoded version: The first source operand is a ZMM register and contains four 16-byte blocks. The
second source operand is a ZMM register or a 512-bit memory location containing four 16-byte block. The destina-
tion operand is a ZMM register and contain four 16-byte results. The imm8[7:0] is the common shift count
used for each of the four successive 16-byte block sources. The low 16-byte block of the two source operands
produce the low 16-byte result of the destination operand, the high 16-byte block of the two source operands
produce the high 16-byte result of the destination operand and so on for the blocks in the middle.
VEX.256 and EVEX.256 encoded versions: The first source operand is a YMM register and contains two 16-byte
blocks. The second source operand is a YMM register or a 256-bit memory location containing two 16-byte block.
The destination operand is a YMM register and contain two 16-byte results. The imm8[7:0] is the common shift
count used for the two lower 16-byte block sources and the two upper 16-byte block sources. The low 16-byte
block of the two source operands produce the low 16-byte result of the destination operand, the high 16-byte block
of the two source operands produce the high 16-byte result of the destination operand. The upper bits (MAXVL-
1:256) of the corresponding ZMM register destination are zeroed.
VEX.128 and EVEX.128 encoded versions: The first source operand is an XMM register. The second source operand
is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-
1:128) of the corresponding ZMM register destination are zeroed.
Concatenation is done with 128-bit data in the first and second source operand for both 128-bit and 256-bit
instructions. The high 128-bits of the intermediate composite 256-bit result came from the 128-bit data from the
first source operand; the low 128-bits of the intermediate result came from the 128-bit data of the second source
operand.

127 0 127 0

SRC1 SRC2

Imm8[7:0]*8
255 128 255 128

SRC1 SRC2

Imm8[7:0]*8

255 128 127 0

DEST DEST

Figure 4-7. 256-bit VPALIGN Instruction Operation

PALIGNR—Packed Align Right Vol. 2B 4-222


Operation
PALIGNR (With 64-bit Operands)
temp1[127:0] = CONCATENATE(DEST,SRC)>>(imm8*8)
DEST[63:0] = temp1[63:0]

PALIGNR (With 128-bit Operands)


temp1[255:0] := ((DEST[127:0] << 128) OR SRC[127:0])>>(imm8*8);
DEST[127:0] := temp1[127:0]
DEST[MAXVL-1:128] (Unmodified)

VPALIGNR (VEX.128 Encoded Version)


temp1[255:0] := ((SRC1[127:0] << 128) OR SRC2[127:0])>>(imm8*8);
DEST[127:0] := temp1[127:0]
DEST[MAXVL-1:128] := 0

VPALIGNR (VEX.256 Encoded Version)


temp1[255:0] := ((SRC1[127:0] << 128) OR SRC2[127:0])>>(imm8[7:0]*8);
DEST[127:0] := temp1[127:0]
temp1[255:0] := ((SRC1[255:128] << 128) OR SRC2[255:128])>>(imm8[7:0]*8);
DEST[MAXVL-1:128] := temp1[127:0]

VPALIGNR (EVEX Encoded Versions)


(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR l := 0 TO VL-1 with increments of 128


temp1[255:0] := ((SRC1[l+127:l] << 128) OR SRC2[l+127:l])>>(imm8[7:0]*8);
TMP_DEST[l+127:l] := temp1[127:0]
ENDFOR;

FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := TMP_DEST[i+7:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PALIGNR—Packed Align Right Vol. 2B 4-223


Intel C/C++ Compiler Intrinsic Equivalents
PALIGNR __m64 _mm_alignr_pi8 (__m64 a, __m64 b, int n)
(V)PALIGNR __m128i _mm_alignr_epi8 (__m128i a, __m128i b, int n)
VPALIGNR __m256i _mm256_alignr_epi8 (__m256i a, __m256i b, const int n)
VPALIGNR __m512i _mm512_alignr_epi8 (__m512i a, __m512i b, const int n)
VPALIGNR __m512i _mm512_mask_alignr_epi8 (__m512i s, __mmask64 m, __m512i a, __m512i b, const int n)
VPALIGNR __m512i _mm512_maskz_alignr_epi8 ( __mmask64 m, __m512i a, __m512i b, const int n)
VPALIGNR __m256i _mm256_mask_alignr_epi8 (__m256i s, __mmask32 m, __m256i a, __m256i b, const int n)
VPALIGNR __m256i _mm256_maskz_alignr_epi8 (__mmask32 m, __m256i a, __m256i b, const int n)
VPALIGNR __m128i _mm_mask_alignr_epi8 (__m128i s, __mmask16 m, __m128i a, __m128i b, const int n)
VPALIGNR __m128i _mm_maskz_alignr_epi8 (__mmask16 m, __m128i a, __m128i b, const int n)

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”

PALIGNR—Packed Align Right Vol. 2B 4-224


PAND—Logical AND
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F DB /r1 A V/V MMX Bitwise AND mm/m64 and mm.
PAND mm, mm/m64
66 0F DB /r A V/V SSE2 Bitwise AND of xmm2/m128 and xmm1.
PAND xmm1, xmm2/m128
VEX.128.66.0F.WIG DB /r B V/V AVX Bitwise AND of xmm3/m128 and xmm.
VPAND xmm1, xmm2, xmm3/m128
VEX.256.66.0F.WIG DB /r B V/V AVX2 Bitwise AND of ymm2, and ymm3/m256 and store
VPAND ymm1, ymm2, ymm3/.m256 result in ymm1.

EVEX.128.66.0F.W0 DB /r C V/V (AVX512VL AND Bitwise AND of packed doubleword integers in


VPANDD xmm1 {k1}{z}, xmm2, AVX512F) OR xmm2 and xmm3/m128/m32bcst and store result
xmm3/m128/m32bcst AVX10.12 in xmm1 using writemask k1.
EVEX.256.66.0F.W0 DB /r C V/V (AVX512VL AND Bitwise AND of packed doubleword integers in
VPANDD ymm1 {k1}{z}, ymm2, AVX512F) OR ymm2 and ymm3/m256/m32bcst and store result
ymm3/m256/m32bcst AVX10.12 in ymm1 using writemask k1.
EVEX.512.66.0F.W0 DB /r C V/V AVX512F Bitwise AND of packed doubleword integers in
VPANDD zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm2 and zmm3/m512/m32bcst and store result
zmm3/m512/m32bcst in zmm1 using writemask k1.
EVEX.128.66.0F.W1 DB /r C V/V (AVX512VL AND Bitwise AND of packed quadword integers in xmm2
VPANDQ xmm1 {k1}{z}, xmm2, AVX512F) OR and xmm3/m128/m64bcst and store result in
xmm3/m128/m64bcst AVX10.12 xmm1 using writemask k1.
EVEX.256.66.0F.W1 DB /r C V/V (AVX512VL AND Bitwise AND of packed quadword integers in ymm2
VPANDQ ymm1 {k1}{z}, ymm2, AVX512F) OR and ymm3/m256/m64bcst and store result in
ymm3/m256/m64bcst AVX10.12 ymm1 using writemask k1.
EVEX.512.66.0F.W1 DB /r C V/V AVX512F Bitwise AND of packed quadword integers in zmm2
VPANDQ zmm1 {k1}{z}, zmm2, OR AVX10.12 and zmm3/m512/m64bcst and store result in
zmm3/m512/m64bcst zmm1 using writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a bitwise logical AND operation on the first source operand and second source operand and stores the
result in the destination operand. Each bit of the result is set to 1 if the corresponding bits of the first and second
operands are 1, otherwise it is set to 0.

PAND—Logical AND Vol. 2B 4-224


In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE instructions: The source operand can be an MMX technology register or a 64-bit memory location. The
destination operand can be an MMX technology register.
128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the
upper bits (MAXVL-1:128) of the corresponding ZMM register destination are unmodified.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a
32/64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with write-
mask k1 at 32/64-bit granularity.
VEX.256 encoded versions: The first source operand is a YMM register. The second source operand is a YMM
register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of
the corresponding ZMM register destination are zeroed.
VEX.128 encoded versions: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.

Operation
PAND (64-bit Operand)
DEST := DEST AND SRC

PAND (128-bit Legacy SSE Version)


DEST := DEST AND SRC
DEST[MAXVL-1:128] (Unmodified)

VPAND (VEX.128 Encoded Version)


DEST := SRC1 AND SRC2
DEST[MAXVL-1:128] := 0

VPAND (VEX.256 Encoded Instruction)


DEST[255:0] := (SRC1[255:0] AND SRC2[255:0])
DEST[MAXVL-1:256] := 0

VPANDD (EVEX Encoded Versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+31:i] := SRC1[i+31:i] BITWISE AND SRC2[31:0]
ELSE DEST[i+31:i] := SRC1[i+31:i] BITWISE AND SRC2[i+31:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

PAND—Logical AND Vol. 2B 4-225


VPANDQ (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+63:i] := SRC1[i+63:i] BITWISE AND SRC2[63:0]
ELSE DEST[i+63:i] := SRC1[i+63:i] BITWISE AND SRC2[i+63:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalents


VPANDD __m512i _mm512_and_epi32( __m512i a, __m512i b);
VPANDD __m512i _mm512_mask_and_epi32(__m512i s, __mmask16 k, __m512i a, __m512i b);
VPANDD __m512i _mm512_maskz_and_epi32( __mmask16 k, __m512i a, __m512i b);
VPANDQ __m512i _mm512_and_epi64( __m512i a, __m512i b);
VPANDQ __m512i _mm512_mask_and_epi64(__m512i s, __mmask8 k, __m512i a, __m512i b);
VPANDQ __m512i _mm512_maskz_and_epi64( __mmask8 k, __m512i a, __m512i b);
VPANDND __m256i _mm256_mask_and_epi32(__m256i s, __mmask8 k, __m256i a, __m256i b);
VPANDND __m256i _mm256_maskz_and_epi32( __mmask8 k, __m256i a, __m256i b);
VPANDND __m128i _mm_mask_and_epi32(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPANDND __m128i _mm_maskz_and_epi32( __mmask8 k, __m128i a, __m128i b);
VPANDNQ __m256i _mm256_mask_and_epi64(__m256i s, __mmask8 k, __m256i a, __m256i b);
VPANDNQ __m256i _mm256_maskz_and_epi64( __mmask8 k, __m256i a, __m256i b);
VPANDNQ __m128i _mm_mask_and_epi64(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPANDNQ __m128i _mm_maskz_and_epi64( __mmask8 k, __m128i a, __m128i b);
PAND __m64 _mm_and_si64 (__m64 m1, __m64 m2)
(V)PAND __m128i _mm_and_si128 ( __m128i a, __m128i b)
VPAND __m256i _mm256_and_si256 ( __m256i a, __m256i b)

Flags Affected
None.

Numeric Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

PAND—Logical AND Vol. 2B 4-226


PANDN—Logical AND NOT
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F DF /r1 A V/V MMX Bitwise AND NOT of mm/m64 and mm.
PANDN mm, mm/m64
66 0F DF /r A V/V SSE2 Bitwise AND NOT of xmm2/m128 and xmm1.
PANDN xmm1, xmm2/m128
VEX.128.66.0F.WIG DF /r B V/V AVX Bitwise AND NOT of xmm3/m128 and xmm2.
VPANDN xmm1, xmm2, xmm3/m128
VEX.256.66.0F.WIG DF /r B V/V AVX2 Bitwise AND NOT of ymm2, and ymm3/m256
VPANDN ymm1, ymm2, ymm3/m256 and store result in ymm1.

EVEX.128.66.0F.W0 DF /r C V/V (AVX512VL AND Bitwise AND NOT of packed doubleword


VPANDND xmm1 {k1}{z}, xmm2, AVX512F) OR integers in xmm2 and xmm3/m128/m32bcst
xmm3/m128/m32bcst AVX10.12 and store result in xmm1 using writemask k1.
EVEX.256.66.0F.W0 DF /r C V/V (AVX512VL AND Bitwise AND NOT of packed doubleword
VPANDND ymm1 {k1}{z}, ymm2, AVX512F) OR integers in ymm2 and ymm3/m256/m32bcst
ymm3/m256/m32bcst AVX10.12 and store result in ymm1 using writemask k1.
EVEX.512.66.0F.W0 DF /r C V/V AVX512F Bitwise AND NOT of packed doubleword
VPANDND zmm1 {k1}{z}, zmm2, OR AVX10.12 integers in zmm2 and zmm3/m512/m32bcst
zmm3/m512/m32bcst and store result in zmm1 using writemask k1.
EVEX.128.66.0F.W1 DF /r C V/V (AVX512VL AND Bitwise AND NOT of packed quadword
VPANDNQ xmm1 {k1}{z}, xmm2, AVX512F) OR integers in xmm2 and xmm3/m128/m64bcst
xmm3/m128/m64bcst AVX10.12 and store result in xmm1 using writemask k1.
EVEX.256.66.0F.W1 DF /r C V/V (AVX512VL AND Bitwise AND NOT of packed quadword
VPANDNQ ymm1 {k1}{z}, ymm2, AVX512F) OR integers in ymm2 and ymm3/m256/m64bcst
ymm3/m256/m64bcst AVX10.12 and store result in ymm1 using writemask k1.
EVEX.512.66.0F.W1 DF /r C V/V AVX512F Bitwise AND NOT of packed quadword
VPANDNQ zmm1 {k1}{z}, zmm2, OR AVX10.12 integers in zmm2 and zmm3/m512/m64bcst
zmm3/m512/m64bcst and store result in zmm1 using writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a bitwise logical NOT operation on the first source operand, then performs bitwise AND with second
source operand and stores the result in the destination operand. Each bit of the result is set to 1 if the corre-
sponding bit in the first operand is 0 and the corresponding bit in the second operand is 1, otherwise it is set to 0.

PANDN—Logical AND NOT Vol. 2B 4-227


In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE instructions: The source operand can be an MMX technology register or a 64-bit memory location. The
destination operand can be an MMX technology register.
128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the
upper bits (MAXVL-1:128) of the corresponding ZMM register destination are unmodified.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a
32/64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with write-
mask k1 at 32/64-bit granularity.
VEX.256 encoded versions: The first source operand is a YMM register. The second source operand is a YMM
register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of
the corresponding ZMM register destination are zeroed.
VEX.128 encoded versions: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.

Operation
PANDN (64-bit Operand)
DEST := NOT(DEST) AND SRC

PANDN (128-bit Legacy SSE Version)


DEST := NOT(DEST) AND SRC
DEST[MAXVL-1:128] (Unmodified)

VPANDN (VEX.128 Encoded Version)


DEST := NOT(SRC1) AND SRC2
DEST[MAXVL-1:128] := 0

VPANDN (VEX.256 Encoded Instruction)


DEST[255:0] := ((NOT SRC1[255:0]) AND SRC2[255:0])
DEST[MAXVL-1:256] := 0

VPANDND (EVEX Encoded Versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+31:i] := ((NOT SRC1[i+31:i]) AND SRC2[31:0])
ELSE DEST[i+31:i] := ((NOT SRC1[i+31:i]) AND SRC2[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

PANDN—Logical AND NOT Vol. 2B 4-228


VPANDNQ (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+63:i] := ((NOT SRC1[i+63:i]) AND SRC2[63:0])
ELSE DEST[i+63:i] := ((NOT SRC1[i+63:i]) AND SRC2[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalents


VPANDND __m512i _mm512_andnot_epi32( __m512i a, __m512i b);
VPANDND __m512i _mm512_mask_andnot_epi32(__m512i s, __mmask16 k, __m512i a, __m512i b);
VPANDND __m512i _mm512_maskz_andnot_epi32( __mmask16 k, __m512i a, __m512i b);
VPANDND __m256i _mm256_mask_andnot_epi32(__m256i s, __mmask8 k, __m256i a, __m256i b);
VPANDND __m256i _mm256_maskz_andnot_epi32( __mmask8 k, __m256i a, __m256i b);
VPANDND __m128i _mm_mask_andnot_epi32(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPANDND __m128i _mm_maskz_andnot_epi32( __mmask8 k, __m128i a, __m128i b);
VPANDNQ __m512i _mm512_andnot_epi64( __m512i a, __m512i b);
VPANDNQ __m512i _mm512_mask_andnot_epi64(__m512i s, __mmask8 k, __m512i a, __m512i b);
VPANDNQ __m512i _mm512_maskz_andnot_epi64( __mmask8 k, __m512i a, __m512i b);
VPANDNQ __m256i _mm256_mask_andnot_epi64(__m256i s, __mmask8 k, __m256i a, __m256i b);
VPANDNQ __m256i _mm256_maskz_andnot_epi64( __mmask8 k, __m256i a, __m256i b);
VPANDNQ __m128i _mm_mask_andnot_epi64(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPANDNQ __m128i _mm_maskz_andnot_epi64( __mmask8 k, __m128i a, __m128i b);
PANDN __m64 _mm_andnot_si64 (__m64 m1, __m64 m2)
(V)PANDN __m128i _mm_andnot_si128 ( __m128i a, __m128i b)
VPANDN __m256i _mm256_andnot_si256 ( __m256i a, __m256i b)

Flags Affected
None.

Numeric Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

PANDN—Logical AND NOT Vol. 2B 4-229


PAVGB/PAVGW—Average Packed Integers
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F E0 /r1 A V/V SSE Average packed unsigned byte integers from
PAVGB mm1, mm2/m64 mm2/m64 and mm1 with rounding.

66 0F E0, /r A V/V SSE2 Average packed unsigned byte integers from


PAVGB xmm1, xmm2/m128 xmm2/m128 and xmm1 with rounding.

NP 0F E3 /r1 A V/V SSE Average packed unsigned word integers from


PAVGW mm1, mm2/m64 mm2/m64 and mm1 with rounding.

66 0F E3 /r A V/V SSE2 Average packed unsigned word integers from


PAVGW xmm1, xmm2/m128 xmm2/m128 and xmm1 with rounding.

VEX.128.66.0F.WIG E0 /r B V/V AVX Average packed unsigned byte integers from


VPAVGB xmm1, xmm2, xmm3/m128 xmm3/m128 and xmm2 with rounding.

VEX.128.66.0F.WIG E3 /r B V/V AVX Average packed unsigned word integers from


VPAVGW xmm1, xmm2, xmm3/m128 xmm3/m128 and xmm2 with rounding.

VEX.256.66.0F.WIG E0 /r B V/V AVX2 Average packed unsigned byte integers from


VPAVGB ymm1, ymm2, ymm3/m256 ymm2, and ymm3/m256 with rounding and
store to ymm1.
VEX.256.66.0F.WIG E3 /r B V/V AVX2 Average packed unsigned word integers from
VPAVGW ymm1, ymm2, ymm3/m256 ymm2, ymm3/m256 with rounding to ymm1.

EVEX.128.66.0F.WIG E0 /r C V/V (AVX512VL AND Average packed unsigned byte integers from
VPAVGB xmm1 {k1}{z}, xmm2, xmm3/m128 AVX512BW) OR xmm2, and xmm3/m128 with rounding and
AVX10.12 store to xmm1 under writemask k1.
EVEX.256.66.0F.WIG E0 /r C V/V (AVX512VL AND Average packed unsigned byte integers from
VPAVGB ymm1 {k1}{z}, ymm2, ymm3/m256 AVX512BW) OR ymm2, and ymm3/m256 with rounding and
AVX10.12 store to ymm1 under writemask k1.
EVEX.512.66.0F.WIG E0 /r C V/V AVX512BW Average packed unsigned byte integers from
VPAVGB zmm1 {k1}{z}, zmm2, zmm3/m512 OR AVX10.12 zmm2, and zmm3/m512 with rounding and
store to zmm1 under writemask k1.
EVEX.128.66.0F.WIG E3 /r C V/V (AVX512VL AND Average packed unsigned word integers from
VPAVGW xmm1 {k1}{z}, xmm2, xmm3/m128 AVX512BW) OR xmm2, xmm3/m128 with rounding to xmm1
AVX10.12 under writemask k1.
EVEX.256.66.0F.WIG E3 /r C V/V (AVX512VL AND Average packed unsigned word integers from
VPAVGW ymm1 {k1}{z}, ymm2, ymm3/m256 AVX512BW) OR ymm2, ymm3/m256 with rounding to ymm1
AVX10.12 under writemask k1.
EVEX.512.66.0F.WIG E3 /r C V/V AVX512BW Average packed unsigned word integers from
VPAVGW zmm1 {k1}{z}, zmm2, zmm3/m512 OR AVX10.12 zmm2, zmm3/m512 with rounding to zmm1
under writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2B, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

PAVGB/PAVGW—Average Packed Integers Vol. 2B 4-231


Instruction Operand Encoding
Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD average of the packed unsigned integers from the source operand (second operand) and the
destination operand (first operand), and stores the results in the destination operand. For each corresponding pair
of data elements in the first and second operands, the elements are added together, a 1 is added to the temporary
sum, and that result is shifted right one bit position.
The (V)PAVGB instruction operates on packed unsigned bytes and the (V)PAVGW instruction operates on packed
unsigned words.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE instructions: The source operand can be an MMX technology register or a 64-bit memory location. The
destination operand can be an MMX technology register.
128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the
upper bits (MAXVL-1:128) of the corresponding register destination are unmodified.
EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM
register or a 512-bit memory location. The destination operand is a ZMM register.
VEX.256 and EVEX.256 encoded versions: The first source operand is a YMM register. The second source operand
is a YMM register or a 256-bit memory location. The destination operand is a YMM register.
VEX.128 and EVEX.128 encoded versions: The first source operand is an XMM register. The second source operand
is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-
1:128) of the corresponding register destination are zeroed.

Operation
PAVGB (With 64-bit Operands)
DEST[7:0] := (SRC[7:0] + DEST[7:0] + 1) >> 1; (* Temp sum before shifting is 9 bits *)
(* Repeat operation performed for bytes 2 through 6 *)
DEST[63:56] := (SRC[63:56] + DEST[63:56] + 1) >> 1;

PAVGW (With 64-bit Operands)


DEST[15:0] := (SRC[15:0] + DEST[15:0] + 1) >> 1; (* Temp sum before shifting is 17 bits *)
(* Repeat operation performed for words 2 and 3 *)
DEST[63:48] := (SRC[63:48] + DEST[63:48] + 1) >> 1;

PAVGB (With 128-bit Operands)


DEST[7:0] := (SRC[7:0] + DEST[7:0] + 1) >> 1; (* Temp sum before shifting is 9 bits *)
(* Repeat operation performed for bytes 2 through 14 *)
DEST[127:120] := (SRC[127:120] + DEST[127:120] + 1) >> 1;

PAVGW (With 128-bit Operands)


DEST[15:0] := (SRC[15:0] + DEST[15:0] + 1) >> 1; (* Temp sum before shifting is 17 bits *)
(* Repeat operation performed for words 2 through 6 *)
DEST[127:112] := (SRC[127:112] + DEST[127:112] + 1) >> 1;

PAVGB/PAVGW—Average Packed Integers Vol. 2B 4-232


VPAVGB (VEX.128 Encoded Version)
DEST[7:0] := (SRC1[7:0] + SRC2[7:0] + 1) >> 1;
(* Repeat operation performed for bytes 2 through 15 *)
DEST[127:120] := (SRC1[127:120] + SRC2[127:120] + 1) >> 1
DEST[MAXVL-1:128] := 0

VPAVGW (VEX.128 Encoded Version)


DEST[15:0] := (SRC1[15:0] + SRC2[15:0] + 1) >> 1;
(* Repeat operation performed for 16-bit words 2 through 7 *)
DEST[127:112] := (SRC1[127:112] + SRC2[127:112] + 1) >> 1
DEST[MAXVL-1:128] := 0

VPAVGB (VEX.256 Encoded Instruction)


DEST[7:0] := (SRC1[7:0] + SRC2[7:0] + 1) >> 1; (* Temp sum before shifting is 9 bits *)
(* Repeat operation performed for bytes 2 through 31)
DEST[255:248] := (SRC1[255:248] + SRC2[255:248] + 1) >> 1;

VPAVGW (VEX.256 Encoded Instruction)


DEST[15:0] := (SRC1[15:0] + SRC2[15:0] + 1) >> 1; (* Temp sum before shifting is 17 bits *)
(* Repeat operation performed for words 2 through 15)
DEST[255:14]) := (SRC1[255:240] + SRC2[255:240] + 1) >> 1;
VPAVGB (EVEX encoded versions)
(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := (SRC1[i+7:i] + SRC2[i+7:i] + 1) >> 1; (* Temp sum before shifting is 9 bits *)
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

VPAVGW (EVEX Encoded Versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := (SRC1[i+15:i] + SRC2[i+15:i] + 1) >> 1
; (* Temp sum before shifting is 17 bits *)
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PAVGB/PAVGW—Average Packed Integers Vol. 2B 4-233


Intel C/C++ Compiler Intrinsic Equivalents
VPAVGB __m512i _mm512_avg_epu8( __m512i a, __m512i b);
VPAVGW __m512i _mm512_avg_epu16( __m512i a, __m512i b);
VPAVGB __m512i _mm512_mask_avg_epu8(__m512i s, __mmask64 m, __m512i a, __m512i b);
VPAVGW __m512i _mm512_mask_avg_epu16(__m512i s, __mmask32 m, __m512i a, __m512i b);
VPAVGB __m512i _mm512_maskz_avg_epu8( __mmask64 m, __m512i a, __m512i b);
VPAVGW __m512i _mm512_maskz_avg_epu16( __mmask32 m, __m512i a, __m512i b);
VPAVGB __m256i _mm256_mask_avg_epu8(__m256i s, __mmask32 m, __m256i a, __m256i b);
VPAVGW __m256i _mm256_mask_avg_epu16(__m256i s, __mmask16 m, __m256i a, __m256i b);
VPAVGB __m256i _mm256_maskz_avg_epu8( __mmask32 m, __m256i a, __m256i b);
VPAVGW __m256i _mm256_maskz_avg_epu16( __mmask16 m, __m256i a, __m256i b);
VPAVGB __m128i _mm_mask_avg_epu8(__m128i s, __mmask16 m, __m128i a, __m128i b);
VPAVGW __m128i _mm_mask_avg_epu16(__m128i s, __mmask8 m, __m128i a, __m128i b);
VPAVGB __m128i _mm_maskz_avg_epu8( __mmask16 m, __m128i a, __m128i b);
VPAVGW __m128i _mm_maskz_avg_epu16( __mmask8 m, __m128i a, __m128i b);
PAVGB __m64 _mm_avg_pu8 (__m64 a, __m64 b)
PAVGW __m64 _mm_avg_pu16 (__m64 a, __m64 b)
(V)PAVGB __m128i _mm_avg_epu8 ( __m128i a, __m128i b)
(V)PAVGW __m128i _mm_avg_epu16 ( __m128i a, __m128i b)
VPAVGB __m256i _mm256_avg_epu8 ( __m256i a, __m256i b)
VPAVGW __m256i _mm256_avg_epu16 ( __m256i a, __m256i b)

Flags Affected
None.

Numeric Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

PAVGB/PAVGW—Average Packed Integers Vol. 2B 4-234


PCLMULQDQ—Carry-Less Multiplication Quadword
Opcode/ Op/ 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
66 0F 3A 44 /r ib A V/V PCLMULQDQ Carry-less multiplication of one quadword of
PCLMULQDQ xmm1, xmm2/m128, imm8 xmm1 by one quadword of xmm2/m128,
stores the 128-bit result in xmm1. The imme-
diate is used to determine which quadwords
of xmm1 and xmm2/m128 should be used.
VEX.128.66.0F3A.WIG 44 /r ib B V/V PCLMULQDQ Carry-less multiplication of one quadword of
VPCLMULQDQ xmm1, xmm2, xmm3/m128, imm8 AVX xmm2 by one quadword of xmm3/m128,
stores the 128-bit result in xmm1. The imme-
diate is used to determine which quadwords
of xmm2 and xmm3/m128 should be used.
VEX.256.66.0F3A.WIG 44 /r /ib B V/V VPCLMULQDQ Carry-less multiplication of one quadword of
VPCLMULQDQ ymm1, ymm2, ymm3/m256, imm8 AVX ymm2 by one quadword of ymm3/m256,
stores the 128-bit result in ymm1. The imme-
diate is used to determine which quadwords
of ymm2 and ymm3/m256 should be used.
EVEX.128.66.0F3A.WIG 44 /r /ib C V/V VPCLMULQDQ Carry-less multiplication of one quadword of
VPCLMULQDQ xmm1, xmm2, xmm3/m128, imm8 (AVX512VL xmm2 by one quadword of xmm3/m128,
OR AVX10.11) stores the 128-bit result in xmm1. The imme-
diate is used to determine which quadwords
of xmm2 and xmm3/m128 should be used.
EVEX.256.66.0F3A.WIG 44 /r /ib C V/V VPCLMULQDQ Carry-less multiplication of one quadword of
VPCLMULQDQ ymm1, ymm2, ymm3/m256, imm8 (AVX512VL ymm2 by one quadword of ymm3/m256,
OR AVX10.11) stores the 128-bit result in ymm1. The imme-
diate is used to determine which quadwords
of ymm2 and ymm3/m256 should be used.
EVEX.512.66.0F3A.WIG 44 /r /ib C V/V VPCLMULQDQ Carry-less multiplication of one quadword of
VPCLMULQDQ zmm1, zmm2, zmm3/m512, imm8 (AVX512F zmm2 by one quadword of zmm3/m512,
OR AVX10.11) stores the 128-bit result in zmm1. The imme-
diate is used to determine which quadwords
of zmm2 and zmm3/m512 should be used.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) imm8 N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8 (r)

Description
Performs a carry-less multiplication of two quadwords, selected from the first source and second source operand
according to the value of the immediate byte. Bits 4 and 0 are used to select which 64-bit half of each operand to
use according to Table 4-13, other bits of the immediate byte are ignored.
The EVEX encoded form of this instruction does not support memory fault suppression.

PCLMULQDQ—Carry-Less Multiplication Quadword Vol. 2B 4-242


Table 4-13. PCLMULQDQ Quadword Selection of Immediate Byte
Imm[4] Imm[0] PCLMULQDQ Operation
0 0 CL_MUL( SRC21 [63:0], SRC1[63:0] )
0 1 CL_MUL( SRC2[63:0], SRC1[127:64] )
1 0 CL_MUL( SRC2[127:64], SRC1[63:0] )
1 1 CL_MUL( SRC2[127:64], SRC1[127:64] )
NOTES:
1. SRC2 denotes the second source operand, which can be a register or memory; SRC1 denotes the first source and destination oper-
and.

The first source operand and the destination operand are the same and must be a ZMM/YMM/XMM register. The
second source operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location. Bits (VL_MAX-
1:128) of the corresponding YMM destination register remain unchanged.
Compilers and assemblers may implement the following pseudo-op syntax to simplify programming and emit the
required encoding for imm8.

Table 4-14. Pseudo-Op and PCLMULQDQ Implementation


Pseudo-Op Imm8 Encoding
PCLMULLQLQDQ xmm1, xmm2 0000_0000B
PCLMULHQLQDQ xmm1, xmm2 0000_0001B
PCLMULLQHQDQ xmm1, xmm2 0001_0000B
PCLMULHQHQDQ xmm1, xmm2 0001_0001B

Operation
define PCLMUL128(X,Y): // helper function
FOR i := 0 to 63:
TMP [ i ] := X[ 0 ] and Y[ i ]
FOR j := 1 to i:
TMP [ i ] := TMP [ i ] xor (X[ j ] and Y[ i - j ])
DEST[ i ] := TMP[ i ]
FOR i := 64 to 126:
TMP [ i ] := 0
FOR j := i - 63 to 63:
TMP [ i ] := TMP [ i ] xor (X[ j ] and Y[ i - j ])
DEST[ i ] := TMP[ i ]
DEST[127] := 0;
RETURN DEST // 128b vector

PCLMULQDQ—Carry-Less Multiplication Quadword Vol. 2B 4-243


PCLMULQDQ (SSE Version)
IF imm8[0] = 0:
TEMP1 := SRC1.qword[0]
ELSE:
TEMP1 := SRC1.qword[1]
IF imm8[4] = 0:
TEMP2 := SRC2.qword[0]
ELSE:
TEMP2 := SRC2.qword[1]
DEST[127:0] := PCLMUL128(TEMP1, TEMP2)
DEST[MAXVL-1:128] (Unmodified)

VPCLMULQDQ (128b and 256b VEX Encoded Versions)


(KL,VL) = (1,128), (2,256)
FOR i= 0 to KL-1:
IF imm8[0] = 0:
TEMP1 := SRC1.xmm[i].qword[0]
ELSE:
TEMP1 := SRC1.xmm[i].qword[1]
IF imm8[4] = 0:
TEMP2 := SRC2.xmm[i].qword[0]
ELSE:
TEMP2 := SRC2.xmm[i].qword[1]
DEST.xmm[i] := PCLMUL128(TEMP1, TEMP2)
DEST[MAXVL-1:VL] := 0

VPCLMULQDQ (EVEX Encoded Version)


(KL,VL) = (1,128), (2,256), (4,512)
FOR i = 0 to KL-1:
IF imm8[0] = 0:
TEMP1 := SRC1.xmm[i].qword[0]
ELSE:
TEMP1 := SRC1.xmm[i].qword[1]
IF imm8[4] = 0:
TEMP2 := SRC2.xmm[i].qword[0]
ELSE:
TEMP2 := SRC2.xmm[i].qword[1]
DEST.xmm[i] := PCLMUL128(TEMP1, TEMP2)
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


(V)PCLMULQDQ __m128i _mm_clmulepi64_si128 (__m128i, __m128i, const int)
VPCLMULQDQ __m256i _mm256_clmulepi64_epi128(__m256i, __m256i, const int);
VPCLMULQDQ __m512i _mm512_clmulepi64_epi128(__m512i, __m512i, const int);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-21, “Type 4 Class Exception Conditions,” additionally:
#UD If VEX.L = 1.
EVEX-encoded: See Table 2-52, “Type E4NF Class Exception Conditions.”

PCLMULQDQ—Carry-Less Multiplication Quadword Vol. 2B 4-244


PCMPEQB/PCMPEQW/PCMPEQD— Compare Packed Data for Equal
Opcode/ Op/ En 64/32 bit CPUID Feature Description
Instruction Mode Flag
Support
NP 0F 74 /r1 A V/V MMX Compare packed bytes in mm/m64 and mm for
PCMPEQB mm, mm/m64 equality.

66 0F 74 /r A V/V SSE2 Compare packed bytes in xmm2/m128 and


PCMPEQB xmm1, xmm2/m128 xmm1 for equality.

NP 0F 75 /r1 A V/V MMX Compare packed words in mm/m64 and mm


PCMPEQW mm, mm/m64 for equality.

66 0F 75 /r A V/V SSE2 Compare packed words in xmm2/m128 and


PCMPEQW xmm1, xmm2/m128 xmm1 for equality.

NP 0F 76 /r1 A V/V MMX Compare packed doublewords in mm/m64 and


PCMPEQD mm, mm/m64 mm for equality.

66 0F 76 /r A V/V SSE2 Compare packed doublewords in xmm2/m128


PCMPEQD xmm1, xmm2/m128 and xmm1 for equality.

VEX.128.66.0F.WIG 74 /r B V/V AVX Compare packed bytes in xmm3/m128 and


VPCMPEQB xmm1, xmm2, xmm3/m128 xmm2 for equality.

VEX.128.66.0F.WIG 75 /r B V/V AVX Compare packed words in xmm3/m128 and


VPCMPEQW xmm1, xmm2, xmm3/m128 xmm2 for equality.

VEX.128.66.0F.WIG 76 /r B V/V AVX Compare packed doublewords in xmm3/m128


VPCMPEQD xmm1, xmm2, xmm3/m128 and xmm2 for equality.

VEX.256.66.0F.WIG 74 /r B V/V AVX2 Compare packed bytes in ymm3/m256 and


VPCMPEQB ymm1, ymm2, ymm3 /m256 ymm2 for equality.
VEX.256.66.0F.WIG 75 /r B V/V AVX2 Compare packed words in ymm3/m256 and
VPCMPEQW ymm1, ymm2, ymm3 /m256 ymm2 for equality.

VEX.256.66.0F.WIG 76 /r B V/V AVX2 Compare packed doublewords in ymm3/m256


VPCMPEQD ymm1, ymm2, ymm3 /m256 and ymm2 for equality.

EVEX.128.66.0F.W0 76 /r C V/V (AVX512VL AND Compare Equal between int32 vector xmm2
VPCMPEQD k1 {k2}, xmm2, AVX512F) OR and int32 vector xmm3/m128/m32bcst, and
xmm3/m128/m32bcst AVX10.12 set vector mask k1 to reflect the
zero/nonzero status of each element of the
result, under writemask.
EVEX.256.66.0F.W0 76 /r C V/V (AVX512VL AND Compare Equal between int32 vector ymm2
VPCMPEQD k1 {k2}, ymm2, AVX512F) OR and int32 vector ymm3/m256/m32bcst, and
ymm3/m256/m32bcst AVX10.12 set vector mask k1 to reflect the
zero/nonzero status of each element of the
result, under writemask.
EVEX.512.66.0F.W0 76 /r C V/V AVX512F Compare Equal between int32 vectors in
VPCMPEQD k1 {k2}, zmm2, OR AVX10.12 zmm2 and zmm3/m512/m32bcst, and set
zmm3/m512/m32bcst destination k1 according to the comparison
results under writemask k2.
EVEX.128.66.0F.WIG 74 /r D V/V (AVX512VL AND Compare packed bytes in xmm3/m128 and
VPCMPEQB k1 {k2}, xmm2, xmm3 /m128 AVX512BW) OR xmm2 for equality and set vector mask k1 to
AVX10.12 reflect the zero/nonzero status of each
element of the result, under writemask.

PCMPEQB/PCMPEQW/PCMPEQD— Compare Packed Data for Equal Vol. 2B 4-245


Opcode/ Op/ En 64/32 bit CPUID Feature Description
Instruction Mode Flag
Support
EVEX.256.66.0F.WIG 74 /r D V/V (AVX512VL AND Compare packed bytes in ymm3/m256 and
VPCMPEQB k1 {k2}, ymm2, ymm3 /m256 AVX512BW) OR ymm2 for equality and set vector mask k1 to
AVX10.12 reflect the zero/nonzero status of each
element of the result, under writemask.
EVEX.512.66.0F.WIG 74 /r D V/V AVX512BW Compare packed bytes in zmm3/m512 and
VPCMPEQB k1 {k2}, zmm2, zmm3 /m512 OR AVX10.12 zmm2 for equality and set vector mask k1 to
reflect the zero/nonzero status of each
element of the result, under writemask.
EVEX.128.66.0F.WIG 75 /r D V/V (AVX512VL AND Compare packed words in xmm3/m128 and
VPCMPEQW k1 {k2}, xmm2, xmm3 /m128 AVX512BW) OR xmm2 for equality and set vector mask k1 to
AVX10.12 reflect the zero/nonzero status of each
element of the result, under writemask.
EVEX.256.66.0F.WIG 75 /r D V/V (AVX512VL AND Compare packed words in ymm3/m256 and
VPCMPEQW k1 {k2}, ymm2, ymm3 /m256 AVX512BW) OR ymm2 for equality and set vector mask k1 to
AVX10.12 reflect the zero/nonzero status of each
element of the result, under writemask.
EVEX.512.66.0F.WIG 75 /r D V/V AVX512BW Compare packed words in zmm3/m512 and
VPCMPEQW k1 {k2}, zmm2, zmm3 /m512 OR AVX10.12 zmm2 for equality and set vector mask k1 to
reflect the zero/nonzero status of each
element of the result, under writemask.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A
D Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD compare for equality of the packed bytes, words, or doublewords in the destination operand (first
operand) and the source operand (second operand). If a pair of data elements is equal, the corresponding data
element in the destination operand is set to all 1s; otherwise, it is set to all 0s.
The (V)PCMPEQB instruction compares the corresponding bytes in the destination and source operands; the
(V)PCMPEQW instruction compares the corresponding words in the destination and source operands; and the
(V)PCMPEQD instruction compares the corresponding doublewords in the destination and source operands.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE instructions: The source operand can be an MMX technology register or a 64-bit memory location. The
destination operand can be an MMX technology register.

PCMPEQB/PCMPEQW/PCMPEQD— Compare Packed Data for Equal Vol. 2B 4-246


128-bit Legacy SSE version: The second source operand can be an XMM register or a 128-bit memory location. The
first source and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM destination
register remain unchanged.
VEX.128 encoded version: The second source operand can be an XMM register or a 128-bit memory location. The
first source and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM register
are zeroed.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register.
EVEX encoded VPCMPEQD: The first source operand (second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 32-bit memory location. The destination operand (first operand) is a mask register updated
according to the writemask k2.
EVEX encoded VPCMPEQB/W: The first source operand (second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand
(first operand) is a mask register updated according to the writemask k2.

Operation
PCMPEQB (With 64-bit Operands)
IF DEST[7:0] = SRC[7:0]
THEN DEST[7:0) := FFH;
ELSE DEST[7:0] := 0; FI;
(* Continue comparison of 2nd through 7th bytes in DEST and SRC *)
IF DEST[63:56] = SRC[63:56]
THEN DEST[63:56] := FFH;
ELSE DEST[63:56] := 0; FI;

COMPARE_BYTES_EQUAL (SRC1, SRC2)


IF SRC1[7:0] = SRC2[7:0]
THEN DEST[7:0] := FFH;
ELSE DEST[7:0] := 0; FI;
(* Continue comparison of 2nd through 15th bytes in SRC1 and SRC2 *)
IF SRC1[127:120] = SRC2[127:120]
THEN DEST[127:120] := FFH;
ELSE DEST[127:120] := 0; FI;

COMPARE_WORDS_EQUAL (SRC1, SRC2)


IF SRC1[15:0] = SRC2[15:0]
THEN DEST[15:0] := FFFFH;
ELSE DEST[15:0] := 0; FI;
(* Continue comparison of 2nd through 7th 16-bit words in SRC1 and SRC2 *)
IF SRC1[127:112] = SRC2[127:112]
THEN DEST[127:112] := FFFFH;
ELSE DEST[127:112] := 0; FI;

COMPARE_DWORDS_EQUAL (SRC1, SRC2)


IF SRC1[31:0] = SRC2[31:0]
THEN DEST[31:0] := FFFFFFFFH;
ELSE DEST[31:0] := 0; FI;
(* Continue comparison of 2nd through 3rd 32-bit dwords in SRC1 and SRC2 *)
IF SRC1[127:96] = SRC2[127:96]
THEN DEST[127:96] := FFFFFFFFH;
ELSE DEST[127:96] := 0; FI;

PCMPEQB/PCMPEQW/PCMPEQD— Compare Packed Data for Equal Vol. 2B 4-247


PCMPEQB (With 128-bit Operands)
DEST[127:0] := COMPARE_BYTES_EQUAL(DEST[127:0],SRC[127:0])
DEST[MAXVL-1:128] (Unmodified)

VPCMPEQB (VEX.128 Encoded Version)


DEST[127:0] := COMPARE_BYTES_EQUAL(SRC1[127:0],SRC2[127:0])
DEST[MAXVL-1:128] := 0

VPCMPEQB (VEX.256 Encoded Version)


DEST[127:0] := COMPARE_BYTES_EQUAL(SRC1[127:0],SRC2[127:0])
DEST[255:128] := COMPARE_BYTES_EQUAL(SRC1[255:128],SRC2[255:128])
DEST[MAXVL-1:256] := 0

VPCMPEQB (EVEX Encoded Versions)


(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j := 0 TO KL-1
i := j * 8
IF k2[j] OR *no writemask*
THEN
/* signed comparison */
CMP := SRC1[i+7:i] == SRC2[i+7:i];
IF CMP = TRUE
THEN DEST[j] := 1;
ELSE DEST[j] := 0; FI;
ELSE DEST[j] := 0 ; zeroing-masking onlyFI;
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

PCMPEQW (With 64-bit Operands)


IF DEST[15:0] = SRC[15:0]
THEN DEST[15:0] := FFFFH;
ELSE DEST[15:0] := 0; FI;
(* Continue comparison of 2nd and 3rd words in DEST and SRC *)
IF DEST[63:48] = SRC[63:48]
THEN DEST[63:48] := FFFFH;
ELSE DEST[63:48] := 0; FI;

PCMPEQW (With 128-bit Operands)


DEST[127:0] := COMPARE_WORDS_EQUAL(DEST[127:0],SRC[127:0])
DEST[MAXVL-1:128] (Unmodified)

VPCMPEQW (VEX.128 Encoded Version)


DEST[127:0] := COMPARE_WORDS_EQUAL(SRC1[127:0],SRC2[127:0])
DEST[MAXVL-1:128] := 0

VPCMPEQW (VEX.256 Encoded Version)


DEST[127:0] := COMPARE_WORDS_EQUAL(SRC1[127:0],SRC2[127:0])
DEST[255:128] := COMPARE_WORDS_EQUAL(SRC1[255:128],SRC2[255:128])
DEST[MAXVL-1:256] := 0

PCMPEQB/PCMPEQW/PCMPEQD— Compare Packed Data for Equal Vol. 2B 4-248


VPCMPEQW (EVEX Encoded Versions)
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k2[j] OR *no writemask*
THEN
/* signed comparison */
CMP := SRC1[i+15:i] == SRC2[i+15:i];
IF CMP = TRUE
THEN DEST[j] := 1;
ELSE DEST[j] := 0; FI;
ELSE DEST[j] := 0 ; zeroing-masking onlyFI;
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

PCMPEQD (With 64-bit Operands)


IF DEST[31:0] = SRC[31:0]
THEN DEST[31:0] := FFFFFFFFH;
ELSE DEST[31:0] := 0; FI;
IF DEST[63:32] = SRC[63:32]
THEN DEST[63:32] := FFFFFFFFH;
ELSE DEST[63:32] := 0; FI;

PCMPEQD (With 128-bit Operands)


DEST[127:0] := COMPARE_DWORDS_EQUAL(DEST[127:0],SRC[127:0])
DEST[MAXVL-1:128] (Unmodified)

VPCMPEQD (VEX.128 Encoded Version)


DEST[127:0] := COMPARE_DWORDS_EQUAL(SRC1[127:0],SRC2[127:0])
DEST[MAXVL-1:128] := 0

VPCMPEQD (VEX.256 Encoded Version)


DEST[127:0] := COMPARE_DWORDS_EQUAL(SRC1[127:0],SRC2[127:0])
DEST[255:128] := COMPARE_DWORDS_EQUAL(SRC1[255:128],SRC2[255:128])
DEST[MAXVL-1:256] := 0

VPCMPEQD (EVEX Encoded Versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k2[j] OR *no writemask*
THEN
/* signed comparison */
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN CMP := SRC1[i+31:i] = SRC2[31:0];
ELSE CMP := SRC1[i+31:i] = SRC2[i+31:i];
FI;
IF CMP = TRUE
THEN DEST[j] := 1;
ELSE DEST[j] := 0; FI;
ELSE DEST[j] := 0 ; zeroing-masking only
FI;
ENDFOR

PCMPEQB/PCMPEQW/PCMPEQD— Compare Packed Data for Equal Vol. 2B 4-249


DEST[MAX_KL-1:KL] := 0

Intel C/C++ Compiler Intrinsic Equivalents


VPCMPEQB __mmask64 _mm512_cmpeq_epi8_mask(__m512i a, __m512i b);
VPCMPEQB __mmask64 _mm512_mask_cmpeq_epi8_mask(__mmask64 k, __m512i a, __m512i b);
VPCMPEQB __mmask32 _mm256_cmpeq_epi8_mask(__m256i a, __m256i b);
VPCMPEQB __mmask32 _mm256_mask_cmpeq_epi8_mask(__mmask32 k, __m256i a, __m256i b);
VPCMPEQB __mmask16 _mm_cmpeq_epi8_mask(__m128i a, __m128i b);
VPCMPEQB __mmask16 _mm_mask_cmpeq_epi8_mask(__mmask16 k, __m128i a, __m128i b);
VPCMPEQW __mmask32 _mm512_cmpeq_epi16_mask(__m512i a, __m512i b);
VPCMPEQW __mmask32 _mm512_mask_cmpeq_epi16_mask(__mmask32 k, __m512i a, __m512i b);
VPCMPEQW __mmask16 _mm256_cmpeq_epi16_mask(__m256i a, __m256i b);
VPCMPEQW __mmask16 _mm256_mask_cmpeq_epi16_mask(__mmask16 k, __m256i a, __m256i b);
VPCMPEQW __mmask8 _mm_cmpeq_epi16_mask(__m128i a, __m128i b);
VPCMPEQW __mmask8 _mm_mask_cmpeq_epi16_mask(__mmask8 k, __m128i a, __m128i b);
VPCMPEQD __mmask16 _mm512_cmpeq_epi32_mask( __m512i a, __m512i b);
VPCMPEQD __mmask16 _mm512_mask_cmpeq_epi32_mask(__mmask16 k, __m512i a, __m512i b);
VPCMPEQD __mmask8 _mm256_cmpeq_epi32_mask(__m256i a, __m256i b);
VPCMPEQD __mmask8 _mm256_mask_cmpeq_epi32_mask(__mmask8 k, __m256i a, __m256i b);
VPCMPEQD __mmask8 _mm_cmpeq_epi32_mask(__m128i a, __m128i b);
VPCMPEQD __mmask8 _mm_mask_cmpeq_epi32_mask(__mmask8 k, __m128i a, __m128i b);
PCMPEQB __m64 _mm_cmpeq_pi8 (__m64 m1, __m64 m2)
PCMPEQW __m64 _mm_cmpeq_pi16 (__m64 m1, __m64 m2)
PCMPEQD __m64 _mm_cmpeq_pi32 (__m64 m1, __m64 m2)
(V)PCMPEQB __m128i _mm_cmpeq_epi8 ( __m128i a, __m128i b)
(V)PCMPEQW __m128i _mm_cmpeq_epi16 ( __m128i a, __m128i b)
(V)PCMPEQD __m128i _mm_cmpeq_epi32 ( __m128i a, __m128i b)
VPCMPEQB __m256i _mm256_cmpeq_epi8 ( __m256i a, __m256i b)
VPCMPEQW __m256i _mm256_cmpeq_epi16 ( __m256i a, __m256i b)
VPCMPEQD __m256i _mm256_cmpeq_epi32 ( __m256i a, __m256i b)

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPCMPEQD, see Table 2-51, “Type E4 Class Exception Conditions.”
EVEX-encoded VPCMPEQB/W, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

PCMPEQB/PCMPEQW/PCMPEQD— Compare Packed Data for Equal Vol. 2B 4-250


PCMPEQQ—Compare Packed Qword Data for Equal
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 38 29 /r A V/V SSE4_1 Compare packed qwords in xmm2/m128 and
PCMPEQQ xmm1, xmm2/m128 xmm1 for equality.
VEX.128.66.0F38.WIG 29 /r B V/V AVX Compare packed quadwords in xmm3/m128 and
VPCMPEQQ xmm1, xmm2, xmm3/m128 xmm2 for equality.
VEX.256.66.0F38.WIG 29 /r B V/V AVX2 Compare packed quadwords in ymm3/m256 and
VPCMPEQQ ymm1, ymm2, ymm3 /m256 ymm2 for equality.
EVEX.128.66.0F38.W1 29 /r C V/V (AVX512VL AND Compare Equal between int64 vector xmm2 and
VPCMPEQQ k1 {k2}, xmm2, AVX512F) OR int64 vector xmm3/m128/m64bcst, and set
xmm3/m128/m64bcst AVX10.11 vector mask k1 to reflect the zero/nonzero status
of each element of the result, under writemask.
EVEX.256.66.0F38.W1 29 /r C V/V (AVX512VL AND Compare Equal between int64 vector ymm2 and
VPCMPEQQ k1 {k2}, ymm2, AVX512F) OR int64 vector ymm3/m256/m64bcst, and set
ymm3/m256/m64bcst AVX10.11 vector mask k1 to reflect the zero/nonzero status
of each element of the result, under writemask.
EVEX.512.66.0F38.W1 29 /r C V/V AVX512F Compare Equal between int64 vector zmm2 and
VPCMPEQQ k1 {k2}, zmm2, OR AVX10.11 int64 vector zmm3/m512/m64bcst, and set
zmm3/m512/m64bcst vector mask k1 to reflect the zero/nonzero status
of each element of the result, under writemask.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs an SIMD compare for equality of the packed quadwords in the destination operand (first operand) and the
source operand (second operand). If a pair of data elements is equal, the corresponding data element in the desti-
nation is set to all 1s; otherwise, it is set to 0s.
128-bit Legacy SSE version: The second source operand can be an XMM register or a 128-bit memory location. The
first source and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM destination
register remain unchanged.
VEX.128 encoded version: The second source operand can be an XMM register or a 128-bit memory location. The
first source and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM register
are zeroed.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register.
EVEX encoded VPCMPEQQ: The first source operand (second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 64-bit memory location. The destination operand (first operand) is a mask register updated
according to the writemask k2.

PCMPEQQ—Compare Packed Qword Data for Equal Vol. 2B 4-251


Operation
PCMPEQQ (With 128-bit Operands)
IF (DEST[63:0] = SRC[63:0])
THEN DEST[63:0] := FFFFFFFFFFFFFFFFH;
ELSE DEST[63:0] := 0; FI;
IF (DEST[127:64] = SRC[127:64])
THEN DEST[127:64] := FFFFFFFFFFFFFFFFH;
ELSE DEST[127:64] := 0; FI;
DEST[MAXVL-1:128] (Unmodified)

COMPARE_QWORDS_EQUAL (SRC1, SRC2)


IF SRC1[63:0] = SRC2[63:0]
THEN DEST[63:0] := FFFFFFFFFFFFFFFFH;
ELSE DEST[63:0] := 0; FI;
IF SRC1[127:64] = SRC2[127:64]
THEN DEST[127:64] := FFFFFFFFFFFFFFFFH;
ELSE DEST[127:64] := 0; FI;

VPCMPEQQ (VEX.128 Encoded Version)


DEST[127:0] := COMPARE_QWORDS_EQUAL(SRC1,SRC2)
DEST[MAXVL-1:128] := 0

VPCMPEQQ (VEX.256 Encoded Version)


DEST[127:0] := COMPARE_QWORDS_EQUAL(SRC1[127:0],SRC2[127:0])
DEST[255:128] := COMPARE_QWORDS_EQUAL(SRC1[255:128],SRC2[255:128])
DEST[MAXVL-1:256] := 0

VPCMPEQQ (EVEX Encoded Versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k2[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN CMP := SRC1[i+63:i] = SRC2[63:0];
ELSE CMP := SRC1[i+63:i] = SRC2[i+63:i];
FI;
IF CMP = TRUE
THEN DEST[j] := 1;
ELSE DEST[j] := 0; FI;
ELSE DEST[j] := 0 ; zeroing-masking only
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

PCMPEQQ—Compare Packed Qword Data for Equal Vol. 2B 4-252


Intel C/C++ Compiler Intrinsic Equivalent
VPCMPEQQ __mmask8 _mm512_cmpeq_epi64_mask( __m512i a, __m512i b);
VPCMPEQQ __mmask8 _mm512_mask_cmpeq_epi64_mask(__mmask8 k, __m512i a, __m512i b);
VPCMPEQQ __mmask8 _mm256_cmpeq_epi64_mask( __m256i a, __m256i b);
VPCMPEQQ __mmask8 _mm256_mask_cmpeq_epi64_mask(__mmask8 k, __m256i a, __m256i b);
VPCMPEQQ __mmask8 _mm_cmpeq_epi64_mask( __m128i a, __m128i b);
VPCMPEQQ __mmask8 _mm_mask_cmpeq_epi64_mask(__mmask8 k, __m128i a, __m128i b);
(V)PCMPEQQ __m128i _mm_cmpeq_epi64(__m128i a, __m128i b);
VPCMPEQQ __m256i _mm256_cmpeq_epi64( __m256i a, __m256i b);

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPCMPEQQ, see Table 2-51, “Type E4 Class Exception Conditions.”

PCMPEQQ—Compare Packed Qword Data for Equal Vol. 2B 4-253


PCMPGTB/PCMPGTW/PCMPGTD—Compare Packed Signed Integers for Greater Than
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 64 /r1 A V/V MMX Compare packed signed byte integers in mm and
PCMPGTB mm, mm/m64 mm/m64 for greater than.

66 0F 64 /r A V/V SSE2 Compare packed signed byte integers in xmm1


PCMPGTB xmm1, xmm2/m128 and xmm2/m128 for greater than.

NP 0F 65 /r1 A V/V MMX Compare packed signed word integers in mm and


PCMPGTW mm, mm/m64 mm/m64 for greater than.

66 0F 65 /r A V/V SSE2 Compare packed signed word integers in xmm1


PCMPGTW xmm1, xmm2/m128 and xmm2/m128 for greater than.

NP 0F 66 /r1 A V/V MMX Compare packed signed doubleword integers in


PCMPGTD mm, mm/m64 mm and mm/m64 for greater than.

66 0F 66 /r A V/V SSE2 Compare packed signed doubleword integers in


PCMPGTD xmm1, xmm2/m128 xmm1 and xmm2/m128 for greater than.

VEX.128.66.0F.WIG 64 /r B V/V AVX Compare packed signed byte integers in xmm2


VPCMPGTB xmm1, xmm2, xmm3/m128 and xmm3/m128 for greater than.

VEX.128.66.0F.WIG 65 /r B V/V AVX Compare packed signed word integers in xmm2


VPCMPGTW xmm1, xmm2, xmm3/m128 and xmm3/m128 for greater than.

VEX.128.66.0F.WIG 66 /r B V/V AVX Compare packed signed doubleword integers in


VPCMPGTD xmm1, xmm2, xmm3/m128 xmm2 and xmm3/m128 for greater than.

VEX.256.66.0F.WIG 64 /r B V/V AVX2 Compare packed signed byte integers in ymm2


VPCMPGTB ymm1, ymm2, ymm3/m256 and ymm3/m256 for greater than.

VEX.256.66.0F.WIG 65 /r B V/V AVX2 Compare packed signed word integers in ymm2


VPCMPGTW ymm1, ymm2, ymm3/m256 and ymm3/m256 for greater than.

VEX.256.66.0F.WIG 66 /r B V/V AVX2 Compare packed signed doubleword integers in


VPCMPGTD ymm1, ymm2, ymm3/m256 ymm2 and ymm3/m256 for greater than.

EVEX.128.66.0F.W0 66 /r C V/V (AVX512VL AND Compare Greater between int32 vector xmm2 and
VPCMPGTD k1 {k2}, xmm2, AVX512F) OR int32 vector xmm3/m128/m32bcst, and set
xmm3/m128/m32bcst AVX10.12 vector mask k1 to reflect the zero/nonzero status
of each element of the result, under writemask.
EVEX.256.66.0F.W0 66 /r C V/V (AVX512VL AND Compare Greater between int32 vector ymm2 and
VPCMPGTD k1 {k2}, ymm2, AVX512F) OR int32 vector ymm3/m256/m32bcst, and set
ymm3/m256/m32bcst AVX10.12 vector mask k1 to reflect the zero/nonzero status
of each element of the result, under writemask.
EVEX.512.66.0F.W0 66 /r C V/V AVX512F Compare Greater between int32 elements in
VPCMPGTD k1 {k2}, zmm2, OR AVX10.12 zmm2 and zmm3/m512/m32bcst, and set
zmm3/m512/m32bcst destination k1 according to the comparison results
under writemask. k2.
EVEX.128.66.0F.WIG 64 /r D V/V (AVX512VL AND Compare packed signed byte integers in xmm2
VPCMPGTB k1 {k2}, xmm2, xmm3/m128 AVX512BW) OR and xmm3/m128 for greater than, and set vector
AVX10.12 mask k1 to reflect the zero/nonzero status of each
element of the result, under writemask.
EVEX.256.66.0F.WIG 64 /r D V/V (AVX512VL AND Compare packed signed byte integers in ymm2
VPCMPGTB k1 {k2}, ymm2, ymm3/m256 AVX512BW) OR and ymm3/m256 for greater than, and set vector
AVX10.12 mask k1 to reflect the zero/nonzero status of each
element of the result, under writemask.

PCMPGTB/PCMPGTW/PCMPGTD—Compare Packed Signed Integers for Greater Than Vol. 2B 4-258


Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
EVEX.512.66.0F.WIG 64 /r D V/V AVX512BW Compare packed signed byte integers in zmm2 and
VPCMPGTB k1 {k2}, zmm2, zmm3/m512 OR AVX10.11 zmm3/m512 for greater than, and set vector
mask k1 to reflect the zero/nonzero status of each
element of the result, under writemask.
EVEX.128.66.0F.WIG 65 /r D V/V (AVX512VL AND Compare packed signed word integers in xmm2
VPCMPGTW k1 {k2}, xmm2, xmm3/m128 AVX512BW) OR and xmm3/m128 for greater than, and set vector
AVX10.12 mask k1 to reflect the zero/nonzero status of each
element of the result, under writemask.
EVEX.256.66.0F.WIG 65 /r D V/V (AVX512VL AND Compare packed signed word integers in ymm2
VPCMPGTW k1 {k2}, ymm2, ymm3/m256 AVX512BW) OR and ymm3/m256 for greater than, and set vector
AVX10.12 mask k1 to reflect the zero/nonzero status of each
element of the result, under writemask.
EVEX.512.66.0F.WIG 65 /r D V/V AVX512BW Compare packed signed word integers in zmm2
VPCMPGTW k1 {k2}, zmm2, zmm3/m512 OR AVX10.12 and zmm3/m512 for greater than, and set vector
mask k1 to reflect the zero/nonzero status of each
element of the result, under writemask.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A
D Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs an SIMD signed compare for the greater value of the packed byte, word, or doubleword integers in the
destination operand (first operand) and the source operand (second operand). If a data element in the destination
operand is greater than the corresponding date element in the source operand, the corresponding data element in
the destination operand is set to all 1s; otherwise, it is set to all 0s.
The PCMPGTB instruction compares the corresponding signed byte integers in the destination and source oper-
ands; the PCMPGTW instruction compares the corresponding signed word integers in the destination and source
operands; and the PCMPGTD instruction compares the corresponding signed doubleword integers in the destina-
tion and source operands.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE instructions: The source operand can be an MMX technology register or a 64-bit memory location. The
destination operand can be an MMX technology register.
128-bit Legacy SSE version: The second source operand can be an XMM register or a 128-bit memory location. The
first source operand and destination operand are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM
destination register remain unchanged.

PCMPGTB/PCMPGTW/PCMPGTD—Compare Packed Signed Integers for Greater Than Vol. 2B 4-259


VEX.128 encoded version: The second source operand can be an XMM register or a 128-bit memory location. The
first source operand and destination operand are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM
register are zeroed.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register.
EVEX encoded VPCMPGTD: The first source operand (second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 32-bit memory location. The destination operand (first operand) is a mask register updated
according to the writemask k2.
EVEX encoded VPCMPGTB/W: The first source operand (second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand
(first operand) is a mask register updated according to the writemask k2.

Operation
PCMPGTB (With 64-bit Operands)
IF DEST[7:0] > SRC[7:0]
THEN DEST[7:0) := FFH;
ELSE DEST[7:0] := 0; FI;
(* Continue comparison of 2nd through 7th bytes in DEST and SRC *)
IF DEST[63:56] > SRC[63:56]
THEN DEST[63:56] := FFH;
ELSE DEST[63:56] := 0; FI;

COMPARE_BYTES_GREATER (SRC1, SRC2)


IF SRC1[7:0] > SRC2[7:0]
THEN DEST[7:0] := FFH;
ELSE DEST[7:0] := 0; FI;
(* Continue comparison of 2nd through 15th bytes in SRC1 and SRC2 *)
IF SRC1[127:120] > SRC2[127:120]
THEN DEST[127:120] := FFH;
ELSE DEST[127:120] := 0; FI;

COMPARE_WORDS_GREATER (SRC1, SRC2)


IF SRC1[15:0] > SRC2[15:0]
THEN DEST[15:0] := FFFFH;
ELSE DEST[15:0] := 0; FI;
(* Continue comparison of 2nd through 7th 16-bit words in SRC1 and SRC2 *)
IF SRC1[127:112] > SRC2[127:112]
THEN DEST[127:112] := FFFFH;
ELSE DEST[127:112] := 0; FI;

COMPARE_DWORDS_GREATER (SRC1, SRC2)


IF SRC1[31:0] > SRC2[31:0]
THEN DEST[31:0] := FFFFFFFFH;
ELSE DEST[31:0] := 0; FI;
(* Continue comparison of 2nd through 3rd 32-bit dwords in SRC1 and SRC2 *)
IF SRC1[127:96] > SRC2[127:96]
THEN DEST[127:96] := FFFFFFFFH;
ELSE DEST[127:96] := 0; FI;

PCMPGTB (With 128-bit Operands)


DEST[127:0] := COMPARE_BYTES_GREATER(DEST[127:0],SRC[127:0])
DEST[MAXVL-1:128] (Unmodified)

PCMPGTB/PCMPGTW/PCMPGTD—Compare Packed Signed Integers for Greater Than Vol. 2B 4-260


VPCMPGTB (VEX.128 Encoded Version)
DEST[127:0] := COMPARE_BYTES_GREATER(SRC1,SRC2)
DEST[MAXVL-1:128] := 0

VPCMPGTB (VEX.256 Encoded Version)


DEST[127:0] := COMPARE_BYTES_GREATER(SRC1[127:0],SRC2[127:0])
DEST[255:128] := COMPARE_BYTES_GREATER(SRC1[255:128],SRC2[255:128])
DEST[MAXVL-1:256] := 0

VPCMPGTB (EVEX Encoded Versions)


(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF k2[j] OR *no writemask*
THEN
/* signed comparison */
CMP := SRC1[i+7:i] > SRC2[i+7:i];
IF CMP = TRUE
THEN DEST[j] := 1;
ELSE DEST[j] := 0; FI;
ELSE DEST[j] := 0 ; zeroing-masking onlyFI;
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

PCMPGTW (With 64-bit Operands)


IF DEST[15:0] > SRC[15:0]
THEN DEST[15:0] := FFFFH;
ELSE DEST[15:0] := 0; FI;
(* Continue comparison of 2nd and 3rd words in DEST and SRC *)
IF DEST[63:48] > SRC[63:48]
THEN DEST[63:48] := FFFFH;
ELSE DEST[63:48] := 0; FI;

PCMPGTW (With 128-bit Operands)


DEST[127:0] := COMPARE_WORDS_GREATER(DEST[127:0],SRC[127:0])
DEST[MAXVL-1:128] (Unmodified)

VPCMPGTW (VEX.128 Encoded Version)


DEST[127:0] := COMPARE_WORDS_GREATER(SRC1,SRC2)
DEST[MAXVL-1:128] := 0

VPCMPGTW (VEX.256 Encoded Version)


DEST[127:0] := COMPARE_WORDS_GREATER(SRC1[127:0],SRC2[127:0])
DEST[255:128] := COMPARE_WORDS_GREATER(SRC1[255:128],SRC2[255:128])
DEST[MAXVL-1:256] := 0

PCMPGTB/PCMPGTW/PCMPGTD—Compare Packed Signed Integers for Greater Than Vol. 2B 4-261


VPCMPGTW (EVEX Encoded Versions)
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k2[j] OR *no writemask*
THEN
/* signed comparison */
CMP := SRC1[i+15:i] > SRC2[i+15:i];
IF CMP = TRUE
THEN DEST[j] := 1;
ELSE DEST[j] := 0; FI;
ELSE DEST[j] := 0 ; zeroing-masking onlyFI;
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

PCMPGTD (With 64-bit Operands)


IF DEST[31:0] > SRC[31:0]
THEN DEST[31:0] := FFFFFFFFH;
ELSE DEST[31:0] := 0; FI;
IF DEST[63:32] > SRC[63:32]
THEN DEST[63:32] := FFFFFFFFH;
ELSE DEST[63:32] := 0; FI;

PCMPGTD (With 128-bit Operands)


DEST[127:0] := COMPARE_DWORDS_GREATER(DEST[127:0],SRC[127:0])
DEST[MAXVL-1:128] (Unmodified)

VPCMPGTD (VEX.128 Encoded Version)


DEST[127:0] := COMPARE_DWORDS_GREATER(SRC1,SRC2)
DEST[MAXVL-1:128] := 0

VPCMPGTD (VEX.256 Encoded Version)


DEST[127:0] := COMPARE_DWORDS_GREATER(SRC1[127:0],SRC2[127:0])
DEST[255:128] := COMPARE_DWORDS_GREATER(SRC1[255:128],SRC2[255:128])
DEST[MAXVL-1:256] := 0

VPCMPGTD (EVEX Encoded Versions)


(KL, VL) = (4, 128), (8, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k2[j] OR *no writemask*
THEN
/* signed comparison */
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN CMP := SRC1[i+31:i] > SRC2[31:0];
ELSE CMP := SRC1[i+31:i] > SRC2[i+31:i];
FI;
IF CMP = TRUE
THEN DEST[j] := 1;
ELSE DEST[j] := 0; FI;
ELSE DEST[j] := 0 ; zeroing-masking only
FI;
ENDFOR

PCMPGTB/PCMPGTW/PCMPGTD—Compare Packed Signed Integers for Greater Than Vol. 2B 4-262


DEST[MAX_KL-1:KL] := 0

Intel C/C++ Compiler Intrinsic Equivalents


VPCMPGTB __mmask64 _mm512_cmpgt_epi8_mask(__m512i a, __m512i b);
VPCMPGTB __mmask64 _mm512_mask_cmpgt_epi8_mask(__mmask64 k, __m512i a, __m512i b);
VPCMPGTB __mmask32 _mm256_cmpgt_epi8_mask(__m256i a, __m256i b);
VPCMPGTB __mmask32 _mm256_mask_cmpgt_epi8_mask(__mmask32 k, __m256i a, __m256i b);
VPCMPGTB __mmask16 _mm_cmpgt_epi8_mask(__m128i a, __m128i b);
VPCMPGTB __mmask16 _mm_mask_cmpgt_epi8_mask(__mmask16 k, __m128i a, __m128i b);
VPCMPGTD __mmask16 _mm512_cmpgt_epi32_mask(__m512i a, __m512i b);
VPCMPGTD __mmask16 _mm512_mask_cmpgt_epi32_mask(__mmask16 k, __m512i a, __m512i b);
VPCMPGTD __mmask8 _mm256_cmpgt_epi32_mask(__m256i a, __m256i b);
VPCMPGTD __mmask8 _mm256_mask_cmpgt_epi32_mask(__mmask8 k, __m256i a, __m256i b);
VPCMPGTD __mmask8 _mm_cmpgt_epi32_mask(__m128i a, __m128i b);
VPCMPGTD __mmask8 _mm_mask_cmpgt_epi32_mask(__mmask8 k, __m128i a, __m128i b);
VPCMPGTW __mmask32 _mm512_cmpgt_epi16_mask(__m512i a, __m512i b);
VPCMPGTW __mmask32 _mm512_mask_cmpgt_epi16_mask(__mmask32 k, __m512i a, __m512i b);
VPCMPGTW __mmask16 _mm256_cmpgt_epi16_mask(__m256i a, __m256i b);
VPCMPGTW __mmask16 _mm256_mask_cmpgt_epi16_mask(__mmask16 k, __m256i a, __m256i b);
VPCMPGTW __mmask8 _mm_cmpgt_epi16_mask(__m128i a, __m128i b);
VPCMPGTW __mmask8 _mm_mask_cmpgt_epi16_mask(__mmask8 k, __m128i a, __m128i b);
PCMPGTB __m64 _mm_cmpgt_pi8 (__m64 m1, __m64 m2)
PCMPGTW __m64 _mm_cmpgt_pi16 (__m64 m1, __m64 m2)
PCMPGTD __m64 _mm_cmpgt_pi32 (__m64 m1, __m64 m2)
(V)PCMPGTB __m128i _mm_cmpgt_epi8 ( __m128i a, __m128i b)
(V)PCMPGTW __m128i _mm_cmpgt_epi16 ( __m128i a, __m128i b)
(V)DCMPGTD __m128i _mm_cmpgt_epi32 ( __m128i a, __m128i b)
VPCMPGTB __m256i _mm256_cmpgt_epi8 ( __m256i a, __m256i b)
VPCMPGTW __m256i _mm256_cmpgt_epi16 ( __m256i a, __m256i b)
VPCMPGTD __m256i _mm256_cmpgt_epi32 ( __m256i a, __m256i b)

Flags Affected
None.

Numeric Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPCMPGTD, see Table 2-51, “Type E4 Class Exception Conditions.”
EVEX-encoded VPCMPGTB/W, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

PCMPGTB/PCMPGTW/PCMPGTD—Compare Packed Signed Integers for Greater Than Vol. 2B 4-263


PCMPGTQ—Compare Packed Data for Greater Than
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 38 37 /r A V/V SSE4_2 Compare packed signed qwords in xmm2/m128
PCMPGTQ xmm1,xmm2/m128 and xmm1 for greater than.
VEX.128.66.0F38.WIG 37 /r B V/V AVX Compare packed signed qwords in xmm2 and
VPCMPGTQ xmm1, xmm2, xmm3/m128 xmm3/m128 for greater than.
VEX.256.66.0F38.WIG 37 /r B V/V AVX2 Compare packed signed qwords in ymm2 and
VPCMPGTQ ymm1, ymm2, ymm3/m256 ymm3/m256 for greater than.
EVEX.128.66.0F38.W1 37 /r C V/V (AVX512VL AND Compare Greater between int64 vector xmm2 and
VPCMPGTQ k1 {k2}, xmm2, AVX512F) OR int64 vector xmm3/m128/m64bcst, and set
xmm3/m128/m64bcst AVX10.11 vector mask k1 to reflect the zero/nonzero status
of each element of the result, under writemask.
EVEX.256.66.0F38.W1 37 /r C V/V (AVX512VL AND Compare Greater between int64 vector ymm2 and
VPCMPGTQ k1 {k2}, ymm2, AVX512F) OR int64 vector ymm3/m256/m64bcst, and set
ymm3/m256/m64bcst AVX10.11 vector mask k1 to reflect the zero/nonzero status
of each element of the result, under writemask.
EVEX.512.66.0F38.W1 37 /r C V/V AVX512F Compare Greater between int64 vector zmm2 and
VPCMPGTQ k1 {k2}, zmm2, OR AVX10.11 int64 vector zmm3/m512/m64bcst, and set
zmm3/m512/m64bcst vector mask k1 to reflect the zero/nonzero status
of each element of the result, under writemask.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs an SIMD signed compare for the packed quadwords in the destination operand (first operand) and the
source operand (second operand). If the data element in the first (destination) operand is greater than the
corresponding element in the second (source) operand, the corresponding data element in the destination is set
to all 1s; otherwise, it is set to 0s.
128-bit Legacy SSE version: The second source operand can be an XMM register or a 128-bit memory location. The
first source operand and destination operand are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM
destination register remain unchanged.
VEX.128 encoded version: The second source operand can be an XMM register or a 128-bit memory location. The
first source operand and destination operand are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM
register are zeroed.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register.
EVEX encoded VPCMPGTD/Q: The first source operand (second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 64-bit memory location. The destination operand (first operand) is a mask register updated
according to the writemask k2.

PCMPGTQ—Compare Packed Data for Greater Than Vol. 2B 4-264


Operation
COMPARE_QWORDS_GREATER (SRC1, SRC2)
IF SRC1[63:0] > SRC2[63:0]
THEN DEST[63:0] := FFFFFFFFFFFFFFFFH;
ELSE DEST[63:0] := 0; FI;
IF SRC1[127:64] > SRC2[127:64]
THEN DEST[127:64] := FFFFFFFFFFFFFFFFH;
ELSE DEST[127:64] := 0; FI;

VPCMPGTQ (VEX.128 Encoded Version)


DEST[127:0] := COMPARE_QWORDS_GREATER(SRC1,SRC2)
DEST[MAXVL-1:128] := 0

VPCMPGTQ (VEX.256 Encoded Version)


DEST[127:0] := COMPARE_QWORDS_GREATER(SRC1[127:0],SRC2[127:0])
DEST[255:128] := COMPARE_QWORDS_GREATER(SRC1[255:128],SRC2[255:128])
DEST[MAXVL-1:256] := 0

VPCMPGTQ (EVEX Encoded Versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k2[j] OR *no writemask*
THEN
/* signed comparison */
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN CMP := SRC1[i+63:i] > SRC2[63:0];
ELSE CMP := SRC1[i+63:i] > SRC2[i+63:i];
FI;
IF CMP = TRUE
THEN DEST[j] := 1;
ELSE DEST[j] := 0; FI;
ELSE DEST[j] := 0 ; zeroing-masking only
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPCMPGTQ __mmask8 _mm512_cmpgt_epi64_mask( __m512i a, __m512i b);
VPCMPGTQ __mmask8 _mm512_mask_cmpgt_epi64_mask(__mmask8 k, __m512i a, __m512i b);
VPCMPGTQ __mmask8 _mm256_cmpgt_epi64_mask( __m256i a, __m256i b);
VPCMPGTQ __mmask8 _mm256_mask_cmpgt_epi64_mask(__mmask8 k, __m256i a, __m256i b);
VPCMPGTQ __mmask8 _mm_cmpgt_epi64_mask( __m128i a, __m128i b);
VPCMPGTQ __mmask8 _mm_mask_cmpgt_epi64_mask(__mmask8 k, __m128i a, __m128i b);
(V)PCMPGTQ __m128i _mm_cmpgt_epi64(__m128i a, __m128i b)
VPCMPGTQ __m256i _mm256_cmpgt_epi64( __m256i a, __m256i b);

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

PCMPGTQ—Compare Packed Data for Greater Than Vol. 2B 4-265


Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPCMPGTQ, see Table 2-51, “Type E4 Class Exception Conditions.”

PCMPGTQ—Compare Packed Data for Greater Than Vol. 2B 4-266


PEXTRB/PEXTRD/PEXTRQ—Extract Byte/Dword/Qword
Opcode/ Op/ En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
66 0F 3A 14 /r ib A V/V SSE4_1 Extract a byte integer value from xmm2 at the
PEXTRB reg/m8, xmm2, imm8 source byte offset specified by imm8 into reg or
m8. The upper bits of r32 or r64 are zeroed.
66 0F 3A 16 /r ib A V/V SSE4_1 Extract a dword integer value from xmm2 at the
PEXTRD r/m32, xmm2, imm8 source dword offset specified by imm8 into r/m32.
66 REX.W 0F 3A 16 /r ib A V/N.E. SSE4_1 Extract a qword integer value from xmm2 at the
PEXTRQ r/m64, xmm2, imm8 source qword offset specified by imm8 into r/m64.
VEX.128.66.0F3A.W0 14 /r ib A V1/V AVX Extract a byte integer value from xmm2 at the
VPEXTRB reg/m8, xmm2, imm8 source byte offset specified by imm8 into reg or
m8. The upper bits of r64/r32 is filled with zeros.
VEX.128.66.0F3A.W0 16 /r ib A V/V AVX Extract a dword integer value from xmm2 at the
VPEXTRD r32/m32, xmm2, imm8 source dword offset specified by imm8 into
r32/m32.
VEX.128.66.0F3A.W1 16 /r ib A V/I2 AVX Extract a qword integer value from xmm2 at the
VPEXTRQ r64/m64, xmm2, imm8 source dword offset specified by imm8 into
r64/m64.
EVEX.128.66.0F3A.WIG 14 /r ib B V/V AVX512BW Extract a byte integer value from xmm2 at the
VPEXTRB reg/m8, xmm2, imm8 OR AVX10.13 source byte offset specified by imm8 into reg or
m8. The upper bits of r64/r32 is filled with zeros.
EVEX.128.66.0F3A.W0 16 /r ib B V/V AVX512DQ Extract a dword integer value from xmm2 at the
VPEXTRD r32/m32, xmm2, imm8 OR AVX10.13 source dword offset specified by imm8 into
r32/m32.
EVEX.128.66.0F3A.W1 16 /r ib B V/N.E.2 AVX512DQ Extract a qword integer value from xmm2 at the
VPEXTRQ r64/m64, xmm2, imm8 OR AVX10.13 source dword offset specified by imm8 into
r64/m64.

NOTES:
1. In 64-bit mode, VEX.W1 is ignored for VPEXTRB (similar to legacy REX.W=1 prefix in PEXTRB).
2. VEX.W/EVEX.W in non-64 bit is ignored; the instructions behaves as if the W0 version is used.
3. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:r/m (w) ModRM:reg (r) imm8 N/A
B Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) imm8 N/A

Description
Extract a byte/dword/qword integer value from the source XMM register at a byte/dword/qword offset determined
from imm8[3:0]. The destination can be a register or byte/dword/qword memory location. If the destination is a
register, the upper bits of the register are zero extended.
In legacy non-VEX encoded version and if the destination operand is a register, the default operand size in 64-bit
mode for PEXTRB/PEXTRD is 64 bits, the bits above the least significant byte/dword data are filled with zeros.
PEXTRQ is not encodable in non-64-bit modes and requires REX.W in 64-bit mode.
Note: In VEX.128 encoded versions, VEX.vvvv is reserved and must be 1111b, VEX.L must be 0, otherwise the
instruction will #UD. In EVEX.128 encoded versions, EVEX.vvvv is reserved and must be 1111b, EVEX.L”L must be

PEXTRB/PEXTRD/PEXTRQ—Extract Byte/Dword/Qword Vol. 2B 4-281


0, otherwise the instruction will #UD. If the destination operand is a register, the default operand size in 64-bit
mode for VPEXTRB/VPEXTRD is 64 bits, the bits above the least significant byte/word/dword data are filled with
zeros.

Operation
CASE of
PEXTRB: SEL := COUNT[3:0];
TEMP := (Src >> SEL*8) AND FFH;
IF (DEST = Mem8)
THEN
Mem8 := TEMP[7:0];
ELSE IF (64-Bit Mode and 64-bit register selected)
THEN
R64[7:0] := TEMP[7:0];
r64[63:8] := ZERO_FILL; };
ELSE
R32[7:0] := TEMP[7:0];
r32[31:8] := ZERO_FILL; };
FI;
PEXTRD:SEL := COUNT[1:0];
TEMP := (Src >> SEL*32) AND FFFF_FFFFH;
DEST := TEMP;
PEXTRQ: SEL := COUNT[0];
TEMP := (Src >> SEL*64);
DEST := TEMP;
EASC:

VPEXTRTD/VPEXTRQ
IF (64-Bit Mode and 64-bit dest operand)
THEN
Src_Offset := imm8[0]
r64/m64 := (Src >> Src_Offset * 64)
ELSE
Src_Offset := imm8[1:0]
r32/m32 := ((Src >> Src_Offset *32) AND 0FFFFFFFFh);
FI

VPEXTRB ( dest=m8)
SRC_Offset := imm8[3:0]
Mem8 := (Src >> Src_Offset*8)

VPEXTRB ( dest=reg)
IF (64-Bit Mode )
THEN
SRC_Offset := imm8[3:0]
DEST[7:0] := ((Src >> Src_Offset*8) AND 0FFh)
DEST[63:8] := ZERO_FILL;
ELSE
SRC_Offset := imm8[3:0];
DEST[7:0] := ((Src >> Src_Offset*8) AND 0FFh);
DEST[31:8] := ZERO_FILL;
FI

PEXTRB/PEXTRD/PEXTRQ—Extract Byte/Dword/Qword Vol. 2B 4-282


Intel C/C++ Compiler Intrinsic Equivalent
PEXTRB int _mm_extract_epi8 (__m128i src, const int ndx);
PEXTRD int _mm_extract_epi32 (__m128i src, const int ndx);
PEXTRQ __int64 _mm_extract_epi64 (__m128i src, const int ndx);

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-59, “Type E9NF Class Exception Conditions.”
Additionally:
#UD If VEX.L = 1 or EVEX.L’L > 0.
If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

PEXTRB/PEXTRD/PEXTRQ—Extract Byte/Dword/Qword Vol. 2B 4-283


PEXTRW—Extract Word
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F C5 /r ib1 A V/V SSE Extract the word specified by imm8 from mm and
PEXTRW reg, mm, imm8 move it to reg, bits 15-0. The upper bits of r32 or r64
is zeroed.
66 0F C5 /r ib A V/V SSE2 Extract the word specified by imm8 from xmm and
PEXTRW reg, xmm, imm8 move it to reg, bits 15-0. The upper bits of r32 or r64
is zeroed.
66 0F 3A 15 /r ib B V/V SSE4_1 Extract the word specified by imm8 from xmm and
PEXTRW reg/m16, xmm, imm8 copy it to lowest 16 bits of reg or m16. Zero-extend
the result in the destination, r32 or r64.
VEX.128.66.0F.W0 C5 /r ib A V2/V AVX Extract the word specified by imm8 from xmm1 and
VPEXTRW reg, xmm1, imm8 move it to reg, bits 15:0. Zero-extend the result. The
upper bits of r64/r32 is filled with zeros.
VEX.128.66.0F3A.W0 15 /r ib B V/V AVX Extract a word integer value from xmm2 at the
VPEXTRW reg/m16, xmm2, imm8 source word offset specified by imm8 into reg or
m16. The upper bits of r64/r32 is filled with zeros.
EVEX.128.66.0F.WIG C5 /r ib A V/V AVX512BW Extract the word specified by imm8 from xmm1 and
VPEXTRW reg, xmm1, imm8 OR AVX10.13 move it to reg, bits 15:0. Zero-extend the result. The
upper bits of r64/r32 is filled with zeros.
EVEX.128.66.0F3A.WIG 15 /r ib C V/V AVX512BW Extract a word integer value from xmm2 at the
VPEXTRW reg/m16, xmm2, imm8 OR AVX10.13 source word offset specified by imm8 into reg or
m16. The upper bits of r64/r32 is filled with zeros.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. In 64-bit mode, VEX.W1 is ignored for VPEXTRW (similar to legacy REX.W=1 prefix in PEXTRW).
3. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) imm8 N/A
B N/A ModRM:r/m (w) ModRM:reg (r) imm8 N/A
C Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) imm8 N/A

Description
Copies the word in the source operand (second operand) specified by the count operand (third operand) to the
destination operand (first operand). The source operand can be an MMX technology register or an XMM register.
The destination operand can be the low word of a general-purpose register or a 16-bit memory address. The count
operand is an 8-bit immediate. When specifying a word location in an MMX technology register, the 2 least-signifi-
cant bits of the count operand specify the location; for an XMM register, the 3 least-significant bits specify the loca-
tion. The content of the destination register above bit 16 is cleared (set to all 0s).
In 64-bit mode, using a REX prefix in the form of REX.R permits this instruction to access additional registers
(XMM8-XMM15, R8-15). If the destination operand is a general-purpose register, the default operand size is 64-bits
in 64-bit mode.

PEXTRW—Extract Word Vol. 2B 4-284


Note: In VEX.128 encoded versions, VEX.vvvv is reserved and must be 1111b, VEX.L must be 0, otherwise the
instruction will #UD. In EVEX.128 encoded versions, EVEX.vvvv is reserved and must be 1111b, EVEX.L must be 0,
otherwise the instruction will #UD. If the destination operand is a register, the default operand size in 64-bit mode
for VPEXTRW is 64 bits, the bits above the least significant byte/word/dword data are filled with zeros.

Operation
IF (DEST = Mem16)
THEN
SEL := COUNT[2:0];
TEMP := (Src >> SEL*16) AND FFFFH;
Mem16 := TEMP[15:0];
ELSE IF (64-Bit Mode and destination is a general-purpose register)
THEN
FOR (PEXTRW instruction with 64-bit source operand)
{ SEL := COUNT[1:0];
TEMP := (SRC >> (SEL ∗ 16)) AND FFFFH;
r64[15:0] := TEMP[15:0];
r64[63:16] := ZERO_FILL; };
FOR (PEXTRW instruction with 128-bit source operand)
{ SEL := COUNT[2:0];
TEMP := (SRC >> (SEL ∗ 16)) AND FFFFH;
r64[15:0] := TEMP[15:0];
r64[63:16] := ZERO_FILL; }
ELSE
FOR (PEXTRW instruction with 64-bit source operand)
{ SEL := COUNT[1:0];
TEMP := (SRC >> (SEL ∗ 16)) AND FFFFH;
r32[15:0] := TEMP[15:0];
r32[31:16] := ZERO_FILL; };
FOR (PEXTRW instruction with 128-bit source operand)
{ SEL := COUNT[2:0];
TEMP := (SRC >> (SEL ∗ 16)) AND FFFFH;
r32[15:0] := TEMP[15:0];
r32[31:16] := ZERO_FILL; };
FI;
FI;

VPEXTRW ( dest=m16)
SRC_Offset := imm8[2:0]
Mem16 := (Src >> Src_Offset*16)

VPEXTRW ( dest=reg)
IF (64-Bit Mode )
THEN
SRC_Offset := imm8[2:0]
DEST[15:0] := ((Src >> Src_Offset*16) AND 0FFFFh)
DEST[63:16] := ZERO_FILL;
ELSE
SRC_Offset := imm8[2:0]
DEST[15:0] := ((Src >> Src_Offset*16) AND 0FFFFh)
DEST[31:16] := ZERO_FILL;
FI

PEXTRW—Extract Word Vol. 2B 4-285


Intel C/C++ Compiler Intrinsic Equivalent
PEXTRW int _mm_extract_pi16 (__m64 a, int n)
PEXTRW int _mm_extract_epi16 ( __m128i a, int imm)

Flags Affected
None.

Numeric Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-59, “Type E9NF Class Exception Conditions.”
Additionally:
#UD If VEX.L = 1 or EVEX.L’L > 0.
If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

PEXTRW—Extract Word Vol. 2B 4-286


PINSRB/PINSRD/PINSRQ—Insert Byte/Dword/Qword
Opcode/ Op/ En 64/32 bit CPUID Feature Description
Instruction Mode Flag
Support
66 0F 3A 20 /r ib A V/V SSE4_1 Insert a byte integer value from r32/m8 into
PINSRB xmm1, r32/m8, imm8 xmm1 at the destination element in xmm1
specified by imm8.
66 0F 3A 22 /r ib A V/V SSE4_1 Insert a dword integer value from r/m32 into
PINSRD xmm1, r/m32, imm8 the xmm1 at the destination element
specified by imm8.
66 REX.W 0F 3A 22 /r ib A V/N. E. SSE4_1 Insert a qword integer value from r/m64 into
PINSRQ xmm1, r/m64, imm8 the xmm1 at the destination element
specified by imm8.
VEX.128.66.0F3A.W0 20 /r ib B V1/V AVX Merge a byte integer value from r32/m8 and
VPINSRB xmm1, xmm2, r32/m8, imm8 rest from xmm2 into xmm1 at the byte offset
in imm8.
VEX.128.66.0F3A.W0 22 /r ib B V/V AVX Insert a dword integer value from r32/m32
VPINSRD xmm1, xmm2, r/m32, imm8 and rest from xmm2 into xmm1 at the dword
offset in imm8.
VEX.128.66.0F3A.W1 22 /r ib B V/I2 AVX Insert a qword integer value from r64/m64
VPINSRQ xmm1, xmm2, r/m64, imm8 and rest from xmm2 into xmm1 at the qword
offset in imm8.
EVEX.128.66.0F3A.WIG 20 /r ib C V/V AVX512BW Merge a byte integer value from r32/m8 and
VPINSRB xmm1, xmm2, r32/m8, imm8 OR AVX10.13 rest from xmm2 into xmm1 at the byte offset
in imm8.
EVEX.128.66.0F3A.W0 22 /r ib C V/V AVX512DQ Insert a dword integer value from r32/m32
VPINSRD xmm1, xmm2, r32/m32, imm8 OR AVX10.13 and rest from xmm2 into xmm1 at the dword
offset in imm8.
EVEX.128.66.0F3A.W1 22 /r ib C V/N.E.2 AVX512DQ Insert a qword integer value from r64/m64
VPINSRQ xmm1, xmm2, r64/m64, imm8 OR AVX10.13 and rest from xmm2 into xmm1 at the qword
offset in imm8.

NOTES:
1. In 64-bit mode, VEX.W1 is ignored for VPINSRB (similar to legacy REX.W=1 prefix with PINSRB).
2. VEX.W/EVEX.W in non-64 bit is ignored; the instructions behaves as if the W0 version is used.
3. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) imm8 N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
Copies a byte/dword/qword from the source operand (second operand) and inserts it in the destination operand
(first operand) at the location specified with the count operand (third operand). (The other elements in the desti-
nation register are left untouched.) The source operand can be a general-purpose register or a memory location.
(When the source operand is a general-purpose register, PINSRB copies the low byte of the register.) The destina-

PINSRB/PINSRD/PINSRQ—Insert Byte/Dword/Qword Vol. 2B 4-300


tion operand is an XMM register. The count operand is an 8-bit immediate. When specifying a qword[dword, byte]
location in an XMM register, the [2, 4] least-significant bit(s) of the count operand specify the location.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15, R8-15). Use of REX.W permits the use of 64 bit general purpose regis-
ters.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding YMM destination register remain
unchanged.
VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed. VEX.L must be 0, otherwise
the instruction will #UD. Attempt to execute VPINSRQ in non-64-bit mode will cause #UD.
EVEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed. EVEX.L’L must be 0, other-
wise the instruction will #UD.

Operation
CASE OF
PINSRB: SEL := COUNT[3:0];
MASK := (0FFH << (SEL * 8));
TEMP := (((SRC[7:0] << (SEL *8)) AND MASK);
PINSRD: SEL := COUNT[1:0];
MASK := (0FFFFFFFFH << (SEL * 32));
TEMP := (((SRC << (SEL *32)) AND MASK) ;
PINSRQ: SEL := COUNT[0]
MASK := (0FFFFFFFFFFFFFFFFH << (SEL * 64));
TEMP := (((SRC << (SEL *64)) AND MASK) ;
ESAC;
DEST := ((DEST AND NOT MASK) OR TEMP);

VPINSRB (VEX/EVEX Encoded Version)


SEL := imm8[3:0]
DEST[127:0] := write_b_element(SEL, SRC2, SRC1)
DEST[MAXVL-1:128] := 0

VPINSRD (VEX/EVEX Encoded Version)


SEL := imm8[1:0]
DEST[127:0] := write_d_element(SEL, SRC2, SRC1)
DEST[MAXVL-1:128] := 0

VPINSRQ (VEX/EVEX Encoded Version)


SEL := imm8[0]
DEST[127:0] := write_q_element(SEL, SRC2, SRC1)
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


PINSRB __m128i _mm_insert_epi8 (__m128i s1, int s2, const int ndx);
PINSRD __m128i _mm_insert_epi32 (__m128i s2, int s, const int ndx);
PINSRQ __m128i _mm_insert_epi64(__m128i s2, __int64 s, const int ndx);

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

PINSRB/PINSRD/PINSRQ—Insert Byte/Dword/Qword Vol. 2B 4-301


Other Exceptions
EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-59, “Type E9NF Class Exception Conditions.”
Additionally:
#UD If VEX.L = 1 or EVEX.L’L > 0.

PINSRB/PINSRD/PINSRQ—Insert Byte/Dword/Qword Vol. 2B 4-302


PINSRW—Insert Word
Opcode/ Op/ En 64/32 bit CPUID Feature Description
Instruction Mode Flag
Support
NP 0F C4 /r ib1 A V/V SSE Insert the low word from r32 or from m16 into
PINSRW mm, r32/m16, imm8 mm at the word position specified by imm8.

66 0F C4 /r ib A V/V SSE2 Move the low word of r32 or from m16 into
PINSRW xmm, r32/m16, imm8 xmm at the word position specified by imm8.

VEX.128.66.0F.W0 C4 /r ib B V2/V AVX Insert the word from r32/m16 at the offset
VPINSRW xmm1, xmm2, r32/m16, imm8 indicated by imm8 into the value from xmm2
and store result in xmm1.
EVEX.128.66.0F.WIG C4 /r ib C V/V AVX512BW OR Insert the word from r32/m16 at the offset
VPINSRW xmm1, xmm2, r32/m16, imm8 AVX10.13 indicated by imm8 into the value from xmm2
and store result in xmm1.
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. In 64-bit mode, VEX.W1 is ignored for VPINSRW (similar to legacy REX.W=1 prefix in PINSRW).
3. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the
processor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported
vector width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) imm8 N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
Three operand MMX and SSE instructions:
Copies a word from the source operand and inserts it in the destination operand at the location specified with the
count operand. (The other words in the destination register are left untouched.) The source operand can be a
general-purpose register or a 16-bit memory location. (When the source operand is a general-purpose register, the
low word of the register is copied.) The destination operand can be an MMX technology register or an XMM register.
The count operand is an 8-bit immediate. When specifying a word location in an MMX technology register, the 2
least-significant bits of the count operand specify the location; for an XMM register, the 3 least-significant bits
specify the location.
Bits (MAXVL-1:128) of the corresponding YMM destination register remain unchanged.
Four operand AVX and AVX-512 instructions:
Combines a word from the first source operand with the second source operand, and inserts it in the destination
operand at the location specified with the count operand. The second source operand can be a general-purpose
register or a 16-bit memory location. (When the source operand is a general-purpose register, the low word of the
register is copied.) The first source and destination operands are XMM registers. The count operand is an 8-bit
immediate. When specifying a word location, the 3 least-significant bits specify the location.
Bits (MAXVL-1:128) of the destination YMM register are zeroed. VEX.L/EVEX.L’L must be 0, otherwise the instruc-
tion will #UD.

PINSRW—Insert Word Vol. 2B 4-303


Operation
PINSRW dest, src, imm8 (MMX)
SEL := imm8[1:0]
DEST.word[SEL] := src.word[0]

PINSRW dest, src, imm8 (SSE)


SEL := imm8[2:0]
DEST.word[SEL] := src.word[0]

VPINSRW dest, src1, src2, imm8 (AVX/AVX512)


SEL := imm8[2:0]
DEST := src1
DEST.word[SEL] := src2.word[0]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


PINSRW __m64 _mm_insert_pi16 (__m64 a, int d, int n)
PINSRW __m128i _mm_insert_epi16 ( __m128i a, int b, int imm)

Flags Affected
None.

Numeric Exceptions
None.

Other Exceptions
EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-59, “Type E9NF Class Exception Conditions.”
Additionally:
#UD If VEX.L = 1 or EVEX.L’L > 0.

PINSRW—Insert Word Vol. 2B 4-304


PMADDUBSW—Multiply and Add Packed Signed and Unsigned Bytes
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 38 04 /r1 A V/V SSSE3 Multiply signed and unsigned bytes, add
PMADDUBSW mm1, mm2/m64 horizontal pair of signed words, pack saturated
signed-words to mm1.
66 0F 38 04 /r A V/V SSSE3 Multiply signed and unsigned bytes, add
PMADDUBSW xmm1, xmm2/m128 horizontal pair of signed words, pack saturated
signed-words to xmm1.
VEX.128.66.0F38.WIG 04 /r B V/V AVX Multiply signed and unsigned bytes, add
VPMADDUBSW xmm1, xmm2, xmm3/m128 horizontal pair of signed words, pack saturated
signed-words to xmm1.
VEX.256.66.0F38.WIG 04 /r B V/V AVX2 Multiply signed and unsigned bytes, add
VPMADDUBSW ymm1, ymm2, ymm3/m256 horizontal pair of signed words, pack saturated
signed-words to ymm1.
EVEX.128.66.0F38.WIG 04 /r C V/V (AVX512VL AND Multiply signed and unsigned bytes, add
VPMADDUBSW xmm1 {k1}{z}, xmm2, AVX512BW) OR horizontal pair of signed words, pack saturated
xmm3/m128 AVX10.12 signed-words to xmm1 under writemask k1.
EVEX.256.66.0F38.WIG 04 /r C V/V (AVX512VL AND Multiply signed and unsigned bytes, add
VPMADDUBSW ymm1 {k1}{z}, ymm2, AVX512BW) OR horizontal pair of signed words, pack saturated
ymm3/m256 AVX10.12 signed-words to ymm1 under writemask k1.
EVEX.512.66.0F38.WIG 04 /r C V/V AVX512BW Multiply signed and unsigned bytes, add
VPMADDUBSW zmm1 {k1}{z}, zmm2, OR AVX10.12 horizontal pair of signed words, pack saturated
zmm3/m512 signed-words to zmm1 under writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
(V)PMADDUBSW multiplies vertically each unsigned byte of the destination operand (first operand) with the corre-
sponding signed byte of the source operand (second operand), producing intermediate signed 16-bit integers.
Each adjacent pair of signed words is added and the saturated result is packed to the destination operand. For
example, the lowest-order bytes (bits 7-0) in the source and destination operands are multiplied and the interme-
diate signed word result is added with the corresponding intermediate result from the 2nd lowest-order bytes (bits
15-8) of the operands; the sign-saturated result is stored in the lowest word of the destination register (15-0). The
same operation is performed on the other pairs of adjacent bytes. Both operands can be MMX register or XMM
registers. When the source operand is a 128-bit memory operand, the operand must be aligned on a 16-byte
boundary or a general-protection exception (#GP) will be generated.
In 64-bit mode and not encoded with VEX/EVEX, use the REX prefix to access XMM8-XMM15.

PMADDUBSW—Multiply and Add Packed Signed and Unsigned Bytes Vol. 2B 4-305
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register remain unchanged.
VEX.128 and EVEX.128 encoded versions: The first source and destination operands are XMM registers. The
second source operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding
destination register are zeroed.
VEX.256 and EVEX.256 encoded versions: The second source operand can be an YMM register or a 256-bit memory
location. The first source and destination operands are YMM registers. Bits (MAXVL-1:256) of the corresponding
ZMM register are zeroed.
EVEX.512 encoded version: The second source operand can be an ZMM register or a 512-bit memory location. The
first source and destination operands are ZMM registers.

Operation
PMADDUBSW (With 64-bit Operands)
DEST[15-0] = SaturateToSignedWord(SRC[15-8]*DEST[15-8]+SRC[7-0]*DEST[7-0]);
DEST[31-16] = SaturateToSignedWord(SRC[31-24]*DEST[31-24]+SRC[23-16]*DEST[23-16]);
DEST[47-32] = SaturateToSignedWord(SRC[47-40]*DEST[47-40]+SRC[39-32]*DEST[39-32]);
DEST[63-48] = SaturateToSignedWord(SRC[63-56]*DEST[63-56]+SRC[55-48]*DEST[55-48]);

PMADDUBSW (With 128-bit Operands)


DEST[15-0] = SaturateToSignedWord(SRC[15-8]* DEST[15-8]+SRC[7-0]*DEST[7-0]);
// Repeat operation for 2nd through 7th word
SRC1/DEST[127-112] = SaturateToSignedWord(SRC[127-120]*DEST[127-120]+ SRC[119-112]* DEST[119-112]);

VPMADDUBSW (VEX.128 Encoded Version)


DEST[15:0] := SaturateToSignedWord(SRC2[15:8]* SRC1[15:8]+SRC2[7:0]*SRC1[7:0])
// Repeat operation for 2nd through 7th word
DEST[127:112] := SaturateToSignedWord(SRC2[127:120]*SRC1[127:120]+ SRC2[119:112]* SRC1[119:112])
DEST[MAXVL-1:128] := 0

VPMADDUBSW (VEX.256 Encoded Version)


DEST[15:0] := SaturateToSignedWord(SRC2[15:8]* SRC1[15:8]+SRC2[7:0]*SRC1[7:0])
// Repeat operation for 2nd through 15th word
DEST[255:240] := SaturateToSignedWord(SRC2[255:248]*SRC1[255:248]+ SRC2[247:240]* SRC1[247:240])
DEST[MAXVL-1:256] := 0

VPMADDUBSW (EVEX Encoded Versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SaturateToSignedWord(SRC2[i+15:i+8]* SRC1[i+15:i+8] + SRC2[i+7:i]*SRC1[i+7:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PMADDUBSW—Multiply and Add Packed Signed and Unsigned Bytes Vol. 2B 4-306
Intel C/C++ Compiler Intrinsic Equivalents
VPMADDUBSW __m512i _mm512_maddubs_epi16( __m512i a, __m512i b);
VPMADDUBSW __m512i _mm512_mask_maddubs_epi16(__m512i s, __mmask32 k, __m512i a, __m512i b);
VPMADDUBSW __m512i _mm512_maskz_maddubs_epi16( __mmask32 k, __m512i a, __m512i b);
VPMADDUBSW __m256i _mm256_mask_maddubs_epi16(__m256i s, __mmask16 k, __m256i a, __m256i b);
VPMADDUBSW __m256i _mm256_maskz_maddubs_epi16( __mmask16 k, __m256i a, __m256i b);
VPMADDUBSW __m128i _mm_mask_maddubs_epi16(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMADDUBSW __m128i _mm_maskz_maddubs_epi16( __mmask8 k, __m128i a, __m128i b);
PMADDUBSW __m64 _mm_maddubs_pi16 (__m64 a, __m64 b)
(V)PMADDUBSW __m128i _mm_maddubs_epi16 (__m128i a, __m128i b)
VPMADDUBSW __m256i _mm256_maddubs_epi16 (__m256i a, __m256i b)

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”

PMADDUBSW—Multiply and Add Packed Signed and Unsigned Bytes Vol. 2B 4-307
PMADDWD—Multiply and Add Packed Integers
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F F5 /r1 A V/V MMX Multiply the packed words in mm by the packed
PMADDWD mm, mm/m64 words in mm/m64, add adjacent doubleword
results, and store in mm.
66 0F F5 /r A V/V SSE2 Multiply the packed word integers in xmm1 by
PMADDWD xmm1, xmm2/m128 the packed word integers in xmm2/m128, add
adjacent doubleword results, and store in
xmm1.
VEX.128.66.0F.WIG F5 /r B V/V AVX Multiply the packed word integers in xmm2 by
VPMADDWD xmm1, xmm2, xmm3/m128 the packed word integers in xmm3/m128, add
adjacent doubleword results, and store in
xmm1.
VEX.256.66.0F.WIG F5 /r B V/V AVX2 Multiply the packed word integers in ymm2 by
VPMADDWD ymm1, ymm2, ymm3/m256 the packed word integers in ymm3/m256, add
adjacent doubleword results, and store in
ymm1.
EVEX.128.66.0F.WIG F5 /r C V/V (AVX512VL AND Multiply the packed word integers in xmm2 by
VPMADDWD xmm1 {k1}{z}, xmm2, AVX512BW) OR the packed word integers in xmm3/m128, add
xmm3/m128 AVX10.12 adjacent doubleword results, and store in
xmm1 under writemask k1.
EVEX.256.66.0F.WIG F5 /r C V/V (AVX512VL AND Multiply the packed word integers in ymm2 by
VPMADDWD ymm1 {k1}{z}, ymm2, AVX512BW) OR the packed word integers in ymm3/m256, add
ymm3/m256 AVX10.12 adjacent doubleword results, and store in
ymm1 under writemask k1.
EVEX.512.66.0F.WIG F5 /r C V/V AVX512BW Multiply the packed word integers in zmm2 by
VPMADDWD zmm1 {k1}{z}, zmm2, OR AVX10.12 the packed word integers in zmm3/m512, add
zmm3/m512 adjacent doubleword results, and store in
zmm1 under writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Multiplies the individual signed words of the destination operand (first operand) by the corresponding signed words
of the source operand (second operand), producing temporary signed, doubleword results. The adjacent double-
word results are then summed and stored in the destination operand. For example, the corresponding low-order
words (15-0) and (31-16) in the source and destination operands are multiplied by one another and the double-
word results are added together and stored in the low doubleword of the destination register (31-0). The same

PMADDWD—Multiply and Add Packed Integers Vol. 2B 4-308


operation is performed on the other pairs of adjacent words. (Figure 4-11 shows this operation when using 64-bit
operands).
The (V)PMADDWD instruction wraps around only in one situation: when the 2 pairs of words being operated on in
a group are all 8000H. In this case, the result wraps around to 80000000H.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE version: The first source and destination operands are MMX registers. The second source operand is an
MMX register or a 64-bit memory location.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding YMM destina-
tion register remain unchanged.
VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the destination YMM register are
zeroed.
VEX.256 encoded version: The second source operand can be an YMM register or a 256-bit memory location. The
first source and destination operands are YMM registers.
EVEX.512 encoded version: The second source operand can be an ZMM register or a 512-bit memory location. The
first source and destination operands are ZMM registers.

SRC X3 X2 X1 X0

DEST Y3 Y2 Y1 Y0

TEMP X3 ∗ Y3 X2 ∗ Y2 X1 ∗ Y1 X0 ∗ Y0

DEST (X3∗Y3) + (X2∗Y2) (X1∗Y1) + (X0∗Y0)

Figure 4-11. PMADDWD Execution Model Using 64-bit Operands

Operation
PMADDWD (With 64-bit Operands)
DEST[31:0] := (DEST[15:0] ∗ SRC[15:0]) + (DEST[31:16] ∗ SRC[31:16]);
DEST[63:32] := (DEST[47:32] ∗ SRC[47:32]) + (DEST[63:48] ∗ SRC[63:48]);

PMADDWD (With 128-bit Operands)


DEST[31:0] := (DEST[15:0] ∗ SRC[15:0]) + (DEST[31:16] ∗ SRC[31:16]);
DEST[63:32] := (DEST[47:32] ∗ SRC[47:32]) + (DEST[63:48] ∗ SRC[63:48]);
DEST[95:64] := (DEST[79:64] ∗ SRC[79:64]) + (DEST[95:80] ∗ SRC[95:80]);
DEST[127:96] := (DEST[111:96] ∗ SRC[111:96]) + (DEST[127:112] ∗ SRC[127:112]);

VPMADDWD (VEX.128 Encoded Version)


DEST[31:0] := (SRC1[15:0] * SRC2[15:0]) + (SRC1[31:16] * SRC2[31:16])
DEST[63:32] := (SRC1[47:32] * SRC2[47:32]) + (SRC1[63:48] * SRC2[63:48])
DEST[95:64] := (SRC1[79:64] * SRC2[79:64]) + (SRC1[95:80] * SRC2[95:80])
DEST[127:96] := (SRC1[111:96] * SRC2[111:96]) + (SRC1[127:112] * SRC2[127:112])
DEST[MAXVL-1:128] := 0

PMADDWD—Multiply and Add Packed Integers Vol. 2B 4-309


VPMADDWD (VEX.256 Encoded Version)
DEST[31:0] := (SRC1[15:0] * SRC2[15:0]) + (SRC1[31:16] * SRC2[31:16])
DEST[63:32] := (SRC1[47:32] * SRC2[47:32]) + (SRC1[63:48] * SRC2[63:48])
DEST[95:64] := (SRC1[79:64] * SRC2[79:64]) + (SRC1[95:80] * SRC2[95:80])
DEST[127:96] := (SRC1[111:96] * SRC2[111:96]) + (SRC1[127:112] * SRC2[127:112])
DEST[159:128] := (SRC1[143:128] * SRC2[143:128]) + (SRC1[159:144] * SRC2[159:144])
DEST[191:160] := (SRC1[175:160] * SRC2[175:160]) + (SRC1[191:176] * SRC2[191:176])
DEST[223:192] := (SRC1[207:192] * SRC2[207:192]) + (SRC1[223:208] * SRC2[223:208])
DEST[255:224] := (SRC1[239:224] * SRC2[239:224]) + (SRC1[255:240] * SRC2[255:240])
DEST[MAXVL-1:256] := 0

VPMADDWD (EVEX Encoded Versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := (SRC2[i+31:i+16]* SRC1[i+31:i+16]) + (SRC2[i+15:i]*SRC1[i+15:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPMADDWD __m512i _mm512_madd_epi16( __m512i a, __m512i b);
VPMADDWD __m512i _mm512_mask_madd_epi16(__m512i s, __mmask32 k, __m512i a, __m512i b);
VPMADDWD __m512i _mm512_maskz_madd_epi16( __mmask32 k, __m512i a, __m512i b);
VPMADDWD __m256i _mm256_mask_madd_epi16(__m256i s, __mmask16 k, __m256i a, __m256i b);
VPMADDWD __m256i _mm256_maskz_madd_epi16( __mmask16 k, __m256i a, __m256i b);
VPMADDWD __m128i _mm_mask_madd_epi16(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMADDWD __m128i _mm_maskz_madd_epi16( __mmask8 k, __m128i a, __m128i b);
PMADDWD __m64 _mm_madd_pi16(__m64 m1, __m64 m2)
(V)PMADDWD __m128i _mm_madd_epi16 ( __m128i a, __m128i b)
VPMADDWD __m256i _mm256_madd_epi16 ( __m256i a, __m256i b)

Flags Affected
None.

Numeric Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”

PMADDWD—Multiply and Add Packed Integers Vol. 2B 4-310


PMAXSB/PMAXSW/PMAXSD/PMAXSQ—Maximum of Packed Signed Integers
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F EE /r1 A V/V SSE Compare signed word integers in mm2/m64 and
PMAXSW mm1, mm2/m64 mm1 and return maximum values.

66 0F 38 3C /r A V/V SSE4_1 Compare packed signed byte integers in xmm1


PMAXSB xmm1, xmm2/m128 and xmm2/m128 and store packed maximum
values in xmm1.
66 0F EE /r A V/V SSE2 Compare packed signed word integers in
PMAXSW xmm1, xmm2/m128 xmm2/m128 and xmm1 and stores maximum
packed values in xmm1.
66 0F 38 3D /r A V/V SSE4_1 Compare packed signed dword integers in xmm1
PMAXSD xmm1, xmm2/m128 and xmm2/m128 and store packed maximum
values in xmm1.
VEX.128.66.0F38.WIG 3C /r B V/V AVX Compare packed signed byte integers in xmm2
VPMAXSB xmm1, xmm2, xmm3/m128 and xmm3/m128 and store packed maximum
values in xmm1.
VEX.128.66.0F.WIG EE /r B V/V AVX Compare packed signed word integers in
VPMAXSW xmm1, xmm2, xmm3/m128 xmm3/m128 and xmm2 and store packed
maximum values in xmm1.
VEX.128.66.0F38.WIG 3D /r B V/V AVX Compare packed signed dword integers in xmm2
VPMAXSD xmm1, xmm2, xmm3/m128 and xmm3/m128 and store packed maximum
values in xmm1.
VEX.256.66.0F38.WIG 3C /r B V/V AVX2 Compare packed signed byte integers in ymm2
VPMAXSB ymm1, ymm2, ymm3/m256 and ymm3/m256 and store packed maximum
values in ymm1.
VEX.256.66.0F.WIG EE /r B V/V AVX2 Compare packed signed word integers in
VPMAXSW ymm1, ymm2, ymm3/m256 ymm3/m256 and ymm2 and store packed
maximum values in ymm1.
VEX.256.66.0F38.WIG 3D /r B V/V AVX2 Compare packed signed dword integers in ymm2
VPMAXSD ymm1, ymm2, ymm3/m256 and ymm3/m256 and store packed maximum
values in ymm1.
EVEX.128.66.0F38.WIG 3C /r C V/V (AVX512VL AND Compare packed signed byte integers in xmm2
VPMAXSB xmm1{k1}{z}, xmm2, AVX512BW) OR and xmm3/m128 and store packed maximum
xmm3/m128 AVX10.11 values in xmm1 under writemask k1.
EVEX.256.66.0F38.WIG 3C /r C V/V (AVX512VL AND Compare packed signed byte integers in ymm2
VPMAXSB ymm1{k1}{z}, ymm2, AVX512BW) OR and ymm3/m256 and store packed maximum
ymm3/m256 AVX10.11 values in ymm1 under writemask k1.
EVEX.512.66.0F38.WIG 3C /r C V/V AVX512BW OR Compare packed signed byte integers in zmm2
VPMAXSB zmm1{k1}{z}, zmm2, AVX10.11 and zmm3/m512 and store packed maximum
zmm3/m512 values in zmm1 under writemask k1.
EVEX.128.66.0F.WIG EE /r C V/V (AVX512VL AND Compare packed signed word integers in xmm2
VPMAXSW xmm1{k1}{z}, xmm2, AVX512BW) OR and xmm3/m128 and store packed maximum
xmm3/m128 AVX10.11 values in xmm1 under writemask k1.
EVEX.256.66.0F.WIG EE /r C V/V (AVX512VL AND Compare packed signed word integers in ymm2
VPMAXSW ymm1{k1}{z}, ymm2, AVX512BW) OR and ymm3/m256 and store packed maximum
ymm3/m256 AVX10.11 values in ymm1 under writemask k1.
EVEX.512.66.0F.WIG EE /r C V/V AVX512BW OR Compare packed signed word integers in zmm2
VPMAXSW zmm1{k1}{z}, zmm2, AVX10.11 and zmm3/m512 and store packed maximum
zmm3/m512 values in zmm1 under writemask k1.

PMAXSB/PMAXSW/PMAXSD/PMAXSQ—Maximum of Packed Signed Integers Vol. 2B 4-311


Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
EVEX.128.66.0F38.W0 3D /r D V/V (AVX512VL AND Compare packed signed dword integers in xmm2
VPMAXSD xmm1 {k1}{z}, xmm2, AVX512F) OR and xmm3/m128/m32bcst and store packed
xmm3/m128/m32bcst AVX10.11 maximum values in xmm1 using writemask k1.
EVEX.256.66.0F38.W0 3D /r D V/V (AVX512VL AND Compare packed signed dword integers in ymm2
VPMAXSD ymm1 {k1}{z}, ymm2, AVX512F) OR and ymm3/m256/m32bcst and store packed
ymm3/m256/m32bcst AVX10.11 maximum values in ymm1 using writemask k1.
EVEX.512.66.0F38.W0 3D /r D V/V AVX512F Compare packed signed dword integers in zmm2
VPMAXSD zmm1 {k1}{z}, zmm2, OR AVX10.11 and zmm3/m512/m32bcst and store packed
zmm3/m512/m32bcst maximum values in zmm1 using writemask k1.
EVEX.128.66.0F38.W1 3D /r D V/V (AVX512VL AND Compare packed signed qword integers in xmm2
VPMAXSQ xmm1 {k1}{z}, xmm2, AVX512F) OR and xmm3/m128/m64bcst and store packed
xmm3/m128/m64bcst AVX10.11 maximum values in xmm1 using writemask k1.
EVEX.256.66.0F38.W1 3D /r D V/V (AVX512VL AND Compare packed signed qword integers in ymm2
VPMAXSQ ymm1 {k1}{z}, ymm2, AVX512F) OR and ymm3/m256/m64bcst and store packed
ymm3/m256/m64bcst AVX10.11 maximum values in ymm1 using writemask k1.
EVEX.512.66.0F38.W1 3D /r D V/V AVX512F Compare packed signed qword integers in zmm2
VPMAXSQ zmm1 {k1}{z}, zmm2, OR AVX10.11 and zmm3/m512/m64bcst and store packed
zmm3/m512/m64bcst maximum values in zmm1 using writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A
D Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD compare of the packed signed byte, word, dword or qword integers in the second source operand
and the first source operand and returns the maximum value for each pair of integers to the destination operand.
Legacy SSE version PMAXSW: The source operand can be an MMX technology register or a 64-bit memory location.
The destination operand can be an MMX technology register.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding YMM destina-
tion register remain unchanged.
VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register are zeroed.
VEX.256 encoded version: The second source operand can be an YMM register or a 256-bit memory location. The
first source and destination operands are YMM registers. Bits (MAXVL-1:256) of the corresponding destination
register are zeroed.

PMAXSB/PMAXSW/PMAXSD/PMAXSQ—Maximum of Packed Signed Integers Vol. 2B 4-312


EVEX encoded VPMAXSD/Q: The first source operand is a ZMM/YMM/XMM register; The second source operand is
a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a
32/64-bit memory location. The destination operand is conditionally updated based on writemask k1.
EVEX encoded VPMAXSB/W: The first source operand is a ZMM/YMM/XMM register; The second source operand is
a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand is conditionally updated
based on writemask k1.

Operation
PMAXSW (64-bit Operands)
IF DEST[15:0] > SRC[15:0]) THEN
DEST[15:0] := DEST[15:0];
ELSE
DEST[15:0] := SRC[15:0]; FI;
(* Repeat operation for 2nd and 3rd words in source and destination operands *)
IF DEST[63:48] > SRC[63:48]) THEN
DEST[63:48] := DEST[63:48];
ELSE
DEST[63:48] := SRC[63:48]; FI;

PMAXSB (128-bit Legacy SSE Version)


IF DEST[7:0] > SRC[7:0] THEN
DEST[7:0] := DEST[7:0];
ELSE
DEST[7:0] := SRC[7:0]; FI;
(* Repeat operation for 2nd through 15th bytes in source and destination operands *)
IF DEST[127:120] >SRC[127:120] THEN
DEST[127:120] := DEST[127:120];
ELSE
DEST[127:120] := SRC[127:120]; FI;
DEST[MAXVL-1:128] (Unmodified)

VPMAXSB (VEX.128 Encoded Version)


IF SRC1[7:0] > SRC2[7:0] THEN
DEST[7:0] := SRC1[7:0];
ELSE
DEST[7:0] := SRC2[7:0]; FI;
(* Repeat operation for 2nd through 15th bytes in source and destination operands *)
IF SRC1[127:120] >SRC2[127:120] THEN
DEST[127:120] := SRC1[127:120];
ELSE
DEST[127:120] := SRC2[127:120]; FI;
DEST[MAXVL-1:128] := 0

VPMAXSB (VEX.256 Encoded Version)


IF SRC1[7:0] > SRC2[7:0] THEN
DEST[7:0] := SRC1[7:0];
ELSE
DEST[7:0] := SRC2[7:0]; FI;
(* Repeat operation for 2nd through 31st bytes in source and destination operands *)
IF SRC1[255:248] >SRC2[255:248] THEN
DEST[255:248] := SRC1[255:248];
ELSE
DEST[255:248] := SRC2[255:248]; FI;
DEST[MAXVL-1:256] := 0

PMAXSB/PMAXSW/PMAXSD/PMAXSQ—Maximum of Packed Signed Integers Vol. 2B 4-313


VPMAXSB (EVEX Encoded Versions)
(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask* THEN
IF SRC1[i+7:i] > SRC2[i+7:i]
THEN DEST[i+7:i] := SRC1[i+7:i];
ELSE DEST[i+7:i] := SRC2[i+7:i];
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PMAXSW (128-bit Legacy SSE Version)


IF DEST[15:0] >SRC[15:0] THEN
DEST[15:0] := DEST[15:0];
ELSE
DEST[15:0] := SRC[15:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
IF DEST[127:112] >SRC[127:112] THEN
DEST[127:112] := DEST[127:112];
ELSE
DEST[127:112] := SRC[127:112]; FI;
DEST[MAXVL-1:128] (Unmodified)

VPMAXSW (VEX.128 Encoded Version)


IF SRC1[15:0] > SRC2[15:0] THEN
DEST[15:0] := SRC1[15:0];
ELSE
DEST[15:0] := SRC2[15:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
IF SRC1[127:112] >SRC2[127:112] THEN
DEST[127:112] := SRC1[127:112];
ELSE
DEST[127:112] := SRC2[127:112]; FI;
DEST[MAXVL-1:128] := 0

VPMAXSW (VEX.256 Encoded Version)


IF SRC1[15:0] > SRC2[15:0] THEN
DEST[15:0] := SRC1[15:0];
ELSE
DEST[15:0] := SRC2[15:0]; FI;
(* Repeat operation for 2nd through 15th words in source and destination operands *)
IF SRC1[255:240] >SRC2[255:240] THEN
DEST[255:240] := SRC1[255:240];
ELSE
DEST[255:240] := SRC2[255:240]; FI;
DEST[MAXVL-1:256] := 0

PMAXSB/PMAXSW/PMAXSD/PMAXSQ—Maximum of Packed Signed Integers Vol. 2B 4-314


VPMAXSW (EVEX Encoded Versions)
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask* THEN
IF SRC1[i+15:i] > SRC2[i+15:i]
THEN DEST[i+15:i] := SRC1[i+15:i];
ELSE DEST[i+15:i] := SRC2[i+15:i];
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PMAXSD (128-bit Legacy SSE Version)


IF DEST[31:0] >SRC[31:0] THEN
DEST[31:0] := DEST[31:0];
ELSE
DEST[31:0] := SRC[31:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
IF DEST[127:96] >SRC[127:96] THEN
DEST[127:96] := DEST[127:96];
ELSE
DEST[127:96] := SRC[127:96]; FI;
DEST[MAXVL-1:128] (Unmodified)

VPMAXSD (VEX.128 Encoded Version)


IF SRC1[31:0] > SRC2[31:0] THEN
DEST[31:0] := SRC1[31:0];
ELSE
DEST[31:0] := SRC2[31:0]; FI;
(* Repeat operation for 2nd through 3rd dwords in source and destination operands *)
IF SRC1[127:96] > SRC2[127:96] THEN
DEST[127:96] := SRC1[127:96];
ELSE
DEST[127:96] := SRC2[127:96]; FI;
DEST[MAXVL-1:128] := 0

VPMAXSD (VEX.256 Encoded Version)


IF SRC1[31:0] > SRC2[31:0] THEN
DEST[31:0] := SRC1[31:0];
ELSE
DEST[31:0] := SRC2[31:0]; FI;
(* Repeat operation for 2nd through 7th dwords in source and destination operands *)
IF SRC1[255:224] > SRC2[255:224] THEN
DEST[255:224] := SRC1[255:224];
ELSE
DEST[255:224] := SRC2[255:224]; FI;
DEST[MAXVL-1:256] := 0

PMAXSB/PMAXSW/PMAXSD/PMAXSQ—Maximum of Packed Signed Integers Vol. 2B 4-315


VPMAXSD (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
IF SRC1[i+31:i] > SRC2[31:0]
THEN DEST[i+31:i] := SRC1[i+31:i];
ELSE DEST[i+31:i] := SRC2[31:0];
FI;
ELSE
IF SRC1[i+31:i] > SRC2[i+31:i]
THEN DEST[i+31:i] := SRC1[i+31:i];
ELSE DEST[i+31:i] := SRC2[i+31:i];
FI;
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE DEST[i+31:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPMAXSQ (EVEX Encoded Versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
IF SRC1[i+63:i] > SRC2[63:0]
THEN DEST[i+63:i] := SRC1[i+63:i];
ELSE DEST[i+63:i] := SRC2[63:0];
FI;
ELSE
IF SRC1[i+63:i] > SRC2[i+63:i]
THEN DEST[i+63:i] := SRC1[i+63:i];
ELSE DEST[i+63:i] := SRC2[i+63:i];
FI;
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[i+63:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PMAXSB/PMAXSW/PMAXSD/PMAXSQ—Maximum of Packed Signed Integers Vol. 2B 4-316


Intel C/C++ Compiler Intrinsic Equivalent
VPMAXSB __m512i _mm512_max_epi8( __m512i a, __m512i b);
VPMAXSB __m512i _mm512_mask_max_epi8(__m512i s, __mmask64 k, __m512i a, __m512i b);
VPMAXSB __m512i _mm512_maskz_max_epi8( __mmask64 k, __m512i a, __m512i b);
VPMAXSW __m512i _mm512_max_epi16( __m512i a, __m512i b);
VPMAXSW __m512i _mm512_mask_max_epi16(__m512i s, __mmask32 k, __m512i a, __m512i b);
VPMAXSW __m512i _mm512_maskz_max_epi16( __mmask32 k, __m512i a, __m512i b);
VPMAXSB __m256i _mm256_mask_max_epi8(__m256i s, __mmask32 k, __m256i a, __m256i b);
VPMAXSB __m256i _mm256_maskz_max_epi8( __mmask32 k, __m256i a, __m256i b);
VPMAXSW __m256i _mm256_mask_max_epi16(__m256i s, __mmask16 k, __m256i a, __m256i b);
VPMAXSW __m256i _mm256_maskz_max_epi16( __mmask16 k, __m256i a, __m256i b);
VPMAXSB __m128i _mm_mask_max_epi8(__m128i s, __mmask16 k, __m128i a, __m128i b);
VPMAXSB __m128i _mm_maskz_max_epi8( __mmask16 k, __m128i a, __m128i b);
VPMAXSW __m128i _mm_mask_max_epi16(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMAXSW __m128i _mm_maskz_max_epi16( __mmask8 k, __m128i a, __m128i b);
VPMAXSD __m256i _mm256_mask_max_epi32(__m256i s, __mmask16 k, __m256i a, __m256i b);
VPMAXSD __m256i _mm256_maskz_max_epi32( __mmask16 k, __m256i a, __m256i b);
VPMAXSQ __m256i _mm256_mask_max_epi64(__m256i s, __mmask8 k, __m256i a, __m256i b);
VPMAXSQ __m256i _mm256_maskz_max_epi64( __mmask8 k, __m256i a, __m256i b);
VPMAXSD __m128i _mm_mask_max_epi32(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMAXSD __m128i _mm_maskz_max_epi32( __mmask8 k, __m128i a, __m128i b);
VPMAXSQ __m128i _mm_mask_max_epi64(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMAXSQ __m128i _mm_maskz_max_epu64( __mmask8 k, __m128i a, __m128i b);
VPMAXSD __m512i _mm512_max_epi32( __m512i a, __m512i b);
VPMAXSD __m512i _mm512_mask_max_epi32(__m512i s, __mmask16 k, __m512i a, __m512i b);
VPMAXSD __m512i _mm512_maskz_max_epi32( __mmask16 k, __m512i a, __m512i b);
VPMAXSQ __m512i _mm512_max_epi64( __m512i a, __m512i b);
VPMAXSQ __m512i _mm512_mask_max_epi64(__m512i s, __mmask8 k, __m512i a, __m512i b);
VPMAXSQ __m512i _mm512_maskz_max_epi64( __mmask8 k, __m512i a, __m512i b);
(V)PMAXSB __m128i _mm_max_epi8 ( __m128i a, __m128i b);
(V)PMAXSW __m128i _mm_max_epi16 ( __m128i a, __m128i b)
(V)PMAXSD __m128i _mm_max_epi32 ( __m128i a, __m128i b);
VPMAXSB __m256i _mm256_max_epi8 ( __m256i a, __m256i b);
VPMAXSW __m256i _mm256_max_epi16 ( __m256i a, __m256i b)
VPMAXSD __m256i _mm256_max_epi32 ( __m256i a, __m256i b);
PMAXSW:__m64 _mm_max_pi16(__m64 a, __m64 b)

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPMAXSD/Q, see Table 2-51, “Type E4 Class Exception Conditions.”
EVEX-encoded VPMAXSB/W, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

PMAXSB/PMAXSW/PMAXSD/PMAXSQ—Maximum of Packed Signed Integers Vol. 2B 4-317


PMAXUB/PMAXUW—Maximum of Packed Unsigned Integers
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F DE /r1 A V/V SSE Compare unsigned byte integers in mm2/m64 and
PMAXUB mm1, mm2/m64 mm1 and returns maximum values.

66 0F DE /r A V/V SSE2 Compare packed unsigned byte integers in xmm1


PMAXUB xmm1, xmm2/m128 and xmm2/m128 and store packed maximum
values in xmm1.
66 0F 38 3E/r A V/V SSE4_1 Compare packed unsigned word integers in
PMAXUW xmm1, xmm2/m128 xmm2/m128 and xmm1 and stores maximum
packed values in xmm1.
VEX.128.66.0F DE /r B V/V AVX Compare packed unsigned byte integers in xmm2
VPMAXUB xmm1, xmm2, xmm3/m128 and xmm3/m128 and store packed maximum
values in xmm1.
VEX.128.66.0F38 3E/r B V/V AVX Compare packed unsigned word integers in
VPMAXUW xmm1, xmm2, xmm3/m128 xmm3/m128 and xmm2 and store maximum
packed values in xmm1.
VEX.256.66.0F DE /r B V/V AVX2 Compare packed unsigned byte integers in ymm2
VPMAXUB ymm1, ymm2, ymm3/m256 and ymm3/m256 and store packed maximum
values in ymm1.
VEX.256.66.0F38 3E/r B V/V AVX2 Compare packed unsigned word integers in
VPMAXUW ymm1, ymm2, ymm3/m256 ymm3/m256 and ymm2 and store maximum
packed values in ymm1.
EVEX.128.66.0F.WIG DE /r C V/V (AVX512VL AND Compare packed unsigned byte integers in xmm2
VPMAXUB xmm1{k1}{z}, xmm2, AVX512BW) OR and xmm3/m128 and store packed maximum
xmm3/m128 AVX10.12 values in xmm1 under writemask k1.
EVEX.256.66.0F.WIG DE /r C V/V (AVX512VL AND Compare packed unsigned byte integers in ymm2
VPMAXUB ymm1{k1}{z}, ymm2, AVX512BW) OR and ymm3/m256 and store packed maximum
ymm3/m256 AVX10.12 values in ymm1 under writemask k1.
EVEX.512.66.0F.WIG DE /r C V/V AVX512BW Compare packed unsigned byte integers in zmm2
VPMAXUB zmm1{k1}{z}, zmm2, OR AVX10.12 and zmm3/m512 and store packed maximum
zmm3/m512 values in zmm1 under writemask k1.
EVEX.128.66.0F38.WIG 3E /r C V/V (AVX512VL AND Compare packed unsigned word integers in xmm2
VPMAXUW xmm1{k1}{z}, xmm2, AVX512BW) OR and xmm3/m128 and store packed maximum
xmm3/m128 AVX10.12 values in xmm1 under writemask k1.
EVEX.256.66.0F38.WIG 3E /r C V/V (AVX512VL AND Compare packed unsigned word integers in ymm2
VPMAXUW ymm1{k1}{z}, ymm2, AVX512BW) OR and ymm3/m256 and store packed maximum
ymm3/m256 AVX10.12 values in ymm1 under writemask k1.
EVEX.512.66.0F38.WIG 3E /r C V/V AVX512BW Compare packed unsigned word integers in zmm2
VPMAXUW zmm1{k1}{z}, zmm2, OR AVX10.12 and zmm3/m512 and store packed maximum
zmm3/m512 values in zmm1 under writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

PMAXUB/PMAXUW—Maximum of Packed Unsigned Integers Vol. 2B 4-318


Instruction Operand Encoding
Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD compare of the packed unsigned byte, word integers in the second source operand and the first
source operand and returns the maximum value for each pair of integers to the destination operand.
Legacy SSE version PMAXUB: The source operand can be an MMX technology register or a 64-bit memory location.
The destination operand can be an MMX technology register.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register remain unchanged.
VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register are zeroed.
VEX.256 encoded version: The second source operand can be an YMM register or a 256-bit memory location. The
first source and destination operands are YMM registers.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register; The second source operand is a
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand is conditionally updated
based on writemask k1.

Operation
PMAXUB (64-bit Operands)
IF DEST[7:0] > SRC[17:0]) THEN
DEST[7:0] := DEST[7:0];
ELSE
DEST[7:0] := SRC[7:0]; FI;
(* Repeat operation for 2nd through 7th bytes in source and destination operands *)
IF DEST[63:56] > SRC[63:56]) THEN
DEST[63:56] := DEST[63:56];
ELSE
DEST[63:56] := SRC[63:56]; FI;

PMAXUB (128-bit Legacy SSE Version)


IF DEST[7:0] >SRC[7:0] THEN
DEST[7:0] := DEST[7:0];
ELSE
DEST[15:0] := SRC[7:0]; FI;
(* Repeat operation for 2nd through 15th bytes in source and destination operands *)
IF DEST[127:120] >SRC[127:120] THEN
DEST[127:120] := DEST[127:120];
ELSE
DEST[127:120] := SRC[127:120]; FI;
DEST[MAXVL-1:128] (Unmodified)

PMAXUB/PMAXUW—Maximum of Packed Unsigned Integers Vol. 2B 4-319


VPMAXUB (VEX.128 Encoded Version)
IF SRC1[7:0] >SRC2[7:0] THEN
DEST[7:0] := SRC1[7:0];
ELSE
DEST[7:0] := SRC2[7:0]; FI;
(* Repeat operation for 2nd through 15th bytes in source and destination operands *)
IF SRC1[127:120] >SRC2[127:120] THEN
DEST[127:120] := SRC1[127:120];
ELSE
DEST[127:120] := SRC2[127:120]; FI;
DEST[MAXVL-1:128] := 0

VPMAXUB (VEX.256 Encoded Version)


IF SRC1[7:0] >SRC2[7:0] THEN
DEST[7:0] := SRC1[7:0];
ELSE
DEST[15:0] := SRC2[7:0]; FI;
(* Repeat operation for 2nd through 31st bytes in source and destination operands *)
IF SRC1[255:248] >SRC2[255:248] THEN
DEST[255:248] := SRC1[255:248];
ELSE
DEST[255:248] := SRC2[255:248]; FI;
DEST[MAXVL-1:128] := 0

VPMAXUB (EVEX Encoded Versions)


(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask* THEN
IF SRC1[i+7:i] > SRC2[i+7:i]
THEN DEST[i+7:i] := SRC1[i+7:i];
ELSE DEST[i+7:i] := SRC2[i+7:i];
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PMAXUW (128-bit Legacy SSE Version)


IF DEST[15:0] >SRC[15:0] THEN
DEST[15:0] := DEST[15:0];
ELSE
DEST[15:0] := SRC[15:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
IF DEST[127:112] >SRC[127:112] THEN
DEST[127:112] := DEST[127:112];
ELSE
DEST[127:112] := SRC[127:112]; FI;
DEST[MAXVL-1:128] (Unmodified)

PMAXUB/PMAXUW—Maximum of Packed Unsigned Integers Vol. 2B 4-320


VPMAXUW (VEX.128 Encoded Version)
IF SRC1[15:0] > SRC2[15:0] THEN
DEST[15:0] := SRC1[15:0];
ELSE
DEST[15:0] := SRC2[15:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
IF SRC1[127:112] >SRC2[127:112] THEN
DEST[127:112] := SRC1[127:112];
ELSE
DEST[127:112] := SRC2[127:112]; FI;
DEST[MAXVL-1:128] := 0

VPMAXUW (VEX.256 Encoded Version)


IF SRC1[15:0] > SRC2[15:0] THEN
DEST[15:0] := SRC1[15:0];
ELSE
DEST[15:0] := SRC2[15:0]; FI;
(* Repeat operation for 2nd through 15th words in source and destination operands *)
IF SRC1[255:240] >SRC2[255:240] THEN
DEST[255:240] := SRC1[255:240];
ELSE
DEST[255:240] := SRC2[255:240]; FI;
DEST[MAXVL-1:128] := 0

VPMAXUW (EVEX Encoded Versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask* THEN
IF SRC1[i+15:i] > SRC2[i+15:i]
THEN DEST[i+15:i] := SRC1[i+15:i];
ELSE DEST[i+15:i] := SRC2[i+15:i];
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PMAXUB/PMAXUW—Maximum of Packed Unsigned Integers Vol. 2B 4-321


Intel C/C++ Compiler Intrinsic Equivalent
VPMAXUB __m512i _mm512_max_epu8( __m512i a, __m512i b);
VPMAXUB __m512i _mm512_mask_max_epu8(__m512i s, __mmask64 k, __m512i a, __m512i b);
VPMAXUB __m512i _mm512_maskz_max_epu8( __mmask64 k, __m512i a, __m512i b);
VPMAXUW __m512i _mm512_max_epu16( __m512i a, __m512i b);
VPMAXUW __m512i _mm512_mask_max_epu16(__m512i s, __mmask32 k, __m512i a, __m512i b);
VPMAXUW __m512i _mm512_maskz_max_epu16( __mmask32 k, __m512i a, __m512i b);
VPMAXUB __m256i _mm256_mask_max_epu8(__m256i s, __mmask32 k, __m256i a, __m256i b);
VPMAXUB __m256i _mm256_maskz_max_epu8( __mmask32 k, __m256i a, __m256i b);
VPMAXUW __m256i _mm256_mask_max_epu16(__m256i s, __mmask16 k, __m256i a, __m256i b);
VPMAXUW __m256i _mm256_maskz_max_epu16( __mmask16 k, __m256i a, __m256i b);
VPMAXUB __m128i _mm_mask_max_epu8(__m128i s, __mmask16 k, __m128i a, __m128i b);
VPMAXUB __m128i _mm_maskz_max_epu8( __mmask16 k, __m128i a, __m128i b);
VPMAXUW __m128i _mm_mask_max_epu16(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMAXUW __m128i _mm_maskz_max_epu16( __mmask8 k, __m128i a, __m128i b);
(V)PMAXUB __m128i _mm_max_epu8 ( __m128i a, __m128i b);
(V)PMAXUW __m128i _mm_max_epu16 ( __m128i a, __m128i b)
VPMAXUB __m256i _mm256_max_epu8 ( __m256i a, __m256i b);
VPMAXUW __m256i _mm256_max_epu16 ( __m256i a, __m256i b);
PMAXUB __m64 _mm_max_pu8(__m64 a, __m64 b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

PMAXUB/PMAXUW—Maximum of Packed Unsigned Integers Vol. 2B 4-322


PMAXUD/PMAXUQ—Maximum of Packed Unsigned Integers
Opcode/ Op/En 64/32 bit CPUID Feature Description
Instruction Mode Flag
Support
66 0F 38 3F /r A V/V SSE4_1 Compare packed unsigned dword integers in xmm1
PMAXUD xmm1, xmm2/m128 and xmm2/m128 and store packed maximum
values in xmm1.
VEX.128.66.0F38.WIG 3F /r B V/V AVX Compare packed unsigned dword integers in xmm2
VPMAXUD xmm1, xmm2, xmm3/m128 and xmm3/m128 and store packed maximum
values in xmm1.
VEX.256.66.0F38.WIG 3F /r B V/V AVX2 Compare packed unsigned dword integers in ymm2
VPMAXUD ymm1, ymm2, ymm3/m256 and ymm3/m256 and store packed maximum
values in ymm1.
EVEX.128.66.0F38.W0 3F /r C V/V (AVX512VL AND Compare packed unsigned dword integers in xmm2
VPMAXUD xmm1 {k1}{z}, xmm2, AVX512F) OR and xmm3/m128/m32bcst and store packed
xmm3/m128/m32bcst AVX10.11 maximum values in xmm1 under writemask k1.
EVEX.256.66.0F38.W0 3F /r C V/V (AVX512VL AND Compare packed unsigned dword integers in ymm2
VPMAXUD ymm1 {k1}{z}, ymm2, AVX512F) OR and ymm3/m256/m32bcst and store packed
ymm3/m256/m32bcst AVX10.11 maximum values in ymm1 under writemask k1.
EVEX.512.66.0F38.W0 3F /r C V/V AVX512F Compare packed unsigned dword integers in zmm2
VPMAXUD zmm1 {k1}{z}, zmm2, OR AVX10.11 and zmm3/m512/m32bcst and store packed
zmm3/m512/m32bcst maximum values in zmm1 under writemask k1.
EVEX.128.66.0F38.W1 3F /r C V/V (AVX512VL AND Compare packed unsigned qword integers in xmm2
VPMAXUQ xmm1 {k1}{z}, xmm2, AVX512F) OR and xmm3/m128/m64bcst and store packed
xmm3/m128/m64bcst AVX10.11 maximum values in xmm1 under writemask k1.
EVEX.256.66.0F38.W1 3F /r C V/V (AVX512VL AND Compare packed unsigned qword integers in ymm2
VPMAXUQ ymm1 {k1}{z}, ymm2, AVX512F) OR and ymm3/m256/m64bcst and store packed
ymm3/m256/m64bcst AVX10.11 maximum values in ymm1 under writemask k1.
EVEX.512.66.0F38.W1 3F /r C V/V AVX512F Compare packed unsigned qword integers in zmm2
VPMAXUQ zmm1 {k1}{z}, zmm2, OR AVX10.11 and zmm3/m512/m64bcst and store packed
zmm3/m512/m64bcst maximum values in zmm1 under writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) N/A

Description
Performs a SIMD compare of the packed unsigned dword or qword integers in the second source operand and the
first source operand and returns the maximum value for each pair of integers to the destination operand.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register remain unchanged.

PMAXUD/PMAXUQ—Maximum of Packed Unsigned Integers Vol. 2B 4-323


VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register are zeroed.
VEX.256 encoded version: The first source operand is a YMM register; The second source operand is a YMM register
or 256-bit memory location. Bits (MAXVL-1:256) of the corresponding destination register are zeroed.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register; The second source operand is a
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a
32/64-bit memory location. The destination operand is conditionally updated based on writemask k1.

Operation
PMAXUD (128-bit Legacy SSE Version)
IF DEST[31:0] >SRC[31:0] THEN
DEST[31:0] := DEST[31:0];
ELSE
DEST[31:0] := SRC[31:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
IF DEST[127:96] >SRC[127:96] THEN
DEST[127:96] := DEST[127:96];
ELSE
DEST[127:96] := SRC[127:96]; FI;
DEST[MAXVL-1:128] (Unmodified)

VPMAXUD (VEX.128 Encoded Version)


IF SRC1[31:0] > SRC2[31:0] THEN
DEST[31:0] := SRC1[31:0];
ELSE
DEST[31:0] := SRC2[31:0]; FI;
(* Repeat operation for 2nd through 3rd dwords in source and destination operands *)
IF SRC1[127:96] > SRC2[127:96] THEN
DEST[127:96] := SRC1[127:96];
ELSE
DEST[127:96] := SRC2[127:96]; FI;
DEST[MAXVL-1:128] := 0

VPMAXUD (VEX.256 Encoded Version)


IF SRC1[31:0] > SRC2[31:0] THEN
DEST[31:0] := SRC1[31:0];
ELSE
DEST[31:0] := SRC2[31:0]; FI;
(* Repeat operation for 2nd through 7th dwords in source and destination operands *)
IF SRC1[255:224] > SRC2[255:224] THEN
DEST[255:224] := SRC1[255:224];
ELSE
DEST[255:224] := SRC2[255:224]; FI;
DEST[MAXVL-1:256] := 0

PMAXUD/PMAXUQ—Maximum of Packed Unsigned Integers Vol. 2B 4-324


VPMAXUD (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
IF SRC1[i+31:i] > SRC2[31:0]
THEN DEST[i+31:i] := SRC1[i+31:i];
ELSE DEST[i+31:i] := SRC2[31:0];
FI;
ELSE
IF SRC1[i+31:i] > SRC2[i+31:i]
THEN DEST[i+31:i] := SRC1[i+31:i];
ELSE DEST[i+31:i] := SRC2[i+31:i];
FI;
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[i+31:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

VPMAXUQ (EVEX Encoded Versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
IF SRC1[i+63:i] > SRC2[63:0]
THEN DEST[i+63:i] := SRC1[i+63:i];
ELSE DEST[i+63:i] := SRC2[63:0];
FI;
ELSE
IF SRC1[i+31:i] > SRC2[i+31:i]
THEN DEST[i+63:i] := SRC1[i+63:i];
ELSE DEST[i+63:i] := SRC2[i+63:i];
FI;
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[i+63:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PMAXUD/PMAXUQ—Maximum of Packed Unsigned Integers Vol. 2B 4-325


Intel C/C++ Compiler Intrinsic Equivalent
VPMAXUD __m512i _mm512_max_epu32( __m512i a, __m512i b);
VPMAXUD __m512i _mm512_mask_max_epu32(__m512i s, __mmask16 k, __m512i a, __m512i b);
VPMAXUD __m512i _mm512_maskz_max_epu32( __mmask16 k, __m512i a, __m512i b);
VPMAXUQ __m512i _mm512_max_epu64( __m512i a, __m512i b);
VPMAXUQ __m512i _mm512_mask_max_epu64(__m512i s, __mmask8 k, __m512i a, __m512i b);
VPMAXUQ __m512i _mm512_maskz_max_epu64( __mmask8 k, __m512i a, __m512i b);
VPMAXUD __m256i _mm256_mask_max_epu32(__m256i s, __mmask16 k, __m256i a, __m256i b);
VPMAXUD __m256i _mm256_maskz_max_epu32( __mmask16 k, __m256i a, __m256i b);
VPMAXUQ __m256i _mm256_mask_max_epu64(__m256i s, __mmask8 k, __m256i a, __m256i b);
VPMAXUQ __m256i _mm256_maskz_max_epu64( __mmask8 k, __m256i a, __m256i b);
VPMAXUD __m128i _mm_mask_max_epu32(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMAXUD __m128i _mm_maskz_max_epu32( __mmask8 k, __m128i a, __m128i b);
VPMAXUQ __m128i _mm_mask_max_epu64(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMAXUQ __m128i _mm_maskz_max_epu64( __mmask8 k, __m128i a, __m128i b);
(V)PMAXUD __m128i _mm_max_epu32 ( __m128i a, __m128i b);
VPMAXUD __m256i _mm256_max_epu32 ( __m256i a, __m256i b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

PMAXUD/PMAXUQ—Maximum of Packed Unsigned Integers Vol. 2B 4-326


PMINSB/PMINSW—Minimum of Packed Signed Integers
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F EA /r1 A V/V SSE Compare signed word integers in mm2/m64 and
PMINSW mm1, mm2/m64 mm1 and return minimum values.

66 0F 38 38 /r A V/V SSE4_1 Compare packed signed byte integers in xmm1 and


PMINSB xmm1, xmm2/m128 xmm2/m128 and store packed minimum values in
xmm1.
66 0F EA /r A V/V SSE2 Compare packed signed word integers in
PMINSW xmm1, xmm2/m128 xmm2/m128 and xmm1 and store packed
minimum values in xmm1.
VEX.128.66.0F38 38 /r B V/V AVX Compare packed signed byte integers in xmm2 and
VPMINSB xmm1, xmm2, xmm3/m128 xmm3/m128 and store packed minimum values in
xmm1.
VEX.128.66.0F EA /r B V/V AVX Compare packed signed word integers in
VPMINSW xmm1, xmm2, xmm3/m128 xmm3/m128 and xmm2 and return packed
minimum values in xmm1.
VEX.256.66.0F38 38 /r B V/V AVX2 Compare packed signed byte integers in ymm2 and
VPMINSB ymm1, ymm2, ymm3/m256 ymm3/m256 and store packed minimum values in
ymm1.
VEX.256.66.0F EA /r B V/V AVX2 Compare packed signed word integers in
VPMINSW ymm1, ymm2, ymm3/m256 ymm3/m256 and ymm2 and return packed
minimum values in ymm1.
EVEX.128.66.0F38.WIG 38 /r C V/V (AVX512VL AND Compare packed signed byte integers in xmm2 and
VPMINSB xmm1{k1}{z}, xmm2, AVX512BW) OR xmm3/m128 and store packed minimum values in
xmm3/m128 AVX10.12 xmm1 under writemask k1.
EVEX.256.66.0F38.WIG 38 /r C V/V (AVX512VL AND Compare packed signed byte integers in ymm2 and
VPMINSB ymm1{k1}{z}, ymm2, AVX512BW) OR ymm3/m256 and store packed minimum values in
ymm3/m256 AVX10.12 ymm1 under writemask k1.
EVEX.512.66.0F38.WIG 38 /r C V/V AVX512BW OR Compare packed signed byte integers in zmm2 and
VPMINSB zmm1{k1}{z}, zmm2, AVX10.12 zmm3/m512 and store packed minimum values in
zmm3/m512 zmm1 under writemask k1.
EVEX.128.66.0F.WIG EA /r C V/V (AVX512VL AND Compare packed signed word integers in xmm2
VPMINSW xmm1{k1}{z}, xmm2, AVX512BW) OR and xmm3/m128 and store packed minimum
xmm3/m128 AVX10.12 values in xmm1 under writemask k1.
EVEX.256.66.0F.WIG EA /r C V/V (AVX512VL AND Compare packed signed word integers in ymm2
VPMINSW ymm1{k1}{z}, ymm2, AVX512BW) OR and ymm3/m256 and store packed minimum
ymm3/m256 AVX10.12 values in ymm1 under writemask k1.
EVEX.512.66.0F.WIG EA /r C V/V AVX512BW OR Compare packed signed word integers in zmm2 and
VPMINSW zmm1{k1}{z}, zmm2, AVX10.12 zmm3/m512 and store packed minimum values in
zmm3/m512 zmm1 under writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

PMINSB/PMINSW—Minimum of Packed Signed Integers Vol. 2B 4-327


Instruction Operand Encoding
Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD compare of the packed signed byte, word, or dword integers in the second source operand and
the first source operand and returns the minimum value for each pair of integers to the destination operand.
Legacy SSE version PMINSW: The source operand can be an MMX technology register or a 64-bit memory location.
The destination operand can be an MMX technology register.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register remain unchanged.
VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register are zeroed.
VEX.256 encoded version: The second source operand can be an YMM register or a 256-bit memory location. The
first source and destination operands are YMM registers.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register; The second source operand is a
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand is conditionally updated
based on writemask k1.

Operation
PMINSW (64-bit Operands)
IF DEST[15:0] < SRC[15:0] THEN
DEST[15:0] := DEST[15:0];
ELSE
DEST[15:0] := SRC[15:0]; FI;
(* Repeat operation for 2nd and 3rd words in source and destination operands *)
IF DEST[63:48] < SRC[63:48] THEN
DEST[63:48] := DEST[63:48];
ELSE
DEST[63:48] := SRC[63:48]; FI;

PMINSB (128-bit Legacy SSE Version)


IF DEST[7:0] < SRC[7:0] THEN
DEST[7:0] := DEST[7:0];
ELSE
DEST[15:0] := SRC[7:0]; FI;
(* Repeat operation for 2nd through 15th bytes in source and destination operands *)
IF DEST[127:120] < SRC[127:120] THEN
DEST[127:120] := DEST[127:120];
ELSE
DEST[127:120] := SRC[127:120]; FI;
DEST[MAXVL-1:128] (Unmodified)

PMINSB/PMINSW—Minimum of Packed Signed Integers Vol. 2B 4-328


VPMINSB (VEX.128 Encoded Version)
IF SRC1[7:0] < SRC2[7:0] THEN
DEST[7:0] := SRC1[7:0];
ELSE
DEST[7:0] := SRC2[7:0]; FI;
(* Repeat operation for 2nd through 15th bytes in source and destination operands *)
IF SRC1[127:120] < SRC2[127:120] THEN
DEST[127:120] := SRC1[127:120];
ELSE
DEST[127:120] := SRC2[127:120]; FI;
DEST[MAXVL-1:128] := 0

VPMINSB (VEX.256 Encoded Version)


IF SRC1[7:0] < SRC2[7:0] THEN
DEST[7:0] := SRC1[7:0];
ELSE
DEST[15:0] := SRC2[7:0]; FI;
(* Repeat operation for 2nd through 31st bytes in source and destination operands *)
IF SRC1[255:248] < SRC2[255:248] THEN
DEST[255:248] := SRC1[255:248];
ELSE
DEST[255:248] := SRC2[255:248]; FI;
DEST[MAXVL-1:256] := 0

VPMINSB (EVEX Encoded Versions)


(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask* THEN
IF SRC1[i+7:i] < SRC2[i+7:i]
THEN DEST[i+7:i] := SRC1[i+7:i];
ELSE DEST[i+7:i] := SRC2[i+7:i];
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PMINSW (128-bit Legacy SSE Version)


IF DEST[15:0] < SRC[15:0] THEN
DEST[15:0] := DEST[15:0];
ELSE
DEST[15:0] := SRC[15:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
IF DEST[127:112] < SRC[127:112] THEN
DEST[127:112] := DEST[127:112];
ELSE
DEST[127:112] := SRC[127:112]; FI;
DEST[MAXVL-1:128] (Unmodified)

PMINSB/PMINSW—Minimum of Packed Signed Integers Vol. 2B 4-329


VPMINSW (VEX.128 Encoded Version)
IF SRC1[15:0] < SRC2[15:0] THEN
DEST[15:0] := SRC1[15:0];
ELSE
DEST[15:0] := SRC2[15:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
IF SRC1[127:112] < SRC2[127:112] THEN
DEST[127:112] := SRC1[127:112];
ELSE
DEST[127:112] := SRC2[127:112]; FI;
DEST[MAXVL-1:128] := 0

VPMINSW (VEX.256 Encoded Version)


IF SRC1[15:0] < SRC2[15:0] THEN
DEST[15:0] := SRC1[15:0];
ELSE
DEST[15:0] := SRC2[15:0]; FI;
(* Repeat operation for 2nd through 15th words in source and destination operands *)
IF SRC1[255:240] < SRC2[255:240] THEN
DEST[255:240] := SRC1[255:240];
ELSE
DEST[255:240] := SRC2[255:240]; FI;
DEST[MAXVL-1:256] := 0

VPMINSW (EVEX Encoded Versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask* THEN
IF SRC1[i+15:i] < SRC2[i+15:i]
THEN DEST[i+15:i] := SRC1[i+15:i];
ELSE DEST[i+15:i] := SRC2[i+15:i];
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PMINSB/PMINSW—Minimum of Packed Signed Integers Vol. 2B 4-330


Intel C/C++ Compiler Intrinsic Equivalent
VPMINSB __m512i _mm512_min_epi8( __m512i a, __m512i b);
VPMINSB __m512i _mm512_mask_min_epi8(__m512i s, __mmask64 k, __m512i a, __m512i b);
VPMINSB __m512i _mm512_maskz_min_epi8( __mmask64 k, __m512i a, __m512i b);
VPMINSW __m512i _mm512_min_epi16( __m512i a, __m512i b);
VPMINSW __m512i _mm512_mask_min_epi16(__m512i s, __mmask32 k, __m512i a, __m512i b);
VPMINSW __m512i _mm512_maskz_min_epi16( __mmask32 k, __m512i a, __m512i b);
VPMINSB __m256i _mm256_mask_min_epi8(__m256i s, __mmask32 k, __m256i a, __m256i b);
VPMINSB __m256i _mm256_maskz_min_epi8( __mmask32 k, __m256i a, __m256i b);
VPMINSW __m256i _mm256_mask_min_epi16(__m256i s, __mmask16 k, __m256i a, __m256i b);
VPMINSW __m256i _mm256_maskz_min_epi16( __mmask16 k, __m256i a, __m256i b);
VPMINSB __m128i _mm_mask_min_epi8(__m128i s, __mmask16 k, __m128i a, __m128i b);
VPMINSB __m128i _mm_maskz_min_epi8( __mmask16 k, __m128i a, __m128i b);
VPMINSW __m128i _mm_mask_min_epi16(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMINSW __m128i _mm_maskz_min_epi16( __mmask8 k, __m128i a, __m128i b);
(V)PMINSB __m128i _mm_min_epi8 ( __m128i a, __m128i b);
(V)PMINSW __m128i _mm_min_epi16 ( __m128i a, __m128i b)
VPMINSB __m256i _mm256_min_epi8 ( __m256i a, __m256i b);
VPMINSW __m256i _mm256_min_epi16 ( __m256i a, __m256i b)
PMINSW__m64 _mm_min_pi16 (__m64 a, __m64 b)

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#MF (64-bit operations only) If there is a pending x87 FPU exception.

PMINSB/PMINSW—Minimum of Packed Signed Integers Vol. 2B 4-331


PMINSD/PMINSQ—Minimum of Packed Signed Integers
Opcode/ Op/E 64/32 bit CPUID Feature Description
Instruction n Mode Flag
Support
66 0F 38 39 /r A V/V SSE4_1 Compare packed signed dword integers in xmm1
PMINSD xmm1, xmm2/m128 and xmm2/m128 and store packed minimum values
in xmm1.
VEX.128.66.0F38.WIG 39 /r B V/V AVX Compare packed signed dword integers in xmm2
VPMINSD xmm1, xmm2, xmm3/m128 and xmm3/m128 and store packed minimum values
in xmm1.
VEX.256.66.0F38.WIG 39 /r B V/V AVX2 Compare packed signed dword integers in ymm2
VPMINSD ymm1, ymm2, ymm3/m256 and ymm3/m128 and store packed minimum values
in ymm1.
EVEX.128.66.0F38.W0 39 /r C V/V (AVX512VL AND Compare packed signed dword integers in xmm2
VPMINSD xmm1 {k1}{z}, xmm2, AVX512F) OR and xmm3/m128 and store packed minimum values
xmm3/m128/m32bcst AVX10.11 in xmm1 under writemask k1.
EVEX.256.66.0F38.W0 39 /r C V/V (AVX512VL AND Compare packed signed dword integers in ymm2
VPMINSD ymm1 {k1}{z}, ymm2, AVX512F) OR and ymm3/m256 and store packed minimum values
ymm3/m256/m32bcst AVX10.11 in ymm1 under writemask k1.
EVEX.512.66.0F38.W0 39 /r C V/V AVX512F Compare packed signed dword integers in zmm2
VPMINSD zmm1 {k1}{z}, zmm2, OR AVX10.11 and zmm3/m512/m32bcst and store packed
zmm3/m512/m32bcst minimum values in zmm1 under writemask k1.
EVEX.128.66.0F38.W1 39 /r C V/V (AVX512VL AND Compare packed signed qword integers in xmm2
VPMINSQ xmm1 {k1}{z}, xmm2, AVX512F) OR and xmm3/m128 and store packed minimum values
xmm3/m128/m64bcst AVX10.11 in xmm1 under writemask k1.
EVEX.256.66.0F38.W1 39 /r C V/V (AVX512VL AND Compare packed signed qword integers in ymm2
VPMINSQ ymm1 {k1}{z}, ymm2, AVX512F) OR and ymm3/m256 and store packed minimum values
ymm3/m256/m64bcst AVX10.11 in ymm1 under writemask k1.
EVEX.512.66.0F38.W1 39 /r C V/V AVX512F Compare packed signed qword integers in zmm2
VPMINSQ zmm1 {k1}{z}, zmm2, OR AVX10.11 and zmm3/m512/m64bcst and store packed
zmm3/m512/m64bcst minimum values in zmm1 under writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD compare of the packed signed dword or qword integers in the second source operand and the first
source operand and returns the minimum value for each pair of integers to the destination operand.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register remain unchanged.

PMINSD/PMINSQ—Minimum of Packed Signed Integers Vol. 2B 4-332


VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register are zeroed.
VEX.256 encoded version: The second source operand can be an YMM register or a 256-bit memory location. The
first source and destination operands are YMM registers. Bits (MAXVL-1:256) of the corresponding destination
register are zeroed.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register; The second source operand is a
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a
32/64-bit memory location. The destination operand is conditionally updated based on writemask k1.

Operation
PMINSD (128-bit Legacy SSE Version)
IF DEST[31:0] < SRC[31:0] THEN
DEST[31:0] := DEST[31:0];
ELSE
DEST[31:0] := SRC[31:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
IF DEST[127:96] < SRC[127:96] THEN
DEST[127:96] := DEST[127:96];
ELSE
DEST[127:96] := SRC[127:96]; FI;
DEST[MAXVL-1:128] (Unmodified)

VPMINSD (VEX.128 Encoded Version)


IF SRC1[31:0] < SRC2[31:0] THEN
DEST[31:0] := SRC1[31:0];
ELSE
DEST[31:0] := SRC2[31:0]; FI;
(* Repeat operation for 2nd through 3rd dwords in source and destination operands *)
IF SRC1[127:96] < SRC2[127:96] THEN
DEST[127:96] := SRC1[127:96];
ELSE
DEST[127:96] := SRC2[127:96]; FI;
DEST[MAXVL-1:128] := 0

VPMINSD (VEX.256 Encoded Version)


IF SRC1[31:0] < SRC2[31:0] THEN
DEST[31:0] := SRC1[31:0];
ELSE
DEST[31:0] := SRC2[31:0]; FI;
(* Repeat operation for 2nd through 7th dwords in source and destination operands *)
IF SRC1[255:224] < SRC2[255:224] THEN
DEST[255:224] := SRC1[255:224];
ELSE
DEST[255:224] := SRC2[255:224]; FI;
DEST[MAXVL-1:256] := 0

PMINSD/PMINSQ—Minimum of Packed Signed Integers Vol. 2B 4-333


VPMINSD (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
IF SRC1[i+31:i] < SRC2[31:0]
THEN DEST[i+31:i] := SRC1[i+31:i];
ELSE DEST[i+31:i] := SRC2[31:0];
FI;
ELSE
IF SRC1[i+31:i] < SRC2[i+31:i]
THEN DEST[i+31:i] := SRC1[i+31:i];
ELSE DEST[i+31:i] := SRC2[i+31:i];
FI;
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

VPMINSQ (EVEX Encoded Versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
IF SRC1[i+63:i] < SRC2[63:0]
THEN DEST[i+63:i] := SRC1[i+63:i];
ELSE DEST[i+63:i] := SRC2[63:0];
FI;
ELSE
IF SRC1[i+63:i] < SRC2[i+63:i]
THEN DEST[i+63:i] := SRC1[i+63:i];
ELSE DEST[i+63:i] := SRC2[i+63:i];
FI;
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PMINSD/PMINSQ—Minimum of Packed Signed Integers Vol. 2B 4-334


Intel C/C++ Compiler Intrinsic Equivalent
VPMINSD __m512i _mm512_min_epi32( __m512i a, __m512i b);
VPMINSD __m512i _mm512_mask_min_epi32(__m512i s, __mmask16 k, __m512i a, __m512i b);
VPMINSD __m512i _mm512_maskz_min_epi32( __mmask16 k, __m512i a, __m512i b);
VPMINSQ __m512i _mm512_min_epi64( __m512i a, __m512i b);
VPMINSQ __m512i _mm512_mask_min_epi64(__m512i s, __mmask8 k, __m512i a, __m512i b);
VPMINSQ __m512i _mm512_maskz_min_epi64( __mmask8 k, __m512i a, __m512i b);
VPMINSD __m256i _mm256_mask_min_epi32(__m256i s, __mmask16 k, __m256i a, __m256i b);
VPMINSD __m256i _mm256_maskz_min_epi32( __mmask16 k, __m256i a, __m256i b);
VPMINSQ __m256i _mm256_mask_min_epi64(__m256i s, __mmask8 k, __m256i a, __m256i b);
VPMINSQ __m256i _mm256_maskz_min_epi64( __mmask8 k, __m256i a, __m256i b);
VPMINSD __m128i _mm_mask_min_epi32(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMINSD __m128i _mm_maskz_min_epi32( __mmask8 k, __m128i a, __m128i b);
VPMINSQ __m128i _mm_mask_min_epi64(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMINSQ __m128i _mm_maskz_min_epu64( __mmask8 k, __m128i a, __m128i b);
(V)PMINSD __m128i _mm_min_epi32 ( __m128i a, __m128i b);
VPMINSD __m256i _mm256_min_epi32 (__m256i a, __m256i b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

PMINSD/PMINSQ—Minimum of Packed Signed Integers Vol. 2B 4-335


PMINUB/PMINUW—Minimum of Packed Unsigned Integers
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F DA /r1 A V/V SSE Compare unsigned byte integers in mm2/m64 and
PMINUB mm1, mm2/m64 mm1 and returns minimum values.

66 0F DA /r A V/V SSE2 Compare packed unsigned byte integers in xmm1


PMINUB xmm1, xmm2/m128 and xmm2/m128 and store packed minimum values
in xmm1.
66 0F 38 3A/r A V/V SSE4_1 Compare packed unsigned word integers in
PMINUW xmm1, xmm2/m128 xmm2/m128 and xmm1 and store packed minimum
values in xmm1.
VEX.128.66.0F DA /r B V/V AVX Compare packed unsigned byte integers in xmm2
VPMINUB xmm1, xmm2, xmm3/m128 and xmm3/m128 and store packed minimum values
in xmm1.
VEX.128.66.0F38 3A/r B V/V AVX Compare packed unsigned word integers in
VPMINUW xmm1, xmm2, xmm3/m128 xmm3/m128 and xmm2 and return packed
minimum values in xmm1.
VEX.256.66.0F DA /r B V/V AVX2 Compare packed unsigned byte integers in ymm2
VPMINUB ymm1, ymm2, ymm3/m256 and ymm3/m256 and store packed minimum values
in ymm1.
VEX.256.66.0F38 3A/r B V/V AVX2 Compare packed unsigned word integers in
VPMINUW ymm1, ymm2, ymm3/m256 ymm3/m256 and ymm2 and return packed
minimum values in ymm1.
EVEX.128.66.0F DA /r C V/V (AVX512VL AND Compare packed unsigned byte integers in xmm2
VPMINUB xmm1 {k1}{z}, xmm2, AVX512BW) OR and xmm3/m128 and store packed minimum values
xmm3/m128 AVX10.12 in xmm1 under writemask k1.
EVEX.256.66.0F DA /r C V/V (AVX512VL AND Compare packed unsigned byte integers in ymm2
VPMINUB ymm1 {k1}{z}, ymm2, AVX512BW) OR and ymm3/m256 and store packed minimum values
ymm3/m256 AVX10.12 in ymm1 under writemask k1.
EVEX.512.66.0F DA /r C V/V AVX512BW Compare packed unsigned byte integers in zmm2
VPMINUB zmm1 {k1}{z}, zmm2, OR AVX10.12 and zmm3/m512 and store packed minimum values
zmm3/m512 in zmm1 under writemask k1.
EVEX.128.66.0F38 3A/r C V/V (AVX512VL AND Compare packed unsigned word integers in
VPMINUW xmm1{k1}{z}, xmm2, AVX512BW) OR xmm3/m128 and xmm2 and return packed
xmm3/m128 AVX10.12 minimum values in xmm1 under writemask k1.
EVEX.256.66.0F38 3A/r C V/V (AVX512VL AND Compare packed unsigned word integers in
VPMINUW ymm1{k1}{z}, ymm2, AVX512BW) OR ymm3/m256 and ymm2 and return packed
ymm3/m256 AVX10.12 minimum values in ymm1 under writemask k1.
EVEX.512.66.0F38 3A/r C V/V AVX512BW Compare packed unsigned word integers in
VPMINUW zmm1{k1}{z}, zmm2, OR AVX10.12 zmm3/m512 and zmm2 and return packed
zmm3/m512 minimum values in zmm1 under writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

PMINUB/PMINUW—Minimum of Packed Unsigned Integers Vol. 2B 4-336


Instruction Operand Encoding
Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD compare of the packed unsigned byte or word integers in the second source operand and the first
source operand and returns the minimum value for each pair of integers to the destination operand.
Legacy SSE version PMINUB: The source operand can be an MMX technology register or a 64-bit memory location.
The destination operand can be an MMX technology register.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register remain unchanged.
VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register are zeroed.
VEX.256 encoded version: The second source operand can be an YMM register or a 256-bit memory location. The
first source and destination operands are YMM registers.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register; The second source operand is a
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand is conditionally updated
based on writemask k1.

Operation
PMINUB (64-bit Operands)
IF DEST[7:0] < SRC[17:0] THEN
DEST[7:0] := DEST[7:0];
ELSE
DEST[7:0] := SRC[7:0]; FI;
(* Repeat operation for 2nd through 7th bytes in source and destination operands *)
IF DEST[63:56] < SRC[63:56] THEN
DEST[63:56] := DEST[63:56];
ELSE
DEST[63:56] := SRC[63:56]; FI;

PMINUB (128-bit Operands)


IF DEST[7:0] < SRC[7:0] THEN
DEST[7:0] := DEST[7:0];
ELSE
DEST[15:0] := SRC[7:0]; FI;
(* Repeat operation for 2nd through 15th bytes in source and destination operands *)
IF DEST[127:120] < SRC[127:120] THEN
DEST[127:120] := DEST[127:120];
ELSE
DEST[127:120] := SRC[127:120]; FI;
DEST[MAXVL-1:128] (Unmodified)

PMINUB/PMINUW—Minimum of Packed Unsigned Integers Vol. 2B 4-337


VPMINUB (VEX.128 Encoded Version)
IF SRC1[7:0] < SRC2[7:0] THEN
DEST[7:0] := SRC1[7:0];
ELSE
DEST[7:0] := SRC2[7:0]; FI;
(* Repeat operation for 2nd through 15th bytes in source and destination operands *)
IF SRC1[127:120] < SRC2[127:120] THEN
DEST[127:120] := SRC1[127:120];
ELSE
DEST[127:120] := SRC2[127:120]; FI;
DEST[MAXVL-1:128] := 0

VPMINUB (VEX.256 Encoded Version)


IF SRC1[7:0] < SRC2[7:0] THEN
DEST[7:0] := SRC1[7:0];
ELSE
DEST[15:0] := SRC2[7:0]; FI;
(* Repeat operation for 2nd through 31st bytes in source and destination operands *)
IF SRC1[255:248] < SRC2[255:248] THEN
DEST[255:248] := SRC1[255:248];
ELSE
DEST[255:248] := SRC2[255:248]; FI;
DEST[MAXVL-1:256] := 0

VPMINUB (EVEX Encoded Versions)


(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask* THEN
IF SRC1[i+7:i] < SRC2[i+7:i]
THEN DEST[i+7:i] := SRC1[i+7:i];
ELSE DEST[i+7:i] := SRC2[i+7:i];
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PMINUW (128-bit Operands)


IF DEST[15:0] < SRC[15:0] THEN
DEST[15:0] := DEST[15:0];
ELSE
DEST[15:0] := SRC[15:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
IF DEST[127:112] < SRC[127:112] THEN
DEST[127:112] := DEST[127:112];
ELSE
DEST[127:112] := SRC[127:112]; FI;
DEST[MAXVL-1:128] (Unmodified)

PMINUB/PMINUW—Minimum of Packed Unsigned Integers Vol. 2B 4-338


VPMINUW (VEX.128 Encoded Version)
IF SRC1[15:0] < SRC2[15:0] THEN
DEST[15:0] := SRC1[15:0];
ELSE
DEST[15:0] := SRC2[15:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
IF SRC1[127:112] < SRC2[127:112] THEN
DEST[127:112] := SRC1[127:112];
ELSE
DEST[127:112] := SRC2[127:112]; FI;
DEST[MAXVL-1:128] := 0

VPMINUW (VEX.256 Encoded Version)


IF SRC1[15:0] < SRC2[15:0] THEN
DEST[15:0] := SRC1[15:0];
ELSE
DEST[15:0] := SRC2[15:0]; FI;
(* Repeat operation for 2nd through 15th words in source and destination operands *)
IF SRC1[255:240] < SRC2[255:240] THEN
DEST[255:240] := SRC1[255:240];
ELSE
DEST[255:240] := SRC2[255:240]; FI;
DEST[MAXVL-1:256] := 0

VPMINUW (EVEX Encoded Versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask* THEN
IF SRC1[i+15:i] < SRC2[i+15:i]
THEN DEST[i+15:i] := SRC1[i+15:i];
ELSE DEST[i+15:i] := SRC2[i+15:i];
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PMINUB/PMINUW—Minimum of Packed Unsigned Integers Vol. 2B 4-339


Intel C/C++ Compiler Intrinsic Equivalent
VPMINUB __m512i _mm512_min_epu8( __m512i a, __m512i b);
VPMINUB __m512i _mm512_mask_min_epu8(__m512i s, __mmask64 k, __m512i a, __m512i b);
VPMINUB __m512i _mm512_maskz_min_epu8( __mmask64 k, __m512i a, __m512i b);
VPMINUW __m512i _mm512_min_epu16( __m512i a, __m512i b);
VPMINUW __m512i _mm512_mask_min_epu16(__m512i s, __mmask32 k, __m512i a, __m512i b);
VPMINUW __m512i _mm512_maskz_min_epu16( __mmask32 k, __m512i a, __m512i b);
VPMINUB __m256i _mm256_mask_min_epu8(__m256i s, __mmask32 k, __m256i a, __m256i b);
VPMINUB __m256i _mm256_maskz_min_epu8( __mmask32 k, __m256i a, __m256i b);
VPMINUW __m256i _mm256_mask_min_epu16(__m256i s, __mmask16 k, __m256i a, __m256i b);
VPMINUW __m256i _mm256_maskz_min_epu16( __mmask16 k, __m256i a, __m256i b);
VPMINUB __m128i _mm_mask_min_epu8(__m128i s, __mmask16 k, __m128i a, __m128i b);
VPMINUB __m128i _mm_maskz_min_epu8( __mmask16 k, __m128i a, __m128i b);
VPMINUW __m128i _mm_mask_min_epu16(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMINUW __m128i _mm_maskz_min_epu16( __mmask8 k, __m128i a, __m128i b);
(V)PMINUB __m128i _mm_min_epu8 ( __m128i a, __m128i b)
(V)PMINUW __m128i _mm_min_epu16 ( __m128i a, __m128i b);
VPMINUB __m256i _mm256_min_epu8 ( __m256i a, __m256i b)
VPMINUW __m256i _mm256_min_epu16 ( __m256i a, __m256i b);
PMINUB __m64 _m_min_pu8 (__m64 a, __m64 b)

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

PMINUB/PMINUW—Minimum of Packed Unsigned Integers Vol. 2B 4-340


PMINUD/PMINUQ—Minimum of Packed Unsigned Integers
Opcode/ Op/E 64/32 bit CPUID Feature Description
Instruction n Mode Flag
Support
66 0F 38 3B /r A V/V SSE4_1 Compare packed unsigned dword integers in xmm1
PMINUD xmm1, xmm2/m128 and xmm2/m128 and store packed minimum values in
xmm1.
VEX.128.66.0F38.WIG 3B /r B V/V AVX Compare packed unsigned dword integers in xmm2
VPMINUD xmm1, xmm2, and xmm3/m128 and store packed minimum values in
xmm3/m128 xmm1.
VEX.256.66.0F38.WIG 3B /r B V/V AVX2 Compare packed unsigned dword integers in ymm2
VPMINUD ymm1, ymm2, and ymm3/m256 and store packed minimum values in
ymm3/m256 ymm1.
EVEX.128.66.0F38.W0 3B /r C V/V (AVX512VL AND Compare packed unsigned dword integers in xmm2
VPMINUD xmm1 {k1}{z}, xmm2, AVX512F) OR and xmm3/m128/m32bcst and store packed minimum
xmm3/m128/m32bcst AVX10.11 values in xmm1 under writemask k1.
EVEX.256.66.0F38.W0 3B /r C V/V (AVX512VL AND Compare packed unsigned dword integers in ymm2
VPMINUD ymm1 {k1}{z}, ymm2, AVX512F) OR and ymm3/m256/m32bcst and store packed minimum
ymm3/m256/m32bcst AVX10.11 values in ymm1 under writemask k1.
EVEX.512.66.0F38.W0 3B /r C V/V AVX512F Compare packed unsigned dword integers in zmm2
VPMINUD zmm1 {k1}{z}, zmm2, OR AVX10.11 and zmm3/m512/m32bcst and store packed minimum
zmm3/m512/m32bcst values in zmm1 under writemask k1.
EVEX.128.66.0F38.W1 3B /r C V/V (AVX512VL AND Compare packed unsigned qword integers in xmm2
VPMINUQ xmm1 {k1}{z}, xmm2, AVX512F) OR and xmm3/m128/m64bcst and store packed minimum
xmm3/m128/m64bcst AVX10.11 values in xmm1 under writemask k1.
EVEX.256.66.0F38.W1 3B /r C V/V (AVX512VL AND Compare packed unsigned qword integers in ymm2
VPMINUQ ymm1 {k1}{z}, ymm2, AVX512F) OR and ymm3/m256/m64bcst and store packed minimum
ymm3/m256/m64bcst AVX10.11 values in ymm1 under writemask k1.
EVEX.512.66.0F38.W1 3B /r C V/V AVX512F Compare packed unsigned qword integers in zmm2
VPMINUQ zmm1 {k1}{z}, zmm2, OR AVX10.11 and zmm3/m512/m64bcst and store packed minimum
zmm3/m512/m64bcst values in zmm1 under writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD compare of the packed unsigned dword/qword integers in the second source operand and the first
source operand and returns the minimum value for each pair of integers to the destination operand.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register remain unchanged.

PMINUD/PMINUQ—Minimum of Packed Unsigned Integers Vol. 2B 4-341


VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register are zeroed.
VEX.256 encoded version: The second source operand can be an YMM register or a 256-bit memory location. The
first source and destination operands are YMM registers. Bits (MAXVL-1:256) of the corresponding destination
register are zeroed.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register; The second source operand is a
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a
32/64-bit memory location. The destination operand is conditionally updated based on writemask k1.

Operation
PMINUD (128-bit Legacy SSE Version)
PMINUD instruction for 128-bit operands:
IF DEST[31:0] < SRC[31:0] THEN
DEST[31:0] := DEST[31:0];
ELSE
DEST[31:0] := SRC[31:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
IF DEST[127:96] < SRC[127:96] THEN
DEST[127:96] := DEST[127:96];
ELSE
DEST[127:96] := SRC[127:96]; FI;
DEST[MAXVL-1:128] (Unmodified)

VPMINUD (VEX.128 Encoded Version)


VPMINUD instruction for 128-bit operands:
IF SRC1[31:0] < SRC2[31:0] THEN
DEST[31:0] := SRC1[31:0];
ELSE
DEST[31:0] := SRC2[31:0]; FI;
(* Repeat operation for 2nd through 3rd dwords in source and destination operands *)
IF SRC1[127:96] < SRC2[127:96] THEN
DEST[127:96] := SRC1[127:96];
ELSE
DEST[127:96] := SRC2[127:96]; FI;
DEST[MAXVL-1:128] := 0

VPMINUD (VEX.256 Encoded Version)


VPMINUD instruction for 128-bit operands:
IF SRC1[31:0] < SRC2[31:0] THEN
DEST[31:0] := SRC1[31:0];
ELSE
DEST[31:0] := SRC2[31:0]; FI;
(* Repeat operation for 2nd through 7th dwords in source and destination operands *)
IF SRC1[255:224] < SRC2[255:224] THEN
DEST[255:224] := SRC1[255:224];
ELSE
DEST[255:224] := SRC2[255:224]; FI;
DEST[MAXVL-1:256] := 0

PMINUD/PMINUQ—Minimum of Packed Unsigned Integers Vol. 2B 4-342


VPMINUD (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
IF SRC1[i+31:i] < SRC2[31:0]
THEN DEST[i+31:i] := SRC1[i+31:i];
ELSE DEST[i+31:i] := SRC2[31:0];
FI;
ELSE
IF SRC1[i+31:i] < SRC2[i+31:i]
THEN DEST[i+31:i] := SRC1[i+31:i];
ELSE DEST[i+31:i] := SRC2[i+31:i];
FI;
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

VPMINUQ (EVEX Encoded Versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
IF SRC1[i+63:i] < SRC2[63:0]
THEN DEST[i+63:i] := SRC1[i+63:i];
ELSE DEST[i+63:i] := SRC2[63:0];
FI;
ELSE
IF SRC1[i+63:i] < SRC2[i+63:i]
THEN DEST[i+63:i] := SRC1[i+63:i];
ELSE DEST[i+63:i] := SRC2[i+63:i];
FI;
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PMINUD/PMINUQ—Minimum of Packed Unsigned Integers Vol. 2B 4-343


Intel C/C++ Compiler Intrinsic Equivalent
VPMINUD __m512i _mm512_min_epu32( __m512i a, __m512i b);
VPMINUD __m512i _mm512_mask_min_epu32(__m512i s, __mmask16 k, __m512i a, __m512i b);
VPMINUD __m512i _mm512_maskz_min_epu32( __mmask16 k, __m512i a, __m512i b);
VPMINUQ __m512i _mm512_min_epu64( __m512i a, __m512i b);
VPMINUQ __m512i _mm512_mask_min_epu64(__m512i s, __mmask8 k, __m512i a, __m512i b);
VPMINUQ __m512i _mm512_maskz_min_epu64( __mmask8 k, __m512i a, __m512i b);
VPMINUD __m256i _mm256_mask_min_epu32(__m256i s, __mmask16 k, __m256i a, __m256i b);
VPMINUD __m256i _mm256_maskz_min_epu32( __mmask16 k, __m256i a, __m256i b);
VPMINUQ __m256i _mm256_mask_min_epu64(__m256i s, __mmask8 k, __m256i a, __m256i b);
VPMINUQ __m256i _mm256_maskz_min_epu64( __mmask8 k, __m256i a, __m256i b);
VPMINUD __m128i _mm_mask_min_epu32(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMINUD __m128i _mm_maskz_min_epu32( __mmask8 k, __m128i a, __m128i b);
VPMINUQ __m128i _mm_mask_min_epu64(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMINUQ __m128i _mm_maskz_min_epu64( __mmask8 k, __m128i a, __m128i b);
(V)PMINUD __m128i _mm_min_epu32 ( __m128i a, __m128i b);
VPMINUD __m256i _mm256_min_epu32 ( __m256i a, __m256i b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

PMINUD/PMINUQ—Minimum of Packed Unsigned Integers Vol. 2B 4-344


PMOVSX—Packed Move With Sign Extend
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0f 38 20 /r A V/V SSE4_1 Sign extend 8 packed 8-bit integers in the low 8 bytes
PMOVSXBW xmm1, xmm2/m64 of xmm2/m64 to 8 packed 16-bit integers in xmm1.
66 0f 38 21 /r A V/V SSE4_1 Sign extend 4 packed 8-bit integers in the low 4 bytes
PMOVSXBD xmm1, xmm2/m32 of xmm2/m32 to 4 packed 32-bit integers in xmm1.
66 0f 38 22 /r A V/V SSE4_1 Sign extend 2 packed 8-bit integers in the low 2 bytes
PMOVSXBQ xmm1, xmm2/m16 of xmm2/m16 to 2 packed 64-bit integers in xmm1.
66 0f 38 23/r A V/V SSE4_1 Sign extend 4 packed 16-bit integers in the low 8
PMOVSXWD xmm1, xmm2/m64 bytes of xmm2/m64 to 4 packed 32-bit integers in
xmm1.
66 0f 38 24 /r A V/V SSE4_1 Sign extend 2 packed 16-bit integers in the low 4
PMOVSXWQ xmm1, xmm2/m32 bytes of xmm2/m32 to 2 packed 64-bit integers in
xmm1.
66 0f 38 25 /r A V/V SSE4_1 Sign extend 2 packed 32-bit integers in the low 8
PMOVSXDQ xmm1, xmm2/m64 bytes of xmm2/m64 to 2 packed 64-bit integers in
xmm1.
VEX.128.66.0F38.WIG 20 /r A V/V AVX Sign extend 8 packed 8-bit integers in the low 8 bytes
VPMOVSXBW xmm1, xmm2/m64 of xmm2/m64 to 8 packed 16-bit integers in xmm1.
VEX.128.66.0F38.WIG 21 /r A V/V AVX Sign extend 4 packed 8-bit integers in the low 4 bytes
VPMOVSXBD xmm1, xmm2/m32 of xmm2/m32 to 4 packed 32-bit integers in xmm1.
VEX.128.66.0F38.WIG 22 /r A V/V AVX Sign extend 2 packed 8-bit integers in the low 2 bytes
VPMOVSXBQ xmm1, xmm2/m16 of xmm2/m16 to 2 packed 64-bit integers in xmm1.
VEX.128.66.0F38.WIG 23 /r A V/V AVX Sign extend 4 packed 16-bit integers in the low 8
VPMOVSXWD xmm1, xmm2/m64 bytes of xmm2/m64 to 4 packed 32-bit integers in
xmm1.
VEX.128.66.0F38.WIG 24 /r A V/V AVX Sign extend 2 packed 16-bit integers in the low 4
VPMOVSXWQ xmm1, xmm2/m32 bytes of xmm2/m32 to 2 packed 64-bit integers in
xmm1.
VEX.128.66.0F38.WIG 25 /r A V/V AVX Sign extend 2 packed 32-bit integers in the low 8
VPMOVSXDQ xmm1, xmm2/m64 bytes of xmm2/m64 to 2 packed 64-bit integers in
xmm1.
VEX.256.66.0F38.WIG 20 /r A V/V AVX2 Sign extend 16 packed 8-bit integers in xmm2/m128
VPMOVSXBW ymm1, xmm2/m128 to 16 packed 16-bit integers in ymm1.
VEX.256.66.0F38.WIG 21 /r A V/V AVX2 Sign extend 8 packed 8-bit integers in the low 8 bytes
VPMOVSXBD ymm1, xmm2/m64 of xmm2/m64 to 8 packed 32-bit integers in ymm1.
VEX.256.66.0F38.WIG 22 /r A V/V AVX2 Sign extend 4 packed 8-bit integers in the low 4 bytes
VPMOVSXBQ ymm1, xmm2/m32 of xmm2/m32 to 4 packed 64-bit integers in ymm1.
VEX.256.66.0F38.WIG 23 /r A V/V AVX2 Sign extend 8 packed 16-bit integers in the low 16
VPMOVSXWD ymm1, xmm2/m128 bytes of xmm2/m128 to 8 packed 32-bit integers in
ymm1.
VEX.256.66.0F38.WIG 24 /r A V/V AVX2 Sign extend 4 packed 16-bit integers in the low 8
VPMOVSXWQ ymm1, xmm2/m64 bytes of xmm2/m64 to 4 packed 64-bit integers in
ymm1.
VEX.256.66.0F38.WIG 25 /r A V/V AVX2 Sign extend 4 packed 32-bit integers in the low 16
VPMOVSXDQ ymm1, xmm2/m128 bytes of xmm2/m128 to 4 packed 64-bit integers in
ymm1.

PMOVSX—Packed Move With Sign Extend Vol. 2B 4-347


Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
EVEX.128.66.0F38.WIG 20 /r B V/V (AVX512VL AND Sign extend 8 packed 8-bit integers in xmm2/m64 to
VPMOVSXBW xmm1 {k1}{z}, AVX512BW) OR 8 packed 16-bit integers in zmm1.
xmm2/m64 AVX10.11
EVEX.256.66.0F38.WIG 20 /r B V/V (AVX512VL AND Sign extend 16 packed 8-bit integers in xmm2/m128
VPMOVSXBW ymm1 {k1}{z}, AVX512BW) OR to 16 packed 16-bit integers in ymm1.
xmm2/m128 AVX10.11
EVEX.512.66.0F38.WIG 20 /r B V/V AVX512BW Sign extend 32 packed 8-bit integers in ymm2/m256
VPMOVSXBW zmm1 {k1}{z}, OR AVX10.11 to 32 packed 16-bit integers in zmm1.
ymm2/m256
EVEX.128.66.0F38.WIG 21 /r C V/V (AVX512VL AND Sign extend 4 packed 8-bit integers in the low 4 bytes
VPMOVSXBD xmm1 {k1}{z}, AVX512F) OR of xmm2/m32 to 4 packed 32-bit integers in xmm1
xmm2/m32 AVX10.11 subject to writemask k1.
EVEX.256.66.0F38.WIG 21 /r C V/V (AVX512VL AND Sign extend 8 packed 8-bit integers in the low 8 bytes
VPMOVSXBD ymm1 {k1}{z}, AVX512F) OR of xmm2/m64 to 8 packed 32-bit integers in ymm1
xmm2/m64 AVX10.11 subject to writemask k1.
EVEX.512.66.0F38.WIG 21 /r C V/V AVX512F Sign extend 16 packed 8-bit integers in the low 16
VPMOVSXBD zmm1 {k1}{z}, OR AVX10.11 bytes of xmm2/m128 to 16 packed 32-bit integers in
xmm2/m128 zmm1 subject to writemask k1.
EVEX.128.66.0F38.WIG 22 /r D V/V (AVX512VL AND Sign extend 2 packed 8-bit integers in the low 2 bytes
VPMOVSXBQ xmm1 {k1}{z}, AVX512F) OR of xmm2/m16 to 2 packed 64-bit integers in xmm1
xmm2/m16 AVX10.11 subject to writemask k1.
EVEX.256.66.0F38.WIG 22 /r D V/V (AVX512VL AND Sign extend 4 packed 8-bit integers in the low 4 bytes
VPMOVSXBQ ymm1 {k1}{z}, AVX512F) OR of xmm2/m32 to 4 packed 64-bit integers in ymm1
xmm2/m32 AVX10.11 subject to writemask k1.
EVEX.512.66.0F38.WIG 22 /r D V/V AVX512F Sign extend 8 packed 8-bit integers in the low 8 bytes
VPMOVSXBQ zmm1 {k1}{z}, OR AVX10.11 of xmm2/m64 to 8 packed 64-bit integers in zmm1
xmm2/m64 subject to writemask k1.
EVEX.128.66.0F38.WIG 23 /r B V/V (AVX512VL AND Sign extend 4 packed 16-bit integers in the low 8
VPMOVSXWD xmm1 {k1}{z}, AVX512F) OR bytes of ymm2/mem to 4 packed 32-bit integers in
xmm2/m64 AVX10.11 xmm1 subject to writemask k1.
EVEX.256.66.0F38.WIG 23 /r B V/V (AVX512VL AND Sign extend 8 packed 16-bit integers in the low 16
VPMOVSXWD ymm1 {k1}{z}, AVX512F) OR bytes of ymm2/m128 to 8 packed 32-bit integers in
xmm2/m128 AVX10.11 ymm1 subject to writemask k1.
EVEX.512.66.0F38.WIG 23 /r B V/V AVX512F Sign extend 16 packed 16-bit integers in the low 32
VPMOVSXWD zmm1 {k1}{z}, OR AVX10.11 bytes of ymm2/m256 to 16 packed 32-bit integers in
ymm2/m256 zmm1 subject to writemask k1.
EVEX.128.66.0F38.WIG 24 /r C V/V (AVX512VL AND Sign extend 2 packed 16-bit integers in the low 4
VPMOVSXWQ xmm1 {k1}{z}, AVX512F) OR bytes of xmm2/m32 to 2 packed 64-bit integers in
xmm2/m32 AVX10.11 xmm1 subject to writemask k1.
EVEX.256.66.0F38.WIG 24 /r C V/V (AVX512VL AND Sign extend 4 packed 16-bit integers in the low 8
VPMOVSXWQ ymm1 {k1}{z}, AVX512F) OR bytes of xmm2/m64 to 4 packed 64-bit integers in
xmm2/m64 AVX10.11 ymm1 subject to writemask k1.
EVEX.512.66.0F38.WIG 24 /r C V/V AVX512F Sign extend 8 packed 16-bit integers in the low 16
VPMOVSXWQ zmm1 {k1}{z}, OR AVX10.11 bytes of xmm2/m128 to 8 packed 64-bit integers in
xmm2/m128 zmm1 subject to writemask k1.
EVEX.128.66.0F38.W0 25 /r B V/V (AVX512VL AND Sign extend 2 packed 32-bit integers in the low 8
VPMOVSXDQ xmm1 {k1}{z}, AVX512F) OR bytes of xmm2/m64 to 2 packed 64-bit integers in
xmm2/m64 AVX10.11 zmm1 using writemask k1.

PMOVSX—Packed Move With Sign Extend Vol. 2B 4-348


Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
EVEX.256.66.0F38.W0 25 /r B V/V (AVX512VL AND Sign extend 4 packed 32-bit integers in the low 16
VPMOVSXDQ ymm1 {k1}{z}, AVX512F) OR bytes of xmm2/m128 to 4 packed 64-bit integers in
xmm2/m128 AVX10.11 zmm1 using writemask k1.
EVEX.512.66.0F38.W0 25 /r B V/V AVX512F Sign extend 8 packed 32-bit integers in the low 32
VPMOVSXDQ zmm1 {k1}{z}, OR AVX10.11 bytes of ymm2/m256 to 8 packed 64-bit integers in
ymm2/m256 zmm1 using writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Half Mem ModRM:reg (w) ModRM:r/m (r) N/A N/A
C Quarter Mem ModRM:reg (w) ModRM:r/m (r) N/A N/A
D Eighth Mem ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Legacy and VEX encoded versions: Packed byte, word, or dword integers in the low bytes of the source operand
(second operand) are sign extended to word, dword, or quadword integers and stored in packed signed bytes the
destination operand.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.
VEX.128 and EVEX.128 encoded versions: Bits (MAXVL-1:128) of the corresponding destination register are
zeroed.
VEX.256 and EVEX.256 encoded versions: Bits (MAXVL-1:256) of the corresponding destination register are
zeroed.
EVEX encoded versions: Packed byte, word or dword integers starting from the low bytes of the source operand
(second operand) are sign extended to word, dword or quadword integers and stored to the destination operand
under the writemask. The destination register is XMM, YMM or ZMM Register.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

Operation
Packed_Sign_Extend_BYTE_to_WORD(DEST, SRC)
DEST[15:0] := SignExtend(SRC[7:0]);
DEST[31:16] := SignExtend(SRC[15:8]);
DEST[47:32] := SignExtend(SRC[23:16]);
DEST[63:48] := SignExtend(SRC[31:24]);
DEST[79:64] := SignExtend(SRC[39:32]);
DEST[95:80] := SignExtend(SRC[47:40]);
DEST[111:96] := SignExtend(SRC[55:48]);
DEST[127:112] := SignExtend(SRC[63:56]);

PMOVSX—Packed Move With Sign Extend Vol. 2B 4-349


Packed_Sign_Extend_BYTE_to_DWORD(DEST, SRC)
DEST[31:0] := SignExtend(SRC[7:0]);
DEST[63:32] := SignExtend(SRC[15:8]);
DEST[95:64] := SignExtend(SRC[23:16]);
DEST[127:96] := SignExtend(SRC[31:24]);

Packed_Sign_Extend_BYTE_to_QWORD(DEST, SRC)
DEST[63:0] := SignExtend(SRC[7:0]);
DEST[127:64] := SignExtend(SRC[15:8]);

Packed_Sign_Extend_WORD_to_DWORD(DEST, SRC)
DEST[31:0] := SignExtend(SRC[15:0]);
DEST[63:32] := SignExtend(SRC[31:16]);
DEST[95:64] := SignExtend(SRC[47:32]);
DEST[127:96] := SignExtend(SRC[63:48]);

Packed_Sign_Extend_WORD_to_QWORD(DEST, SRC)
DEST[63:0] := SignExtend(SRC[15:0]);
DEST[127:64] := SignExtend(SRC[31:16]);

Packed_Sign_Extend_DWORD_to_QWORD(DEST, SRC)
DEST[63:0] := SignExtend(SRC[31:0]);
DEST[127:64] := SignExtend(SRC[63:32]);

VPMOVSXBW (EVEX Encoded Versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
Packed_Sign_Extend_BYTE_to_WORD(TMP_DEST[127:0], SRC[63:0])
IF VL >= 256
Packed_Sign_Extend_BYTE_to_WORD(TMP_DEST[255:128], SRC[127:64])
FI;
IF VL >= 512
Packed_Sign_Extend_BYTE_to_WORD(TMP_DEST[383:256], SRC[191:128])
Packed_Sign_Extend_BYTE_to_WORD(TMP_DEST[511:384], SRC[255:192])
FI;
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TEMP_DEST[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

PMOVSX—Packed Move With Sign Extend Vol. 2B 4-350


VPMOVSXBD (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
Packed_Sign_Extend_BYTE_to_DWORD(TMP_DEST[127:0], SRC[31:0])
IF VL >= 256
Packed_Sign_Extend_BYTE_to_DWORD(TMP_DEST[255:128], SRC[63:32])
FI;
IF VL >= 512
Packed_Sign_Extend_BYTE_to_DWORD(TMP_DEST[383:256], SRC[95:64])
Packed_Sign_Extend_BYTE_to_DWORD(TMP_DEST[511:384], SRC[127:96])
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TEMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPMOVSXBQ (EVEX Encoded Versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
Packed_Sign_Extend_BYTE_to_QWORD(TMP_DEST[127:0], SRC[15:0])
IF VL >= 256
Packed_Sign_Extend_BYTE_to_QWORD(TMP_DEST[255:128], SRC[31:16])
FI;
IF VL >= 512
Packed_Sign_Extend_BYTE_to_QWORD(TMP_DEST[383:256], SRC[47:32])
Packed_Sign_Extend_BYTE_to_QWORD(TMP_DEST[511:384], SRC[63:48])
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TEMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

PMOVSX—Packed Move With Sign Extend Vol. 2B 4-351


VPMOVSXWD (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
Packed_Sign_Extend_WORD_to_DWORD(TMP_DEST[127:0], SRC[63:0])
IF VL >= 256
Packed_Sign_Extend_WORD_to_DWORD(TMP_DEST[255:128], SRC[127:64])
FI;
IF VL >= 512
Packed_Sign_Extend_WORD_to_DWORD(TMP_DEST[383:256], SRC[191:128])
Packed_Sign_Extend_WORD_to_DWORD(TMP_DEST[511:384], SRC[256:192])
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TEMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPMOVSXWQ (EVEX Encoded Versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
Packed_Sign_Extend_WORD_to_QWORD(TMP_DEST[127:0], SRC[31:0])
IF VL >= 256
Packed_Sign_Extend_WORD_to_QWORD(TMP_DEST[255:128], SRC[63:32])
FI;
IF VL >= 512
Packed_Sign_Extend_WORD_to_QWORD(TMP_DEST[383:256], SRC[95:64])
Packed_Sign_Extend_WORD_to_QWORD(TMP_DEST[511:384], SRC[127:96])
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TEMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

PMOVSX—Packed Move With Sign Extend Vol. 2B 4-352


VPMOVSXDQ (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
Packed_Sign_Extend_DWORD_to_QWORD(TEMP_DEST[127:0], SRC[63:0])
IF VL >= 256
Packed_Sign_Extend_DWORD_to_QWORD(TEMP_DEST[255:128], SRC[127:64])
FI;
IF VL >= 512
Packed_Sign_Extend_DWORD_to_QWORD(TEMP_DEST[383:256], SRC[191:128])
Packed_Sign_Extend_DWORD_to_QWORD(TEMP_DEST[511:384], SRC[255:192])
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TEMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPMOVSXBW (VEX.256 Encoded Version)


Packed_Sign_Extend_BYTE_to_WORD(DEST[127:0], SRC[63:0])
Packed_Sign_Extend_BYTE_to_WORD(DEST[255:128], SRC[127:64])
DEST[MAXVL-1:256] := 0

VPMOVSXBD (VEX.256 Encoded Version)


Packed_Sign_Extend_BYTE_to_DWORD(DEST[127:0], SRC[31:0])
Packed_Sign_Extend_BYTE_to_DWORD(DEST[255:128], SRC[63:32])
DEST[MAXVL-1:256] := 0

VPMOVSXBQ (VEX.256 Encoded Version)


Packed_Sign_Extend_BYTE_to_QWORD(DEST[127:0], SRC[15:0])
Packed_Sign_Extend_BYTE_to_QWORD(DEST[255:128], SRC[31:16])
DEST[MAXVL-1:256] := 0

VPMOVSXWD (VEX.256 Encoded Version)


Packed_Sign_Extend_WORD_to_DWORD(DEST[127:0], SRC[63:0])
Packed_Sign_Extend_WORD_to_DWORD(DEST[255:128], SRC[127:64])
DEST[MAXVL-1:256] := 0

VPMOVSXWQ (VEX.256 Encoded Version)


Packed_Sign_Extend_WORD_to_QWORD(DEST[127:0], SRC[31:0])
Packed_Sign_Extend_WORD_to_QWORD(DEST[255:128], SRC[63:32])
DEST[MAXVL-1:256] := 0

VPMOVSXDQ (VEX.256 Encoded Version)


Packed_Sign_Extend_DWORD_to_QWORD(DEST[127:0], SRC[63:0])
Packed_Sign_Extend_DWORD_to_QWORD(DEST[255:128], SRC[127:64])
DEST[MAXVL-1:256] := 0

PMOVSX—Packed Move With Sign Extend Vol. 2B 4-353


VPMOVSXBW (VEX.128 Encoded Version)
Packed_Sign_Extend_BYTE_to_WORDDEST[127:0], SRC[127:0]()
DEST[MAXVL-1:128] := 0

VPMOVSXBD (VEX.128 Encoded Version)


Packed_Sign_Extend_BYTE_to_DWORD(DEST[127:0], SRC[127:0])
DEST[MAXVL-1:128] := 0

VPMOVSXBQ (VEX.128 Encoded Version)


Packed_Sign_Extend_BYTE_to_QWORD(DEST[127:0], SRC[127:0])
DEST[MAXVL-1:128] := 0

VPMOVSXWD (VEX.128 Encoded Version)


Packed_Sign_Extend_WORD_to_DWORD(DEST[127:0], SRC[127:0])
DEST[MAXVL-1:128] := 0

VPMOVSXWQ (VEX.128 Encoded Version)


Packed_Sign_Extend_WORD_to_QWORD(DEST[127:0], SRC[127:0])
DEST[MAXVL-1:128] := 0

VPMOVSXDQ (VEX.128 Encoded Version)


Packed_Sign_Extend_DWORD_to_QWORD(DEST[127:0], SRC[127:0])
DEST[MAXVL-1:128] := 0

PMOVSXBW
Packed_Sign_Extend_BYTE_to_WORD(DEST[127:0], SRC[127:0])
DEST[MAXVL-1:128] (Unmodified)

PMOVSXBD
Packed_Sign_Extend_BYTE_to_DWORD(DEST[127:0], SRC[127:0])
DEST[MAXVL-1:128] (Unmodified)

PMOVSXBQ
Packed_Sign_Extend_BYTE_to_QWORD(DEST[127:0], SRC[127:0])
DEST[MAXVL-1:128] (Unmodified)

PMOVSXWD
Packed_Sign_Extend_WORD_to_DWORD(DEST[127:0], SRC[127:0])
DEST[MAXVL-1:128] (Unmodified)

PMOVSXWQ
Packed_Sign_Extend_WORD_to_QWORD(DEST[127:0], SRC[127:0])
DEST[MAXVL-1:128] (Unmodified)

PMOVSXDQ
Packed_Sign_Extend_DWORD_to_QWORD(DEST[127:0], SRC[127:0])
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VPMOVSXBW __m512i _mm512_cvtepi8_epi16(__m512i a);
VPMOVSXBW __m512i _mm512_mask_cvtepi8_epi16(__m512i a, __mmask32 k, __m512i b);
VPMOVSXBW __m512i _mm512_maskz_cvtepi8_epi16( __mmask32 k, __m512i b);
VPMOVSXBD __m512i _mm512_cvtepi8_epi32(__m512i a);

PMOVSX—Packed Move With Sign Extend Vol. 2B 4-354


VPMOVSXBD __m512i _mm512_mask_cvtepi8_epi32(__m512i a, __mmask16 k, __m512i b);
VPMOVSXBD __m512i _mm512_maskz_cvtepi8_epi32( __mmask16 k, __m512i b);
VPMOVSXBQ __m512i _mm512_cvtepi8_epi64(__m512i a);
VPMOVSXBQ __m512i _mm512_mask_cvtepi8_epi64(__m512i a, __mmask8 k, __m512i b);
VPMOVSXBQ __m512i _mm512_maskz_cvtepi8_epi64( __mmask8 k, __m512i a);
VPMOVSXDQ __m512i _mm512_cvtepi32_epi64(__m512i a);
VPMOVSXDQ __m512i _mm512_mask_cvtepi32_epi64(__m512i a, __mmask8 k, __m512i b);
VPMOVSXDQ __m512i _mm512_maskz_cvtepi32_epi64( __mmask8 k, __m512i a);
VPMOVSXWD __m512i _mm512_cvtepi16_epi32(__m512i a);
VPMOVSXWD __m512i _mm512_mask_cvtepi16_epi32(__m512i a, __mmask16 k, __m512i b);
VPMOVSXWD __m512i _mm512_maskz_cvtepi16_epi32(__mmask16 k, __m512i a);
VPMOVSXWQ __m512i _mm512_cvtepi16_epi64(__m512i a);
VPMOVSXWQ __m512i _mm512_mask_cvtepi16_epi64(__m512i a, __mmask8 k, __m512i b);
VPMOVSXWQ __m512i _mm512_maskz_cvtepi16_epi64( __mmask8 k, __m512i a);
VPMOVSXBW __m256i _mm256_cvtepi8_epi16(__m256i a);
VPMOVSXBW __m256i _mm256_mask_cvtepi8_epi16(__m256i a, __mmask16 k, __m256i b);
VPMOVSXBW __m256i _mm256_maskz_cvtepi8_epi16( __mmask16 k, __m256i b);
VPMOVSXBD __m256i _mm256_cvtepi8_epi32(__m256i a);
VPMOVSXBD __m256i _mm256_mask_cvtepi8_epi32(__m256i a, __mmask8 k, __m256i b);
VPMOVSXBD __m256i _mm256_maskz_cvtepi8_epi32( __mmask8 k, __m256i b);
VPMOVSXBQ __m256i _mm256_cvtepi8_epi64(__m256i a);
VPMOVSXBQ __m256i _mm256_mask_cvtepi8_epi64(__m256i a, __mmask8 k, __m256i b);
VPMOVSXBQ __m256i _mm256_maskz_cvtepi8_epi64( __mmask8 k, __m256i a);
VPMOVSXDQ __m256i _mm256_cvtepi32_epi64(__m256i a);
VPMOVSXDQ __m256i _mm256_mask_cvtepi32_epi64(__m256i a, __mmask8 k, __m256i b);
VPMOVSXDQ __m256i _mm256_maskz_cvtepi32_epi64( __mmask8 k, __m256i a);
VPMOVSXWD __m256i _mm256_cvtepi16_epi32(__m256i a);
VPMOVSXWD __m256i _mm256_mask_cvtepi16_epi32(__m256i a, __mmask16 k, __m256i b);
VPMOVSXWD __m256i _mm256_maskz_cvtepi16_epi32(__mmask16 k, __m256i a);
VPMOVSXWQ __m256i _mm256_cvtepi16_epi64(__m256i a);
VPMOVSXWQ __m256i _mm256_mask_cvtepi16_epi64(__m256i a, __mmask8 k, __m256i b);
VPMOVSXWQ __m256i _mm256_maskz_cvtepi16_epi64( __mmask8 k, __m256i a);
VPMOVSXBW __m128i _mm_mask_cvtepi8_epi16(__m128i a, __mmask8 k, __m128i b);
VPMOVSXBW __m128i _mm_maskz_cvtepi8_epi16( __mmask8 k, __m128i b);
VPMOVSXBD __m128i _mm_mask_cvtepi8_epi32(__m128i a, __mmask8 k, __m128i b);
VPMOVSXBD __m128i _mm_maskz_cvtepi8_epi32( __mmask8 k, __m128i b);
VPMOVSXBQ __m128i _mm_mask_cvtepi8_epi64(__m128i a, __mmask8 k, __m128i b);
VPMOVSXBQ __m128i _mm_maskz_cvtepi8_epi64( __mmask8 k, __m128i a);
VPMOVSXDQ __m128i _mm_mask_cvtepi32_epi64(__m128i a, __mmask8 k, __m128i b);
VPMOVSXDQ __m128i _mm_maskz_cvtepi32_epi64( __mmask8 k, __m128i a);
VPMOVSXWD __m128i _mm_mask_cvtepi16_epi32(__m128i a, __mmask16 k, __m128i b);
VPMOVSXWD __m128i _mm_maskz_cvtepi16_epi32(__mmask16 k, __m128i a);
VPMOVSXWQ __m128i _mm_mask_cvtepi16_epi64(__m128i a, __mmask8 k, __m128i b);
VPMOVSXWQ __m128i _mm_maskz_cvtepi16_epi64( __mmask8 k, __m128i a);
PMOVSXBW __m128i _mm_ cvtepi8_epi16 ( __m128i a);
PMOVSXBD __m128i _mm_ cvtepi8_epi32 ( __m128i a);
PMOVSXBQ __m128i _mm_ cvtepi8_epi64 ( __m128i a);
PMOVSXWD __m128i _mm_ cvtepi16_epi32 ( __m128i a);
PMOVSXWQ __m128i _mm_ cvtepi16_epi64 ( __m128i a);
PMOVSXDQ __m128i _mm_ cvtepi32_epi64 ( __m128i a);

SIMD Floating-Point Exceptions


None.

PMOVSX—Packed Move With Sign Extend Vol. 2B 4-355


Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-53, “Type E5 Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B, or EVEX.vvvv != 1111B.

PMOVSX—Packed Move With Sign Extend Vol. 2B 4-356


PMOVZX—Packed Move With Zero Extend
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0f 38 30 /r A V/V SSE4_1 Zero extend 8 packed 8-bit integers in the low 8
PMOVZXBW xmm1, xmm2/m64 bytes of xmm2/m64 to 8 packed 16-bit integers in
xmm1.
66 0f 38 31 /r A V/V SSE4_1 Zero extend 4 packed 8-bit integers in the low 4
PMOVZXBD xmm1, xmm2/m32 bytes of xmm2/m32 to 4 packed 32-bit integers in
xmm1.
66 0f 38 32 /r A V/V SSE4_1 Zero extend 2 packed 8-bit integers in the low 2
PMOVZXBQ xmm1, xmm2/m16 bytes of xmm2/m16 to 2 packed 64-bit integers in
xmm1.
66 0f 38 33 /r A V/V SSE4_1 Zero extend 4 packed 16-bit integers in the low 8
PMOVZXWD xmm1, xmm2/m64 bytes of xmm2/m64 to 4 packed 32-bit integers in
xmm1.
66 0f 38 34 /r A V/V SSE4_1 Zero extend 2 packed 16-bit integers in the low 4
PMOVZXWQ xmm1, xmm2/m32 bytes of xmm2/m32 to 2 packed 64-bit integers in
xmm1.
66 0f 38 35 /r A V/V SSE4_1 Zero extend 2 packed 32-bit integers in the low 8
PMOVZXDQ xmm1, xmm2/m64 bytes of xmm2/m64 to 2 packed 64-bit integers in
xmm1.
VEX.128.66.0F38.WIG 30 /r A V/V AVX Zero extend 8 packed 8-bit integers in the low 8
VPMOVZXBW xmm1, xmm2/m64 bytes of xmm2/m64 to 8 packed 16-bit integers in
xmm1.
VEX.128.66.0F38.WIG 31 /r A V/V AVX Zero extend 4 packed 8-bit integers in the low 4
VPMOVZXBD xmm1, xmm2/m32 bytes of xmm2/m32 to 4 packed 32-bit integers in
xmm1.
VEX.128.66.0F38.WIG 32 /r A V/V AVX Zero extend 2 packed 8-bit integers in the low 2
VPMOVZXBQ xmm1, xmm2/m16 bytes of xmm2/m16 to 2 packed 64-bit integers in
xmm1.
VEX.128.66.0F38.WIG 33 /r A V/V AVX Zero extend 4 packed 16-bit integers in the low 8
VPMOVZXWD xmm1, xmm2/m64 bytes of xmm2/m64 to 4 packed 32-bit integers in
xmm1.
VEX.128.66.0F38.WIG 34 /r A V/V AVX Zero extend 2 packed 16-bit integers in the low 4
VPMOVZXWQ xmm1, xmm2/m32 bytes of xmm2/m32 to 2 packed 64-bit integers in
xmm1.
VEX.128.66.0F 38.WIG 35 /r A V/V AVX Zero extend 2 packed 32-bit integers in the low 8
VPMOVZXDQ xmm1, xmm2/m64 bytes of xmm2/m64 to 2 packed 64-bit integers in
xmm1.
VEX.256.66.0F38.WIG 30 /r A V/V AVX2 Zero extend 16 packed 8-bit integers in
VPMOVZXBW ymm1, xmm2/m128 xmm2/m128 to 16 packed 16-bit integers in ymm1.
VEX.256.66.0F38.WIG 31 /r A V/V AVX2 Zero extend 8 packed 8-bit integers in the low 8
VPMOVZXBD ymm1, xmm2/m64 bytes of xmm2/m64 to 8 packed 32-bit integers in
ymm1.
VEX.256.66.0F38.WIG 32 /r A V/V AVX2 Zero extend 4 packed 8-bit integers in the low 4
VPMOVZXBQ ymm1, xmm2/m32 bytes of xmm2/m32 to 4 packed 64-bit integers in
ymm1.
VEX.256.66.0F38.WIG 33 /r A V/V AVX2 Zero extend 8 packed 16-bit integers xmm2/m128
VPMOVZXWD ymm1, xmm2/m128 to 8 packed 32-bit integers in ymm1.

PMOVZX—Packed Move With Zero Extend Vol. 2B 4-356


Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
VEX.256.66.0F38.WIG 34 /r A V/V AVX2 Zero extend 4 packed 16-bit integers in the low 8
VPMOVZXWQ ymm1, xmm2/m64 bytes of xmm2/m64 to 4 packed 64-bit integers in
xmm1.
VEX.256.66.0F38.WIG 35 /r A V/V AVX2 Zero extend 4 packed 32-bit integers in
VPMOVZXDQ ymm1, xmm2/m128 xmm2/m128 to 4 packed 64-bit integers in ymm1.
EVEX.128.66.0F38 30.WIG /r B V/V (AVX512VL AND Zero extend 8 packed 8-bit integers in the low 8
VPMOVZXBW xmm1 {k1}{z}, AVX512BW) OR bytes of xmm2/m64 to 8 packed 16-bit integers in
xmm2/m64 AVX10.11 xmm1.
EVEX.256.66.0F38.WIG 30 /r B V/V (AVX512VL AND Zero extend 16 packed 8-bit integers in
VPMOVZXBW ymm1 {k1}{z}, AVX512BW) OR xmm2/m128 to 16 packed 16-bit integers in ymm1.
xmm2/m128 AVX10.11
EVEX.512.66.0F38.WIG 30 /r B V/V AVX512BW Zero extend 32 packed 8-bit integers in
VPMOVZXBW zmm1 {k1}{z}, OR AVX10.11 ymm2/m256 to 32 packed 16-bit integers in zmm1.
ymm2/m256
EVEX.128.66.0F38.WIG 31 /r C V/V (AVX512VL AND Zero extend 4 packed 8-bit integers in the low 4
VPMOVZXBD xmm1 {k1}{z}, AVX512F) OR bytes of xmm2/m32 to 4 packed 32-bit integers in
xmm2/m32 AVX10.11 xmm1 subject to writemask k1.
EVEX.256.66.0F38.WIG 31 /r C V/V (AVX512VL AND Zero extend 8 packed 8-bit integers in the low 8
VPMOVZXBD ymm1 {k1}{z}, AVX512F) OR bytes of xmm2/m64 to 8 packed 32-bit integers in
xmm2/m64 AVX10.11 ymm1 subject to writemask k1.
EVEX.512.66.0F38.WIG 31 /r C V/V AVX512F Zero extend 16 packed 8-bit integers in
VPMOVZXBD zmm1 {k1}{z}, OR AVX10.11 xmm2/m128 to 16 packed 32-bit integers in zmm1
xmm2/m128 subject to writemask k1.
EVEX.128.66.0F38.WIG 32 /r D V/V (AVX512VL AND Zero extend 2 packed 8-bit integers in the low 2
VPMOVZXBQ xmm1 {k1}{z}, AVX512F) OR bytes of xmm2/m16 to 2 packed 64-bit integers in
xmm2/m16 AVX10.11 xmm1 subject to writemask k1.
EVEX.256.66.0F38.WIG 32 /r D V/V (AVX512VL AND Zero extend 4 packed 8-bit integers in the low 4
VPMOVZXBQ ymm1 {k1}{z}, AVX512F) OR bytes of xmm2/m32 to 4 packed 64-bit integers in
xmm2/m32 AVX10.11 ymm1 subject to writemask k1.
EVEX.512.66.0F38.WIG 32 /r D V/V AVX512F Zero extend 8 packed 8-bit integers in the low 8
VPMOVZXBQ zmm1 {k1}{z}, OR AVX10.11 bytes of xmm2/m64 to 8 packed 64-bit integers in
xmm2/m64 zmm1 subject to writemask k1.
EVEX.128.66.0F38.WIG 33 /r B V/V (AVX512VL AND Zero extend 4 packed 16-bit integers in the low 8
VPMOVZXWD xmm1 {k1}{z}, AVX512F) OR bytes of xmm2/m64 to 4 packed 32-bit integers in
xmm2/m64 AVX10.11 xmm1 subject to writemask k1.
EVEX.256.66.0F38.WIG 33 /r B V/V (AVX512VL AND Zero extend 8 packed 16-bit integers in
VPMOVZXWD ymm1 {k1}{z}, AVX512F) OR xmm2/m128 to 8 packed 32-bit integers in zmm1
xmm2/m128 AVX10.11 subject to writemask k1.
EVEX.512.66.0F38.WIG 33 /r B V/V AVX512F Zero extend 16 packed 16-bit integers in
VPMOVZXWD zmm1 {k1}{z}, OR AVX10.11 ymm2/m256 to 16 packed 32-bit integers in zmm1
ymm2/m256 subject to writemask k1.
EVEX.128.66.0F38.WIG 34 /r C V/V (AVX512VL AND Zero extend 2 packed 16-bit integers in the low 4
VPMOVZXWQ xmm1 {k1}{z}, AVX512F) OR bytes of xmm2/m32 to 2 packed 64-bit integers in
xmm2/m32 AVX10.11 xmm1 subject to writemask k1.
EVEX.256.66.0F38.WIG 34 /r C V/V (AVX512VL AND Zero extend 4 packed 16-bit integers in the low 8
VPMOVZXWQ ymm1 {k1}{z}, AVX512F) OR bytes of xmm2/m64 to 4 packed 64-bit integers in
xmm2/m64 AVX10.11 ymm1 subject to writemask k1.

PMOVZX—Packed Move With Zero Extend Vol. 2B 4-357


Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
EVEX.512.66.0F38.WIG 34 /r C V/V AVX512F Zero extend 8 packed 16-bit integers in
VPMOVZXWQ zmm1 {k1}{z}, OR AVX10.11 xmm2/m128 to 8 packed 64-bit integers in zmm1
xmm2/m128 subject to writemask k1.
EVEX.128.66.0F38.W0 35 /r B V/V (AVX512VL AND Zero extend 2 packed 32-bit integers in the low 8
VPMOVZXDQ xmm1 {k1}{z}, AVX512F) OR bytes of xmm2/m64 to 2 packed 64-bit integers in
xmm2/m64 AVX10.11 zmm1 using writemask k1.
EVEX.256.66.0F38.W0 35 /r B V/V (AVX512VL AND Zero extend 4 packed 32-bit integers in
VPMOVZXDQ ymm1 {k1}{z}, AVX512F) OR xmm2/m128 to 4 packed 64-bit integers in zmm1
xmm2/m128 AVX10.11 using writemask k1.
EVEX.512.66.0F38.W0 35 /r B V/V AVX512F Zero extend 8 packed 32-bit integers in
VPMOVZXDQ zmm1 {k1}{z}, OR AVX10.11 ymm2/m256 to 8 packed 64-bit integers in zmm1
ymm2/m256 using writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Half Mem ModRM:reg (w) ModRM:r/m (r) N/A N/A
C Quarter Mem ModRM:reg (w) ModRM:r/m (r) N/A N/A
D Eighth Mem ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Legacy, VEX, and EVEX encoded versions: Packed byte, word, or dword integers starting from the low bytes of the
source operand (second operand) are zero extended to word, dword, or quadword integers and stored in packed
signed bytes the destination operand.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.
VEX.128 encoded version: Bits (MAXVL-1:128) of the corresponding destination register are zeroed.
VEX.256 encoded version: Bits (MAXVL-1:256) of the corresponding destination register are zeroed.
EVEX encoded versions: Packed dword integers starting from the low bytes of the source operand (second
operand) are zero extended to quadword integers and stored to the destination operand under the writemask.The
destination register is XMM, YMM or ZMM Register.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

Operation
Packed_Zero_Extend_BYTE_to_WORD(DEST, SRC)
DEST[15:0] := ZeroExtend(SRC[7:0]);
DEST[31:16] := ZeroExtend(SRC[15:8]);
DEST[47:32] := ZeroExtend(SRC[23:16]);
DEST[63:48] := ZeroExtend(SRC[31:24]);
DEST[79:64] := ZeroExtend(SRC[39:32]);
DEST[95:80] := ZeroExtend(SRC[47:40]);
DEST[111:96] := ZeroExtend(SRC[55:48]);
DEST[127:112] := ZeroExtend(SRC[63:56]);

PMOVZX—Packed Move With Zero Extend Vol. 2B 4-358


Packed_Zero_Extend_BYTE_to_DWORD(DEST, SRC)
DEST[31:0] := ZeroExtend(SRC[7:0]);
DEST[63:32] := ZeroExtend(SRC[15:8]);
DEST[95:64] := ZeroExtend(SRC[23:16]);
DEST[127:96] := ZeroExtend(SRC[31:24]);

Packed_Zero_Extend_BYTE_to_QWORD(DEST, SRC)
DEST[63:0] := ZeroExtend(SRC[7:0]);
DEST[127:64] := ZeroExtend(SRC[15:8]);

Packed_Zero_Extend_WORD_to_DWORD(DEST, SRC)
DEST[31:0] := ZeroExtend(SRC[15:0]);
DEST[63:32] := ZeroExtend(SRC[31:16]);
DEST[95:64] := ZeroExtend(SRC[47:32]);
DEST[127:96] := ZeroExtend(SRC[63:48]);

Packed_Zero_Extend_WORD_to_QWORD(DEST, SRC)
DEST[63:0] := ZeroExtend(SRC[15:0]);
DEST[127:64] := ZeroExtend(SRC[31:16]);

Packed_Zero_Extend_DWORD_to_QWORD(DEST, SRC)
DEST[63:0] := ZeroExtend(SRC[31:0]);
DEST[127:64] := ZeroExtend(SRC[63:32]);

VPMOVZXBW (EVEX Encoded Versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
Packed_Zero_Extend_BYTE_to_WORD(TMP_DEST[127:0], SRC[63:0])
IF VL >= 256
Packed_Zero_Extend_BYTE_to_WORD(TMP_DEST[255:128], SRC[127:64])
FI;
IF VL >= 512
Packed_Zero_Extend_BYTE_to_WORD(TMP_DEST[383:256], SRC[191:128])
Packed_Zero_Extend_BYTE_to_WORD(TMP_DEST[511:384], SRC[255:192])
FI;
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TEMP_DEST[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

PMOVZX—Packed Move With Zero Extend Vol. 2B 4-359


VPMOVZXBD (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
Packed_Zero_Extend_BYTE_to_DWORD(TMP_DEST[127:0], SRC[31:0])
IF VL >= 256
Packed_Zero_Extend_BYTE_to_DWORD(TMP_DEST[255:128], SRC[63:32])
FI;
IF VL >= 512
Packed_Zero_Extend_BYTE_to_DWORD(TMP_DEST[383:256], SRC[95:64])
Packed_Zero_Extend_BYTE_to_DWORD(TMP_DEST[511:384], SRC[127:96])
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TEMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPMOVZXBQ (EVEX Encoded Versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
Packed_Zero_Extend_BYTE_to_QWORD(TMP_DEST[127:0], SRC[15:0])
IF VL >= 256
Packed_Zero_Extend_BYTE_to_QWORD(TMP_DEST[255:128], SRC[31:16])
FI;
IF VL >= 512
Packed_Zero_Extend_BYTE_to_QWORD(TMP_DEST[383:256], SRC[47:32])
Packed_Zero_Extend_BYTE_to_QWORD(TMP_DEST[511:384], SRC[63:48])
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TEMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

PMOVZX—Packed Move With Zero Extend Vol. 2B 4-360


VPMOVZXWD (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
Packed_Zero_Extend_WORD_to_DWORD(TMP_DEST[127:0], SRC[63:0])
IF VL >= 256
Packed_Zero_Extend_WORD_to_DWORD(TMP_DEST[255:128], SRC[127:64])
FI;
IF VL >= 512
Packed_Zero_Extend_WORD_to_DWORD(TMP_DEST[383:256], SRC[191:128])
Packed_Zero_Extend_WORD_to_DWORD(TMP_DEST[511:384], SRC[256:192])
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TEMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPMOVZXWQ (EVEX Encoded Versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
Packed_Zero_Extend_WORD_to_QWORD(TMP_DEST[127:0], SRC[31:0])
IF VL >= 256
Packed_Zero_Extend_WORD_to_QWORD(TMP_DEST[255:128], SRC[63:32])
FI;
IF VL >= 512
Packed_Zero_Extend_WORD_to_QWORD(TMP_DEST[383:256], SRC[95:64])
Packed_Zero_Extend_WORD_to_QWORD(TMP_DEST[511:384], SRC[127:96])
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TEMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

PMOVZX—Packed Move With Zero Extend Vol. 2B 4-361


VPMOVZXDQ (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
Packed_Zero_Extend_DWORD_to_QWORD(TEMP_DEST[127:0], SRC[63:0])
IF VL >= 256
Packed_Zero_Extend_DWORD_to_QWORD(TEMP_DEST[255:128], SRC[127:64])
FI;
IF VL >= 512
Packed_Zero_Extend_DWORD_to_QWORD(TEMP_DEST[383:256], SRC[191:128])
Packed_Zero_Extend_DWORD_to_QWORD(TEMP_DEST[511:384], SRC[255:192])
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TEMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPMOVZXBW (VEX.256 Encoded Version)


Packed_Zero_Extend_BYTE_to_WORD(DEST[127:0], SRC[63:0])
Packed_Zero_Extend_BYTE_to_WORD(DEST[255:128], SRC[127:64])
DEST[MAXVL-1:256] := 0

VPMOVZXBD (VEX.256 Encoded Version)


Packed_Zero_Extend_BYTE_to_DWORD(DEST[127:0], SRC[31:0])
Packed_Zero_Extend_BYTE_to_DWORD(DEST[255:128], SRC[63:32])
DEST[MAXVL-1:256] := 0

VPMOVZXBQ (VEX.256 Encoded Version)


Packed_Zero_Extend_BYTE_to_QWORD(DEST[127:0], SRC[15:0])
Packed_Zero_Extend_BYTE_to_QWORD(DEST[255:128], SRC[31:16])
DEST[MAXVL-1:256] := 0

VPMOVZXWD (VEX.256 Encoded Version)


Packed_Zero_Extend_WORD_to_DWORD(DEST[127:0], SRC[63:0])
Packed_Zero_Extend_WORD_to_DWORD(DEST[255:128], SRC[127:64])
DEST[MAXVL-1:256] := 0

VPMOVZXWQ (VEX.256 Encoded Version)


Packed_Zero_Extend_WORD_to_QWORD(DEST[127:0], SRC[31:0])
Packed_Zero_Extend_WORD_to_QWORD(DEST[255:128], SRC[63:32])
DEST[MAXVL-1:256] := 0

VPMOVZXDQ (VEX.256 Encoded Version)


Packed_Zero_Extend_DWORD_to_QWORD(DEST[127:0], SRC[63:0])
Packed_Zero_Extend_DWORD_to_QWORD(DEST[255:128], SRC[127:64])
DEST[MAXVL-1:256] := 0

PMOVZX—Packed Move With Zero Extend Vol. 2B 4-362


VPMOVZXBW (VEX.128 Encoded Version)
Packed_Zero_Extend_BYTE_to_WORD()
DEST[MAXVL-1:128] := 0

VPMOVZXBD (VEX.128 Encoded Version)


Packed_Zero_Extend_BYTE_to_DWORD()
DEST[MAXVL-1:128] := 0

VPMOVZXBQ (VEX.128 Encoded Version)


Packed_Zero_Extend_BYTE_to_QWORD()
DEST[MAXVL-1:128] := 0

VPMOVZXWD (VEX.128 Encoded Version)


Packed_Zero_Extend_WORD_to_DWORD()
DEST[MAXVL-1:128] := 0

VPMOVZXWQ (VEX.128 Encoded Version)


Packed_Zero_Extend_WORD_to_QWORD()
DEST[MAXVL-1:128] := 0

VPMOVZXDQ (VEX.128 Encoded Version)


Packed_Zero_Extend_DWORD_to_QWORD()
DEST[MAXVL-1:128] := 0

PMOVZXBW
Packed_Zero_Extend_BYTE_to_WORD()
DEST[MAXVL-1:128] (Unmodified)

PMOVZXBD
Packed_Zero_Extend_BYTE_to_DWORD()
DEST[MAXVL-1:128] (Unmodified)

PMOVZXBQ
Packed_Zero_Extend_BYTE_to_QWORD()
DEST[MAXVL-1:128] (Unmodified)

PMOVZXWD
Packed_Zero_Extend_WORD_to_DWORD()
DEST[MAXVL-1:128] (Unmodified)

PMOVZXWQ
Packed_Zero_Extend_WORD_to_QWORD()
DEST[MAXVL-1:128] (Unmodified)

PMOVZXDQ
Packed_Zero_Extend_DWORD_to_QWORD()
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VPMOVZXBW __m512i _mm512_cvtepu8_epi16(__m256i a);
VPMOVZXBW __m512i _mm512_mask_cvtepu8_epi16(__m512i a, __mmask32 k, __m256i b);
VPMOVZXBW __m512i _mm512_maskz_cvtepu8_epi16( __mmask32 k, __m256i b);
VPMOVZXBD __m512i _mm512_cvtepu8_epi32(__m128i a);

PMOVZX—Packed Move With Zero Extend Vol. 2B 4-363


VPMOVZXBD __m512i _mm512_mask_cvtepu8_epi32(__m512i a, __mmask16 k, __m128i b);
VPMOVZXBD __m512i _mm512_maskz_cvtepu8_epi32( __mmask16 k, __m128i b);
VPMOVZXBQ __m512i _mm512_cvtepu8_epi64(__m128i a);
VPMOVZXBQ __m512i _mm512_mask_cvtepu8_epi64(__m512i a, __mmask8 k, __m128i b);
VPMOVZXBQ __m512i _mm512_maskz_cvtepu8_epi64( __mmask8 k, __m128i a);
VPMOVZXDQ __m512i _mm512_cvtepu32_epi64(__m256i a);
VPMOVZXDQ __m512i _mm512_mask_cvtepu32_epi64(__m512i a, __mmask8 k, __m256i b);
VPMOVZXDQ __m512i _mm512_maskz_cvtepu32_epi64( __mmask8 k, __m256i a);
VPMOVZXWD __m512i _mm512_cvtepu16_epi32(__m128i a);
VPMOVZXWD __m512i _mm512_mask_cvtepu16_epi32(__m512i a, __mmask16 k, __m128i b);
VPMOVZXWD __m512i _mm512_maskz_cvtepu16_epi32(__mmask16 k, __m128i a);
VPMOVZXWQ __m512i _mm512_cvtepu16_epi64(__m256i a);
VPMOVZXWQ __m512i _mm512_mask_cvtepu16_epi64(__m512i a, __mmask8 k, __m256i b);
VPMOVZXWQ __m512i _mm512_maskz_cvtepu16_epi64( __mmask8 k, __m256i a);
VPMOVZXBW __m256i _mm256_cvtepu8_epi16(__m256i a);
VPMOVZXBW __m256i _mm256_mask_cvtepu8_epi16(__m256i a, __mmask16 k, __m128i b);
VPMOVZXBW __m256i _mm256_maskz_cvtepu8_epi16( __mmask16 k, __m128i b);
VPMOVZXBD __m256i _mm256_cvtepu8_epi32(__m128i a);
VPMOVZXBD __m256i _mm256_mask_cvtepu8_epi32(__m256i a, __mmask8 k, __m128i b);
VPMOVZXBD __m256i _mm256_maskz_cvtepu8_epi32( __mmask8 k, __m128i b);
VPMOVZXBQ __m256i _mm256_cvtepu8_epi64(__m128i a);
VPMOVZXBQ __m256i _mm256_mask_cvtepu8_epi64(__m256i a, __mmask8 k, __m128i b);
VPMOVZXBQ __m256i _mm256_maskz_cvtepu8_epi64( __mmask8 k, __m128i a);
VPMOVZXDQ __m256i _mm256_cvtepu32_epi64(__m128i a);
VPMOVZXDQ __m256i _mm256_mask_cvtepu32_epi64(__m256i a, __mmask8 k, __m128i b);
VPMOVZXDQ __m256i _mm256_maskz_cvtepu32_epi64( __mmask8 k, __m128i a);
VPMOVZXWD __m256i _mm256_cvtepu16_epi32(__m128i a);
VPMOVZXWD __m256i _mm256_mask_cvtepu16_epi32(__m256i a, __mmask16 k, __m128i b);
VPMOVZXWD __m256i _mm256_maskz_cvtepu16_epi32(__mmask16 k, __m128i a);
VPMOVZXWQ __m256i _mm256_cvtepu16_epi64(__m128i a);
VPMOVZXWQ __m256i _mm256_mask_cvtepu16_epi64(__m256i a, __mmask8 k, __m128i b);
VPMOVZXWQ __m256i _mm256_maskz_cvtepu16_epi64( __mmask8 k, __m128i a);
VPMOVZXBW __m128i _mm_mask_cvtepu8_epi16(__m128i a, __mmask8 k, __m128i b);
VPMOVZXBW __m128i _mm_maskz_cvtepu8_epi16( __mmask8 k, __m128i b);
VPMOVZXBD __m128i _mm_mask_cvtepu8_epi32(__m128i a, __mmask8 k, __m128i b);
VPMOVZXBD __m128i _mm_maskz_cvtepu8_epi32( __mmask8 k, __m128i b);
VPMOVZXBQ __m128i _mm_mask_cvtepu8_epi64(__m128i a, __mmask8 k, __m128i b);
VPMOVZXBQ __m128i _mm_maskz_cvtepu8_epi64( __mmask8 k, __m128i a);
VPMOVZXDQ __m128i _mm_mask_cvtepu32_epi64(__m128i a, __mmask8 k, __m128i b);
VPMOVZXDQ __m128i _mm_maskz_cvtepu32_epi64( __mmask8 k, __m128i a);
VPMOVZXWD __m128i _mm_mask_cvtepu16_epi32(__m128i a, __mmask16 k, __m128i b);
VPMOVZXWD __m128i _mm_maskz_cvtepu16_epi32(__mmask8 k, __m128i a);
VPMOVZXWQ __m128i _mm_mask_cvtepu16_epi64(__m128i a, __mmask8 k, __m128i b);
VPMOVZXWQ __m128i _mm_maskz_cvtepu16_epi64( __mmask8 k, __m128i a);
PMOVZXBW __m128i _mm_ cvtepu8_epi16 ( __m128i a);
PMOVZXBD __m128i _mm_ cvtepu8_epi32 ( __m128i a);
PMOVZXBQ __m128i _mm_ cvtepu8_epi64 ( __m128i a);
PMOVZXWD __m128i _mm_ cvtepu16_epi32 ( __m128i a);
PMOVZXWQ __m128i _mm_ cvtepu16_epi64 ( __m128i a);
PMOVZXDQ __m128i _mm_ cvtepu32_epi64 ( __m128i a);

SIMD Floating-Point Exceptions


None.

PMOVZX—Packed Move With Zero Extend Vol. 2B 4-364


Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-53, “Type E5 Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B, or EVEX.vvvv != 1111B.

PMOVZX—Packed Move With Zero Extend Vol. 2B 4-365


PMULDQ—Multiply Packed Doubleword Integers
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 38 28 /r A V/V SSE4_1 Multiply packed signed doubleword integers in xmm1 by
PMULDQ xmm1, xmm2/m128 packed signed doubleword integers in xmm2/m128, and
store the quadword results in xmm1.
VEX.128.66.0F38.WIG 28 /r B V/V AVX Multiply packed signed doubleword integers in xmm2 by
VPMULDQ xmm1, xmm2, packed signed doubleword integers in xmm3/m128, and
xmm3/m128 store the quadword results in xmm1.
VEX.256.66.0F38.WIG 28 /r B V/V AVX2 Multiply packed signed doubleword integers in ymm2 by
VPMULDQ ymm1, ymm2, packed signed doubleword integers in ymm3/m256, and
ymm3/m256 store the quadword results in ymm1.
EVEX.128.66.0F38.W1 28 /r C V/V (AVX512VL AND Multiply packed signed doubleword integers in xmm2 by
VPMULDQ xmm1 {k1}{z}, xmm2, AVX512F) OR packed signed doubleword integers in
xmm3/m128/m64bcst AVX10.11 xmm3/m128/m64bcst, and store the quadword results
in xmm1 using writemask k1.
EVEX.256.66.0F38.W1 28 /r C V/V (AVX512VL AND Multiply packed signed doubleword integers in ymm2 by
VPMULDQ ymm1 {k1}{z}, ymm2, AVX512F) OR packed signed doubleword integers in
ymm3/m256/m64bcst AVX10.11 ymm3/m256/m64bcst, and store the quadword results
in ymm1 using writemask k1.
EVEX.512.66.0F38.W1 28 /r C V/V AVX512F Multiply packed signed doubleword integers in zmm2 by
VPMULDQ zmm1 {k1}{z}, zmm2, OR AVX10.11 packed signed doubleword integers in
zmm3/m512/m64bcst zmm3/m512/m64bcst, and store the quadword results
in zmm1 using writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Multiplies packed signed doubleword integers in the even-numbered (zero-based reference) elements of the first
source operand with the packed signed doubleword integers in the corresponding elements of the second source
operand and stores packed signed quadword results in the destination operand.
128-bit Legacy SSE version: The input signed doubleword integers are taken from the even-numbered elements of
the source operands, i.e., the first (low) and third doubleword element. For 128-bit memory operands, 128 bits are
fetched from memory, but only the first and third doublewords are used in the computation. The first source
operand and the destination XMM operand is the same. The second source operand can be an XMM register or 128-
bit memory location. Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.
VEX.128 encoded version: The input signed doubleword integers are taken from the even-numbered elements of
the source operands, i.e., the first (low) and third doubleword element. For 128-bit memory operands, 128 bits are
fetched from memory, but only the first and third doublewords are used in the computation.The first source
operand and the destination operand are XMM registers. The second source operand can be an XMM register or
128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination register are zeroed.

PMULDQ—Multiply Packed Doubleword Integers Vol. 2B 4-366


VEX.256 encoded version: The input signed doubleword integers are taken from the even-numbered elements of
the source operands, i.e., the first, 3rd, 5th, 7th doubleword element. For 256-bit memory operands, 256 bits are
fetched from memory, but only the four even-numbered doublewords are used in the computation. The first source
operand and the destination operand are YMM registers. The second source operand can be a YMM register or 256-
bit memory location. Bits (MAXVL-1:256) of the corresponding destination ZMM register are zeroed.
EVEX encoded version: The input signed doubleword integers are taken from the even-numbered elements of the
source operands. The first source operand is a ZMM/YMM/XMM registers. The second source operand can be an
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-
bit memory location. The destination is a ZMM/YMM/XMM register, and updated according to the writemask at 64-
bit granularity.

Operation
VPMULDQ (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+63:i] := SignExtend64( SRC1[i+31:i]) * SignExtend64( SRC2[31:0])
ELSE DEST[i+63:i] := SignExtend64( SRC1[i+31:i]) * SignExtend64( SRC2[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPMULDQ (VEX.256 Encoded Version)


DEST[63:0] := SignExtend64( SRC1[31:0]) * SignExtend64( SRC2[31:0])
DEST[127:64] := SignExtend64( SRC1[95:64]) * SignExtend64( SRC2[95:64])
DEST[191:128] := SignExtend64( SRC1[159:128]) * SignExtend64( SRC2[159:128])
DEST[255:192] := SignExtend64( SRC1[223:192]) * SignExtend64( SRC2[223:192])
DEST[MAXVL-1:256] := 0

VPMULDQ (VEX.128 Encoded Version)


DEST[63:0] := SignExtend64( SRC1[31:0]) * SignExtend64( SRC2[31:0])
DEST[127:64] := SignExtend64( SRC1[95:64]) * SignExtend64( SRC2[95:64])
DEST[MAXVL-1:128] := 0

PMULDQ (128-bit Legacy SSE Version)


DEST[63:0] := SignExtend64( DEST[31:0]) * SignExtend64( SRC[31:0])
DEST[127:64] := SignExtend64( DEST[95:64]) * SignExtend64( SRC[95:64])
DEST[MAXVL-1:128] (Unmodified)

PMULDQ—Multiply Packed Doubleword Integers Vol. 2B 4-367


Intel C/C++ Compiler Intrinsic Equivalent
VPMULDQ __m512i _mm512_mul_epi32(__m512i a, __m512i b);
VPMULDQ __m512i _mm512_mask_mul_epi32(__m512i s, __mmask8 k, __m512i a, __m512i b);
VPMULDQ __m512i _mm512_maskz_mul_epi32( __mmask8 k, __m512i a, __m512i b);
VPMULDQ __m256i _mm256_mask_mul_epi32(__m256i s, __mmask8 k, __m256i a, __m256i b);
VPMULDQ __m256i _mm256_mask_mul_epi32( __mmask8 k, __m256i a, __m256i b);
VPMULDQ __m128i _mm_mask_mul_epi32(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMULDQ __m128i _mm_mask_mul_epi32( __mmask8 k, __m128i a, __m128i b);
(V)PMULDQ __m128i _mm_mul_epi32( __m128i a, __m128i b);
VPMULDQ __m256i _mm256_mul_epi32( __m256i a, __m256i b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

PMULDQ—Multiply Packed Doubleword Integers Vol. 2B 4-368


PMULHRSW—Packed Multiply High With Round and Scale
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 38 0B /r1 A V/V SSSE3 Multiply 16-bit signed words, scale and
PMULHRSW mm1, mm2/m64 round signed doublewords, pack high 16
bits to mm1.
66 0F 38 0B /r A V/V SSSE3 Multiply 16-bit signed words, scale and
PMULHRSW xmm1, xmm2/m128 round signed doublewords, pack high 16
bits to xmm1.
VEX.128.66.0F38.WIG 0B /r B V/V AVX Multiply 16-bit signed words, scale and
VPMULHRSW xmm1, xmm2, xmm3/m128 round signed doublewords, pack high 16
bits to xmm1.
VEX.256.66.0F38.WIG 0B /r B V/V AVX2 Multiply 16-bit signed words, scale and
VPMULHRSW ymm1, ymm2, ymm3/m256 round signed doublewords, pack high 16
bits to ymm1.
EVEX.128.66.0F38.WIG 0B /r C V/V (AVX512VL AND Multiply 16-bit signed words, scale and
VPMULHRSW xmm1 {k1}{z}, xmm2, xmm3/m128 AVX512BW) OR round signed doublewords, pack high 16
AVX10.12 bits to xmm1 under writemask k1.
EVEX.256.66.0F38.WIG 0B /r C V/V (AVX512VL AND Multiply 16-bit signed words, scale and
VPMULHRSW ymm1 {k1}{z}, ymm2, ymm3/m256 AVX512BW) OR round signed doublewords, pack high 16
AVX10.12 bits to ymm1 under writemask k1.
EVEX.512.66.0F38.WIG 0B /r C V/V AVX512BW Multiply 16-bit signed words, scale and
VPMULHRSW zmm1 {k1}{z}, zmm2, zmm3/m512 OR AVX10.12 round signed doublewords, pack high 16
bits to zmm1 under writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
PMULHRSW multiplies vertically each signed 16-bit integer from the destination operand (first operand) with the
corresponding signed 16-bit integer of the source operand (second operand), producing intermediate, signed 32-
bit integers. Each intermediate 32-bit integer is truncated to the 18 most significant bits. Rounding is always
performed by adding 1 to the least significant bit of the 18-bit intermediate result. The final result is obtained by
selecting the 16 bits immediately to the right of the most significant bit of each 18-bit intermediate result and
packed to the destination operand.
When the source operand is a 128-bit memory operand, the operand must be aligned on a 16-byte boundary or a
general-protection exception (#GP) will be generated.
In 64-bit mode and not encoded with VEX/EVEX, use the REX prefix to access XMM8-XMM15 registers.

PMULHRSW—Packed Multiply High With Round and Scale Vol. 2B 4-369


Legacy SSE version 64-bit operand: Both operands can be MMX registers. The second source operand is an MMX
register or a 64-bit memory location.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding YMM destina-
tion register remain unchanged.
VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the destination YMM register are
zeroed.
VEX.256 encoded version: The second source operand can be an YMM register or a 256-bit memory location. The
first source and destination operands are YMM registers.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand is a ZMM/YMM/XMM
register conditionally updated with writemask k1.

Operation
PMULHRSW (With 64-bit Operands)
temp0[31:0] = INT32 ((DEST[15:0] * SRC[15:0]) >>14) + 1;
temp1[31:0] = INT32 ((DEST[31:16] * SRC[31:16]) >>14) + 1;
temp2[31:0] = INT32 ((DEST[47:32] * SRC[47:32]) >> 14) + 1;
temp3[31:0] = INT32 ((DEST[63:48] * SRc[63:48]) >> 14) + 1;
DEST[15:0] = temp0[16:1];
DEST[31:16] = temp1[16:1];
DEST[47:32] = temp2[16:1];
DEST[63:48] = temp3[16:1];

PMULHRSW (With 128-bit Operands)


temp0[31:0] = INT32 ((DEST[15:0] * SRC[15:0]) >>14) + 1;
temp1[31:0] = INT32 ((DEST[31:16] * SRC[31:16]) >>14) + 1;
temp2[31:0] = INT32 ((DEST[47:32] * SRC[47:32]) >>14) + 1;
temp3[31:0] = INT32 ((DEST[63:48] * SRC[63:48]) >>14) + 1;
temp4[31:0] = INT32 ((DEST[79:64] * SRC[79:64]) >>14) + 1;
temp5[31:0] = INT32 ((DEST[95:80] * SRC[95:80]) >>14) + 1;
temp6[31:0] = INT32 ((DEST[111:96] * SRC[111:96]) >>14) + 1;
temp7[31:0] = INT32 ((DEST[127:112] * SRC[127:112) >>14) + 1;
DEST[15:0] = temp0[16:1];
DEST[31:16] = temp1[16:1];
DEST[47:32] = temp2[16:1];
DEST[63:48] = temp3[16:1];
DEST[79:64] = temp4[16:1];
DEST[95:80] = temp5[16:1];
DEST[111:96] = temp6[16:1];
DEST[127:112] = temp7[16:1];

VPMULHRSW (VEX.128 Encoded Version)


temp0[31:0] := INT32 ((SRC1[15:0] * SRC2[15:0]) >>14) + 1
temp1[31:0] := INT32 ((SRC1[31:16] * SRC2[31:16]) >>14) + 1
temp2[31:0] := INT32 ((SRC1[47:32] * SRC2[47:32]) >>14) + 1
temp3[31:0] := INT32 ((SRC1[63:48] * SRC2[63:48]) >>14) + 1
temp4[31:0] := INT32 ((SRC1[79:64] * SRC2[79:64]) >>14) + 1
temp5[31:0] := INT32 ((SRC1[95:80] * SRC2[95:80]) >>14) + 1
temp6[31:0] := INT32 ((SRC1[111:96] * SRC2[111:96]) >>14) + 1
temp7[31:0] := INT32 ((SRC1[127:112] * SRC2[127:112) >>14) + 1
DEST[15:0] := temp0[16:1]

PMULHRSW—Packed Multiply High With Round and Scale Vol. 2B 4-370


DEST[31:16] := temp1[16:1]
DEST[47:32] := temp2[16:1]
DEST[63:48] := temp3[16:1]
DEST[79:64] := temp4[16:1]
DEST[95:80] := temp5[16:1]
DEST[111:96] := temp6[16:1]
DEST[127:112] := temp7[16:1]
DEST[MAXVL-1:128] := 0

VPMULHRSW (VEX.256 Encoded Version)


temp0[31:0] := INT32 ((SRC1[15:0] * SRC2[15:0]) >>14) + 1
temp1[31:0] := INT32 ((SRC1[31:16] * SRC2[31:16]) >>14) + 1
temp2[31:0] := INT32 ((SRC1[47:32] * SRC2[47:32]) >>14) + 1
temp3[31:0] := INT32 ((SRC1[63:48] * SRC2[63:48]) >>14) + 1
temp4[31:0] := INT32 ((SRC1[79:64] * SRC2[79:64]) >>14) + 1
temp5[31:0] := INT32 ((SRC1[95:80] * SRC2[95:80]) >>14) + 1
temp6[31:0] := INT32 ((SRC1[111:96] * SRC2[111:96]) >>14) + 1
temp7[31:0] := INT32 ((SRC1[127:112] * SRC2[127:112) >>14) + 1
temp8[31:0] := INT32 ((SRC1[143:128] * SRC2[143:128]) >>14) + 1
temp9[31:0] := INT32 ((SRC1[159:144] * SRC2[159:144]) >>14) + 1
temp10[31:0] := INT32 ((SRC1[75:160] * SRC2[175:160]) >>14) + 1
temp11[31:0] := INT32 ((SRC1[191:176] * SRC2[191:176]) >>14) + 1
temp12[31:0] := INT32 ((SRC1[207:192] * SRC2[207:192]) >>14) + 1
temp13[31:0] := INT32 ((SRC1[223:208] * SRC2[223:208]) >>14) + 1
temp14[31:0] := INT32 ((SRC1[239:224] * SRC2[239:224]) >>14) + 1
temp15[31:0] := INT32 ((SRC1[255:240] * SRC2[255:240) >>14) + 1

DEST[15:0] := temp0[16:1]
DEST[31:16] := temp1[16:1]
DEST[47:32] := temp2[16:1]
DEST[63:48] := temp3[16:1]
DEST[79:64] := temp4[16:1]
DEST[95:80] := temp5[16:1]
DEST[111:96] := temp6[16:1]
DEST[127:112] := temp7[16:1]
DEST[143:128] := temp8[16:1]
DEST[159:144] := temp9[16:1]
DEST[175:160] := temp10[16:1]
DEST[191:176] := temp11[16:1]
DEST[207:192] := temp12[16:1]
DEST[223:208] := temp13[16:1]
DEST[239:224] := temp14[16:1]
DEST[255:240] := temp15[16:1]
DEST[MAXVL-1:256] := 0

PMULHRSW—Packed Multiply High With Round and Scale Vol. 2B 4-371


VPMULHRSW (EVEX Encoded Version)
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN
temp[31:0] := ((SRC1[i+15:i] * SRC2[i+15:i]) >>14) + 1
DEST[i+15:i] := tmp[16:1]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalents


VPMULHRSW __m512i _mm512_mulhrs_epi16(__m512i a, __m512i b);
VPMULHRSW __m512i _mm512_mask_mulhrs_epi16(__m512i s, __mmask32 k, __m512i a, __m512i b);
VPMULHRSW __m512i _mm512_maskz_mulhrs_epi16( __mmask32 k, __m512i a, __m512i b);
VPMULHRSW __m256i _mm256_mask_mulhrs_epi16(__m256i s, __mmask16 k, __m256i a, __m256i b);
VPMULHRSW __m256i _mm256_maskz_mulhrs_epi16( __mmask16 k, __m256i a, __m256i b);
VPMULHRSW __m128i _mm_mask_mulhrs_epi16(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMULHRSW __m128i _mm_maskz_mulhrs_epi16( __mmask8 k, __m128i a, __m128i b);
PMULHRSW __m64 _mm_mulhrs_pi16 (__m64 a, __m64 b)
(V)PMULHRSW __m128i _mm_mulhrs_epi16 (__m128i a, __m128i b)
VPMULHRSW __m256i _mm256_mulhrs_epi16 (__m256i a, __m256i b)

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

PMULHRSW—Packed Multiply High With Round and Scale Vol. 2B 4-372


PMULHUW—Multiply Packed Unsigned Integers and Store High Result
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F E4 /r1 A V/V SSE Multiply the packed unsigned word integers
PMULHUW mm1, mm2/m64 in mm1 register and mm2/m64, and store the
high 16 bits of the results in mm1.
66 0F E4 /r A V/V SSE2 Multiply the packed unsigned word integers
PMULHUW xmm1, xmm2/m128 in xmm1 and xmm2/m128, and store the
high 16 bits of the results in xmm1.
VEX.128.66.0F.WIG E4 /r B V/V AVX Multiply the packed unsigned word integers
VPMULHUW xmm1, xmm2, xmm3/m128 in xmm2 and xmm3/m128, and store the
high 16 bits of the results in xmm1.
VEX.256.66.0F.WIG E4 /r B V/V AVX2 Multiply the packed unsigned word integers
VPMULHUW ymm1, ymm2, ymm3/m256 in ymm2 and ymm3/m256, and store the
high 16 bits of the results in ymm1.
EVEX.128.66.0F.WIG E4 /r C V/V (AVX512VL AND Multiply the packed unsigned word integers
VPMULHUW xmm1 {k1}{z}, xmm2, xmm3/m128 AVX512BW) OR in xmm2 and xmm3/m128, and store the
AVX10.12 high 16 bits of the results in xmm1 under
writemask k1.
EVEX.256.66.0F.WIG E4 /r C V/V (AVX512VL AND Multiply the packed unsigned word integers
VPMULHUW ymm1 {k1}{z}, ymm2, ymm3/m256 AVX512BW) OR in ymm2 and ymm3/m256, and store the
AVX10.12 high 16 bits of the results in ymm1 under
writemask k1.
EVEX.512.66.0F.WIG E4 /r C V/V AVX512BW Multiply the packed unsigned word integers
VPMULHUW zmm1 {k1}{z}, zmm2, zmm3/m512 OR AVX10.12 in zmm2 and zmm3/m512, and store the
high 16 bits of the results in zmm1 under
writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD unsigned multiply of the packed unsigned word integers in the destination operand (first operand)
and the source operand (second operand), and stores the high 16 bits of each 32-bit intermediate results in the
destination operand. (Figure 4-12 shows this operation when using 64-bit operands.)
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE version 64-bit operand: The source operand can be an MMX technology register or a 64-bit memory
location. The destination operand is an MMX technology register.

PMULHUW—Multiply Packed Unsigned Integers and Store High Result Vol. 2B 4-373
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding YMM destina-
tion register remain unchanged.
VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the destination YMM register are
zeroed. VEX.L must be 0, otherwise the instruction will #UD.
VEX.256 encoded version: The second source operand can be an YMM register or a 256-bit memory location. The
first source and destination operands are YMM registers.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand is a ZMM/YMM/XMM
register conditionally updated with writemask k1.

SRC X3 X2 X1 X0

DEST Y3 Y2 Y1 Y0

TEMP Z3 = X3 ∗ Y3 Z2 = X2 ∗ Y2 Z1 = X1 ∗ Y1 Z0 = X0 ∗ Y0

DEST Z3[31:16] Z2[31:16] Z1[31:16] Z0[31:16]

Figure 4-12. PMULHUW and PMULHW Instruction Operation Using 64-bit Operands

Operation
PMULHUW (With 64-bit Operands)
TEMP0[31:0] := DEST[15:0] ∗ SRC[15:0]; (* Unsigned multiplication *)
TEMP1[31:0] := DEST[31:16] ∗ SRC[31:16];
TEMP2[31:0] := DEST[47:32] ∗ SRC[47:32];
TEMP3[31:0] := DEST[63:48] ∗ SRC[63:48];
DEST[15:0] := TEMP0[31:16];
DEST[31:16] := TEMP1[31:16];
DEST[47:32] := TEMP2[31:16];
DEST[63:48] := TEMP3[31:16];

PMULHUW (With 128-bit Operands)


TEMP0[31:0] := DEST[15:0] ∗ SRC[15:0]; (* Unsigned multiplication *)
TEMP1[31:0] := DEST[31:16] ∗ SRC[31:16];
TEMP2[31:0] := DEST[47:32] ∗ SRC[47:32];
TEMP3[31:0] := DEST[63:48] ∗ SRC[63:48];
TEMP4[31:0] := DEST[79:64] ∗ SRC[79:64];
TEMP5[31:0] := DEST[95:80] ∗ SRC[95:80];
TEMP6[31:0] := DEST[111:96] ∗ SRC[111:96];
TEMP7[31:0] := DEST[127:112] ∗ SRC[127:112];
DEST[15:0] := TEMP0[31:16];
DEST[31:16] := TEMP1[31:16];
DEST[47:32] := TEMP2[31:16];
DEST[63:48] := TEMP3[31:16];
DEST[79:64] := TEMP4[31:16];
DEST[95:80] := TEMP5[31:16];
DEST[111:96] := TEMP6[31:16];

PMULHUW—Multiply Packed Unsigned Integers and Store High Result Vol. 2B 4-374
DEST[127:112] := TEMP7[31:16];

VPMULHUW (VEX.128 Encoded Version)


TEMP0[31:0] := SRC1[15:0] * SRC2[15:0]
TEMP1[31:0] := SRC1[31:16] * SRC2[31:16]
TEMP2[31:0] := SRC1[47:32] * SRC2[47:32]
TEMP3[31:0] := SRC1[63:48] * SRC2[63:48]
TEMP4[31:0] := SRC1[79:64] * SRC2[79:64]
TEMP5[31:0] := SRC1[95:80] * SRC2[95:80]
TEMP6[31:0] := SRC1[111:96] * SRC2[111:96]
TEMP7[31:0] := SRC1[127:112] * SRC2[127:112]
DEST[15:0] := TEMP0[31:16]
DEST[31:16] := TEMP1[31:16]
DEST[47:32] := TEMP2[31:16]
DEST[63:48] := TEMP3[31:16]
DEST[79:64] := TEMP4[31:16]
DEST[95:80] := TEMP5[31:16]
DEST[111:96] := TEMP6[31:16]
DEST[127:112] := TEMP7[31:16]
DEST[MAXVL-1:128] := 0

PMULHUW (VEX.256 Encoded Version)


TEMP0[31:0] := SRC1[15:0] * SRC2[15:0]
TEMP1[31:0] := SRC1[31:16] * SRC2[31:16]
TEMP2[31:0] := SRC1[47:32] * SRC2[47:32]
TEMP3[31:0] := SRC1[63:48] * SRC2[63:48]
TEMP4[31:0] := SRC1[79:64] * SRC2[79:64]
TEMP5[31:0] := SRC1[95:80] * SRC2[95:80]
TEMP6[31:0] := SRC1[111:96] * SRC2[111:96]
TEMP7[31:0] := SRC1[127:112] * SRC2[127:112]
TEMP8[31:0] := SRC1[143:128] * SRC2[143:128]
TEMP9[31:0] := SRC1[159:144] * SRC2[159:144]
TEMP10[31:0] := SRC1[175:160] * SRC2[175:160]
TEMP11[31:0] := SRC1[191:176] * SRC2[191:176]
TEMP12[31:0] := SRC1[207:192] * SRC2[207:192]
TEMP13[31:0] := SRC1[223:208] * SRC2[223:208]
TEMP14[31:0] := SRC1[239:224] * SRC2[239:224]
TEMP15[31:0] := SRC1[255:240] * SRC2[255:240]
DEST[15:0] := TEMP0[31:16]
DEST[31:16] := TEMP1[31:16]
DEST[47:32] := TEMP2[31:16]
DEST[63:48] := TEMP3[31:16]
DEST[79:64] := TEMP4[31:16]
DEST[95:80] := TEMP5[31:16]
DEST[111:96] := TEMP6[31:16]
DEST[127:112] := TEMP7[31:16]
DEST[143:128] := TEMP8[31:16]
DEST[159:144] := TEMP9[31:16]
DEST[175:160] := TEMP10[31:16]
DEST[191:176] := TEMP11[31:16]
DEST[207:192] := TEMP12[31:16]
DEST[223:208] := TEMP13[31:16]
DEST[239:224] := TEMP14[31:16]
DEST[255:240] := TEMP15[31:16]

PMULHUW—Multiply Packed Unsigned Integers and Store High Result Vol. 2B 4-375
DEST[MAXVL-1:256] := 0

PMULHUW (EVEX Encoded Versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN
temp[31:0] := SRC1[i+15:i] * SRC2[i+15:i]
DEST[i+15:i] := tmp[31:16]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPMULHUW __m512i _mm512_mulhi_epu16(__m512i a, __m512i b);
VPMULHUW __m512i _mm512_mask_mulhi_epu16(__m512i s, __mmask32 k, __m512i a, __m512i b);
VPMULHUW __m512i _mm512_maskz_mulhi_epu16( __mmask32 k, __m512i a, __m512i b);
VPMULHUW __m256i _mm256_mask_mulhi_epu16(__m256i s, __mmask16 k, __m256i a, __m256i b);
VPMULHUW __m256i _mm256_maskz_mulhi_epu16( __mmask16 k, __m256i a, __m256i b);
VPMULHUW __m128i _mm_mask_mulhi_epu16(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMULHUW __m128i _mm_maskz_mulhi_epu16( __mmask8 k, __m128i a, __m128i b);
PMULHUW __m64 _mm_mulhi_pu16(__m64 a, __m64 b)
(V)PMULHUW __m128i _mm_mulhi_epu16 ( __m128i a, __m128i b)
VPMULHUW __m256i _mm256_mulhi_epu16 ( __m256i a, __m256i b)

Flags Affected
None.

Numeric Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

PMULHUW—Multiply Packed Unsigned Integers and Store High Result Vol. 2B 4-376
PMULHW—Multiply Packed Signed Integers and Store High Result
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F E5 /r1 A V/V MMX Multiply the packed signed word integers in mm1
PMULHW mm, mm/m64 register and mm2/m64, and store the high 16
bits of the results in mm1.
66 0F E5 /r A V/V SSE2 Multiply the packed signed word integers in
PMULHW xmm1, xmm2/m128 xmm1 and xmm2/m128, and store the high 16
bits of the results in xmm1.
VEX.128.66.0F.WIG E5 /r B V/V AVX Multiply the packed signed word integers in
VPMULHW xmm1, xmm2, xmm3/m128 xmm2 and xmm3/m128, and store the high 16
bits of the results in xmm1.
VEX.256.66.0F.WIG E5 /r B V/V AVX2 Multiply the packed signed word integers in
VPMULHW ymm1, ymm2, ymm3/m256 ymm2 and ymm3/m256, and store the high 16
bits of the results in ymm1.
EVEX.128.66.0F.WIG E5 /r C V/V (AVX512VL AND Multiply the packed signed word integers in
VPMULHW xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm2 and xmm3/m128, and store the high 16
xmm3/m128 AVX10.12 bits of the results in xmm1 under writemask k1.
EVEX.256.66.0F.WIG E5 /r C V/V (AVX512VL AND Multiply the packed signed word integers in
VPMULHW ymm1 {k1}{z}, ymm2, AVX512BW) OR ymm2 and ymm3/m256, and store the high 16
ymm3/m256 AVX10.12 bits of the results in ymm1 under writemask k1.
EVEX.512.66.0F.WIG E5 /r C V/V AVX512BW Multiply the packed signed word integers in
VPMULHW zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm2 and zmm3/m512, and store the high 16
zmm3/m512 bits of the results in zmm1 under writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD signed multiply of the packed signed word integers in the destination operand (first operand) and
the source operand (second operand), and stores the high 16 bits of each intermediate 32-bit result in the destina-
tion operand. (Figure 4-12 shows this operation when using 64-bit operands.)
n 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE version 64-bit operand: The source operand can be an MMX technology register or a 64-bit memory
location. The destination operand is an MMX technology register.

PMULHW—Multiply Packed Signed Integers and Store High Result Vol. 2B 4-377
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding YMM destina-
tion register remain unchanged.
VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the destination YMM register are
zeroed. VEX.L must be 0, otherwise the instruction will #UD.
VEX.256 encoded version: The second source operand can be an YMM register or a 256-bit memory location. The
first source and destination operands are YMM registers.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand is a ZMM/YMM/XMM
register conditionally updated with writemask k1.

Operation
PMULHW (With 64-bit Operands)
TEMP0[31:0] := DEST[15:0] ∗ SRC[15:0]; (* Signed multiplication *)
TEMP1[31:0] := DEST[31:16] ∗ SRC[31:16];
TEMP2[31:0] := DEST[47:32] ∗ SRC[47:32];
TEMP3[31:0] := DEST[63:48] ∗ SRC[63:48];
DEST[15:0] := TEMP0[31:16];
DEST[31:16] := TEMP1[31:16];
DEST[47:32] := TEMP2[31:16];
DEST[63:48] := TEMP3[31:16];

PMULHW (With 128-bit Operands)


TEMP0[31:0] := DEST[15:0] ∗ SRC[15:0]; (* Signed multiplication *)
TEMP1[31:0] := DEST[31:16] ∗ SRC[31:16];
TEMP2[31:0] := DEST[47:32] ∗ SRC[47:32];
TEMP3[31:0] := DEST[63:48] ∗ SRC[63:48];
TEMP4[31:0] := DEST[79:64] ∗ SRC[79:64];
TEMP5[31:0] := DEST[95:80] ∗ SRC[95:80];
TEMP6[31:0] := DEST[111:96] ∗ SRC[111:96];
TEMP7[31:0] := DEST[127:112] ∗ SRC[127:112];
DEST[15:0] := TEMP0[31:16];
DEST[31:16] := TEMP1[31:16];
DEST[47:32] := TEMP2[31:16];
DEST[63:48] := TEMP3[31:16];
DEST[79:64] := TEMP4[31:16];
DEST[95:80] := TEMP5[31:16];
DEST[111:96] := TEMP6[31:16];
DEST[127:112] := TEMP7[31:16];

VPMULHW (VEX.128 Encoded Version)


TEMP0[31:0] := SRC1[15:0] * SRC2[15:0] (*Signed Multiplication*)
TEMP1[31:0] := SRC1[31:16] * SRC2[31:16]
TEMP2[31:0] := SRC1[47:32] * SRC2[47:32]
TEMP3[31:0] := SRC1[63:48] * SRC2[63:48]
TEMP4[31:0] := SRC1[79:64] * SRC2[79:64]
TEMP5[31:0] := SRC1[95:80] * SRC2[95:80]
TEMP6[31:0] := SRC1[111:96] * SRC2[111:96]
TEMP7[31:0] := SRC1[127:112] * SRC2[127:112]
DEST[15:0] := TEMP0[31:16]
DEST[31:16] := TEMP1[31:16]
DEST[47:32] := TEMP2[31:16]

PMULHW—Multiply Packed Signed Integers and Store High Result Vol. 2B 4-378
DEST[63:48] := TEMP3[31:16]
DEST[79:64] := TEMP4[31:16]
DEST[95:80] := TEMP5[31:16]
DEST[111:96] := TEMP6[31:16]
DEST[127:112] := TEMP7[31:16]
DEST[MAXVL-1:128] := 0

PMULHW (VEX.256 Encoded Version)


TEMP0[31:0] := SRC1[15:0] * SRC2[15:0] (*Signed Multiplication*)
TEMP1[31:0] := SRC1[31:16] * SRC2[31:16]
TEMP2[31:0] := SRC1[47:32] * SRC2[47:32]
TEMP3[31:0] := SRC1[63:48] * SRC2[63:48]
TEMP4[31:0] := SRC1[79:64] * SRC2[79:64]
TEMP5[31:0] := SRC1[95:80] * SRC2[95:80]
TEMP6[31:0] := SRC1[111:96] * SRC2[111:96]
TEMP7[31:0] := SRC1[127:112] * SRC2[127:112]
TEMP8[31:0] := SRC1[143:128] * SRC2[143:128]
TEMP9[31:0] := SRC1[159:144] * SRC2[159:144]
TEMP10[31:0] := SRC1[175:160] * SRC2[175:160]
TEMP11[31:0] := SRC1[191:176] * SRC2[191:176]
TEMP12[31:0] := SRC1[207:192] * SRC2[207:192]
TEMP13[31:0] := SRC1[223:208] * SRC2[223:208]
TEMP14[31:0] := SRC1[239:224] * SRC2[239:224]
TEMP15[31:0] := SRC1[255:240] * SRC2[255:240]
DEST[15:0] := TEMP0[31:16]
DEST[31:16] := TEMP1[31:16]
DEST[47:32] := TEMP2[31:16]
DEST[63:48] := TEMP3[31:16]
DEST[79:64] := TEMP4[31:16]
DEST[95:80] := TEMP5[31:16]
DEST[111:96] := TEMP6[31:16]
DEST[127:112] := TEMP7[31:16]
DEST[143:128] := TEMP8[31:16]
DEST[159:144] := TEMP9[31:16]
DEST[175:160] := TEMP10[31:16]
DEST[191:176] := TEMP11[31:16]
DEST[207:192] := TEMP12[31:16]
DEST[223:208] := TEMP13[31:16]
DEST[239:224] := TEMP14[31:16]
DEST[255:240] := TEMP15[31:16]
DEST[MAXVL-1:256] := 0

PMULHW—Multiply Packed Signed Integers and Store High Result Vol. 2B 4-379
PMULHW (EVEX Encoded Versions)
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN
temp[31:0] := SRC1[i+15:i] * SRC2[i+15:i]
DEST[i+15:i] := tmp[31:16]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPMULHW __m512i _mm512_mulhi_epi16(__m512i a, __m512i b);
VPMULHW __m512i _mm512_mask_mulhi_epi16(__m512i s, __mmask32 k, __m512i a, __m512i b);
VPMULHW __m512i _mm512_maskz_mulhi_epi16( __mmask32 k, __m512i a, __m512i b);
VPMULHW __m256i _mm256_mask_mulhi_epi16(__m256i s, __mmask16 k, __m256i a, __m256i b);
VPMULHW __m256i _mm256_maskz_mulhi_epi16( __mmask16 k, __m256i a, __m256i b);
VPMULHW __m128i _mm_mask_mulhi_epi16(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMULHW __m128i _mm_maskz_mulhi_epi16( __mmask8 k, __m128i a, __m128i b);
PMULHW __m64 _mm_mulhi_pi16 (__m64 m1, __m64 m2)
(V)PMULHW __m128i _mm_mulhi_epi16 ( __m128i a, __m128i b)
VPMULHW __m256i _mm256_mulhi_epi16 ( __m256i a, __m256i b)

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

PMULHW—Multiply Packed Signed Integers and Store High Result Vol. 2B 4-380
PMULLD/PMULLQ—Multiply Packed Integers and Store Low Result
Opcode/ Op/En 64/32 bit CPUID Feature Description
Instruction Mode Flag
Support
66 0F 38 40 /r A V/V SSE4_1 Multiply the packed dword signed integers in xmm1 and
PMULLD xmm1, xmm2/m128 xmm2/m128 and store the low 32 bits of each product
in xmm1.
VEX.128.66.0F38.WIG 40 /r B V/V AVX Multiply the packed dword signed integers in xmm2 and
VPMULLD xmm1, xmm2, xmm3/m128 and store the low 32 bits of each product
xmm3/m128 in xmm1.
VEX.256.66.0F38.WIG 40 /r B V/V AVX2 Multiply the packed dword signed integers in ymm2 and
VPMULLD ymm1, ymm2, ymm3/m256 and store the low 32 bits of each product
ymm3/m256 in ymm1.
EVEX.128.66.0F38.W0 40 /r C V/V (AVX512VL AND Multiply the packed dword signed integers in xmm2 and
VPMULLD xmm1 {k1}{z}, xmm2, AVX512F) OR xmm3/m128/m32bcst and store the low 32 bits of
xmm3/m128/m32bcst AVX10.11 each product in xmm1 under writemask k1.
EVEX.256.66.0F38.W0 40 /r C V/V (AVX512VL AND Multiply the packed dword signed integers in ymm2 and
VPMULLD ymm1 {k1}{z}, ymm2, AVX512F) OR ymm3/m256/m32bcst and store the low 32 bits of
ymm3/m256/m32bcst AVX10.11 each product in ymm1 under writemask k1.
EVEX.512.66.0F38.W0 40 /r C V/V AVX512F Multiply the packed dword signed integers in zmm2 and
VPMULLD zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m32bcst and store the low 32 bits of
zmm3/m512/m32bcst each product in zmm1 under writemask k1.
EVEX.128.66.0F38.W1 40 /r C V/V (AVX512VL AND Multiply the packed qword signed integers in xmm2 and
VPMULLQ xmm1 {k1}{z}, xmm2, AVX512DQ) OR xmm3/m128/m64bcst and store the low 64 bits of
xmm3/m128/m64bcst AVX10.11 each product in xmm1 under writemask k1.
EVEX.256.66.0F38.W1 40 /r C V/V (AVX512VL AND Multiply the packed qword signed integers in ymm2 and
VPMULLQ ymm1 {k1}{z}, ymm2, AVX512DQ) OR ymm3/m256/m64bcst and store the low 64 bits of
ymm3/m256/m64bcst AVX10.11 each product in ymm1 under writemask k1.
EVEX.512.66.0F38.W1 40 /r C V/V AVX512DQ Multiply the packed qword signed integers in zmm2 and
VPMULLQ zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m64bcst and store the low 64 bits of
zmm3/m512/m64bcst each product in zmm1 under writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD signed multiply of the packed signed dword/qword integers from each element of the first source
operand with the corresponding element in the second source operand. The low 32/64 bits of each 64/128-bit
intermediate results are stored to the destination operand.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding ZMM destina-
tion register remain unchanged.

PMULLD/PMULLQ—Multiply Packed Integers and Store Low Result Vol. 2B 4-381


VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding ZMM register
are zeroed.
VEX.256 encoded version: The first source operand is a YMM register; The second source operand is a YMM register
or 256-bit memory location. Bits (MAXVL-1:256) of the corresponding destination ZMM register are zeroed.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand is a
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a
32/64-bit memory location. The destination operand is conditionally updated based on writemask k1.

Operation
VPMULLQ (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b == 1) AND (SRC2 *is memory*)
THEN Temp[127:0] := SRC1[i+63:i] * SRC2[63:0]
ELSE Temp[127:0] := SRC1[i+63:i] * SRC2[i+63:i]
FI;
DEST[i+63:i] := Temp[63:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPMULLD (EVEX Encoded Versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN Temp[63:0] := SRC1[i+31:i] * SRC2[31:0]
ELSE Temp[63:0] := SRC1[i+31:i] * SRC2[i+31:i]
FI;
DEST[i+31:i] := Temp[31:0]
ELSE
IF *merging-masking* ; merging-masking
*DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

PMULLD/PMULLQ—Multiply Packed Integers and Store Low Result Vol. 2B 4-382


VPMULLD (VEX.256 Encoded Version)
Temp0[63:0] := SRC1[31:0] * SRC2[31:0]
Temp1[63:0] := SRC1[63:32] * SRC2[63:32]
Temp2[63:0] := SRC1[95:64] * SRC2[95:64]
Temp3[63:0] := SRC1[127:96] * SRC2[127:96]
Temp4[63:0] := SRC1[159:128] * SRC2[159:128]
Temp5[63:0] := SRC1[191:160] * SRC2[191:160]
Temp6[63:0] := SRC1[223:192] * SRC2[223:192]
Temp7[63:0] := SRC1[255:224] * SRC2[255:224]

DEST[31:0] := Temp0[31:0]
DEST[63:32] := Temp1[31:0]
DEST[95:64] := Temp2[31:0]
DEST[127:96] := Temp3[31:0]
DEST[159:128] := Temp4[31:0]
DEST[191:160] := Temp5[31:0]
DEST[223:192] := Temp6[31:0]
DEST[255:224] := Temp7[31:0]
DEST[MAXVL-1:256] := 0

VPMULLD (VEX.128 Encoded Version)


Temp0[63:0] := SRC1[31:0] * SRC2[31:0]
Temp1[63:0] := SRC1[63:32] * SRC2[63:32]
Temp2[63:0] := SRC1[95:64] * SRC2[95:64]
Temp3[63:0] := SRC1[127:96] * SRC2[127:96]
DEST[31:0] := Temp0[31:0]
DEST[63:32] := Temp1[31:0]
DEST[95:64] := Temp2[31:0]
DEST[127:96] := Temp3[31:0]
DEST[MAXVL-1:128] := 0

PMULLD (128-bit Legacy SSE Version)


Temp0[63:0] := DEST[31:0] * SRC[31:0]
Temp1[63:0] := DEST[63:32] * SRC[63:32]
Temp2[63:0] := DEST[95:64] * SRC[95:64]
Temp3[63:0] := DEST[127:96] * SRC[127:96]
DEST[31:0] := Temp0[31:0]
DEST[63:32] := Temp1[31:0]
DEST[95:64] := Temp2[31:0]
DEST[127:96] := Temp3[31:0]
DEST[MAXVL-1:128] (Unmodified)

PMULLD/PMULLQ—Multiply Packed Integers and Store Low Result Vol. 2B 4-383


Intel C/C++ Compiler Intrinsic Equivalent
VPMULLD __m512i _mm512_mullo_epi32(__m512i a, __m512i b);
VPMULLD __m512i _mm512_mask_mullo_epi32(__m512i s, __mmask16 k, __m512i a, __m512i b);
VPMULLD __m512i _mm512_maskz_mullo_epi32( __mmask16 k, __m512i a, __m512i b);
VPMULLD __m256i _mm256_mask_mullo_epi32(__m256i s, __mmask8 k, __m256i a, __m256i b);
VPMULLD __m256i _mm256_maskz_mullo_epi32( __mmask8 k, __m256i a, __m256i b);
VPMULLD __m128i _mm_mask_mullo_epi32(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMULLD __m128i _mm_maskz_mullo_epi32( __mmask8 k, __m128i a, __m128i b);
VPMULLD __m256i _mm256_mullo_epi32(__m256i a, __m256i b);
PMULLD __m128i _mm_mullo_epi32(__m128i a, __m128i b);
VPMULLQ __m512i _mm512_mullo_epi64(__m512i a, __m512i b);
VPMULLQ __m512i _mm512_mask_mullo_epi64(__m512i s, __mmask8 k, __m512i a, __m512i b);
VPMULLQ __m512i _mm512_maskz_mullo_epi64( __mmask8 k, __m512i a, __m512i b);
VPMULLQ __m256i _mm256_mullo_epi64(__m256i a, __m256i b);
VPMULLQ __m256i _mm256_mask_mullo_epi64(__m256i s, __mmask8 k, __m256i a, __m256i b);
VPMULLQ __m256i _mm256_maskz_mullo_epi64( __mmask8 k, __m256i a, __m256i b);
VPMULLQ __m128i _mm_mullo_epi64(__m128i a, __m128i b);
VPMULLQ __m128i _mm_mask_mullo_epi64(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMULLQ __m128i _mm_maskz_mullo_epi64( __mmask8 k, __m128i a, __m128i b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

PMULLD/PMULLQ—Multiply Packed Integers and Store Low Result Vol. 2B 4-384


PMULLW—Multiply Packed Signed Integers and Store Low Result
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F D5 /r1 A V/V MMX Multiply the packed signed word integers in
PMULLW mm, mm/m64 mm1 register and mm2/m64, and store the low
16 bits of the results in mm1.
66 0F D5 /r A V/V SSE2 Multiply the packed signed word integers in
PMULLW xmm1, xmm2/m128 xmm1 and xmm2/m128, and store the low 16
bits of the results in xmm1.
VEX.128.66.0F.WIG D5 /r B V/V AVX Multiply the packed dword signed integers in
VPMULLW xmm1, xmm2, xmm3/m128 xmm2 and xmm3/m128 and store the low 32
bits of each product in xmm1.
VEX.256.66.0F.WIG D5 /r B V/V AVX2 Multiply the packed signed word integers in
VPMULLW ymm1, ymm2, ymm3/m256 ymm2 and ymm3/m256, and store the low 16
bits of the results in ymm1.
EVEX.128.66.0F.WIG D5 /r C V/V (AVX512VL AND Multiply the packed signed word integers in
VPMULLW xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm2 and xmm3/m128, and store the low 16
xmm3/m128 AVX10.12 bits of the results in xmm1 under writemask k1.
EVEX.256.66.0F.WIG D5 /r C V/V (AVX512VL AND Multiply the packed signed word integers in
VPMULLW ymm1 {k1}{z}, ymm2, AVX512BW) OR ymm2 and ymm3/m256, and store the low 16
ymm3/m256 AVX10.12 bits of the results in ymm1 under writemask k1.
EVEX.512.66.0F.WIG D5 /r C V/V AVX512BW Multiply the packed signed word integers in
VPMULLW zmm1 {k1}{z}, zmm2, zmm3/m512 OR AVX10.12 zmm2 and zmm3/m512, and store the low 16
bits of the results in zmm1 under writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD signed multiply of the packed signed word integers in the destination operand (first operand) and
the source operand (second operand), and stores the low 16 bits of each intermediate 32-bit result in the destina-
tion operand. (Figure 4-12 shows this operation when using 64-bit operands.)
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE version 64-bit operand: The source operand can be an MMX technology register or a 64-bit memory
location. The destination operand is an MMX technology register.

PMULLW—Multiply Packed Signed Integers and Store Low Result Vol. 2B 4-385
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding YMM destina-
tion register remain unchanged.
VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the destination YMM register are
zeroed. VEX.L must be 0, otherwise the instruction will #UD.
VEX.256 encoded version: The second source operand can be an YMM register or a 256-bit memory location. The
first source and destination operands are YMM registers.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand is a
ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand is conditionally updated
based on writemask k1.

SRC X3 X2 X1 X0

DEST Y3 Y2 Y1 Y0

TEMP Z3 = X3 ∗ Y3 Z2 = X2 ∗ Y2 Z1 = X1 ∗ Y1 Z0 = X0 ∗ Y0

DEST Z3[15:0] Z2[15:0] Z1[15:0] Z0[15:0]

Figure 4-13. PMULLU Instruction Operation Using 64-bit Operands

Operation
PMULLW (With 64-bit Operands)
TEMP0[31:0] := DEST[15:0] ∗ SRC[15:0]; (* Signed multiplication *)
TEMP1[31:0] := DEST[31:16] ∗ SRC[31:16];
TEMP2[31:0] := DEST[47:32] ∗ SRC[47:32];
TEMP3[31:0] := DEST[63:48] ∗ SRC[63:48];
DEST[15:0] := TEMP0[15:0];
DEST[31:16] := TEMP1[15:0];
DEST[47:32] := TEMP2[15:0];
DEST[63:48] := TEMP3[15:0];

PMULLW (With 128-bit Operands)


TEMP0[31:0] := DEST[15:0] ∗ SRC[15:0]; (* Signed multiplication *)
TEMP1[31:0] := DEST[31:16] ∗ SRC[31:16];
TEMP2[31:0] := DEST[47:32] ∗ SRC[47:32];
TEMP3[31:0] := DEST[63:48] ∗ SRC[63:48];
TEMP4[31:0] := DEST[79:64] ∗ SRC[79:64];
TEMP5[31:0] := DEST[95:80] ∗ SRC[95:80];
TEMP6[31:0] := DEST[111:96] ∗ SRC[111:96];
TEMP7[31:0] := DEST[127:112] ∗ SRC[127:112];
DEST[15:0] := TEMP0[15:0];
DEST[31:16] := TEMP1[15:0];
DEST[47:32] := TEMP2[15:0];
DEST[63:48] := TEMP3[15:0];
DEST[79:64] := TEMP4[15:0];
DEST[95:80] := TEMP5[15:0];
DEST[111:96] := TEMP6[15:0];
DEST[127:112] := TEMP7[15:0];
DEST[MAXVL-1:256] := 0

PMULLW—Multiply Packed Signed Integers and Store Low Result Vol. 2B 4-386
VPMULLW (VEX.128 Encoded Version)
Temp0[31:0] := SRC1[15:0] * SRC2[15:0]
Temp1[31:0] := SRC1[31:16] * SRC2[31:16]
Temp2[31:0] := SRC1[47:32] * SRC2[47:32]
Temp3[31:0] := SRC1[63:48] * SRC2[63:48]
Temp4[31:0] := SRC1[79:64] * SRC2[79:64]
Temp5[31:0] := SRC1[95:80] * SRC2[95:80]
Temp6[31:0] := SRC1[111:96] * SRC2[111:96]
Temp7[31:0] := SRC1[127:112] * SRC2[127:112]
DEST[15:0] := Temp0[15:0]
DEST[31:16] := Temp1[15:0]
DEST[47:32] := Temp2[15:0]
DEST[63:48] := Temp3[15:0]
DEST[79:64] := Temp4[15:0]
DEST[95:80] := Temp5[15:0]
DEST[111:96] := Temp6[15:0]
DEST[127:112] := Temp7[15:0]
DEST[MAXVL-1:128] := 0

PMULLW (EVEX Encoded Versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN
temp[31:0] := SRC1[i+15:i] * SRC2[i+15:i]
DEST[i+15:i] := temp[15:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPMULLW __m512i _mm512_mullo_epi16(__m512i a, __m512i b);
VPMULLW __m512i _mm512_mask_mullo_epi16(__m512i s, __mmask32 k, __m512i a, __m512i b);
VPMULLW __m512i _mm512_maskz_mullo_epi16( __mmask32 k, __m512i a, __m512i b);
VPMULLW __m256i _mm256_mask_mullo_epi16(__m256i s, __mmask16 k, __m256i a, __m256i b);
VPMULLW __m256i _mm256_maskz_mullo_epi16( __mmask16 k, __m256i a, __m256i b);
VPMULLW __m128i _mm_mask_mullo_epi16(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMULLW __m128i _mm_maskz_mullo_epi16( __mmask8 k, __m128i a, __m128i b);
PMULLW __m64 _mm_mullo_pi16(__m64 m1, __m64 m2)
(V)PMULLW __m128i _mm_mullo_epi16 ( __m128i a, __m128i b)
VPMULLW __m256i _mm256_mullo_epi16 ( __m256i a, __m256i b);

Flags Affected
None.

PMULLW—Multiply Packed Signed Integers and Store Low Result Vol. 2B 4-387
SIMD Floating-Point Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

PMULLW—Multiply Packed Signed Integers and Store Low Result Vol. 2B 4-388
PMULUDQ—Multiply Packed Unsigned Doubleword Integers
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F F4 /r1 A V/V SSE2 Multiply unsigned doubleword integer in mm1 by
PMULUDQ mm1, mm2/m64 unsigned doubleword integer in mm2/m64, and
store the quadword result in mm1.
66 0F F4 /r A V/V SSE2 Multiply packed unsigned doubleword integers in
PMULUDQ xmm1, xmm2/m128 xmm1 by packed unsigned doubleword integers
in xmm2/m128, and store the quadword results
in xmm1.
VEX.128.66.0F.WIG F4 /r B V/V AVX Multiply packed unsigned doubleword integers in
VPMULUDQ xmm1, xmm2, xmm3/m128 xmm2 by packed unsigned doubleword integers
in xmm3/m128, and store the quadword results
in xmm1.
VEX.256.66.0F.WIG F4 /r B V/V AVX2 Multiply packed unsigned doubleword integers in
VPMULUDQ ymm1, ymm2, ymm3/m256 ymm2 by packed unsigned doubleword integers
in ymm3/m256, and store the quadword results
in ymm1.
EVEX.128.66.0F.W1 F4 /r C V/V (AVX512VL AND Multiply packed unsigned doubleword integers in
VPMULUDQ xmm1 {k1}{z}, xmm2, AVX512F) OR xmm2 by packed unsigned doubleword integers
xmm3/m128/m64bcst AVX10.12 in xmm3/m128/m64bcst, and store the
quadword results in xmm1 under writemask k1.
EVEX.256.66.0F.W1 F4 /r C V/V (AVX512VL AND Multiply packed unsigned doubleword integers in
VPMULUDQ ymm1 {k1}{z}, ymm2, AVX512F) OR ymm2 by packed unsigned doubleword integers
ymm3/m256/m64bcst AVX10.12 in ymm3/m256/m64bcst, and store the
quadword results in ymm1 under writemask k1.
EVEX.512.66.0F.W1 F4 /r C V/V AVX512F Multiply packed unsigned doubleword integers in
VPMULUDQ zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm2 by packed unsigned doubleword integers
zmm3/m512/m64bcst in zmm3/m512/m64bcst, and store the
quadword results in zmm1 under writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Multiplies the first operand (destination operand) by the second operand (source operand) and stores the result in
the destination operand.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).

PMULUDQ—Multiply Packed Unsigned Doubleword Integers Vol. 2B 4-389


Legacy SSE version 64-bit operand: The source operand can be an unsigned doubleword integer stored in the low
doubleword of an MMX technology register or a 64-bit memory location. The destination operand can be an
unsigned doubleword integer stored in the low doubleword an MMX technology register. The result is an unsigned
quadword integer stored in the destination an MMX technology register. When a quadword result is too large to be
represented in 64 bits (overflow), the result is wrapped around and the low 64 bits are written to the destination
element (that is, the carry is ignored).
For 64-bit memory operands, 64 bits are fetched from memory, but only the low doubleword is used in the compu-
tation.
128-bit Legacy SSE version: The second source operand is two packed unsigned doubleword integers stored in the
first (low) and third doublewords of an XMM register or a 128-bit memory location. For 128-bit memory operands,
128 bits are fetched from memory, but only the first and third doublewords are used in the computation. The first
source operand is two packed unsigned doubleword integers stored in the first and third doublewords of an XMM
register. The destination contains two packed unsigned quadword integers stored in an XMM register. Bits (MAXVL-
1:128) of the corresponding YMM destination register remain unchanged.
VEX.128 encoded version: The second source operand is two packed unsigned doubleword integers stored in the
first (low) and third doublewords of an XMM register or a 128-bit memory location. For 128-bit memory operands,
128 bits are fetched from memory, but only the first and third doublewords are used in the computation. The first
source operand is two packed unsigned doubleword integers stored in the first and third doublewords of an XMM
register. The destination contains two packed unsigned quadword integers stored in an XMM register. Bits (MAXVL-
1:128) of the destination YMM register are zeroed.
VEX.256 encoded version: The second source operand is four packed unsigned doubleword integers stored in the
first (low), third, fifth, and seventh doublewords of a YMM register or a 256-bit memory location. For 256-bit
memory operands, 256 bits are fetched from memory, but only the first, third, fifth, and seventh doublewords are
used in the computation. The first source operand is four packed unsigned doubleword integers stored in the first,
third, fifth, and seventh doublewords of an YMM register. The destination contains four packed unaligned quadword
integers stored in an YMM register.
EVEX encoded version: The input unsigned doubleword integers are taken from the even-numbered elements of
the source operands. The first source operand is a ZMM/YMM/XMM registers. The second source operand can be an
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-
bit memory location. The destination is a ZMM/YMM/XMM register, and updated according to the writemask at 64-
bit granularity.

Operation
PMULUDQ (With 64-Bit Operands)
DEST[63:0] := DEST[31:0] ∗ SRC[31:0];

PMULUDQ (With 128-Bit Operands)


DEST[63:0] := DEST[31:0] ∗ SRC[31:0];
DEST[127:64] := DEST[95:64] ∗ SRC[95:64];

VPMULUDQ (VEX.128 Encoded Version)


DEST[63:0] := SRC1[31:0] * SRC2[31:0]
DEST[127:64] := SRC1[95:64] * SRC2[95:64]
DEST[MAXVL-1:128] := 0

VPMULUDQ (VEX.256 Encoded Version)


DEST[63:0] := SRC1[31:0] * SRC2[31:0]
DEST[127:64] := SRC1[95:64] * SRC2[95:64
DEST[191:128] := SRC1[159:128] * SRC2[159:128]
DEST[255:192] := SRC1[223:192] * SRC2[223:192]
DEST[MAXVL-1:256] := 0

PMULUDQ—Multiply Packed Unsigned Doubleword Integers Vol. 2B 4-390


VPMULUDQ (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+63:i] := ZeroExtend64( SRC1[i+31:i]) * ZeroExtend64( SRC2[31:0] )
ELSE DEST[i+63:i] := ZeroExtend64( SRC1[i+31:i]) * ZeroExtend64( SRC2[i+31:i] )
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPMULUDQ __m512i _mm512_mul_epu32(__m512i a, __m512i b);
VPMULUDQ __m512i _mm512_mask_mul_epu32(__m512i s, __mmask8 k, __m512i a, __m512i b);
VPMULUDQ __m512i _mm512_maskz_mul_epu32( __mmask8 k, __m512i a, __m512i b);
VPMULUDQ __m256i _mm256_mask_mul_epu32(__m256i s, __mmask8 k, __m256i a, __m256i b);
VPMULUDQ __m256i _mm256_maskz_mul_epu32( __mmask8 k, __m256i a, __m256i b);
VPMULUDQ __m128i _mm_mask_mul_epu32(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMULUDQ __m128i _mm_maskz_mul_epu32( __mmask8 k, __m128i a, __m128i b);
PMULUDQ __m64 _mm_mul_su32 (__m64 a, __m64 b)
(V)PMULUDQ __m128i _mm_mul_epu32 ( __m128i a, __m128i b)
VPMULUDQ __m256i _mm256_mul_epu32( __m256i a, __m256i b);

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

PMULUDQ—Multiply Packed Unsigned Doubleword Integers Vol. 2B 4-391


POR—Bitwise Logical OR
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F EB /r1 A V/V MMX Bitwise OR of mm/m64 and mm.
POR mm, mm/m64
66 0F EB /r A V/V SSE2 Bitwise OR of xmm2/m128 and xmm1.
POR xmm1, xmm2/m128
VEX.128.66.0F.WIG EB /r B V/V AVX Bitwise OR of xmm2/m128 and xmm3.
VPOR xmm1, xmm2, xmm3/m128
VEX.256.66.0F.WIG EB /r B V/V AVX2 Bitwise OR of ymm2/m256 and ymm3.
VPOR ymm1, ymm2, ymm3/m256
EVEX.128.66.0F.W0 EB /r C V/V (AVX512VL AND Bitwise OR of packed doubleword integers in
VPORD xmm1 {k1}{z}, xmm2, AVX512F) OR xmm2 and xmm3/m128/m32bcst using
xmm3/m128/m32bcst AVX10.12 writemask k1.
EVEX.256.66.0F.W0 EB /r C V/V (AVX512VL AND Bitwise OR of packed doubleword integers in
VPORD ymm1 {k1}{z}, ymm2, AVX512F) OR ymm2 and ymm3/m256/m32bcst using
ymm3/m256/m32bcst AVX10.12 writemask k1.
EVEX.512.66.0F.W0 EB /r C V/V AVX512F Bitwise OR of packed doubleword integers in
VPORD zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm2 and zmm3/m512/m32bcst using
zmm3/m512/m32bcst writemask k1.
EVEX.128.66.0F.W1 EB /r C V/V (AVX512VL AND Bitwise OR of packed quadword integers in
VPORQ xmm1 {k1}{z}, xmm2, AVX512F) OR xmm2 and xmm3/m128/m64bcst using
xmm3/m128/m64bcst AVX10.12 writemask k1.
EVEX.256.66.0F.W1 EB /r C V/V (AVX512VL AND Bitwise OR of packed quadword integers in
VPORQ ymm1 {k1}{z}, ymm2, AVX512F) OR ymm2 and ymm3/m256/m64bcst using
ymm3/m256/m64bcst AVX10.12 writemask k1.
EVEX.512.66.0F.W1 EB /r C V/V AVX512F Bitwise OR of packed quadword integers in
VPORQ zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm2 and zmm3/m512/m64bcst using
zmm3/m512/m64bcst writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a bitwise logical OR operation on the source operand (second operand) and the destination operand (first
operand) and stores the result in the destination operand. Each bit of the result is set to 1 if either or both of the
corresponding bits of the first and second operands are 1; otherwise, it is set to 0.

POR—Bitwise Logical OR Vol. 2B 4-405


In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE version: The source operand can be an MMX technology register or a 64-bit memory location. The
destination operand is an MMX technology register.
128-bit Legacy SSE version: The second source operand is an XMM register or a 128-bit memory location. The first
source and destination operands can be XMM registers. Bits (MAXVL-1:128) of the corresponding YMM destination
register remain unchanged.
VEX.128 encoded version: The second source operand is an XMM register or a 128-bit memory location. The first
source and destination operands can be XMM registers. Bits (MAXVL-1:128) of the destination YMM register are
zeroed.
VEX.256 encoded version: The second source operand is an YMM register or a 256-bit memory location. The first
source and destination operands can be YMM registers.
EVEX encoded version: The first source operand is a ZMM/YMM/XMM register. The second source operand can be a
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a
32/64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with write-
mask k1 at 32/64-bit granularity.

Operation
POR (64-bit Operand)
DEST := DEST OR SRC

POR (128-bit Legacy SSE Version)


DEST := DEST OR SRC
DEST[MAXVL-1:128] (Unmodified)

VPOR (VEX.128 Encoded Version)


DEST := SRC1 OR SRC2
DEST[MAXVL-1:128] := 0

VPOR (VEX.256 Encoded Version)


DEST := SRC1 OR SRC2
DEST[MAXVL-1:256] := 0

VPORD (EVEX Encoded Versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+31:i] := SRC1[i+31:i] BITWISE OR SRC2[31:0]
ELSE DEST[i+31:i] := SRC1[i+31:i] BITWISE OR SRC2[i+31:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
*DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

POR—Bitwise Logical OR Vol. 2B 4-406


Intel C/C++ Compiler Intrinsic Equivalent
VPORD __m512i _mm512_or_epi32(__m512i a, __m512i b);
VPORD __m512i _mm512_mask_or_epi32(__m512i s, __mmask16 k, __m512i a, __m512i b);
VPORD __m512i _mm512_maskz_or_epi32( __mmask16 k, __m512i a, __m512i b);
VPORD __m256i _mm256_or_epi32(__m256i a, __m256i b);
VPORD __m256i _mm256_mask_or_epi32(__m256i s, __mmask8 k, __m256i a, __m256i b,);
VPORD __m256i _mm256_maskz_or_epi32( __mmask8 k, __m256i a, __m256i b);
VPORD __m128i _mm_or_epi32(__m128i a, __m128i b);
VPORD __m128i _mm_mask_or_epi32(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPORD __m128i _mm_maskz_or_epi32( __mmask8 k, __m128i a, __m128i b);
VPORQ __m512i _mm512_or_epi64(__m512i a, __m512i b);
VPORQ __m512i _mm512_mask_or_epi64(__m512i s, __mmask8 k, __m512i a, __m512i b);
VPORQ __m512i _mm512_maskz_or_epi64(__mmask8 k, __m512i a, __m512i b);
VPORQ __m256i _mm256_or_epi64(__m256i a, int imm);
VPORQ __m256i _mm256_mask_or_epi64(__m256i s, __mmask8 k, __m256i a, __m256i b);
VPORQ __m256i _mm256_maskz_or_epi64( __mmask8 k, __m256i a, __m256i b);
VPORQ __m128i _mm_or_epi64(__m128i a, __m128i b);
VPORQ __m128i _mm_mask_or_epi64(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPORQ __m128i _mm_maskz_or_epi64( __mmask8 k, __m128i a, __m128i b);
POR __m64 _mm_or_si64(__m64 m1, __m64 m2)
(V)POR __m128i _mm_or_si128(__m128i m1, __m128i m2)
VPOR __m256i _mm256_or_si256 ( __m256i a, __m256i b)

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

POR—Bitwise Logical OR Vol. 2B 4-407


PSADBW—Compute Sum of Absolute Differences
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F F6 /r1 A V/V SSE Computes the absolute differences of the packed
PSADBW mm1, mm2/m64 unsigned byte integers from mm2 /m64 and
mm1; differences are then summed to produce an
unsigned word integer result.
66 0F F6 /r A V/V SSE2 Computes the absolute differences of the packed
PSADBW xmm1, xmm2/m128 unsigned byte integers from xmm2 /m128 and
xmm1; the 8 low differences and 8 high
differences are then summed separately to
produce two unsigned word integer results.
VEX.128.66.0F.WIG F6 /r B V/V AVX Computes the absolute differences of the packed
VPSADBW xmm1, xmm2, xmm3/m128 unsigned byte integers from xmm3 /m128 and
xmm2; the 8 low differences and 8 high
differences are then summed separately to
produce two unsigned word integer results.
VEX.256.66.0F.WIG F6 /r B V/V AVX2 Computes the absolute differences of the packed
VPSADBW ymm1, ymm2, ymm3/m256 unsigned byte integers from ymm3 /m256 and
ymm2; then each consecutive 8 differences are
summed separately to produce four unsigned
word integer results.
EVEX.128.66.0F.WIG F6 /r C V/V (AVX512VL AND Computes the absolute differences of the packed
VPSADBW xmm1, xmm2, xmm3/m128 AVX512BW) OR unsigned byte integers from xmm3 /m128 and
AVX10.12 xmm2; then each consecutive 8 differences are
summed separately to produce two unsigned
word integer results.
EVEX.256.66.0F.WIG F6 /r C V/V (AVX512VL AND Computes the absolute differences of the packed
VPSADBW ymm1, ymm2, ymm3/m256 AVX512BW) OR unsigned byte integers from ymm3 /m256 and
AVX10.12 ymm2; then each consecutive 8 differences are
summed separately to produce four unsigned
word integer results.
EVEX.512.66.0F.WIG F6 /r C V/V AVX512BW Computes the absolute differences of the packed
VPSADBW zmm1, zmm2, zmm3/m512 OR AVX10.12 unsigned byte integers from zmm3 /m512 and
zmm2; then each consecutive 8 differences are
summed separately to produce eight unsigned
word integer results.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) N/A

PSADBW—Compute Sum of Absolute Differences Vol. 2B 4-412


Description
Computes the absolute value of the difference of 8 unsigned byte integers from the source operand (second
operand) and from the destination operand (first operand). These 8 differences are then summed to produce an
unsigned word integer result that is stored in the destination operand. Figure 4-14 shows the operation of the
PSADBW instruction when using 64-bit operands.
When operating on 64-bit operands, the word integer result is stored in the low word of the destination operand,
and the remaining bytes in the destination operand are cleared to all 0s.
When operating on 128-bit operands, two packed results are computed. Here, the 8 low-order bytes of the source
and destination operands are operated on to produce a word result that is stored in the low word of the destination
operand, and the 8 high-order bytes are operated on to produce a word result that is stored in bits 64 through 79
of the destination operand. The remaining bytes of the destination operand are cleared.
For 256-bit version, the third group of 8 differences are summed to produce an unsigned word in bits[143:128] of
the destination register and the fourth group of 8 differences are summed to produce an unsigned word in
bits[207:192] of the destination register. The remaining words of the destination are set to 0.
For 512-bit version, the fifth group result is stored in bits [271:256] of the destination. The result from the sixth
group is stored in bits [335:320]. The results for the seventh and eighth group are stored respectively in bits
[399:384] and bits [463:447], respectively. The remaining bits in the destination are set to 0.
In 64-bit mode and not encoded by VEX/EVEX prefix, using a REX prefix in the form of REX.R permits this instruc-
tion to access additional registers (XMM8-XMM15).
Legacy SSE version: The source operand can be an MMX technology register or a 64-bit memory location. The
destination operand is an MMX technology register.
128-bit Legacy SSE version: The first source operand and destination register are XMM registers. The second
source operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding ZMM
destination register remain unchanged.
VEX.128 and EVEX.128 encoded versions: The first source operand and destination register are XMM registers. The
second source operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding
ZMM register are zeroed.
VEX.256 and EVEX.256 encoded versions: The first source operand and destination register are YMM registers. The
second source operand is an YMM register or a 256-bit memory location. Bits (MAXVL-1:256) of the corresponding
ZMM register are zeroed.
EVEX.512 encoded version: The first source operand and destination register are ZMM registers. The second
source operand is a ZMM register or a 512-bit memory location.

SRC X7 X6 X5 X4 X3 X2 X1 X0

DEST Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0

TEMP ABS(X7:Y7) ABS(X6:Y6) ABS(X5:Y5) ABS(X4:Y4) ABS(X3:Y3) ABS(X2:Y2) ABS(X1:Y1) ABS(X0:Y0)

DEST 00H 00H 00H 00H 00H 00H SUM(TEMP7...TEMP0)

Figure 4-14. PSADBW Instruction Operation Using 64-bit Operands

PSADBW—Compute Sum of Absolute Differences Vol. 2B 4-413


Operation
VPSADBW (EVEX Encoded Versions)
VL = 128, 256, 512
TEMP0 := ABS(SRC1[7:0] - SRC2[7:0])
(* Repeat operation for bytes 1 through 15 *)
TEMP15 := ABS(SRC1[127:120] - SRC2[127:120])
DEST[15:0] := SUM(TEMP0:TEMP7)
DEST[63:16] := 000000000000H
DEST[79:64] := SUM(TEMP8:TEMP15)
DEST[127:80] := 00000000000H

IF VL >= 256
(* Repeat operation for bytes 16 through 31*)
TEMP31 := ABS(SRC1[255:248] - SRC2[255:248])
DEST[143:128] := SUM(TEMP16:TEMP23)
DEST[191:144] := 000000000000H
DEST[207:192] := SUM(TEMP24:TEMP31)
DEST[223:208] := 00000000000H
FI;
IF VL >= 512
(* Repeat operation for bytes 32 through 63*)
TEMP63 := ABS(SRC1[511:504] - SRC2[511:504])
DEST[271:256] := SUM(TEMP0:TEMP7)
DEST[319:272] := 000000000000H
DEST[335:320] := SUM(TEMP8:TEMP15)
DEST[383:336] := 00000000000H
DEST[399:384] := SUM(TEMP16:TEMP23)
DEST[447:400] := 000000000000H
DEST[463:448] := SUM(TEMP24:TEMP31)
DEST[511:464] := 00000000000H
FI;
DEST[MAXVL-1:VL] := 0

VPSADBW (VEX.256 Encoded Version)


TEMP0 := ABS(SRC1[7:0] - SRC2[7:0])
(* Repeat operation for bytes 2 through 30*)
TEMP31 := ABS(SRC1[255:248] - SRC2[255:248])
DEST[15:0] := SUM(TEMP0:TEMP7)
DEST[63:16] := 000000000000H
DEST[79:64] := SUM(TEMP8:TEMP15)
DEST[127:80] := 00000000000H
DEST[143:128] := SUM(TEMP16:TEMP23)
DEST[191:144] := 000000000000H
DEST[207:192] := SUM(TEMP24:TEMP31)
DEST[223:208] := 00000000000H
DEST[MAXVL-1:256] := 0

PSADBW—Compute Sum of Absolute Differences Vol. 2B 4-414


VPSADBW (VEX.128 Encoded Version)
TEMP0 := ABS(SRC1[7:0] - SRC2[7:0])
(* Repeat operation for bytes 2 through 14 *)
TEMP15 := ABS(SRC1[127:120] - SRC2[127:120])
DEST[15:0] := SUM(TEMP0:TEMP7)
DEST[63:16] := 000000000000H
DEST[79:64] := SUM(TEMP8:TEMP15)
DEST[127:80] := 00000000000H
DEST[MAXVL-1:128] := 0

PSADBW (128-bit Legacy SSE Version)


TEMP0 := ABS(DEST[7:0] - SRC[7:0])
(* Repeat operation for bytes 2 through 14 *)
TEMP15 := ABS(DEST[127:120] - SRC[127:120])
DEST[15:0] := SUM(TEMP0:TEMP7)
DEST[63:16] := 000000000000H
DEST[79:64] := SUM(TEMP8:TEMP15)
DEST[127:80] := 00000000000
DEST[MAXVL-1:128] (Unmodified)

PSADBW (64-bit Operand)


TEMP0 := ABS(DEST[7:0] - SRC[7:0])
(* Repeat operation for bytes 2 through 6 *)
TEMP7 := ABS(DEST[63:56] - SRC[63:56])
DEST[15:0] := SUM(TEMP0:TEMP7)
DEST[63:16] := 000000000000H

Intel C/C++ Compiler Intrinsic Equivalent


VPSADBW __m512i _mm512_sad_epu8( __m512i a, __m512i b)
PSADBW __m64 _mm_sad_pu8(__m64 a,__m64 b)
(V)PSADBW __m128i _mm_sad_epu8(__m128i a, __m128i b)
VPSADBW __m256i _mm256_sad_epu8( __m256i a, __m256i b)

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”

PSADBW—Compute Sum of Absolute Differences Vol. 2B 4-415


PSHUFB—Packed Shuffle Bytes
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 38 00 /r1 A V/V SSSE3 Shuffle bytes in mm1 according to contents of
PSHUFB mm1, mm2/m64 mm2/m64.

66 0F 38 00 /r A V/V SSSE3 Shuffle bytes in xmm1 according to contents of


PSHUFB xmm1, xmm2/m128 xmm2/m128.

VEX.128.66.0F38.WIG 00 /r B V/V AVX Shuffle bytes in xmm2 according to contents of


VPSHUFB xmm1, xmm2, xmm3/m128 xmm3/m128.

VEX.256.66.0F38.WIG 00 /r B V/V AVX2 Shuffle bytes in ymm2 according to contents of


VPSHUFB ymm1, ymm2, ymm3/m256 ymm3/m256.

EVEX.128.66.0F38.WIG 00 /r C V/V (AVX512VL AND Shuffle bytes in xmm2 according to contents of


VPSHUFB xmm1 {k1}{z}, xmm2, xmm3/m128 AVX512BW) OR xmm3/m128 under write mask k1.
AVX10.12
EVEX.256.66.0F38.WIG 00 /r C V/V (AVX512VL AND Shuffle bytes in ymm2 according to contents of
VPSHUFB ymm1 {k1}{z}, ymm2, ymm3/m256 AVX512BW) OR ymm3/m256 under write mask k1.
AVX10.12
EVEX.512.66.0F38.WIG 00 /r C V/V AVX512BW Shuffle bytes in zmm2 according to contents of
VPSHUFB zmm1 {k1}{z}, zmm2, zmm3/m512 OR AVX10.12 zmm3/m512 under write mask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
PSHUFB performs in-place shuffles of bytes in the destination operand (the first operand) according to the shuffle
control mask in the source operand (the second operand). The instruction permutes the data in the destination
operand, leaving the shuffle mask unaffected. If the most significant bit (bit[7]) of each byte of the shuffle control
mask is set, then constant zero is written in the result byte. Each byte in the shuffle control mask forms an index
to permute the corresponding byte in the destination operand. The value of each index is the least significant 4 bits
(128-bit operation) or 3 bits (64-bit operation) of the shuffle control byte. When the source operand is a 128-bit
memory operand, the operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will
be generated.
In 64-bit mode and not encoded with VEX/EVEX, use the REX prefix to access XMM8-XMM15 registers.
Legacy SSE version 64-bit operand: Both operands can be MMX registers.
128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (MAXVL-
1:128) of the corresponding YMM destination register remain unchanged.

PSHUFB—Packed Shuffle Bytes Vol. 2B 4-416


VEX.128 encoded version: The destination operand is the first operand, the first source operand is the second
operand, the second source operand is the third operand. Bits (MAXVL-1:128) of the destination YMM register are
zeroed.
VEX.256 encoded version: Bits (255:128) of the destination YMM register stores the 16-byte shuffle result of the
upper 16 bytes of the first source operand, using the upper 16-bytes of the second source operand as control
mask. The value of each index is for the high 128-bit lane is the least significant 4 bits of the respective shuffle
control byte. The index value selects a source data element within each 128-bit lane.
EVEX encoded version: The second source operand is an ZMM/YMM/XMM register or an 512/256/128-bit memory
location. The first source operand and destination operands are ZMM/YMM/XMM registers. The destination is condi-
tionally updated with writemask k1.
EVEX and VEX encoded version: Four/two in-lane 128-bit shuffles.

Operation
PSHUFB (With 64-bit Operands)
TEMP := DEST
for i = 0 to 7 {
if (SRC[(i * 8)+7] = 1 ) then
DEST[(i*8)+7...(i*8)+0] := 0;
else
index[2..0] := SRC[(i*8)+2 .. (i*8)+0];
DEST[(i*8)+7...(i*8)+0] := TEMP[(index*8+7)..(index*8+0)];
endif;
}
PSHUFB (with 128 bit operands)
TEMP := DEST
for i = 0 to 15 {
if (SRC[(i * 8)+7] = 1 ) then
DEST[(i*8)+7..(i*8)+0] := 0;
else
index[3..0] := SRC[(i*8)+3 .. (i*8)+0];
DEST[(i*8)+7..(i*8)+0] := TEMP[(index*8+7)..(index*8+0)];
endif
}

VPSHUFB (VEX.128 Encoded Version)


for i = 0 to 15 {
if (SRC2[(i * 8)+7] = 1) then
DEST[(i*8)+7..(i*8)+0] := 0;
else
index[3..0] := SRC2[(i*8)+3 .. (i*8)+0];
DEST[(i*8)+7..(i*8)+0] := SRC1[(index*8+7)..(index*8+0)];
endif
}
DEST[MAXVL-1:128] := 0

VPSHUFB (VEX.256 Encoded Version)


for i = 0 to 15 {
if (SRC2[(i * 8)+7] == 1 ) then
DEST[(i*8)+7..(i*8)+0] := 0;
else
index[3..0] := SRC2[(i*8)+3 .. (i*8)+0];
DEST[(i*8)+7..(i*8)+0] := SRC1[(index*8+7)..(index*8+0)];
endif
if (SRC2[128 + (i * 8)+7] == 1 ) then

PSHUFB—Packed Shuffle Bytes Vol. 2B 4-417


DEST[128 + (i*8)+7..(i*8)+0] := 0;
else
index[3..0] := SRC2[128 + (i*8)+3 .. (i*8)+0];
DEST[128 + (i*8)+7..(i*8)+0] := SRC1[128 + (index*8+7)..(index*8+0)];
endif
}

VPSHUFB (EVEX Encoded Versions)


(KL, VL) = (16, 128), (32, 256), (64, 512)
jmask := (KL-1) & ~0xF // 0x00, 0x10, 0x30 depending on the VL
FOR j = 0 TO KL-1 // dest
IF kl[ i ] or no_masking
index := src.byte[ j ];
IF index & 0x80
Dest.byte[ j ] := 0;
ELSE
index := (index & 0xF) + (j & jmask); // 16-element in-lane lookup
Dest.byte[ j ] := src.byte[ index ];
ELSE if zeroing
Dest.byte[ j ] := 0;
DEST[MAXVL-1:VL] := 0;

MM2

07H 07H FFH 80H 01H 00H 00H 00H

MM1

04H 01H 07H 03H 02H 02H FFH 01H

MM1
04H 04H 00H 00H FFH 01H 01H 01H

Figure 4-15. PSHUFB with 64-Bit Operands

Intel C/C++ Compiler Intrinsic Equivalent


VPSHUFB __m512i _mm512_shuffle_epi8(__m512i a, __m512i b);
VPSHUFB __m512i _mm512_mask_shuffle_epi8(__m512i s, __mmask64 k, __m512i a, __m512i b);
VPSHUFB __m512i _mm512_maskz_shuffle_epi8( __mmask64 k, __m512i a, __m512i b);
VPSHUFB __m256i _mm256_mask_shuffle_epi8(__m256i s, __mmask32 k, __m256i a, __m256i b);
VPSHUFB __m256i _mm256_maskz_shuffle_epi8( __mmask32 k, __m256i a, __m256i b);
VPSHUFB __m128i _mm_mask_shuffle_epi8(__m128i s, __mmask16 k, __m128i a, __m128i b);
VPSHUFB __m128i _mm_maskz_shuffle_epi8( __mmask16 k, __m128i a, __m128i b);
PSHUFB: __m64 _mm_shuffle_pi8 (__m64 a, __m64 b)
(V)PSHUFB: __m128i _mm_shuffle_epi8 (__m128i a, __m128i b)
VPSHUFB:__m256i _mm256_shuffle_epi8(__m256i a, __m256i b)

PSHUFB—Packed Shuffle Bytes Vol. 2B 4-418


SIMD Floating-Point Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”

PSHUFB—Packed Shuffle Bytes Vol. 2B 4-419


PSHUFD—Shuffle Packed Doublewords
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 70 /r ib A V/V SSE2 Shuffle the doublewords in xmm2/m128 based on
PSHUFD xmm1, xmm2/m128, imm8 the encoding in imm8 and store the result in xmm1.

VEX.128.66.0F.WIG 70 /r ib A V/V AVX Shuffle the doublewords in xmm2/m128 based on


VPSHUFD xmm1, xmm2/m128, imm8 the encoding in imm8 and store the result in xmm1.

VEX.256.66.0F.WIG 70 /r ib A V/V AVX2 Shuffle the doublewords in ymm2/m256 based on


VPSHUFD ymm1, ymm2/m256, imm8 the encoding in imm8 and store the result in ymm1.

EVEX.128.66.0F.W0 70 /r ib B V/V (AVX512VL AND Shuffle the doublewords in xmm2/m128/m32bcst


VPSHUFD xmm1 {k1}{z}, AVX512F) OR based on the encoding in imm8 and store the result
xmm2/m128/m32bcst, imm8 AVX10.11 in xmm1 using writemask k1.
EVEX.256.66.0F.W0 70 /r ib B V/V (AVX512VL AND Shuffle the doublewords in ymm2/m256/m32bcst
VPSHUFD ymm1 {k1}{z}, AVX512F) OR based on the encoding in imm8 and store the result
ymm2/m256/m32bcst, imm8 AVX10.11 in ymm1 using writemask k1.
EVEX.512.66.0F.W0 70 /r ib B V/V AVX512F Shuffle the doublewords in zmm2/m512/m32bcst
VPSHUFD zmm1 {k1}{z}, OR AVX10.11 based on the encoding in imm8 and store the result
zmm2/m512/m32bcst, imm8 in zmm1 using writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) imm8 N/A
B Full ModRM:reg (w) ModRM:r/m (r) imm8 N/A

Description
Copies doublewords from source operand (second operand) and inserts them in the destination operand (first
operand) at the locations selected with the order operand (third operand). Figure 4-16 shows the operation of the
256-bit VPSHUFD instruction and the encoding of the order operand. Each 2-bit field in the order operand selects
the contents of one doubleword location within a 128-bit lane and copy to the target element in the destination
operand. For example, bits 0 and 1 of the order operand targets the first doubleword element in the low and high
128-bit lane of the destination operand for 256-bit VPSHUFD. The encoded value of bits 1:0 of the order operand
(see the field encoding in Figure 4-16) determines which doubleword element (from the respective 128-bit lane) of
the source operand will be copied to doubleword 0 of the destination operand.
For 128-bit operation, only the low 128-bit lane are operative. The source operand can be an XMM register or a
128-bit memory location. The destination operand is an XMM register. The order operand is an 8-bit immediate.
Note that this instruction permits a doubleword in the source operand to be copied to more than one doubleword
location in the destination operand.

PSHUFD—Shuffle Packed Doublewords Vol. 2B 4-420


SRC X7 X6 X5 X4 X3 X2 X1 X0

DEST Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0

Encoding 00B - X4
of Fields in 01B - X5 Encoding 00B - X0
ORDER of Fields in 01B - X1
ORDER 10B - X6
Operand 11B - X7 7 6 5 4 3 2 1 0
ORDER 10B - X2
Operand 11B - X3

Figure 4-16. 256-bit VPSHUFD Instruction Operation

The source operand can be an XMM register or a 128-bit memory location. The destination operand is an XMM
register. The order operand is an 8-bit immediate. Note that this instruction permits a doubleword in the source
operand to be copied to more than one doubleword location in the destination operand.
In 64-bit mode and not encoded in VEX/EVEX, using REX.R permits this instruction to access XMM8-XMM15.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding YMM destination register remain
unchanged.
VEX.128 encoded version: The source operand can be an XMM register or a 128-bit memory location. The destina-
tion operand is an XMM register. Bits (MAXVL-1:128) of the corresponding ZMM register are zeroed.
VEX.256 encoded version: The source operand can be an YMM register or a 256-bit memory location. The destina-
tion operand is an YMM register. Bits (MAXVL-1:256) of the corresponding ZMM register are zeroed. Bits (255-
1:128) of the destination stores the shuffled results of the upper 16 bytes of the source operand using the imme-
diate byte as the order operand.
EVEX encoded version: The source operand can be an ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion, or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a
ZMM/YMM/XMM register updated according to the writemask.
Each 128-bit lane of the destination stores the shuffled results of the respective lane of the source operand using
the immediate byte as the order operand.
Note: EVEX.vvvv and VEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

Operation
PSHUFD (128-bit Legacy SSE Version)
DEST[31:0] := (SRC >> (ORDER[1:0] * 32))[31:0];
DEST[63:32] := (SRC >> (ORDER[3:2] * 32))[31:0];
DEST[95:64] := (SRC >> (ORDER[5:4] * 32))[31:0];
DEST[127:96] := (SRC >> (ORDER[7:6] * 32))[31:0];
DEST[MAXVL-1:128] (Unmodified)

VPSHUFD (VEX.128 Encoded Version)


DEST[31:0] := (SRC >> (ORDER[1:0] * 32))[31:0];
DEST[63:32] := (SRC >> (ORDER[3:2] * 32))[31:0];
DEST[95:64] := (SRC >> (ORDER[5:4] * 32))[31:0];
DEST[127:96] := (SRC >> (ORDER[7:6] * 32))[31:0];
DEST[MAXVL-1:128] := 0

PSHUFD—Shuffle Packed Doublewords Vol. 2B 4-421


VPSHUFD (VEX.256 Encoded Version)
DEST[31:0] := (SRC[127:0] >> (ORDER[1:0] * 32))[31:0];
DEST[63:32] := (SRC[127:0] >> (ORDER[3:2] * 32))[31:0];
DEST[95:64] := (SRC[127:0] >> (ORDER[5:4] * 32))[31:0];
DEST[127:96] := (SRC[127:0] >> (ORDER[7:6] * 32))[31:0];
DEST[159:128] := (SRC[255:128] >> (ORDER[1:0] * 32))[31:0];
DEST[191:160] := (SRC[255:128] >> (ORDER[3:2] * 32))[31:0];
DEST[223:192] := (SRC[255:128] >> (ORDER[5:4] * 32))[31:0];
DEST[255:224] := (SRC[255:128] >> (ORDER[7:6] * 32))[31:0];
DEST[MAXVL-1:256] := 0

VPSHUFD (EVEX Encoded Versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF (EVEX.b = 1) AND (SRC *is memory*)
THEN TMP_SRC[i+31:i] := SRC[31:0]
ELSE TMP_SRC[i+31:i] := SRC[i+31:i]
FI;
ENDFOR;
IF VL >= 128
TMP_DEST[31:0] := (TMP_SRC[127:0] >> (ORDER[1:0] * 32))[31:0];
TMP_DEST[63:32] := (TMP_SRC[127:0] >> (ORDER[3:2] * 32))[31:0];
TMP_DEST[95:64] := (TMP_SRC[127:0] >> (ORDER[5:4] * 32))[31:0];
TMP_DEST[127:96] := (TMP_SRC[127:0] >> (ORDER[7:6] * 32))[31:0];
FI;
IF VL >= 256
TMP_DEST[159:128] := (TMP_SRC[255:128] >> (ORDER[1:0] * 32))[31:0];
TMP_DEST[191:160] := (TMP_SRC[255:128] >> (ORDER[3:2] * 32))[31:0];
TMP_DEST[223:192] := (TMP_SRC[255:128] >> (ORDER[5:4] * 32))[31:0];
TMP_DEST[255:224] := (TMP_SRC[255:128] >> (ORDER[7:6] * 32))[31:0];
FI;
IF VL >= 512
TMP_DEST[287:256] := (TMP_SRC[383:256] >> (ORDER[1:0] * 32))[31:0];
TMP_DEST[319:288] := (TMP_SRC[383:256] >> (ORDER[3:2] * 32))[31:0];
TMP_DEST[351:320] := (TMP_SRC[383:256] >> (ORDER[5:4] * 32))[31:0];
TMP_DEST[383:352] := (TMP_SRC[383:256] >> (ORDER[7:6] * 32))[31:0];
TMP_DEST[415:384] := (TMP_SRC[511:384] >> (ORDER[1:0] * 32))[31:0];
TMP_DEST[447:416] := (TMP_SRC[511:384] >> (ORDER[3:2] * 32))[31:0];
TMP_DEST[479:448] := (TMP_SRC[511:384] >> (ORDER[5:4] * 32))[31:0];
TMP_DEST[511:480] := (TMP_SRC[511:384] >> (ORDER[7:6] * 32))[31:0];
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR

PSHUFD—Shuffle Packed Doublewords Vol. 2B 4-422


DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPSHUFD __m512i _mm512_shuffle_epi32(__m512i a, int n );
VPSHUFD __m512i _mm512_mask_shuffle_epi32(__m512i s, __mmask16 k, __m512i a, int n );
VPSHUFD __m512i _mm512_maskz_shuffle_epi32( __mmask16 k, __m512i a, int n );
VPSHUFD __m256i _mm256_mask_shuffle_epi32(__m256i s, __mmask8 k, __m256i a, int n );
VPSHUFD __m256i _mm256_maskz_shuffle_epi32( __mmask8 k, __m256i a, int n );
VPSHUFD __m128i _mm_mask_shuffle_epi32(__m128i s, __mmask8 k, __m128i a, int n );
VPSHUFD __m128i _mm_maskz_shuffle_epi32( __mmask8 k, __m128i a, int n );
(V)PSHUFD __m128i _mm_shuffle_epi32(__m128i a, int n)
VPSHUFD __m256i _mm256_shuffle_epi32(__m256i a, const int n)

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-52, “Type E4NF Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv ≠ 1111B or EVEX.vvvv ≠ 1111B.

PSHUFD—Shuffle Packed Doublewords Vol. 2B 4-423


PSHUFHW—Shuffle Packed High Words
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
F3 0F 70 /r ib A V/V SSE2 Shuffle the high words in xmm2/m128 based
PSHUFHW xmm1, xmm2/m128, imm8 on the encoding in imm8 and store the result in
xmm1.
VEX.128.F3.0F.WIG 70 /r ib A V/V AVX Shuffle the high words in xmm2/m128 based
VPSHUFHW xmm1, xmm2/m128, imm8 on the encoding in imm8 and store the result in
xmm1.
VEX.256.F3.0F.WIG 70 /r ib A V/V AVX2 Shuffle the high words in ymm2/m256 based
VPSHUFHW ymm1, ymm2/m256, imm8 on the encoding in imm8 and store the result in
ymm1.
EVEX.128.F3.0F.WIG 70 /r ib B V/V (AVX512VL AND Shuffle the high words in xmm2/m128 based
VPSHUFHW xmm1 {k1}{z}, xmm2/m128, AVX512BW) OR on the encoding in imm8 and store the result in
imm8 AVX10.11 xmm1 under write mask k1.
EVEX.256.F3.0F.WIG 70 /r ib B V/V (AVX512VL AND Shuffle the high words in ymm2/m256 based
VPSHUFHW ymm1 {k1}{z}, ymm2/m256, AVX512BW) OR on the encoding in imm8 and store the result in
imm8 AVX10.11 ymm1 under write mask k1.
EVEX.512.F3.0F.WIG 70 /r ib B V/V AVX512BW Shuffle the high words in zmm2/m512 based
VPSHUFHW zmm1 {k1}{z}, zmm2/m512, OR AVX10.11 on the encoding in imm8 and store the result in
imm8 zmm1 under write mask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) imm8 N/A
B Full Mem ModRM:reg (w) ModRM:r/m (r) imm8 N/A

Description
Copies words from the high quadword of a 128-bit lane of the source operand and inserts them in the high quad-
word of the destination operand at word locations (of the respective lane) selected with the immediate operand.
This 256-bit operation is similar to the in-lane operation used by the 256-bit VPSHUFD instruction, which is illus-
trated in Figure 4-16. For 128-bit operation, only the low 128-bit lane is operative. Each 2-bit field in the immediate
operand selects the contents of one word location in the high quadword of the destination operand. The binary
encodings of the immediate operand fields select words (0, 1, 2 or 3, 4) from the high quadword of the source
operand to be copied to the destination operand. The low quadword of the source operand is copied to the low
quadword of the destination operand, for each 128-bit lane.
Note that this instruction permits a word in the high quadword of the source operand to be copied to more than one
word location in the high quadword of the destination operand.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
128-bit Legacy SSE version: The destination operand is an XMM register. The source operand can be an XMM
register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding YMM destination register remain
unchanged.

PSHUFHW—Shuffle Packed High Words Vol. 2B 4-424


VEX.128 encoded version: The destination operand is an XMM register. The source operand can be an XMM register
or a 128-bit memory location. Bits (MAXVL-1:128) of the destination YMM register are zeroed. VEX.vvvv is
reserved and must be 1111b, VEX.L must be 0, otherwise the instruction will #UD.
VEX.256 encoded version: The destination operand is an YMM register. The source operand can be an YMM register
or a 256-bit memory location.
EVEX encoded version: The destination operand is a ZMM/YMM/XMM registers. The source operand can be a
ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination is updated according to the write-
mask.
Note: In VEX encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
PSHUFHW (128-bit Legacy SSE Version)
DEST[63:0] := SRC[63:0]
DEST[79:64] := (SRC >> (imm[1:0] *16))[79:64]
DEST[95:80] := (SRC >> (imm[3:2] * 16))[79:64]
DEST[111:96] := (SRC >> (imm[5:4] * 16))[79:64]
DEST[127:112] := (SRC >> (imm[7:6] * 16))[79:64]
DEST[MAXVL-1:128] (Unmodified)

VPSHUFHW (VEX.128 Encoded Version)


DEST[63:0] := SRC1[63:0]
DEST[79:64] := (SRC1 >> (imm[1:0] *16))[79:64]
DEST[95:80] := (SRC1 >> (imm[3:2] * 16))[79:64]
DEST[111:96] := (SRC1 >> (imm[5:4] * 16))[79:64]
DEST[127:112] := (SRC1 >> (imm[7:6] * 16))[79:64]
DEST[MAXVL-1:128] := 0

VPSHUFHW (VEX.256 Encoded Version)


DEST[63:0] := SRC1[63:0]
DEST[79:64] := (SRC1 >> (imm[1:0] *16))[79:64]
DEST[95:80] := (SRC1 >> (imm[3:2] * 16))[79:64]
DEST[111:96] := (SRC1 >> (imm[5:4] * 16))[79:64]
DEST[127:112] := (SRC1 >> (imm[7:6] * 16))[79:64]
DEST[191:128] := SRC1[191:128]
DEST[207192] := (SRC1 >> (imm[1:0] *16))[207:192]
DEST[223:208] := (SRC1 >> (imm[3:2] * 16))[207:192]
DEST[239:224] := (SRC1 >> (imm[5:4] * 16))[207:192]
DEST[255:240] := (SRC1 >> (imm[7:6] * 16))[207:192]
DEST[MAXVL-1:256] := 0

VPSHUFHW (EVEX Encoded Versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
IF VL >= 128
TMP_DEST[63:0] := SRC1[63:0]
TMP_DEST[79:64] := (SRC1 >> (imm[1:0] *16))[79:64]
TMP_DEST[95:80] := (SRC1 >> (imm[3:2] * 16))[79:64]
TMP_DEST[111:96] := (SRC1 >> (imm[5:4] * 16))[79:64]
TMP_DEST[127:112] := (SRC1 >> (imm[7:6] * 16))[79:64]
FI;
IF VL >= 256
TMP_DEST[191:128] := SRC1[191:128]
TMP_DEST[207:192] := (SRC1 >> (imm[1:0] *16))[207:192]
TMP_DEST[223:208] := (SRC1 >> (imm[3:2] * 16))[207:192]

PSHUFHW—Shuffle Packed High Words Vol. 2B 4-425


TMP_DEST[239:224] := (SRC1 >> (imm[5:4] * 16))[207:192]
TMP_DEST[255:240] := (SRC1 >> (imm[7:6] * 16))[207:192]
FI;
IF VL >= 512
TMP_DEST[319:256] := SRC1[319:256]
TMP_DEST[335:320] := (SRC1 >> (imm[1:0] *16))[335:320]
TMP_DEST[351:336] := (SRC1 >> (imm[3:2] * 16))[335:320]
TMP_DEST[367:352] := (SRC1 >> (imm[5:4] * 16))[335:320]
TMP_DEST[383:368] := (SRC1 >> (imm[7:6] * 16))[335:320]
TMP_DEST[447:384] := SRC1[447:384]
TMP_DEST[463:448] := (SRC1 >> (imm[1:0] *16))[463:448]
TMP_DEST[479:464] := (SRC1 >> (imm[3:2] * 16))[463:448]
TMP_DEST[495:480] := (SRC1 >> (imm[5:4] * 16))[463:448]
TMP_DEST[511:496] := (SRC1 >> (imm[7:6] * 16))[463:448]
FI;

FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TMP_DEST[i+15:i];
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPSHUFHW __m512i _mm512_shufflehi_epi16(__m512i a, int n);
VPSHUFHW __m512i _mm512_mask_shufflehi_epi16(__m512i s, __mmask16 k, __m512i a, int n );
VPSHUFHW __m512i _mm512_maskz_shufflehi_epi16( __mmask16 k, __m512i a, int n );
VPSHUFHW __m256i _mm256_mask_shufflehi_epi16(__m256i s, __mmask8 k, __m256i a, int n );
VPSHUFHW __m256i _mm256_maskz_shufflehi_epi16( __mmask8 k, __m256i a, int n );
VPSHUFHW __m128i _mm_mask_shufflehi_epi16(__m128i s, __mmask8 k, __m128i a, int n );
VPSHUFHW __m128i _mm_maskz_shufflehi_epi16( __mmask8 k, __m128i a, int n );
(V)PSHUFHW __m128i _mm_shufflehi_epi16(__m128i a, int n)
VPSHUFHW __m256i _mm256_shufflehi_epi16(__m256i a, const int n)

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B, or EVEX.vvvv != 1111B.

PSHUFHW—Shuffle Packed High Words Vol. 2B 4-426


PSHUFLW—Shuffle Packed Low Words
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
F2 0F 70 /r ib A V/V SSE2 Shuffle the low words in xmm2/m128 based on
PSHUFLW xmm1, xmm2/m128, imm8 the encoding in imm8 and store the result in
xmm1.
VEX.128.F2.0F.WIG 70 /r ib A V/V AVX Shuffle the low words in xmm2/m128 based on
VPSHUFLW xmm1, xmm2/m128, imm8 the encoding in imm8 and store the result in
xmm1.
VEX.256.F2.0F.WIG 70 /r ib A V/V AVX2 Shuffle the low words in ymm2/m256 based on
VPSHUFLW ymm1, ymm2/m256, imm8 the encoding in imm8 and store the result in
ymm1.
EVEX.128.F2.0F.WIG 70 /r ib B V/V (AVX512VL AND Shuffle the low words in xmm2/m128 based on
VPSHUFLW xmm1 {k1}{z}, xmm2/m128, AVX512BW) OR the encoding in imm8 and store the result in
imm8 AVX10.11 xmm1 under write mask k1.
EVEX.256.F2.0F.WIG 70 /r ib B V/V (AVX512VL AND Shuffle the low words in ymm2/m256 based on
VPSHUFLW ymm1 {k1}{z}, ymm2/m256, AVX512BW) OR the encoding in imm8 and store the result in
imm8 AVX10.11 ymm1 under write mask k1.
EVEX.512.F2.0F.WIG 70 /r ib B V/V AVX512BW Shuffle the low words in zmm2/m512 based on
VPSHUFLW zmm1 {k1}{z}, zmm2/m512, OR AVX10.11 the encoding in imm8 and store the result in
imm8 zmm1 under write mask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) imm8 N/A
B Full Mem ModRM:reg (w) ModRM:r/m (r) imm8 N/A

Description
Copies words from the low quadword of a 128-bit lane of the source operand and inserts them in the low quadword
of the destination operand at word locations (of the respective lane) selected with the immediate operand. The
256-bit operation is similar to the in-lane operation used by the 256-bit VPSHUFD instruction, which is illustrated
in Figure 4-16. For 128-bit operation, only the low 128-bit lane is operative. Each 2-bit field in the immediate
operand selects the contents of one word location in the low quadword of the destination operand. The binary
encodings of the immediate operand fields select words (0, 1, 2 or 3) from the low quadword of the source operand
to be copied to the destination operand. The high quadword of the source operand is copied to the high quadword
of the destination operand, for each 128-bit lane.
Note that this instruction permits a word in the low quadword of the source operand to be copied to more than one
word location in the low quadword of the destination operand.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
128-bit Legacy SSE version: The destination operand is an XMM register. The source operand can be an XMM
register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding YMM destination register remain
unchanged.
VEX.128 encoded version: The destination operand is an XMM register. The source operand can be an XMM register
or a 128-bit memory location. Bits (MAXVL-1:128) of the destination YMM register are zeroed.

PSHUFLW—Shuffle Packed Low Words Vol. 2B 4-427


VEX.256 encoded version: The destination operand is an YMM register. The source operand can be an YMM register
or a 256-bit memory location.
EVEX encoded version: The destination operand is a ZMM/YMM/XMM registers. The source operand can be a
ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination is updated according to the write-
mask.
Note: In VEX encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
PSHUFLW (128-bit Legacy SSE Version)
DEST[15:0] := (SRC >> (imm[1:0] *16))[15:0]
DEST[31:16] := (SRC >> (imm[3:2] * 16))[15:0]
DEST[47:32] := (SRC >> (imm[5:4] * 16))[15:0]
DEST[63:48] := (SRC >> (imm[7:6] * 16))[15:0]
DEST[127:64] := SRC[127:64]
DEST[MAXVL-1:128] (Unmodified)

VPSHUFLW (VEX.128 Encoded Version)


DEST[15:0] := (SRC1 >> (imm[1:0] *16))[15:0]
DEST[31:16] := (SRC1 >> (imm[3:2] * 16))[15:0]
DEST[47:32] := (SRC1 >> (imm[5:4] * 16))[15:0]
DEST[63:48] := (SRC1 >> (imm[7:6] * 16))[15:0]
DEST[127:64] := SRC[127:64]
DEST[MAXVL-1:128] := 0

VPSHUFLW (VEX.256 Encoded Version)


DEST[15:0] := (SRC1 >> (imm[1:0] *16))[15:0]
DEST[31:16] := (SRC1 >> (imm[3:2] * 16))[15:0]
DEST[47:32] := (SRC1 >> (imm[5:4] * 16))[15:0]
DEST[63:48] := (SRC1 >> (imm[7:6] * 16))[15:0]
DEST[127:64] := SRC1[127:64]
DEST[143:128] := (SRC1 >> (imm[1:0] *16))[143:128]
DEST[159:144] := (SRC1 >> (imm[3:2] * 16))[143:128]
DEST[175:160] := (SRC1 >> (imm[5:4] * 16))[143:128]
DEST[191:176] := (SRC1 >> (imm[7:6] * 16))[143:128]
DEST[255:192] := SRC1[255:192]
DEST[MAXVL-1:256] := 0

VPSHUFLW (EVEX.U1.512 Encoded Version)


(KL, VL) = (8, 128), (16, 256), (32, 512)
IF VL >= 128
TMP_DEST[15:0] := (SRC1 >> (imm[1:0] *16))[15:0]
TMP_DEST[31:16] := (SRC1 >> (imm[3:2] * 16))[15:0]
TMP_DEST[47:32] := (SRC1 >> (imm[5:4] * 16))[15:0]
TMP_DEST[63:48] := (SRC1 >> (imm[7:6] * 16))[15:0]
TMP_DEST[127:64] := SRC1[127:64]
FI;
IF VL >= 256
TMP_DEST[143:128] := (SRC1 >> (imm[1:0] *16))[143:128]
TMP_DEST[159:144] := (SRC1 >> (imm[3:2] * 16))[143:128]
TMP_DEST[175:160] := (SRC1 >> (imm[5:4] * 16))[143:128]
TMP_DEST[191:176] := (SRC1 >> (imm[7:6] * 16))[143:128]
TMP_DEST[255:192] := SRC1[255:192]
FI;
IF VL >= 512

PSHUFLW—Shuffle Packed Low Words Vol. 2B 4-428


TMP_DEST[271:256] := (SRC1 >> (imm[1:0] *16))[271:256]
TMP_DEST[287:272] := (SRC1 >> (imm[3:2] * 16))[271:256]
TMP_DEST[303:288] := (SRC1 >> (imm[5:4] * 16))[271:256]
TMP_DEST[319:304] := (SRC1 >> (imm[7:6] * 16))[271:256]
TMP_DEST[383:320] := SRC1[383:320]
TMP_DEST[399:384] := (SRC1 >> (imm[1:0] *16))[399:384]
TMP_DEST[415:400] := (SRC1 >> (imm[3:2] * 16))[399:384]
TMP_DEST[431:416] := (SRC1 >> (imm[5:4] * 16))[399:384]
TMP_DEST[447:432] := (SRC1 >> (imm[7:6] * 16))[399:384]
TMP_DEST[511:448] := SRC1[511:448]
FI;

FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TMP_DEST[i+15:i];
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPSHUFLW __m512i _mm512_shufflelo_epi16(__m512i a, int n);
VPSHUFLW __m512i _mm512_mask_shufflelo_epi16(__m512i s, __mmask16 k, __m512i a, int n );
VPSHUFLW __m512i _mm512_maskz_shufflelo_epi16( __mmask16 k, __m512i a, int n );
VPSHUFLW __m256i _mm256_mask_shufflelo_epi16(__m256i s, __mmask8 k, __m256i a, int n );
VPSHUFLW __m256i _mm256_maskz_shufflelo_epi16( __mmask8 k, __m256i a, int n );
VPSHUFLW __m128i _mm_mask_shufflelo_epi16(__m128i s, __mmask8 k, __m128i a, int n );
VPSHUFLW __m128i _mm_maskz_shufflelo_epi16( __mmask8 k, __m128i a, int n );
(V)PSHUFLW:__m128i _mm_shufflelo_epi16(__m128i a, int n)
VPSHUFLW:__m256i _mm256_shufflelo_epi16(__m256i a, const int n)

Flags Affected
None.

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B, or EVEX.vvvv != 1111B.

PSHUFLW—Shuffle Packed Low Words Vol. 2B 4-429


PSLLDQ—Shift Double Quadword Left Logical
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 73 /7 ib A V/V SSE2 Shift xmm1 left by imm8 bytes while shifting
PSLLDQ xmm1, imm8 in 0s.

VEX.128.66.0F.WIG 73 /7 ib B V/V AVX Shift xmm2 left by imm8 bytes while shifting
VPSLLDQ xmm1, xmm2, imm8 in 0s and store result in xmm1.

VEX.256.66.0F.WIG 73 /7 ib B V/V AVX2 Shift ymm2 left by imm8 bytes while shifting
VPSLLDQ ymm1, ymm2, imm8 in 0s and store result in ymm1.

EVEX.128.66.0F.WIG 73 /7 ib C V/V (AVX512VL AND Shift xmm2/m128 left by imm8 bytes while
VPSLLDQ xmm1,xmm2/ m128, imm8 AVX512BW) OR shifting in 0s and store result in xmm1.
AVX10.11
EVEX.256.66.0F.WIG 73 /7 ib C V/V (AVX512VL AND Shift ymm2/m256 left by imm8 bytes while
VPSLLDQ ymm1, ymm2/m256, imm8 AVX512BW) OR shifting in 0s and store result in ymm1.
AVX10.11
EVEX.512.66.0F.WIG 73 /7 ib C V/V AVX512BW Shift zmm2/m512 left by imm8 bytes while
VPSLLDQ zmm1, zmm2/m512, imm8 OR AVX10.11 shifting in 0s and store result in zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:r/m (r, w) imm8 N/A N/A
B N/A VEX.vvvv (w) ModRM:r/m (r) imm8 N/A
C Full Mem EVEX.vvvv (w) ModRM:r/m (r) imm8 N/A

Description
Shifts the destination operand (first operand) to the left by the number of bytes specified in the count operand
(second operand). The empty low-order bytes are cleared (set to all 0s). If the value specified by the count operand
is greater than 15, the destination operand is set to all 0s. The count operand is an 8-bit immediate.
128-bit Legacy SSE version: The source and destination operands are the same. Bits (MAXVL-1:128) of the corre-
sponding YMM destination register remain unchanged.
VEX.128 encoded version: The source and destination operands are XMM registers. Bits (MAXVL-1:128) of the
destination YMM register are zeroed.
VEX.256 encoded version: The source operand is YMM register. The destination operand is an YMM register. Bits
(MAXVL-1:256) of the corresponding ZMM register are zeroed. The count operand applies to both the low and high
128-bit lanes.
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location.
The destination operand is a ZMM/YMM/XMM register. The count operand applies to each 128-bit lanes.

PSLLDQ—Shift Double Quadword Left Logical Vol. 2B 4-435


Operation
VPSLLDQ (EVEX.U1.512 Encoded Version)
TEMP := COUNT
IF (TEMP > 15) THEN TEMP := 16; FI
DEST[127:0] := SRC[127:0] << (TEMP * 8)
DEST[255:128] := SRC[255:128] << (TEMP * 8)
DEST[383:256] := SRC[383:256] << (TEMP * 8)
DEST[511:384] := SRC[511:384] << (TEMP * 8)
DEST[MAXVL-1:512] := 0

VPSLLDQ (VEX.256 and EVEX.256 Encoded Version)


TEMP := COUNT
IF (TEMP > 15) THEN TEMP := 16; FI
DEST[127:0] := SRC[127:0] << (TEMP * 8)
DEST[255:128] := SRC[255:128] << (TEMP * 8)
DEST[MAXVL-1:256] := 0

VPSLLDQ (VEX.128 and EVEX.128 Encoded Version)


TEMP := COUNT
IF (TEMP > 15) THEN TEMP := 16; FI
DEST := SRC << (TEMP * 8)
DEST[MAXVL-1:128] := 0

PSLLDQ(128-bit Legacy SSE Version)


TEMP := COUNT
IF (TEMP > 15) THEN TEMP := 16; FI
DEST := DEST << (TEMP * 8)
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


(V)PSLLDQ __m128i _mm_slli_si128 ( __m128i a, int imm)
VPSLLDQ __m256i _mm256_slli_si256 ( __m256i a, const int imm)
VPSLLDQ __m512i _mm512_bslli_epi128 ( __m512i a, const int imm)

Flags Affected
None.

Numeric Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-24, “Type 7 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”

PSLLDQ—Shift Double Quadword Left Logical Vol. 2B 4-436


PSLLW/PSLLD/PSLLQ—Shift Packed Data Left Logical
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F F1 /r1 A V/V MMX Shift words in mm left mm/m64 while shifting in
PSLLW mm, mm/m64 0s.

66 0F F1 /r A V/V SSE2 Shift words in xmm1 left by xmm2/m128 while


PSLLW xmm1, xmm2/m128 shifting in 0s.

NP 0F 71 /6 ib B V/V MMX Shift words in mm left by imm8 while shifting in


PSLLW mm1, imm8 0s.

66 0F 71 /6 ib B V/V SSE2 Shift words in xmm1 left by imm8 while shifting


PSLLW xmm1, imm8 in 0s.

NP 0F F2 /r1 A V/V MMX Shift doublewords in mm left by mm/m64 while


PSLLD mm, mm/m64 shifting in 0s.

66 0F F2 /r A V/V SSE2 Shift doublewords in xmm1 left by xmm2/m128


PSLLD xmm1, xmm2/m128 while shifting in 0s.

NP 0F 72 /6 ib1 B V/V MMX Shift doublewords in mm left by imm8 while


PSLLD mm, imm8 shifting in 0s.

66 0F 72 /6 ib B V/V SSE2 Shift doublewords in xmm1 left by imm8 while


PSLLD xmm1, imm8 shifting in 0s.

NP 0F F3 /r1 A V/V MMX Shift quadword in mm left by mm/m64 while


PSLLQ mm, mm/m64 shifting in 0s.

66 0F F3 /r A V/V SSE2 Shift quadwords in xmm1 left by xmm2/m128


PSLLQ xmm1, xmm2/m128 while shifting in 0s.

NP 0F 73 /6 ib1 B V/V MMX Shift quadword in mm left by imm8 while


PSLLQ mm, imm8 shifting in 0s.

66 0F 73 /6 ib B V/V SSE2 Shift quadwords in xmm1 left by imm8 while


PSLLQ xmm1, imm8 shifting in 0s.

VEX.128.66.0F.WIG F1 /r C V/V AVX Shift words in xmm2 left by amount specified in


VPSLLW xmm1, xmm2, xmm3/m128 xmm3/m128 while shifting in 0s.

VEX.128.66.0F.WIG 71 /6 ib D V/V AVX Shift words in xmm2 left by imm8 while shifting
VPSLLW xmm1, xmm2, imm8 in 0s.

VEX.128.66.0F.WIG F2 /r C V/V AVX Shift doublewords in xmm2 left by amount


VPSLLD xmm1, xmm2, xmm3/m128 specified in xmm3/m128 while shifting in 0s.

VEX.128.66.0F.WIG 72 /6 ib D V/V AVX Shift doublewords in xmm2 left by imm8 while


VPSLLD xmm1, xmm2, imm8 shifting in 0s.

VEX.128.66.0F.WIG F3 /r C V/V AVX Shift quadwords in xmm2 left by amount


VPSLLQ xmm1, xmm2, xmm3/m128 specified in xmm3/m128 while shifting in 0s.

VEX.128.66.0F.WIG 73 /6 ib D V/V AVX Shift quadwords in xmm2 left by imm8 while


VPSLLQ xmm1, xmm2, imm8 shifting in 0s.

VEX.256.66.0F.WIG F1 /r C V/V AVX2 Shift words in ymm2 left by amount specified in


VPSLLW ymm1, ymm2, xmm3/m128 xmm3/m128 while shifting in 0s.

VEX.256.66.0F.WIG 71 /6 ib D V/V AVX2 Shift words in ymm2 left by imm8 while shifting
VPSLLW ymm1, ymm2, imm8 in 0s.

PSLLW/PSLLD/PSLLQ—Shift Packed Data Left Logical Vol. 2B 4-437


Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
VEX.256.66.0F.WIG F2 /r C V/V AVX2 Shift doublewords in ymm2 left by amount
VPSLLD ymm1, ymm2, xmm3/m128 specified in xmm3/m128 while shifting in 0s.

VEX.256.66.0F.WIG 72 /6 ib D V/V AVX2 Shift doublewords in ymm2 left by imm8 while


VPSLLD ymm1, ymm2, imm8 shifting in 0s.

VEX.256.66.0F.WIG F3 /r C V/V AVX2 Shift quadwords in ymm2 left by amount


VPSLLQ ymm1, ymm2, xmm3/m128 specified in xmm3/m128 while shifting in 0s.

VEX.256.66.0F.WIG 73 /6 ib D V/V AVX2 Shift quadwords in ymm2 left by imm8 while


VPSLLQ ymm1, ymm2, imm8 shifting in 0s.

EVEX.128.66.0F.WIG F1 /r G V/V (AVX512VL AND Shift words in xmm2 left by amount specified in
VPSLLW xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm3/m128 while shifting in 0s using
xmm3/m128 AVX10.12 writemask k1.
EVEX.256.66.0F.WIG F1 /r G V/V (AVX512VL AND Shift words in ymm2 left by amount specified in
VPSLLW ymm1 {k1}{z}, ymm2, AVX512BW) OR xmm3/m128 while shifting in 0s using
xmm3/m128 AVX10.12 writemask k1.
EVEX.512.66.0F.WIG F1 /r G V/V AVX512BW Shift words in zmm2 left by amount specified in
VPSLLW zmm1 {k1}{z}, zmm2, OR AVX10.12 xmm3/m128 while shifting in 0s using
xmm3/m128 writemask k1.
EVEX.128.66.0F.WIG 71 /6 ib E V/V (AVX512VL AND Shift words in xmm2/m128 left by imm8 while
VPSLLW xmm1 {k1}{z}, xmm2/m128, AVX512BW) OR shifting in 0s using writemask k1.
imm8 AVX10.12
EVEX.256.66.0F.WIG 71 /6 ib E V/V (AVX512VL AND Shift words in ymm2/m256 left by imm8 while
VPSLLW ymm1 {k1}{z}, ymm2/m256, AVX512BW) OR shifting in 0s using writemask k1.
imm8 AVX10.12
EVEX.512.66.0F.WIG 71 /6 ib E V/V AVX512BW Shift words in zmm2/m512 left by imm8 while
VPSLLW zmm1 {k1}{z}, zmm2/m512, OR AVX10.12 shifting in 0 using writemask k1.
imm8
EVEX.128.66.0F.W0 F2 /r G V/V (AVX512VL AND Shift doublewords in xmm2 left by amount
VPSLLD xmm1 {k1}{z}, xmm2, AVX512F) OR specified in xmm3/m128 while shifting in 0s
xmm3/m128 AVX10.12 under writemask k1.
EVEX.256.66.0F.W0 F2 /r G V/V (AVX512VL AND Shift doublewords in ymm2 left by amount
VPSLLD ymm1 {k1}{z}, ymm2, AVX512F) OR specified in xmm3/m128 while shifting in 0s
xmm3/m128 AVX10.12 under writemask k1.
EVEX.512.66.0F.W0 F2 /r G V/V AVX512F Shift doublewords in zmm2 left by amount
VPSLLD zmm1 {k1}{z}, zmm2, OR AVX10.12 specified in xmm3/m128 while shifting in 0s
xmm3/m128 under writemask k1.
EVEX.128.66.0F.W0 72 /6 ib F V/V (AVX512VL AND Shift doublewords in xmm2/m128/m32bcst left
VPSLLD xmm1 {k1}{z}, AVX512F) OR by imm8 while shifting in 0s using writemask k1.
xmm2/m128/m32bcst, imm8 AVX10.12
EVEX.256.66.0F.W0 72 /6 ib F V/V (AVX512VL AND Shift doublewords in ymm2/m256/m32bcst left
VPSLLD ymm1 {k1}{z}, AVX512F) OR by imm8 while shifting in 0s using writemask k1.
ymm2/m256/m32bcst, imm8 AVX10.12
EVEX.512.66.0F.W0 72 /6 ib F V/V AVX512F Shift doublewords in zmm2/m512/m32bcst left
VPSLLD zmm1 {k1}{z}, OR AVX10.12 by imm8 while shifting in 0s using writemask k1.
zmm2/m512/m32bcst, imm8
EVEX.128.66.0F.W1 F3 /r G V/V (AVX512VL AND Shift quadwords in xmm2 left by amount
VPSLLQ xmm1 {k1}{z}, xmm2, AVX512F) OR specified in xmm3/m128 while shifting in 0s
xmm3/m128 AVX10.12 using writemask k1.

PSLLW/PSLLD/PSLLQ—Shift Packed Data Left Logical Vol. 2B 4-438


Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
EVEX.256.66.0F.W1 F3 /r G V/V (AVX512VL AND Shift quadwords in ymm2 left by amount
VPSLLQ ymm1 {k1}{z}, ymm2, AVX512F) OR specified in xmm3/m128 while shifting in 0s
xmm3/m128 AVX10.12 using writemask k1.
EVEX.512.66.0F.W1 F3 /r G V/V AVX512F Shift quadwords in zmm2 left by amount
VPSLLQ zmm1 {k1}{z}, zmm2, OR AVX10.12 specified in xmm3/m128 while shifting in 0s
xmm3/m128 using writemask k1.
EVEX.128.66.0F.W1 73 /6 ib F V/V (AVX512VL AND Shift quadwords in xmm2/m128/m64bcst left
VPSLLQ xmm1 {k1}{z}, AVX512F) OR by imm8 while shifting in 0s using writemask k1.
xmm2/m128/m64bcst, imm8 AVX10.12
EVEX.256.66.0F.W1 73 /6 ib F V/V (AVX512VL AND Shift quadwords in ymm2/m256/m64bcst left
VPSLLQ ymm1 {k1}{z}, AVX512F) OR by imm8 while shifting in 0s using writemask k1.
ymm2/m256/m64bcst, imm8 AVX10.12
EVEX.512.66.0F.W1 73 /6 ib F V/V AVX512F Shift quadwords in zmm2/m512/m64bcst left
VPSLLQ zmm1 {k1}{z}, OR AVX10.12 by imm8 while shifting in 0s using writemask k1.
zmm2/m512/m64bcst, imm8

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:r/m (r, w) imm8 N/A N/A
C N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
D N/A VEX.vvvv (w) ModRM:r/m (r) imm8 N/A
E Full Mem EVEX.vvvv (w) ModRM:r/m (r) imm8 N/A
F Full EVEX.vvvv (w) ModRM:r/m (r) imm8 N/A
G Mem128 ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Shifts the bits in the individual data elements (words, doublewords, or quadword) in the destination operand (first
operand) to the left by the number of bits specified in the count operand (second operand). As the bits in the data
elements are shifted left, the empty low-order bits are cleared (set to 0). If the value specified by the count
operand is greater than 15 (for words), 31 (for doublewords), or 63 (for a quadword), then the destination operand
is set to all 0s. Figure 4-17 gives an example of shifting words in a 64-bit operand.

PSLLW/PSLLD/PSLLQ—Shift Packed Data Left Logical Vol. 2B 4-439


Pre-Shift
X3 X2 X1 X0
DEST
Shift Left
with Zero
Extension

Post-Shift
DEST X3 << COUNT X2 << COUNT X1 << COUNT X0 << COUNT

Figure 4-17. PSLLW, PSLLD, and PSLLQ Instruction Operation Using 64-bit Operand

The (V)PSLLW instruction shifts each of the words in the destination operand to the left by the number of bits spec-
ified in the count operand; the (V)PSLLD instruction shifts each of the doublewords in the destination operand; and
the (V)PSLLQ instruction shifts the quadword (or quadwords) in the destination operand.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE instructions 64-bit operand: The destination operand is an MMX technology register; the count
operand can be either an MMX technology register or an 64-bit memory location.
128-bit Legacy SSE version: The destination and first source operands are XMM registers. Bits (MAXVL-1:128) of
the corresponding YMM destination register remain unchanged. The count operand can be either an XMM register
or a 128-bit memory location or an 8-bit immediate. If the count operand is a memory address, 128 bits are loaded
but the upper 64 bits are ignored.
VEX.128 encoded version: The destination and first source operands are XMM registers. Bits (MAXVL-1:128) of the
destination YMM register are zeroed. The count operand can be either an XMM register or a 128-bit memory loca-
tion or an 8-bit immediate. If the count operand is a memory address, 128 bits are loaded but the upper 64 bits are
ignored.
VEX.256 encoded version: The destination operand is a YMM register. The source operand is a YMM register or a
memory location. The count operand can come either from an XMM register or a memory location or an 8-bit
immediate. Bits (MAXVL-1:256) of the corresponding ZMM register are zeroed.
EVEX encoded versions: The destination operand is a ZMM register updated according to the writemask. The count
operand is either an 8-bit immediate (the immediate count version) or an 8-bit value from an XMM register or a
memory location (the variable count version). For the immediate count version, the source operand (the second
operand) can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32/64-bit
memory location. For the variable count version, the first source operand (the second operand) is a ZMM register,
the second source operand (the third operand, 8-bit variable count) can be an XMM register or a memory location.
Note: In VEX/EVEX encoded versions of shifts with an immediate count, vvvv of VEX/EVEX encode the destination
register, and VEX.B/EVEX.B + ModRM.r/m encodes the source register.
Note: For shifts with an immediate count (VEX.128.66.0F 71-73 /6, or EVEX.128.66.0F 71-73 /6),
VEX.vvvv/EVEX.vvvv encodes the destination register.

Operation
PSLLW (With 64-bit Operand)
IF (COUNT > 15)
THEN
DEST[64:0] := 0000000000000000H;
ELSE
DEST[15:0] := ZeroExtend(DEST[15:0] << COUNT);
(* Repeat shift operation for 2nd and 3rd words *)
DEST[63:48] := ZeroExtend(DEST[63:48] << COUNT);
FI;
PSLLD (with 64-bit operand)
IF (COUNT > 31)
THEN

PSLLW/PSLLD/PSLLQ—Shift Packed Data Left Logical Vol. 2B 4-440


DEST[64:0] := 0000000000000000H;
ELSE
DEST[31:0] := ZeroExtend(DEST[31:0] << COUNT);
DEST[63:32] := ZeroExtend(DEST[63:32] << COUNT);
FI;

PSLLQ (With 64-bit Operand)


IF (COUNT > 63)
THEN
DEST[64:0] := 0000000000000000H;
ELSE
DEST := ZeroExtend(DEST << COUNT);
FI;

LOGICAL_LEFT_SHIFT_WORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 15)
THEN
DEST[127:0] := 00000000000000000000000000000000H
ELSE
DEST[15:0] := ZeroExtend(SRC[15:0] << COUNT);
(* Repeat shift operation for 2nd through 7th words *)
DEST[127:112] := ZeroExtend(SRC[127:112] << COUNT);
FI;

LOGICAL_LEFT_SHIFT_DWORDS1(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 31)
THEN
DEST[31:0] := 0
ELSE
DEST[31:0] := ZeroExtend(SRC[31:0] << COUNT);
FI;

LOGICAL_LEFT_SHIFT_DWORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 31)
THEN
DEST[127:0] := 00000000000000000000000000000000H
ELSE
DEST[31:0] := ZeroExtend(SRC[31:0] << COUNT);
(* Repeat shift operation for 2nd through 3rd words *)
DEST[127:96] := ZeroExtend(SRC[127:96] << COUNT);
FI;

LOGICAL_LEFT_SHIFT_QWORDS1(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 63)
THEN
DEST[63:0] := 0
ELSE
DEST[63:0] := ZeroExtend(SRC[63:0] << COUNT);
FI;

PSLLW/PSLLD/PSLLQ—Shift Packed Data Left Logical Vol. 2B 4-441


LOGICAL_LEFT_SHIFT_QWORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 63)
THEN
DEST[127:0] := 00000000000000000000000000000000H
ELSE
DEST[63:0] := ZeroExtend(SRC[63:0] << COUNT);
DEST[127:64] := ZeroExtend(SRC[127:64] << COUNT);
FI;
LOGICAL_LEFT_SHIFT_WORDS_256b(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 15)
THEN
DEST[127:0] := 00000000000000000000000000000000H
DEST[255:128] := 00000000000000000000000000000000H
ELSE
DEST[15:0] := ZeroExtend(SRC[15:0] << COUNT);
(* Repeat shift operation for 2nd through 15th words *)
DEST[255:240] := ZeroExtend(SRC[255:240] << COUNT);
FI;

LOGICAL_LEFT_SHIFT_DWORDS_256b(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 31)
THEN
DEST[127:0] := 00000000000000000000000000000000H
DEST[255:128] := 00000000000000000000000000000000H
ELSE
DEST[31:0] := ZeroExtend(SRC[31:0] << COUNT);
(* Repeat shift operation for 2nd through 7th words *)
DEST[255:224] := ZeroExtend(SRC[255:224] << COUNT);
FI;

LOGICAL_LEFT_SHIFT_QWORDS_256b(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 63)
THEN
DEST[127:0] := 00000000000000000000000000000000H
DEST[255:128] := 00000000000000000000000000000000H
ELSE
DEST[63:0] := ZeroExtend(SRC[63:0] << COUNT);
DEST[127:64] := ZeroExtend(SRC[127:64] << COUNT)
DEST[191:128] := ZeroExtend(SRC[191:128] << COUNT);
DEST[255:192] := ZeroExtend(SRC[255:192] << COUNT);
FI;

VPSLLW (EVEX Versions, xmm/m128)


(KL, VL) = (8, 128), (16, 256), (32, 512)
IF VL = 128
TMP_DEST[127:0] := LOGICAL_LEFT_SHIFT_WORDS_128b(SRC1[127:0], SRC2)
FI;
IF VL = 256
TMP_DEST[255:0] := LOGICAL_LEFT_SHIFT_WORDS_256b(SRC1[255:0], SRC2)
FI;

PSLLW/PSLLD/PSLLQ—Shift Packed Data Left Logical Vol. 2B 4-442


IF VL = 512
TMP_DEST[255:0] := LOGICAL_LEFT_SHIFT_WORDS_256b(SRC1[255:0], SRC2)
TMP_DEST[511:256] := LOGICAL_LEFT_SHIFT_WORDS_256b(SRC1[511:256], SRC2)
FI;

FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TMP_DEST[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPSLLW (EVEX Versions, imm8)


(KL, VL) = (8, 128), (16, 256), (32, 512)
IF VL = 128
TMP_DEST[127:0] := LOGICAL_LEFT_SHIFT_WORDS_128b(SRC1[127:0], imm8)
FI;
IF VL = 256
TMP_DEST[255:0] := LOGICAL_RIGHT_SHIFT_WORDS_256b(SRC1[255:0], imm8)
FI;
IF VL = 512
TMP_DEST[255:0] := LOGICAL_LEFT_SHIFT_WORDS_256b(SRC1[255:0], imm8)
TMP_DEST[511:256] := LOGICAL_LEFT_SHIFT_WORDS_256b(SRC1[511:256], imm8)
FI;

FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TMP_DEST[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPSLLW (ymm, ymm, xmm/m128) - VEX.256 Encoding


DEST[255:0] := LOGICAL_LEFT_SHIFT_WORDS_256b(SRC1, SRC2)
DEST[MAXVL-1:256] := 0;

VPSLLW (ymm, imm8) - VEX.256 Encoding


DEST[255:0] := LOGICAL_LEFT_SHIFT_WORD_256b(SRC1, imm8)
DEST[MAXVL-1:256] := 0;

PSLLW/PSLLD/PSLLQ—Shift Packed Data Left Logical Vol. 2B 4-443


VPSLLW (xmm, xmm, xmm/m128) - VEX.128 Encoding
DEST[127:0] := LOGICAL_LEFT_SHIFT_WORDS(SRC1, SRC2)
DEST[MAXVL-1:128] := 0

VPSLLW (xmm, imm8) - VEX.128 Encoding


DEST[127:0] := LOGICAL_LEFT_SHIFT_WORDS(SRC1, imm8)
DEST[MAXVL-1:128] := 0

PSLLW (xmm, xmm, xmm/m128)


DEST[127:0] := LOGICAL_LEFT_SHIFT_WORDS(DEST, SRC)
DEST[MAXVL-1:128] (Unmodified)

PSLLW (xmm, imm8)


DEST[127:0] := LOGICAL_LEFT_SHIFT_WORDS(DEST, imm8)
DEST[MAXVL-1:128] (Unmodified)

VPSLLD (EVEX versions, imm8)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC1 *is memory*)
THEN DEST[i+31:i] := LOGICAL_LEFT_SHIFT_DWORDS1(SRC1[31:0], imm8)
ELSE DEST[i+31:i] := LOGICAL_LEFT_SHIFT_DWORDS1(SRC1[i+31:i], imm8)
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPSLLD (EVEX Versions, xmm/m128)


(KL, VL) = (4, 128), (8, 256), (16, 512)
IF VL = 128
TMP_DEST[127:0] := LOGICAL_LEFT_SHIFT_DWORDS_128b(SRC1[127:0], SRC2)
FI;
IF VL = 256
TMP_DEST[255:0] := LOGICAL_LEFT_SHIFT_DWORDS_256b(SRC1[255:0], SRC2)
FI;
IF VL = 512
TMP_DEST[255:0] := LOGICAL_LEFT_SHIFT_DWORDS_256b(SRC1[255:0], SRC2)
TMP_DEST[511:256] := LOGICAL_LEFT_SHIFT_DWORDS_256b(SRC1[511:256], SRC2)
FI;

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking

PSLLW/PSLLD/PSLLQ—Shift Packed Data Left Logical Vol. 2B 4-444


THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPSLLD (ymm, ymm, xmm/m128) - VEX.256 Encoding


DEST[255:0] := LOGICAL_LEFT_SHIFT_DWORDS_256b(SRC1, SRC2)
DEST[MAXVL-1:256] := 0;

VPSLLD (ymm, imm8) - VEX.256 Encoding


DEST[255:0] := LOGICAL_LEFT_SHIFT_DWORDS_256b(SRC1, imm8)
DEST[MAXVL-1:256] := 0;

VPSLLD (xmm, xmm, xmm/m128) - VEX.128 Encoding


DEST[127:0] := LOGICAL_LEFT_SHIFT_DWORDS(SRC1, SRC2)
DEST[MAXVL-1:128] := 0

VPSLLD (xmm, imm8) - VEX.128 Encoding


DEST[127:0] := LOGICAL_LEFT_SHIFT_DWORDS(SRC1, imm8)
DEST[MAXVL-1:128] := 0

PSLLD (xmm, xmm, xmm/m128)


DEST[127:0] := LOGICAL_LEFT_SHIFT_DWORDS(DEST, SRC)
DEST[MAXVL-1:128] (Unmodified)

PSLLD (xmm, imm8)


DEST[127:0] := LOGICAL_LEFT_SHIFT_DWORDS(DEST, imm8)
DEST[MAXVL-1:128] (Unmodified)

VPSLLQ (EVEX Versions, imm8)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC1 *is memory*)
THEN DEST[i+63:i] := LOGICAL_LEFT_SHIFT_QWORDS1(SRC1[63:0], imm8)
ELSE DEST[i+63:i] := LOGICAL_LEFT_SHIFT_QWORDS1(SRC1[i+63:i], imm8)
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR

VPSLLQ (EVEX Versions, xmm/m128)


(KL, VL) = (2, 128), (4, 256), (8, 512)
IF VL = 128
TMP_DEST[127:0] := LOGICAL_LEFT_SHIFT_QWORDS_128b(SRC1[127:0], SRC2)

PSLLW/PSLLD/PSLLQ—Shift Packed Data Left Logical Vol. 2B 4-445


FI;
IF VL = 256
TMP_DEST[255:0] := LOGICAL_LEFT_SHIFT_QWORDS_256b(SRC1[255:0], SRC2)
FI;
IF VL = 512
TMP_DEST[255:0] := LOGICAL_LEFT_SHIFT_QWORDS_256b(SRC1[255:0], SRC2)
TMP_DEST[511:256] := LOGICAL_LEFT_SHIFT_QWORDS_256b(SRC1[511:256], SRC2)
FI;

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPSLLQ (ymm, ymm, xmm/m128) - VEX.256 Encoding


DEST[255:0] := LOGICAL_LEFT_SHIFT_QWORDS_256b(SRC1, SRC2)
DEST[MAXVL-1:256] := 0;

VPSLLQ (ymm, imm8) - VEX.256 Encoding


DEST[255:0] := LOGICAL_LEFT_SHIFT_QWORDS_256b(SRC1, imm8)
DEST[MAXVL-1:256] := 0;

VPSLLQ (xmm, xmm, xmm/m128) - VEX.128 Encoding


DEST[127:0] := LOGICAL_LEFT_SHIFT_QWORDS(SRC1, SRC2)
DEST[MAXVL-1:128] := 0

VPSLLQ (xmm, imm8) - VEX.128 Encoding


DEST[127:0] := LOGICAL_LEFT_SHIFT_QWORDS(SRC1, imm8)
DEST[MAXVL-1:128] := 0

PSLLQ (xmm, xmm, xmm/m128)


DEST[127:0] := LOGICAL_LEFT_SHIFT_QWORDS(DEST, SRC)
DEST[MAXVL-1:128] (Unmodified)

PSLLQ (xmm, imm8)


DEST[127:0] := LOGICAL_LEFT_SHIFT_QWORDS(DEST, imm8)
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalents


VPSLLD __m512i _mm512_slli_epi32(__m512i a, unsigned int imm);
VPSLLD __m512i _mm512_mask_slli_epi32(__m512i s, __mmask16 k, __m512i a, unsigned int imm);
VPSLLD __m512i _mm512_maskz_slli_epi32( __mmask16 k, __m512i a, unsigned int imm);
VPSLLD __m256i _mm256_mask_slli_epi32(__m256i s, __mmask8 k, __m256i a, unsigned int imm);
VPSLLD __m256i _mm256_maskz_slli_epi32( __mmask8 k, __m256i a, unsigned int imm);

PSLLW/PSLLD/PSLLQ—Shift Packed Data Left Logical Vol. 2B 4-446


VPSLLD __m128i _mm_mask_slli_epi32(__m128i s, __mmask8 k, __m128i a, unsigned int imm);
VPSLLD __m128i _mm_maskz_slli_epi32( __mmask8 k, __m128i a, unsigned int imm);
VPSLLD __m512i _mm512_sll_epi32(__m512i a, __m128i cnt);
VPSLLD __m512i _mm512_mask_sll_epi32(__m512i s, __mmask16 k, __m512i a, __m128i cnt);
VPSLLD __m512i _mm512_maskz_sll_epi32( __mmask16 k, __m512i a, __m128i cnt);
VPSLLD __m256i _mm256_mask_sll_epi32(__m256i s, __mmask8 k, __m256i a, __m128i cnt);
VPSLLD __m256i _mm256_maskz_sll_epi32( __mmask8 k, __m256i a, __m128i cnt);
VPSLLD __m128i _mm_mask_sll_epi32(__m128i s, __mmask8 k, __m128i a, __m128i cnt);
VPSLLD __m128i _mm_maskz_sll_epi32( __mmask8 k, __m128i a, __m128i cnt);
VPSLLQ __m512i _mm512_mask_slli_epi64(__m512i a, unsigned int imm);
VPSLLQ __m512i _mm512_mask_slli_epi64(__m512i s, __mmask8 k, __m512i a, unsigned int imm);
VPSLLQ __m512i _mm512_maskz_slli_epi64( __mmask8 k, __m512i a, unsigned int imm);
VPSLLQ __m256i _mm256_mask_slli_epi64(__m256i s, __mmask8 k, __m256i a, unsigned int imm);
VPSLLQ __m256i _mm256_maskz_slli_epi64( __mmask8 k, __m256i a, unsigned int imm);
VPSLLQ __m128i _mm_mask_slli_epi64(__m128i s, __mmask8 k, __m128i a, unsigned int imm);
VPSLLQ __m128i _mm_maskz_slli_epi64( __mmask8 k, __m128i a, unsigned int imm);
VPSLLQ __m512i _mm512_mask_sll_epi64(__m512i a, __m128i cnt);
VPSLLQ __m512i _mm512_mask_sll_epi64(__m512i s, __mmask8 k, __m512i a, __m128i cnt);
VPSLLQ __m512i _mm512_maskz_sll_epi64( __mmask8 k, __m512i a, __m128i cnt);
VPSLLQ __m256i _mm256_mask_sll_epi64(__m256i s, __mmask8 k, __m256i a, __m128i cnt);
VPSLLQ __m256i _mm256_maskz_sll_epi64( __mmask8 k, __m256i a, __m128i cnt);
VPSLLQ __m128i _mm_mask_sll_epi64(__m128i s, __mmask8 k, __m128i a, __m128i cnt);
VPSLLQ __m128i _mm_maskz_sll_epi64( __mmask8 k, __m128i a, __m128i cnt);
VPSLLW __m512i _mm512_slli_epi16(__m512i a, unsigned int imm);
VPSLLW __m512i _mm512_mask_slli_epi16(__m512i s, __mmask32 k, __m512i a, unsigned int imm);
VPSLLW __m512i _mm512_maskz_slli_epi16( __mmask32 k, __m512i a, unsigned int imm);
VPSLLW __m256i _mm256_mask_slli_epi16(__m256i s, __mmask16 k, __m256i a, unsigned int imm);
VPSLLW __m256i _mm256_maskz_slli_epi16( __mmask16 k, __m256i a, unsigned int imm);
VPSLLW __m128i _mm_mask_slli_epi16(__m128i s, __mmask8 k, __m128i a, unsigned int imm);
VPSLLW __m128i _mm_maskz_slli_epi16( __mmask8 k, __m128i a, unsigned int imm);
VPSLLW __m512i _mm512_sll_epi16(__m512i a, __m128i cnt);
VPSLLW __m512i _mm512_mask_sll_epi16(__m512i s, __mmask32 k, __m512i a, __m128i cnt);
VPSLLW __m512i _mm512_maskz_sll_epi16( __mmask32 k, __m512i a, __m128i cnt);
VPSLLW __m256i _mm256_mask_sll_epi16(__m256i s, __mmask16 k, __m256i a, __m128i cnt);
VPSLLW __m256i _mm256_maskz_sll_epi16( __mmask16 k, __m256i a, __m128i cnt);
VPSLLW __m128i _mm_mask_sll_epi16(__m128i s, __mmask8 k, __m128i a, __m128i cnt);
VPSLLW __m128i _mm_maskz_sll_epi16( __mmask8 k, __m128i a, __m128i cnt);
PSLLW __m64 _mm_slli_pi16 (__m64 m, int count)
PSLLW __m64 _mm_sll_pi16(__m64 m, __m64 count)
(V)PSLLW __m128i _mm_slli_epi16(__m64 m, int count)
(V)PSLLW __m128i _mm_sll_epi16(__m128i m, __m128i count)
VPSLLW __m256i _mm256_slli_epi16 (__m256i m, int count)
VPSLLW __m256i _mm256_sll_epi16 (__m256i m, __m128i count)
PSLLD __m64 _mm_slli_pi32(__m64 m, int count)
PSLLD __m64 _mm_sll_pi32(__m64 m, __m64 count)
(V)PSLLD __m128i _mm_slli_epi32(__m128i m, int count)
(V)PSLLD __m128i _mm_sll_epi32(__m128i m, __m128i count)
VPSLLD __m256i _mm256_slli_epi32 (__m256i m, int count)
VPSLLD __m256i _mm256_sll_epi32 (__m256i m, __m128i count)
PSLLQ __m64 _mm_slli_si64(__m64 m, int count)
PSLLQ __m64 _mm_sll_si64(__m64 m, __m64 count)
(V)PSLLQ __m128i _mm_slli_epi64(__m128i m, int count)
(V)PSLLQ __m128i _mm_sll_epi64(__m128i m, __m128i count)
VPSLLQ __m256i _mm256_slli_epi64 (__m256i m, int count)

PSLLW/PSLLD/PSLLQ—Shift Packed Data Left Logical Vol. 2B 4-447


VPSLLQ __m256i _mm256_sll_epi64 (__m256i m, __m128i count)

Flags Affected
None.

Numeric Exceptions
None.

Other Exceptions
• VEX-encoded instructions:
— Syntax with RM/RVM operand encoding (A/C in the operand encoding table), see Table 2-21, “Type 4 Class
Exception Conditions.”
— Syntax with MI/VMI operand encoding (B/D in the operand encoding table), see Table 2-24, “Type 7 Class
Exception Conditions.”
• EVEX-encoded VPSLLW (E in the operand encoding table), see Exceptions Type E4NF.nb in Table 2-52, “Type
E4NF Class Exception Conditions.”
• EVEX-encoded VPSLLD/Q:
— Syntax with Mem128 tuple type (G in the operand encoding table), see Exceptions Type E4NF.nb in
Table 2-52, “Type E4NF Class Exception Conditions.”
— Syntax with Full tuple type (F in the operand encoding table), see Table 2-51, “Type E4 Class Exception
Conditions.”

PSLLW/PSLLD/PSLLQ—Shift Packed Data Left Logical Vol. 2B 4-448


PSRAW/PSRAD/PSRAQ—Shift Packed Data Right Arithmetic
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F E1 /r1 A V/V MMX Shift words in mm right by mm/m64 while
PSRAW mm, mm/m64 shifting in sign bits.

66 0F E1 /r A V/V SSE2 Shift words in xmm1 right by xmm2/m128


PSRAW xmm1, xmm2/m128 while shifting in sign bits.

NP 0F 71 /4 ib1 B V/V MMX Shift words in mm right by imm8 while shifting


PSRAW mm, imm8 in sign bits

66 0F 71 /4 ib B V/V SSE2 Shift words in xmm1 right by imm8 while


PSRAW xmm1, imm8 shifting in sign bits

NP 0F E2 /r1 A V/V MMX Shift doublewords in mm right by mm/m64


PSRAD mm, mm/m64 while shifting in sign bits.

66 0F E2 /r A V/V SSE2 Shift doubleword in xmm1 right by xmm2


PSRAD xmm1, xmm2/m128 /m128 while shifting in sign bits.

NP 0F 72 /4 ib1 B V/V MMX Shift doublewords in mm right by imm8 while


PSRAD mm, imm8 shifting in sign bits.

66 0F 72 /4 ib B V/V SSE2 Shift doublewords in xmm1 right by imm8 while


PSRAD xmm1, imm8 shifting in sign bits.

VEX.128.66.0F.WIG E1 /r C V/V AVX Shift words in xmm2 right by amount specified


VPSRAW xmm1, xmm2, xmm3/m128 in xmm3/m128 while shifting in sign bits.

VEX.128.66.0F.WIG 71 /4 ib D V/V AVX Shift words in xmm2 right by imm8 while


VPSRAW xmm1, xmm2, imm8 shifting in sign bits.

VEX.128.66.0F.WIG E2 /r C V/V AVX Shift doublewords in xmm2 right by amount


VPSRAD xmm1, xmm2, xmm3/m128 specified in xmm3/m128 while shifting in sign
bits.
VEX.128.66.0F.WIG 72 /4 ib D V/V AVX Shift doublewords in xmm2 right by imm8 while
VPSRAD xmm1, xmm2, imm8 shifting in sign bits.

VEX.256.66.0F.WIG E1 /r C V/V AVX2 Shift words in ymm2 right by amount specified


VPSRAW ymm1, ymm2, xmm3/m128 in xmm3/m128 while shifting in sign bits.

VEX.256.66.0F.WIG 71 /4 ib D V/V AVX2 Shift words in ymm2 right by imm8 while


VPSRAW ymm1, ymm2, imm8 shifting in sign bits.

VEX.256.66.0F.WIG E2 /r C V/V AVX2 Shift doublewords in ymm2 right by amount


VPSRAD ymm1, ymm2, xmm3/m128 specified in xmm3/m128 while shifting in sign
bits.
VEX.256.66.0F.WIG 72 /4 ib D V/V AVX2 Shift doublewords in ymm2 right by imm8 while
VPSRAD ymm1, ymm2, imm8 shifting in sign bits.

EVEX.128.66.0F.WIG E1 /r G V/V (AVX512VL AND Shift words in xmm2 right by amount specified
VPSRAW xmm1 {k1}{z}, xmm2, AVX512BW) OR in xmm3/m128 while shifting in sign bits using
xmm3/m128 AVX10.12 writemask k1.
EVEX.256.66.0F.WIG E1 /r G V/V (AVX512VL AND Shift words in ymm2 right by amount specified
VPSRAW ymm1 {k1}{z}, ymm2, AVX512BW) OR in xmm3/m128 while shifting in sign bits using
xmm3/m128 AVX10.12 writemask k1.
EVEX.512.66.0F.WIG E1 /r G V/V AVX512BW Shift words in zmm2 right by amount specified
VPSRAW zmm1 {k1}{z}, zmm2, OR AVX10.12 in xmm3/m128 while shifting in sign bits using
xmm3/m128 writemask k1.

PSRAW/PSRAD/PSRAQ—Shift Packed Data Right Arithmetic Vol. 2B 4-449


Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
EVEX.128.66.0F.WIG 71 /4 ib E V/V (AVX512VL AND Shift words in xmm2/m128 right by imm8 while
VPSRAW xmm1 {k1}{z}, xmm2/m128, AVX512BW) OR shifting in sign bits using writemask k1.
imm8 AVX10.12
EVEX.256.66.0F.WIG 71 /4 ib E V/V (AVX512VL AND Shift words in ymm2/m256 right by imm8 while
VPSRAW ymm1 {k1}{z}, ymm2/m256, AVX512BW) OR shifting in sign bits using writemask k1.
imm8 AVX10.12
EVEX.512.66.0F.WIG 71 /4 ib E V/V AVX512BW Shift words in zmm2/m512 right by imm8 while
VPSRAW zmm1 {k1}{z}, zmm2/m512, OR AVX10.12 shifting in sign bits using writemask k1.
imm8
EVEX.128.66.0F.W0 E2 /r G V/V (AVX512VL AND Shift doublewords in xmm2 right by amount
VPSRAD xmm1 {k1}{z}, xmm2, AVX512F) OR specified in xmm3/m128 while shifting in sign
xmm3/m128 AVX10.12 bits using writemask k1.
EVEX.256.66.0F.W0 E2 /r G V/V (AVX512VL AND Shift doublewords in ymm2 right by amount
VPSRAD ymm1 {k1}{z}, ymm2, AVX512F) OR specified in xmm3/m128 while shifting in sign
xmm3/m128 AVX10.12 bits using writemask k1.
EVEX.512.66.0F.W0 E2 /r G V/V AVX512F Shift doublewords in zmm2 right by amount
VPSRAD zmm1 {k1}{z}, zmm2, OR AVX10.12 specified in xmm3/m128 while shifting in sign
xmm3/m128 bits using writemask k1.
EVEX.128.66.0F.W0 72 /4 ib F V/V (AVX512VL AND Shift doublewords in xmm2/m128/m32bcst
VPSRAD xmm1 {k1}{z}, AVX512F) OR right by imm8 while shifting in sign bits using
xmm2/m128/m32bcst, imm8 AVX10.12 writemask k1.
EVEX.256.66.0F.W0 72 /4 ib F V/V (AVX512VL AND Shift doublewords in ymm2/m256/m32bcst
VPSRAD ymm1 {k1}{z}, AVX512F) OR right by imm8 while shifting in sign bits using
ymm2/m256/m32bcst, imm8 AVX10.12 writemask k1.
EVEX.512.66.0F.W0 72 /4 ib F V/V AVX512F Shift doublewords in zmm2/m512/m32bcst
VPSRAD zmm1 {k1}{z}, OR AVX10.12 right by imm8 while shifting in sign bits using
zmm2/m512/m32bcst, imm8 writemask k1.
EVEX.128.66.0F.W1 E2 /r G V/V (AVX512VL AND Shift quadwords in xmm2 right by amount
VPSRAQ xmm1 {k1}{z}, xmm2, AVX512F) OR specified in xmm3/m128 while shifting in sign
xmm3/m128 AVX10.12 bits using writemask k1.
EVEX.256.66.0F.W1 E2 /r G V/V (AVX512VL AND Shift quadwords in ymm2 right by amount
VPSRAQ ymm1 {k1}{z}, ymm2, AVX512F) OR specified in xmm3/m128 while shifting in sign
xmm3/m128 AVX10.12 bits using writemask k1.
EVEX.512.66.0F.W1 E2 /r G V/V AVX512F Shift quadwords in zmm2 right by amount
VPSRAQ zmm1 {k1}{z}, zmm2, OR AVX10.12 specified in xmm3/m128 while shifting in sign
xmm3/m128 bits using writemask k1.
EVEX.128.66.0F.W1 72 /4 ib F V/V (AVX512VL AND Shift quadwords in xmm2/m128/m64bcst right
VPSRAQ xmm1 {k1}{z}, AVX512F) OR by imm8 while shifting in sign bits using
xmm2/m128/m64bcst, imm8 AVX10.12 writemask k1.
EVEX.256.66.0F.W1 72 /4 ib F V/V (AVX512VL AND Shift quadwords in ymm2/m256/m64bcst right
VPSRAQ ymm1 {k1}{z}, AVX512F) OR by imm8 while shifting in sign bits using
ymm2/m256/m64bcst, imm8 AVX10.12 writemask k1.
EVEX.512.66.0F.W1 72 /4 ib F V/V AVX512F Shift quadwords in zmm2/m512/m64bcst right
VPSRAQ zmm1 {k1}{z}, OR AVX10.12 by imm8 while shifting in sign bits using
zmm2/m512/m64bcst, imm8 writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.

PSRAW/PSRAD/PSRAQ—Shift Packed Data Right Arithmetic Vol. 2B 4-450


2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:r/m (r, w) imm8 N/A N/A
C N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
D N/A VEX.vvvv (w) ModRM:r/m (r) imm8 N/A
E Full Mem EVEX.vvvv (w) ModRM:r/m (r) imm8 N/A
F Full EVEX.vvvv (w) ModRM:r/m (r) imm8 N/A
G Mem128 ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Shifts the bits in the individual data elements (words, doublewords or quadwords) in the destination operand (first
operand) to the right by the number of bits specified in the count operand (second operand). As the bits in the data
elements are shifted right, the empty high-order bits are filled with the initial value of the sign bit of the data
element. If the value specified by the count operand is greater than 15 (for words), 31 (for doublewords), or 63 (for
quadwords), each destination data element is filled with the initial value of the sign bit of the element. (Figure 4-18
gives an example of shifting words in a 64-bit operand.)

Pre-Shift
X3 X2 X1 X0
DEST
Shift Right
with Sign
Extension

Post-Shift
DEST X3 >> COUNT X2 >> COUNT X1 >> COUNT X0 >> COUNT

Figure 4-18. PSRAW and PSRAD Instruction Operation Using a 64-bit Operand

Note that only the first 64-bits of a 128-bit count operand are checked to compute the count. If the second source
operand is a memory address, 128 bits are loaded.
The (V)PSRAW instruction shifts each of the words in the destination operand to the right by the number of bits
specified in the count operand, and the (V)PSRAD instruction shifts each of the doublewords in the destination
operand.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE instructions 64-bit operand: The destination operand is an MMX technology register; the count
operand can be either an MMX technology register or an 64-bit memory location.
128-bit Legacy SSE version: The destination and first source operands are XMM registers. Bits (MAXVL-1:128) of
the corresponding YMM destination register remain unchanged. The count operand can be either an XMM register
or a 128-bit memory location or an 8-bit immediate. If the count operand is a memory address, 128 bits are loaded
but the upper 64 bits are ignored.
VEX.128 encoded version: The destination and first source operands are XMM registers. Bits (MAXVL-1:128) of the
destination YMM register are zeroed. The count operand can be either an XMM register or a 128-bit memory loca-

PSRAW/PSRAD/PSRAQ—Shift Packed Data Right Arithmetic Vol. 2B 4-451


tion or an 8-bit immediate. If the count operand is a memory address, 128 bits are loaded but the upper 64 bits are
ignored.
VEX.256 encoded version: The destination operand is a YMM register. The source operand is a YMM register or a
memory location. The count operand can come either from an XMM register or a memory location or an 8-bit
immediate. Bits (MAXVL-1:256) of the corresponding ZMM register are zeroed.
EVEX encoded versions: The destination operand is a ZMM register updated according to the writemask. The count
operand is either an 8-bit immediate (the immediate count version) or an 8-bit value from an XMM register or a
memory location (the variable count version). For the immediate count version, the source operand (the second
operand) can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32/64-bit
memory location. For the variable count version, the first source operand (the second operand) is a ZMM register,
the second source operand (the third operand, 8-bit variable count) can be an XMM register or a memory location.
Note: In VEX/EVEX encoded versions of shifts with an immediate count, vvvv of VEX/EVEX encode the destination
register, and VEX.B/EVEX.B + ModRM.r/m encodes the source register.
Note: For shifts with an immediate count (VEX.128.66.0F 71-73 /4, EVEX.128.66.0F 71-73 /4),
VEX.vvvv/EVEX.vvvv encodes the destination register.

Operation
PSRAW (With 64-bit Operand)
IF (COUNT > 15)
THEN COUNT := 16;
FI;
DEST[15:0] := SignExtend(DEST[15:0] >> COUNT);
(* Repeat shift operation for 2nd and 3rd words *)
DEST[63:48] := SignExtend(DEST[63:48] >> COUNT);

PSRAD (with 64-bit operand)


IF (COUNT > 31)
THEN COUNT := 32;
FI;
DEST[31:0] := SignExtend(DEST[31:0] >> COUNT);
DEST[63:32] := SignExtend(DEST[63:32] >> COUNT);

ARITHMETIC_RIGHT_SHIFT_DWORDS1(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 31)
THEN
DEST[31:0] := SignBit
ELSE
DEST[31:0] := SignExtend(SRC[31:0] >> COUNT);
FI;

ARITHMETIC_RIGHT_SHIFT_QWORDS1(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 63)
THEN
DEST[63:0] := SignBit
ELSE
DEST[63:0] := SignExtend(SRC[63:0] >> COUNT);
FI;

ARITHMETIC_RIGHT_SHIFT_WORDS_256b(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 15)

PSRAW/PSRAD/PSRAQ—Shift Packed Data Right Arithmetic Vol. 2B 4-452


THEN COUNT := 16;
FI;
DEST[15:0] := SignExtend(SRC[15:0] >> COUNT);
(* Repeat shift operation for 2nd through 15th words *)
DEST[255:240] := SignExtend(SRC[255:240] >> COUNT);

ARITHMETIC_RIGHT_SHIFT_DWORDS_256b(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 31)
THEN COUNT := 32;
FI;
DEST[31:0] := SignExtend(SRC[31:0] >> COUNT);
(* Repeat shift operation for 2nd through 7th words *)
DEST[255:224] := SignExtend(SRC[255:224] >> COUNT);

ARITHMETIC_RIGHT_SHIFT_QWORDS(SRC, COUNT_SRC, VL) ; VL: 128b, 256b or 512b


COUNT := COUNT_SRC[63:0];
IF (COUNT > 63)
THEN COUNT := 64;
FI;
DEST[63:0] := SignExtend(SRC[63:0] >> COUNT);
(* Repeat shift operation for 2nd through 7th words *)
DEST[VL-1:VL-64] := SignExtend(SRC[VL-1:VL-64] >> COUNT);

ARITHMETIC_RIGHT_SHIFT_WORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 15)
THEN COUNT := 16;
FI;
DEST[15:0] := SignExtend(SRC[15:0] >> COUNT);
(* Repeat shift operation for 2nd through 7th words *)
DEST[127:112] := SignExtend(SRC[127:112] >> COUNT);

ARITHMETIC_RIGHT_SHIFT_DWORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 31)
THEN COUNT := 32;
FI;
DEST[31:0] := SignExtend(SRC[31:0] >> COUNT);
(* Repeat shift operation for 2nd through 3rd words *)
DEST[127:96] := SignExtend(SRC[127:96] >> COUNT);

VPSRAW (EVEX versions, xmm/m128)


(KL, VL) = (8, 128), (16, 256), (32, 512)
IF VL = 128
TMP_DEST[127:0] := ARITHMETIC_RIGHT_SHIFT_WORDS_128b(SRC1[127:0], SRC2)
FI;
IF VL = 256
TMP_DEST[255:0] := ARITHMETIC_RIGHT_SHIFT_WORDS_256b(SRC1[255:0], SRC2)
FI;
IF VL = 512
TMP_DEST[255:0] := ARITHMETIC_RIGHT_SHIFT_WORDS_256b(SRC1[255:0], SRC2)
TMP_DEST[511:256] := ARITHMETIC_RIGHT_SHIFT_WORDS_256b(SRC1[511:256], SRC2)
FI;

PSRAW/PSRAD/PSRAQ—Shift Packed Data Right Arithmetic Vol. 2B 4-453


FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TMP_DEST[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPSRAW (EVEX Versions, imm8)


(KL, VL) = (8, 128), (16, 256), (32, 512)
IF VL = 128
TMP_DEST[127:0] := ARITHMETIC_RIGHT_SHIFT_WORDS_128b(SRC1[127:0], imm8)
FI;
IF VL = 256
TMP_DEST[255:0] := ARITHMETIC_RIGHT_SHIFT_WORDS_256b(SRC1[255:0], imm8)
FI;
IF VL = 512
TMP_DEST[255:0] := ARITHMETIC_RIGHT_SHIFT_WORDS_256b(SRC1[255:0], imm8)
TMP_DEST[511:256] := ARITHMETIC_RIGHT_SHIFT_WORDS_256b(SRC1[511:256], imm8)
FI;

FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TMP_DEST[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPSRAW (ymm, ymm, xmm/m128) - VEX


DEST[255:0] := ARITHMETIC_RIGHT_SHIFT_WORDS_256b(SRC1, SRC2)
DEST[MAXVL-1:256] := 0

VPSRAW (ymm, imm8) - VEX


DEST[255:0] := ARITHMETIC_RIGHT_SHIFT_WORDS_256b(SRC1, imm8)
DEST[MAXVL-1:256] := 0

VPSRAW (xmm, xmm, xmm/m128) - VEX


DEST[127:0] := ARITHMETIC_RIGHT_SHIFT_WORDS(SRC1, SRC2)
DEST[MAXVL-1:128] := 0

PSRAW/PSRAD/PSRAQ—Shift Packed Data Right Arithmetic Vol. 2B 4-454


VPSRAW (xmm, imm8) - VEX
DEST[127:0] := ARITHMETIC_RIGHT_SHIFT_WORDS(SRC1, imm8)
DEST[MAXVL-1:128] := 0

PSRAW (xmm, xmm, xmm/m128)


DEST[127:0] := ARITHMETIC_RIGHT_SHIFT_WORDS(DEST, SRC)
DEST[MAXVL-1:128] (Unmodified)

PSRAW (xmm, imm8)


DEST[127:0] := ARITHMETIC_RIGHT_SHIFT_WORDS(DEST, imm8)
DEST[MAXVL-1:128] (Unmodified)

VPSRAD (EVEX Versions, imm8)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC1 *is memory*)
THEN DEST[i+31:i] := ARITHMETIC_RIGHT_SHIFT_DWORDS1(SRC1[31:0], imm8)
ELSE DEST[i+31:i] := ARITHMETIC_RIGHT_SHIFT_DWORDS1(SRC1[i+31:i], imm8)
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPSRAD (EVEX Versions, xmm/m128)


(KL, VL) = (4, 128), (8, 256), (16, 512)
IF VL = 128
TMP_DEST[127:0] := ARITHMETIC_RIGHT_SHIFT_DWORDS_128b(SRC1[127:0], SRC2)
FI;
IF VL = 256
TMP_DEST[255:0] := ARITHMETIC_RIGHT_SHIFT_DWORDS_256b(SRC1[255:0], SRC2)
FI;
IF VL = 512
TMP_DEST[255:0] := ARITHMETIC_RIGHT_SHIFT_DWORDS_256b(SRC1[255:0], SRC2)
TMP_DEST[511:256] := ARITHMETIC_RIGHT_SHIFT_DWORDS_256b(SRC1[511:256], SRC2)
FI;

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI

PSRAW/PSRAD/PSRAQ—Shift Packed Data Right Arithmetic Vol. 2B 4-455


FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPSRAD (ymm, ymm, xmm/m128) - VEX


DEST[255:0] := ARITHMETIC_RIGHT_SHIFT_DWORDS_256b(SRC1, SRC2)
DEST[MAXVL-1:256] := 0

VPSRAD (ymm, imm8) - VEX


DEST[255:0] := ARITHMETIC_RIGHT_SHIFT_DWORDS_256b(SRC1, imm8)
DEST[MAXVL-1:256] := 0

VPSRAD (xmm, xmm, xmm/m128) - VEX


DEST[127:0] := ARITHMETIC_RIGHT_SHIFT_DWORDS(SRC1, SRC2)
DEST[MAXVL-1:128] := 0

VPSRAD (xmm, imm8) - VEX


DEST[127:0] := ARITHMETIC_RIGHT_SHIFT_DWORDS(SRC1, imm8)
DEST[MAXVL-1:128] := 0

PSRAD (xmm, xmm, xmm/m128)


DEST[127:0] := ARITHMETIC_RIGHT_SHIFT_DWORDS(DEST, SRC)
DEST[MAXVL-1:128] (Unmodified)

PSRAD (xmm, imm8)


DEST[127:0] := ARITHMETIC_RIGHT_SHIFT_DWORDS(DEST, imm8)
DEST[MAXVL-1:128] (Unmodified)

VPSRAQ (EVEX Versions, imm8)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC1 *is memory*)
THEN DEST[i+63:i] := ARITHMETIC_RIGHT_SHIFT_QWORDS1(SRC1[63:0], imm8)
ELSE DEST[i+63:i] := ARITHMETIC_RIGHT_SHIFT_QWORDS1(SRC1[i+63:i], imm8)
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPSRAQ (EVEX Versions, xmm/m128)


(KL, VL) = (2, 128), (4, 256), (8, 512)
TMP_DEST[VL-1:0] := ARITHMETIC_RIGHT_SHIFT_QWORDS(SRC1[VL-1:0], SRC2, VL)

FOR j := 0 TO 7
i := j * 64
IF k1[j] OR *no writemask*

PSRAW/PSRAD/PSRAQ—Shift Packed Data Right Arithmetic Vol. 2B 4-456


THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalents


VPSRAD __m512i _mm512_srai_epi32(__m512i a, unsigned int imm);
VPSRAD __m512i _mm512_mask_srai_epi32(__m512i s, __mmask16 k, __m512i a, unsigned int imm);
VPSRAD __m512i _mm512_maskz_srai_epi32( __mmask16 k, __m512i a, unsigned int imm);
VPSRAD __m256i _mm256_mask_srai_epi32(__m256i s, __mmask8 k, __m256i a, unsigned int imm);
VPSRAD __m256i _mm256_maskz_srai_epi32( __mmask8 k, __m256i a, unsigned int imm);
VPSRAD __m128i _mm_mask_srai_epi32(__m128i s, __mmask8 k, __m128i a, unsigned int imm);
VPSRAD __m128i _mm_maskz_srai_epi32( __mmask8 k, __m128i a, unsigned int imm);
VPSRAD __m512i _mm512_sra_epi32(__m512i a, __m128i cnt);
VPSRAD __m512i _mm512_mask_sra_epi32(__m512i s, __mmask16 k, __m512i a, __m128i cnt);
VPSRAD __m512i _mm512_maskz_sra_epi32( __mmask16 k, __m512i a, __m128i cnt);
VPSRAD __m256i _mm256_mask_sra_epi32(__m256i s, __mmask8 k, __m256i a, __m128i cnt);
VPSRAD __m256i _mm256_maskz_sra_epi32( __mmask8 k, __m256i a, __m128i cnt);
VPSRAD __m128i _mm_mask_sra_epi32(__m128i s, __mmask8 k, __m128i a, __m128i cnt);
VPSRAD __m128i _mm_maskz_sra_epi32( __mmask8 k, __m128i a, __m128i cnt);
VPSRAQ __m512i _mm512_srai_epi64(__m512i a, unsigned int imm);
VPSRAQ __m512i _mm512_mask_srai_epi64(__m512i s, __mmask8 k, __m512i a, unsigned int imm)
VPSRAQ __m512i _mm512_maskz_srai_epi64( __mmask8 k, __m512i a, unsigned int imm)
VPSRAQ __m256i _mm256_mask_srai_epi64(__m256i s, __mmask8 k, __m256i a, unsigned int imm);
VPSRAQ __m256i _mm256_maskz_srai_epi64( __mmask8 k, __m256i a, unsigned int imm);
VPSRAQ __m128i _mm_mask_srai_epi64(__m128i s, __mmask8 k, __m128i a, unsigned int imm);
VPSRAQ __m128i _mm_maskz_srai_epi64( __mmask8 k, __m128i a, unsigned int imm);
VPSRAQ __m512i _mm512_sra_epi64(__m512i a, __m128i cnt);
VPSRAQ __m512i _mm512_mask_sra_epi64(__m512i s, __mmask8 k, __m512i a, __m128i cnt)
VPSRAQ __m512i _mm512_maskz_sra_epi64( __mmask8 k, __m512i a, __m128i cnt)
VPSRAQ __m256i _mm256_mask_sra_epi64(__m256i s, __mmask8 k, __m256i a, __m128i cnt);
VPSRAQ __m256i _mm256_maskz_sra_epi64( __mmask8 k, __m256i a, __m128i cnt);
VPSRAQ __m128i _mm_mask_sra_epi64(__m128i s, __mmask8 k, __m128i a, __m128i cnt);
VPSRAQ __m128i _mm_maskz_sra_epi64( __mmask8 k, __m128i a, __m128i cnt);
VPSRAW __m512i _mm512_srai_epi16(__m512i a, unsigned int imm);
VPSRAW __m512i _mm512_mask_srai_epi16(__m512i s, __mmask32 k, __m512i a, unsigned int imm);
VPSRAW __m512i _mm512_maskz_srai_epi16( __mmask32 k, __m512i a, unsigned int imm);
VPSRAW __m256i _mm256_mask_srai_epi16(__m256i s, __mmask16 k, __m256i a, unsigned int imm);
VPSRAW __m256i _mm256_maskz_srai_epi16( __mmask16 k, __m256i a, unsigned int imm);
VPSRAW __m128i _mm_mask_srai_epi16(__m128i s, __mmask8 k, __m128i a, unsigned int imm);
VPSRAW __m128i _mm_maskz_srai_epi16( __mmask8 k, __m128i a, unsigned int imm);
VPSRAW __m512i _mm512_sra_epi16(__m512i a, __m128i cnt);
VPSRAW __m512i _mm512_mask_sra_epi16(__m512i s, __mmask16 k, __m512i a, __m128i cnt);
VPSRAW __m512i _mm512_maskz_sra_epi16( __mmask16 k, __m512i a, __m128i cnt);
VPSRAW __m256i _mm256_mask_sra_epi16(__m256i s, __mmask8 k, __m256i a, __m128i cnt);
VPSRAW __m256i _mm256_maskz_sra_epi16( __mmask8 k, __m256i a, __m128i cnt);
VPSRAW __m128i _mm_mask_sra_epi16(__m128i s, __mmask8 k, __m128i a, __m128i cnt);

PSRAW/PSRAD/PSRAQ—Shift Packed Data Right Arithmetic Vol. 2B 4-457


VPSRAW __m128i _mm_maskz_sra_epi16( __mmask8 k, __m128i a, __m128i cnt);
PSRAW __m64 _mm_srai_pi16 (__m64 m, int count)
PSRAW __m64 _mm_sra_pi16 (__m64 m, __m64 count)
(V)PSRAW __m128i _mm_srai_epi16(__m128i m, int count)
(V)PSRAW __m128i _mm_sra_epi16(__m128i m, __m128i count)
VPSRAW __m256i _mm256_srai_epi16 (__m256i m, int count)
VPSRAW __m256i _mm256_sra_epi16 (__m256i m, __m128i count)
PSRAD __m64 _mm_srai_pi32 (__m64 m, int count)
PSRAD __m64 _mm_sra_pi32 (__m64 m, __m64 count)
(V)PSRAD __m128i _mm_srai_epi32 (__m128i m, int count)
(V)PSRAD __m128i _mm_sra_epi32 (__m128i m, __m128i count)
VPSRAD __m256i _mm256_srai_epi32 (__m256i m, int count)
VPSRAD __m256i _mm256_sra_epi32 (__m256i m, __m128i count)

Flags Affected
None.

Numeric Exceptions
None.

Other Exceptions
• VEX-encoded instructions:
— Syntax with RM/RVM operand encoding (A/C in the operand encoding table), see Table 2-21, “Type 4 Class
Exception Conditions.”
— Syntax with MI/VMI operand encoding (B/D in the operand encoding table), see Table 2-24, “Type 7 Class
Exception Conditions.”
• EVEX-encoded VPSRAW (E in the operand encoding table), see Exceptions Type E4NF.nb in Table 2-52, “Type
E4NF Class Exception Conditions.”
• EVEX-encoded VPSRAD/Q:
— Syntax with Mem128 tuple type (G in the operand encoding table), see Exceptions Type E4NF.nb in
Table 2-52, “Type E4NF Class Exception Conditions.”
— Syntax with Full tuple type (F in the operand encoding table), see Table 2-51, “Type E4 Class Exception
Conditions.”

PSRAW/PSRAD/PSRAQ—Shift Packed Data Right Arithmetic Vol. 2B 4-458


PSRLDQ—Shift Double Quadword Right Logical
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 73 /3 ib A V/V SSE2 Shift xmm1 right by imm8 while shifting in 0s.
PSRLDQ xmm1, imm8
VEX.128.66.0F.WIG 73 /3 ib B V/V AVX Shift xmm2 right by imm8 bytes while shifting in
VPSRLDQ xmm1, xmm2, imm8 0s.

VEX.256.66.0F.WIG 73 /3 ib B V/V AVX2 Shift ymm1 right by imm8 bytes while shifting in
VPSRLDQ ymm1, ymm2, imm8 0s.

EVEX.128.66.0F.WIG 73 /3 ib C V/V (AVX512VL AND Shift xmm2/m128 right by imm8 bytes while
VPSRLDQ xmm1, xmm2/m128, imm8 AVX512BW) OR shifting in 0s and store result in xmm1.
AVX10.11
EVEX.256.66.0F.WIG 73 /3 ib C V/V (AVX512VL AND Shift ymm2/m256 right by imm8 bytes while
VPSRLDQ ymm1, ymm2/m256, imm8 AVX512BW) OR shifting in 0s and store result in ymm1.
AVX10.11
EVEX.512.66.0F.WIG 73 /3 ib C V/V AVX512BW Shift zmm2/m512 right by imm8 bytes while
VPSRLDQ zmm1, zmm2/m512, imm8 OR AVX10.11 shifting in 0s and store result in zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:r/m (r, w) imm8 N/A N/A
B N/A VEX.vvvv (w) ModRM:r/m (r) imm8 N/A
C Full Mem EVEX.vvvv (w) ModRM:r/m (r) imm8 N/A

Description
Shifts the destination operand (first operand) to the right by the number of bytes specified in the count operand
(second operand). The empty high-order bytes are cleared (set to all 0s). If the value specified by the count
operand is greater than 15, the destination operand is set to all 0s. The count operand is an 8-bit immediate.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
128-bit Legacy SSE version: The source and destination operands are the same. Bits (MAXVL-1:128) of the corre-
sponding YMM destination register remain unchanged.
VEX.128 encoded version: The source and destination operands are XMM registers. Bits (MAXVL-1:128) of the
destination YMM register are zeroed.
VEX.256 encoded version: The source operand is a YMM register. The destination operand is a YMM register. The
count operand applies to both the low and high 128-bit lanes.
VEX.256 encoded version: The source operand is YMM register. The destination operand is an YMM register. Bits
(MAXVL-1:256) of the corresponding ZMM register are zeroed. The count operand applies to both the low and high
128-bit lanes.
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location.
The destination operand is a ZMM/YMM/XMM register. The count operand applies to each 128-bit lanes.
Note: VEX.vvvv/EVEX.vvvv encodes the destination register.

PSRLDQ—Shift Double Quadword Right Logical Vol. 2B 4-459


Operation
VPSRLDQ (EVEX.512 Encoded Version)
TEMP := COUNT
IF (TEMP > 15) THEN TEMP := 16; FI
DEST[127:0] := SRC[127:0] >> (TEMP * 8)
DEST[255:128] := SRC[255:128] >> (TEMP * 8)
DEST[383:256] := SRC[383:256] >> (TEMP * 8)
DEST[511:384] := SRC[511:384] >> (TEMP * 8)
DEST[MAXVL-1:512] := 0;

VPSRLDQ (VEX.256 and EVEX.256 Encoded Version)


TEMP := COUNT
IF (TEMP > 15) THEN TEMP := 16; FI
DEST[127:0] := SRC[127:0] >> (TEMP * 8)
DEST[255:128] := SRC[255:128] >> (TEMP * 8)
DEST[MAXVL-1:256] := 0;

VPSRLDQ (VEX.128 and EVEX.128 Encoded Version)


TEMP := COUNT
IF (TEMP > 15) THEN TEMP := 16; FI
DEST := SRC >> (TEMP * 8)
DEST[MAXVL-1:128] := 0;

PSRLDQ (128-bit Legacy SSE Version)


TEMP := COUNT
IF (TEMP > 15) THEN TEMP := 16; FI
DEST := DEST >> (TEMP * 8)
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalents


(V)PSRLDQ __m128i _mm_srli_si128 ( __m128i a, int imm)
VPSRLDQ __m256i _mm256_bsrli_epi128 ( __m256i, const int)
VPSRLDQ __m512i _mm512_bsrli_epi128 ( __m512i, int)

Flags Affected
None.

Numeric Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-24, “Type 7 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”

PSRLDQ—Shift Double Quadword Right Logical Vol. 2B 4-460


PSRLW/PSRLD/PSRLQ—Shift Packed Data Right Logical
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F D1 /r1 A V/V MMX Shift words in mm right by amount specified in
PSRLW mm, mm/m64 mm/m64 while shifting in 0s.

66 0F D1 /r A V/V SSE2 Shift words in xmm1 right by amount specified


PSRLW xmm1, xmm2/m128 in xmm2/m128 while shifting in 0s.

NP 0F 71 /2 ib1 B V/V MMX Shift words in mm right by imm8 while shifting


PSRLW mm, imm8 in 0s.

66 0F 71 /2 ib B V/V SSE2 Shift words in xmm1 right by imm8 while


PSRLW xmm1, imm8 shifting in 0s.

NP 0F D2 /r1 A V/V MMX Shift doublewords in mm right by amount


PSRLD mm, mm/m64 specified in mm/m64 while shifting in 0s.

66 0F D2 /r A V/V SSE2 Shift doublewords in xmm1 right by amount


PSRLD xmm1, xmm2/m128 specified in xmm2 /m128 while shifting in 0s.

NP 0F 72 /2 ib1 B V/V MMX Shift doublewords in mm right by imm8 while


PSRLD mm, imm8 shifting in 0s.

66 0F 72 /2 ib B V/V SSE2 Shift doublewords in xmm1 right by imm8


PSRLD xmm1, imm8 while shifting in 0s.

NP 0F D3 /r1 A V/V MMX Shift mm right by amount specified in


PSRLQ mm, mm/m64 mm/m64 while shifting in 0s.

66 0F D3 /r A V/V SSE2 Shift quadwords in xmm1 right by amount


PSRLQ xmm1, xmm2/m128 specified in xmm2/m128 while shifting in 0s.

NP 0F 73 /2 ib1 B V/V MMX Shift mm right by imm8 while shifting in 0s.


PSRLQ mm, imm8
66 0F 73 /2 ib B V/V SSE2 Shift quadwords in xmm1 right by imm8 while
PSRLQ xmm1, imm8 shifting in 0s.

VEX.128.66.0F.WIG D1 /r C V/V AVX Shift words in xmm2 right by amount specified


VPSRLW xmm1, xmm2, xmm3/m128 in xmm3/m128 while shifting in 0s.

VEX.128.66.0F.WIG 71 /2 ib D V/V AVX Shift words in xmm2 right by imm8 while


VPSRLW xmm1, xmm2, imm8 shifting in 0s.

VEX.128.66.0F.WIG D2 /r C V/V AVX Shift doublewords in xmm2 right by amount


VPSRLD xmm1, xmm2, xmm3/m128 specified in xmm3/m128 while shifting in 0s.

VEX.128.66.0F.WIG 72 /2 ib D V/V AVX Shift doublewords in xmm2 right by imm8


VPSRLD xmm1, xmm2, imm8 while shifting in 0s.

VEX.128.66.0F.WIG D3 /r C V/V AVX Shift quadwords in xmm2 right by amount


VPSRLQ xmm1, xmm2, xmm3/m128 specified in xmm3/m128 while shifting in 0s.

VEX.128.66.0F.WIG 73 /2 ib D V/V AVX Shift quadwords in xmm2 right by imm8 while


VPSRLQ xmm1, xmm2, imm8 shifting in 0s.

VEX.256.66.0F.WIG D1 /r C V/V AVX2 Shift words in ymm2 right by amount specified


VPSRLW ymm1, ymm2, xmm3/m128 in xmm3/m128 while shifting in 0s.

VEX.256.66.0F.WIG 71 /2 ib D V/V AVX2 Shift words in ymm2 right by imm8 while


VPSRLW ymm1, ymm2, imm8 shifting in 0s.

PSRLW/PSRLD/PSRLQ—Shift Packed Data Right Logical Vol. 2B 4-461


Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
VEX.256.66.0F.WIG D2 /r C V/V AVX2 Shift doublewords in ymm2 right by amount
VPSRLD ymm1, ymm2, xmm3/m128 specified in xmm3/m128 while shifting in 0s.

VEX.256.66.0F.WIG 72 /2 ib D V/V AVX2 Shift doublewords in ymm2 right by imm8


VPSRLD ymm1, ymm2, imm8 while shifting in 0s.

VEX.256.66.0F.WIG D3 /r C V/V AVX2 Shift quadwords in ymm2 right by amount


VPSRLQ ymm1, ymm2, xmm3/m128 specified in xmm3/m128 while shifting in 0s.

VEX.256.66.0F.WIG 73 /2 ib D V/V AVX2 Shift quadwords in ymm2 right by imm8 while


VPSRLQ ymm1, ymm2, imm8 shifting in 0s.

EVEX.128.66.0F.WIG D1 /r G V/V (AVX512VL AND Shift words in xmm2 right by amount specified
VPSRLW xmm1 {k1}{z}, xmm2, xmm3/m128 AVX512BW) OR in xmm3/m128 while shifting in 0s using
AVX10.12 writemask k1.
EVEX.256.66.0F.WIG D1 /r G V/V (AVX512VL AND Shift words in ymm2 right by amount specified
VPSRLW ymm1 {k1}{z}, ymm2, xmm3/m128 AVX512BW) OR in xmm3/m128 while shifting in 0s using
AVX10.12 writemask k1.
EVEX.512.66.0F.WIG D1 /r G V/V AVX512BW OR Shift words in zmm2 right by amount specified
VPSRLW zmm1 {k1}{z}, zmm2, xmm3/m128 AVX10.12 in xmm3/m128 while shifting in 0s using
writemask k1.
EVEX.128.66.0F.WIG 71 /2 ib E V/V (AVX512VL AND Shift words in xmm2/m128 right by imm8
VPSRLW xmm1 {k1}{z}, xmm2/m128, imm8 AVX512BW) OR while shifting in 0s using writemask k1.
AVX10.12
EVEX.256.66.0F.WIG 71 /2 ib E V/V (AVX512VL AND Shift words in ymm2/m256 right by imm8
VPSRLW ymm1 {k1}{z}, ymm2/m256, imm8 AVX512BW) OR while shifting in 0s using writemask k1.
AVX10.12
EVEX.512.66.0F.WIG 71 /2 ib E V/V AVX512BW OR Shift words in zmm2/m512 right by imm8
VPSRLW zmm1 {k1}{z}, zmm2/m512, imm8 AVX10.12 while shifting in 0s using writemask k1.
EVEX.128.66.0F.W0 D2 /r G V/V (AVX512VL AND Shift doublewords in xmm2 right by amount
VPSRLD xmm1 {k1}{z}, xmm2, xmm3/m128 AVX512F) OR specified in xmm3/m128 while shifting in 0s
AVX10.12 using writemask k1.
EVEX.256.66.0F.W0 D2 /r G V/V (AVX512VL AND Shift doublewords in ymm2 right by amount
VPSRLD ymm1 {k1}{z}, ymm2, xmm3/m128 AVX512F) OR specified in xmm3/m128 while shifting in 0s
AVX10.12 using writemask k1.
EVEX.512.66.0F.W0 D2 /r G V/V AVX512F Shift doublewords in zmm2 right by amount
VPSRLD zmm1 {k1}{z}, zmm2, xmm3/m128 OR AVX10.12 specified in xmm3/m128 while shifting in 0s
using writemask k1.
EVEX.128.66.0F.W0 72 /2 ib F V/V (AVX512VL AND Shift doublewords in xmm2/m128/m32bcst
VPSRLD xmm1 {k1}{z}, AVX512F) OR right by imm8 while shifting in 0s using
xmm2/m128/m32bcst, imm8 AVX10.12 writemask k1.
EVEX.256.66.0F.W0 72 /2 ib F V/V (AVX512VL AND Shift doublewords in ymm2/m256/m32bcst
VPSRLD ymm1 {k1}{z}, AVX512F) OR right by imm8 while shifting in 0s using
ymm2/m256/m32bcst, imm8 AVX10.12 writemask k1.
EVEX.512.66.0F.W0 72 /2 ib F V/V AVX512F Shift doublewords in zmm2/m512/m32bcst
VPSRLD zmm1 {k1}{z}, OR AVX10.12 right by imm8 while shifting in 0s using
zmm2/m512/m32bcst, imm8 writemask k1.
EVEX.128.66.0F.W1 D3 /r G V/V (AVX512VL AND Shift quadwords in xmm2 right by amount
VPSRLQ xmm1 {k1}{z}, xmm2, xmm3/m128 AVX512F) OR specified in xmm3/m128 while shifting in 0s
AVX10.12 using writemask k1.

PSRLW/PSRLD/PSRLQ—Shift Packed Data Right Logical Vol. 2B 4-462


Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
EVEX.256.66.0F.W1 D3 /r G V/V (AVX512VL AND Shift quadwords in ymm2 right by amount
VPSRLQ ymm1 {k1}{z}, ymm2, xmm3/m128 AVX512F) OR specified in xmm3/m128 while shifting in 0s
AVX10.12 using writemask k1.
EVEX.512.66.0F.W1 D3 /r G V/V AVX512F Shift quadwords in zmm2 right by amount
VPSRLQ zmm1 {k1}{z}, zmm2, xmm3/m128 OR AVX10.12 specified in xmm3/m128 while shifting in 0s
using writemask k1.
EVEX.128.66.0F.W1 73 /2 ib F V/V (AVX512VL AND Shift quadwords in xmm2/m128/m64bcst
VPSRLQ xmm1 {k1}{z}, AVX512F) OR right by imm8 while shifting in 0s using
xmm2/m128/m64bcst, imm8 AVX10.12 writemask k1.
EVEX.256.66.0F.W1 73 /2 ib F V/V (AVX512VL AND Shift quadwords in ymm2/m256/m64bcst
VPSRLQ ymm1 {k1}{z}, AVX512F) OR right by imm8 while shifting in 0s using
ymm2/m256/m64bcst, imm8 AVX10.12 writemask k1.
EVEX.512.66.0F.W1 73 /2 ib F V/V AVX512F Shift quadwords in zmm2/m512/m64bcst
VPSRLQ zmm1 {k1}{z}, OR AVX10.12 right by imm8 while shifting in 0s using
zmm2/m512/m64bcst, imm8 writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:r/m (r, w) imm8 N/A N/A
C N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
D N/A VEX.vvvv (w) ModRM:r/m (r) imm8 N/A
E Full Mem EVEX.vvvv (w) ModRM:r/m (r) imm8 N/A
F Full EVEX.vvvv (w) ModRM:r/m (r) imm8 N/A
G Mem128 ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Shifts the bits in the individual data elements (words, doublewords, or quadword) in the destination operand (first
operand) to the right by the number of bits specified in the count operand (second operand). As the bits in the data
elements are shifted right, the empty high-order bits are cleared (set to 0). If the value specified by the count
operand is greater than 15 (for words), 31 (for doublewords), or 63 (for a quadword), then the destination operand
is set to all 0s. Figure 4-19 gives an example of shifting words in a 64-bit operand.
Note that only the low 64-bits of a 128-bit count operand are checked to compute the count.

PSRLW/PSRLD/PSRLQ—Shift Packed Data Right Logical Vol. 2B 4-463


Pre-Shift
X3 X2 X1 X0
DEST
Shift Right
with Zero
Extension

Post-Shift
DEST X3 >> COUNT X2 >> COUNT X1 >> COUNT X0 >> COUNT

Figure 4-19. PSRLW, PSRLD, and PSRLQ Instruction Operation Using 64-bit Operand

The (V)PSRLW instruction shifts each of the words in the destination operand to the right by the number of bits
specified in the count operand; the (V)PSRLD instruction shifts each of the doublewords in the destination operand;
and the PSRLQ instruction shifts the quadword (or quadwords) in the destination operand.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE instruction 64-bit operand: The destination operand is an MMX technology register; the count operand
can be either an MMX technology register or an 64-bit memory location.
128-bit Legacy SSE version: The destination operand is an XMM register; the count operand can be either an XMM
register or a 128-bit memory location, or an 8-bit immediate. If the count operand is a memory address, 128 bits
are loaded but the upper 64 bits are ignored. Bits (MAXVL-1:128) of the corresponding YMM destination register
remain unchanged.
VEX.128 encoded version: The destination operand is an XMM register; the count operand can be either an XMM
register or a 128-bit memory location, or an 8-bit immediate. If the count operand is a memory address, 128 bits
are loaded but the upper 64 bits are ignored. Bits (MAXVL-1:128) of the destination YMM register are zeroed.
VEX.256 encoded version: The destination operand is a YMM register. The source operand is a YMM register or a
memory location. The count operand can come either from an XMM register or a memory location or an 8-bit
immediate. Bits (MAXVL-1:256) of the corresponding ZMM register are zeroed.
EVEX encoded versions: The destination operand is a ZMM register updated according to the writemask. The count
operand is either an 8-bit immediate (the immediate count version) or an 8-bit value from an XMM register or a
memory location (the variable count version). For the immediate count version, the source operand (the second
operand) can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32/64-bit
memory location. For the variable count version, the first source operand (the second operand) is a ZMM register,
the second source operand (the third operand, 8-bit variable count) can be an XMM register or a memory location.
Note: In VEX/EVEX encoded versions of shifts with an immediate count, vvvv of VEX/EVEX encode the destination
register, and VEX.B/EVEX.B + ModRM.r/m encodes the source register.
Note: For shifts with an immediate count (VEX.128.66.0F 71-73 /2, or EVEX.128.66.0F 71-73 /2),
VEX.vvvv/EVEX.vvvv encodes the destination register.

Operation
PSRLW (With 64-bit Operand)
IF (COUNT > 15)
THEN
DEST[64:0] := 0000000000000000H
ELSE
DEST[15:0] := ZeroExtend(DEST[15:0] >> COUNT);
(* Repeat shift operation for 2nd and 3rd words *)
DEST[63:48] := ZeroExtend(DEST[63:48] >> COUNT);
FI;

PSRLW/PSRLD/PSRLQ—Shift Packed Data Right Logical Vol. 2B 4-464


PSRLD (With 64-bit Operand)
IF (COUNT > 31)
THEN
DEST[64:0] := 0000000000000000H
ELSE
DEST[31:0] := ZeroExtend(DEST[31:0] >> COUNT);
DEST[63:32] := ZeroExtend(DEST[63:32] >> COUNT);
FI;

PSRLQ (With 64-bit Operand)


IF (COUNT > 63)
THEN
DEST[64:0] := 0000000000000000H
ELSE
DEST := ZeroExtend(DEST >> COUNT);
FI;
LOGICAL_RIGHT_SHIFT_DWORDS1(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 31)
THEN
DEST[31:0] := 0
ELSE
DEST[31:0] := ZeroExtend(SRC[31:0] >> COUNT);
FI;

LOGICAL_RIGHT_SHIFT_QWORDS1(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 63)
THEN
DEST[63:0] := 0
ELSE
DEST[63:0] := ZeroExtend(SRC[63:0] >> COUNT);
FI;
LOGICAL_RIGHT_SHIFT_WORDS_256b(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 15)
THEN
DEST[255:0] := 0
ELSE
DEST[15:0] := ZeroExtend(SRC[15:0] >> COUNT);
(* Repeat shift operation for 2nd through 15th words *)
DEST[255:240] := ZeroExtend(SRC[255:240] >> COUNT);
FI;

LOGICAL_RIGHT_SHIFT_WORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 15)
THEN
DEST[127:0] := 00000000000000000000000000000000H
ELSE
DEST[15:0] := ZeroExtend(SRC[15:0] >> COUNT);
(* Repeat shift operation for 2nd through 7th words *)
DEST[127:112] := ZeroExtend(SRC[127:112] >> COUNT);
FI;

PSRLW/PSRLD/PSRLQ—Shift Packed Data Right Logical Vol. 2B 4-465


LOGICAL_RIGHT_SHIFT_DWORDS_256b(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 31)
THEN
DEST[255:0] := 0
ELSE
DEST[31:0] := ZeroExtend(SRC[31:0] >> COUNT);
(* Repeat shift operation for 2nd through 3rd words *)
DEST[255:224] := ZeroExtend(SRC[255:224] >> COUNT);
FI;

LOGICAL_RIGHT_SHIFT_DWORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 31)
THEN
DEST[127:0] := 00000000000000000000000000000000H
ELSE
DEST[31:0] := ZeroExtend(SRC[31:0] >> COUNT);
(* Repeat shift operation for 2nd through 3rd words *)
DEST[127:96] := ZeroExtend(SRC[127:96] >> COUNT);
FI;
LOGICAL_RIGHT_SHIFT_QWORDS_256b(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 63)
THEN
DEST[255:0] := 0
ELSE
DEST[63:0] := ZeroExtend(SRC[63:0] >> COUNT);
DEST[127:64] := ZeroExtend(SRC[127:64] >> COUNT);
DEST[191:128] := ZeroExtend(SRC[191:128] >> COUNT);
DEST[255:192] := ZeroExtend(SRC[255:192] >> COUNT);
FI;

LOGICAL_RIGHT_SHIFT_QWORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 63)
THEN
DEST[127:0] := 00000000000000000000000000000000H
ELSE
DEST[63:0] := ZeroExtend(SRC[63:0] >> COUNT);
DEST[127:64] := ZeroExtend(SRC[127:64] >> COUNT);
FI;

VPSRLW (EVEX Versions, xmm/m128)


(KL, VL) = (8, 128), (16, 256), (32, 512)
IF VL = 128
TMP_DEST[127:0] := LOGICAL_RIGHT_SHIFT_WORDS_128b(SRC1[127:0], SRC2)
FI;
IF VL = 256
TMP_DEST[255:0] := LOGICAL_RIGHT_SHIFT_WORDS_256b(SRC1[255:0], SRC2)
FI;
IF VL = 512
TMP_DEST[255:0] := LOGICAL_RIGHT_SHIFT_WORDS_256b(SRC1[255:0], SRC2)

PSRLW/PSRLD/PSRLQ—Shift Packed Data Right Logical Vol. 2B 4-466


TMP_DEST[511:256] := LOGICAL_RIGHT_SHIFT_WORDS_256b(SRC1[511:256], SRC2)
FI;

FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TMP_DEST[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPSRLW (EVEX Versions, imm8)


(KL, VL) = (8, 128), (16, 256), (32, 512)
IF VL = 128
TMP_DEST[127:0] := LOGICAL_RIGHT_SHIFT_WORDS_128b(SRC1[127:0], imm8)
FI;
IF VL = 256
TMP_DEST[255:0] := LOGICAL_RIGHT_SHIFT_WORDS_256b(SRC1[255:0], imm8)
FI;
IF VL = 512
TMP_DEST[255:0] := LOGICAL_RIGHT_SHIFT_WORDS_256b(SRC1[255:0], imm8)
TMP_DEST[511:256] := LOGICAL_RIGHT_SHIFT_WORDS_256b(SRC1[511:256], imm8)
FI;

FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TMP_DEST[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPSRLW (ymm, ymm, xmm/m128) - VEX.256 Encoding


DEST[255:0] := LOGICAL_RIGHT_SHIFT_WORDS_256b(SRC1, SRC2)
DEST[MAXVL-1:256] := 0;

VPSRLW (ymm, imm8) - VEX.256 Encoding


DEST[255:0] := LOGICAL_RIGHT_SHIFT_WORDS_256b(SRC1, imm8)
DEST[MAXVL-1:256] := 0;

PSRLW/PSRLD/PSRLQ—Shift Packed Data Right Logical Vol. 2B 4-467


VPSRLW (xmm, xmm, xmm/m128) - VEX.128 Encoding
DEST[127:0] := LOGICAL_RIGHT_SHIFT_WORDS(SRC1, SRC2)
DEST[MAXVL-1:128] := 0

VPSRLW (xmm, imm8) - VEX.128 Encoding


DEST[127:0] := LOGICAL_RIGHT_SHIFT_WORDS(SRC1, imm8)
DEST[MAXVL-1:128] := 0

PSRLW (xmm, xmm, xmm/m128)


DEST[127:0] := LOGICAL_RIGHT_SHIFT_WORDS(DEST, SRC)
DEST[MAXVL-1:128] (Unmodified)

PSRLW (xmm, imm8)


DEST[127:0] := LOGICAL_RIGHT_SHIFT_WORDS(DEST, imm8)
DEST[MAXVL-1:128] (Unmodified)

VPSRLD (EVEX Versions, xmm/m128)


(KL, VL) = (4, 128), (8, 256), (16, 512)
IF VL = 128
TMP_DEST[127:0] := LOGICAL_RIGHT_SHIFT_DWORDS_128b(SRC1[127:0], SRC2)
FI;
IF VL = 256
TMP_DEST[255:0] := LOGICAL_RIGHT_SHIFT_DWORDS_256b(SRC1[255:0], SRC2)
FI;
IF VL = 512
TMP_DEST[255:0] := LOGICAL_RIGHT_SHIFT_DWORDS_256b(SRC1[255:0], SRC2)
TMP_DEST[511:256] := LOGICAL_RIGHT_SHIFT_DWORDS_256b(SRC1[511:256], SRC2)
FI;

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPSRLD (EVEX Versions, imm8)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC1 *is memory*)
THEN DEST[i+31:i] := LOGICAL_RIGHT_SHIFT_DWORDS1(SRC1[31:0], imm8)
ELSE DEST[i+31:i] := LOGICAL_RIGHT_SHIFT_DWORDS1(SRC1[i+31:i], imm8)
FI;
ELSE
IF *merging-masking* ; merging-masking

PSRLW/PSRLD/PSRLQ—Shift Packed Data Right Logical Vol. 2B 4-468


THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPSRLD (ymm, ymm, xmm/m128) - VEX.256 Encoding


DEST[255:0] := LOGICAL_RIGHT_SHIFT_DWORDS_256b(SRC1, SRC2)
DEST[MAXVL-1:256] := 0;

VPSRLD (ymm, imm8) - VEX.256 Encoding


DEST[255:0] := LOGICAL_RIGHT_SHIFT_DWORDS_256b(SRC1, imm8)
DEST[MAXVL-1:256] := 0;

VPSRLD (xmm, xmm, xmm/m128) - VEX.128 Encoding


DEST[127:0] := LOGICAL_RIGHT_SHIFT_DWORDS(SRC1, SRC2)
DEST[MAXVL-1:128] := 0

VPSRLD (xmm, imm8) - VEX.128 Encoding


DEST[127:0] := LOGICAL_RIGHT_SHIFT_DWORDS(SRC1, imm8)
DEST[MAXVL-1:128] := 0

PSRLD (xmm, xmm, xmm/m128)


DEST[127:0] := LOGICAL_RIGHT_SHIFT_DWORDS(DEST, SRC)
DEST[MAXVL-1:128] (Unmodified)

PSRLD (xmm, imm8)


DEST[127:0] := LOGICAL_RIGHT_SHIFT_DWORDS(DEST, imm8)
DEST[MAXVL-1:128] (Unmodified)

VPSRLQ (EVEX Versions, xmm/m128)


(KL, VL) = (2, 128), (4, 256), (8, 512)
TMP_DEST[255:0] := LOGICAL_RIGHT_SHIFT_QWORDS_256b(SRC1[255:0], SRC2)
TMP_DEST[511:256] := LOGICAL_RIGHT_SHIFT_QWORDS_256b(SRC1[511:256], SRC2)
IF VL = 128
TMP_DEST[127:0] := LOGICAL_RIGHT_SHIFT_QWORDS_128b(SRC1[127:0], SRC2)
FI;
IF VL = 256
TMP_DEST[255:0] := LOGICAL_RIGHT_SHIFT_QWORDS_256b(SRC1[255:0], SRC2)
FI;
IF VL = 512
TMP_DEST[255:0] := LOGICAL_RIGHT_SHIFT_QWORDS_256b(SRC1[255:0], SRC2)
TMP_DEST[511:256] := LOGICAL_RIGHT_SHIFT_QWORDS_256b(SRC1[511:256], SRC2)
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking

PSRLW/PSRLD/PSRLQ—Shift Packed Data Right Logical Vol. 2B 4-469


DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPSRLQ (EVEX Versions, imm8)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC1 *is memory*)
THEN DEST[i+63:i] := LOGICAL_RIGHT_SHIFT_QWORDS1(SRC1[63:0], imm8)
ELSE DEST[i+63:i] := LOGICAL_RIGHT_SHIFT_QWORDS1(SRC1[i+63:i], imm8)
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPSRLQ (ymm, ymm, xmm/m128) - VEX.256 Encoding


DEST[255:0] := LOGICAL_RIGHT_SHIFT_QWORDS_256b(SRC1, SRC2)
DEST[MAXVL-1:256] := 0;

VPSRLQ (ymm, imm8) - VEX.256 Encoding


DEST[255:0] := LOGICAL_RIGHT_SHIFT_QWORDS_256b(SRC1, imm8)
DEST[MAXVL-1:256] := 0;
VPSRLQ (xmm, xmm, xmm/m128) - VEX.128 Encoding
DEST[127:0] := LOGICAL_RIGHT_SHIFT_QWORDS(SRC1, SRC2)
DEST[MAXVL-1:128] := 0

VPSRLQ (xmm, imm8) - VEX.128 Encoding


DEST[127:0] := LOGICAL_RIGHT_SHIFT_QWORDS(SRC1, imm8)
DEST[MAXVL-1:128] := 0

PSRLQ (xmm, xmm, xmm/m128)


DEST[127:0] := LOGICAL_RIGHT_SHIFT_QWORDS(DEST, SRC)
DEST[MAXVL-1:128] (Unmodified)

PSRLQ (xmm, imm8)


DEST[127:0] := LOGICAL_RIGHT_SHIFT_QWORDS(DEST, imm8)
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalents


VPSRLD __m512i _mm512_srli_epi32(__m512i a, unsigned int imm);
VPSRLD __m512i _mm512_mask_srli_epi32(__m512i s, __mmask16 k, __m512i a, unsigned int imm);
VPSRLD __m512i _mm512_maskz_srli_epi32( __mmask16 k, __m512i a, unsigned int imm);
VPSRLD __m256i _mm256_mask_srli_epi32(__m256i s, __mmask8 k, __m256i a, unsigned int imm);

PSRLW/PSRLD/PSRLQ—Shift Packed Data Right Logical Vol. 2B 4-470


VPSRLD __m256i _mm256_maskz_srli_epi32( __mmask8 k, __m256i a, unsigned int imm);
VPSRLD __m128i _mm_mask_srli_epi32(__m128i s, __mmask8 k, __m128i a, unsigned int imm);
VPSRLD __m128i _mm_maskz_srli_epi32( __mmask8 k, __m128i a, unsigned int imm);
VPSRLD __m512i _mm512_srl_epi32(__m512i a, __m128i cnt);
VPSRLD __m512i _mm512_mask_srl_epi32(__m512i s, __mmask16 k, __m512i a, __m128i cnt);
VPSRLD __m512i _mm512_maskz_srl_epi32( __mmask16 k, __m512i a, __m128i cnt);
VPSRLD __m256i _mm256_mask_srl_epi32(__m256i s, __mmask8 k, __m256i a, __m128i cnt);
VPSRLD __m256i _mm256_maskz_srl_epi32( __mmask8 k, __m256i a, __m128i cnt);
VPSRLD __m128i _mm_mask_srl_epi32(__m128i s, __mmask8 k, __m128i a, __m128i cnt);
VPSRLD __m128i _mm_maskz_srl_epi32( __mmask8 k, __m128i a, __m128i cnt);
VPSRLQ __m512i _mm512_srli_epi64(__m512i a, unsigned int imm);
VPSRLQ __m512i _mm512_mask_srli_epi64(__m512i s, __mmask8 k, __m512i a, unsigned int imm);
VPSRLQ __m512i _mm512_mask_srli_epi64( __mmask8 k, __m512i a, unsigned int imm);
VPSRLQ __m256i _mm256_mask_srli_epi64(__m256i s, __mmask8 k, __m256i a, unsigned int imm);
VPSRLQ __m256i _mm256_maskz_srli_epi64( __mmask8 k, __m256i a, unsigned int imm);
VPSRLQ __m128i _mm_mask_srli_epi64(__m128i s, __mmask8 k, __m128i a, unsigned int imm);
VPSRLQ __m128i _mm_maskz_srli_epi64( __mmask8 k, __m128i a, unsigned int imm);
VPSRLQ __m512i _mm512_srl_epi64(__m512i a, __m128i cnt);
VPSRLQ __m512i _mm512_mask_srl_epi64(__m512i s, __mmask8 k, __m512i a, __m128i cnt);
VPSRLQ __m512i _mm512_mask_srl_epi64( __mmask8 k, __m512i a, __m128i cnt);
VPSRLQ __m256i _mm256_mask_srl_epi64(__m256i s, __mmask8 k, __m256i a, __m128i cnt);
VPSRLQ __m256i _mm256_maskz_srl_epi64( __mmask8 k, __m256i a, __m128i cnt);
VPSRLQ __m128i _mm_mask_srl_epi64(__m128i s, __mmask8 k, __m128i a, __m128i cnt);
VPSRLQ __m128i _mm_maskz_srl_epi64( __mmask8 k, __m128i a, __m128i cnt);
VPSRLW __m512i _mm512_srli_epi16(__m512i a, unsigned int imm);
VPSRLW __m512i _mm512_mask_srli_epi16(__m512i s, __mmask32 k, __m512i a, unsigned int imm);
VPSRLW __m512i _mm512_maskz_srli_epi16( __mmask32 k, __m512i a, unsigned int imm);
VPSRLW __m256i _mm256_mask_srli_epi16(__m256i s, __mmask16 k, __m256i a, unsigned int imm);
VPSRLW __m256i _mm256_maskz_srli_epi16( __mmask16 k, __m256i a, unsigned int imm);
VPSRLW __m128i _mm_mask_srli_epi16(__m128i s, __mmask8 k, __m128i a, unsigned int imm);
VPSRLW __m128i _mm_maskz_srli_epi16( __mmask8 k, __m128i a, unsigned int imm);
VPSRLW __m512i _mm512_srl_epi16(__m512i a, __m128i cnt);
VPSRLW __m512i _mm512_mask_srl_epi16(__m512i s, __mmask32 k, __m512i a, __m128i cnt);
VPSRLW __m512i _mm512_maskz_srl_epi16( __mmask32 k, __m512i a, __m128i cnt);
VPSRLW __m256i _mm256_mask_srl_epi16(__m256i s, __mmask16 k, __m256i a, __m128i cnt);
VPSRLW __m256i _mm256_maskz_srl_epi16( __mmask8 k, __mmask16 a, __m128i cnt);
VPSRLW __m128i _mm_mask_srl_epi16(__m128i s, __mmask8 k, __m128i a, __m128i cnt);
VPSRLW __m128i _mm_maskz_srl_epi16( __mmask8 k, __m128i a, __m128i cnt);
PSRLW __m64 _mm_srli_pi16(__m64 m, int count)
PSRLW __m64 _mm_srl_pi16 (__m64 m, __m64 count)
(V)PSRLW __m128i _mm_srli_epi16 (__m128i m, int count)
(V)PSRLW __m128i _mm_srl_epi16 (__m128i m, __m128i count)
VPSRLW __m256i _mm256_srli_epi16 (__m256i m, int count)
VPSRLW __m256i _mm256_srl_epi16 (__m256i m, __m128i count)
PSRLD __m64 _mm_srli_pi32 (__m64 m, int count)
PSRLD __m64 _mm_srl_pi32 (__m64 m, __m64 count)
(V)PSRLD __m128i _mm_srli_epi32 (__m128i m, int count)
(V)PSRLD __m128i _mm_srl_epi32 (__m128i m, __m128i count)
VPSRLD __m256i _mm256_srli_epi32 (__m256i m, int count)
VPSRLD __m256i _mm256_srl_epi32 (__m256i m, __m128i count)
PSRLQ __m64 _mm_srli_si64 (__m64 m, int count)
PSRLQ __m64 _mm_srl_si64 (__m64 m, __m64 count)
(V)PSRLQ __m128i _mm_srli_epi64 (__m128i m, int count)
(V)PSRLQ __m128i _mm_srl_epi64 (__m128i m, __m128i count)

PSRLW/PSRLD/PSRLQ—Shift Packed Data Right Logical Vol. 2B 4-471


VPSRLQ __m256i _mm256_srli_epi64 (__m256i m, int count)
VPSRLQ __m256i _mm256_srl_epi64 (__m256i m, __m128i count)

Flags Affected
None.

Numeric Exceptions
None.

Other Exceptions
• VEX-encoded instructions:
— Syntax with RM/RVM operand encoding (A/C in the operand encoding table), see Table 2-21, “Type 4 Class
Exception Conditions.”
— Syntax with MI/VMI operand encoding (B/D in the operand encoding table), see Table 2-24, “Type 7 Class
Exception Conditions.”
• EVEX-encoded VPSRLW (E in the operand encoding table), see Exceptions Type E4NF.nb in Table 2-52, “Type
E4NF Class Exception Conditions.”
• EVEX-encoded VPSRLD/Q:
— Syntax with Mem128 tuple type (G in the operand encoding table), see Exceptions Type E4NF.nb in
Table 2-52, “Type E4NF Class Exception Conditions.”
— Syntax with Full tuple type (F in the operand encoding table), see Table 2-51, “Type E4 Class Exception
Conditions.”

PSRLW/PSRLD/PSRLQ—Shift Packed Data Right Logical Vol. 2B 4-472


PSUBB/PSUBW/PSUBD—Subtract Packed Integers
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F F8 /r1 A V/V MMX Subtract packed byte integers in mm/m64
PSUBB mm, mm/m64 from packed byte integers in mm.

66 0F F8 /r A V/V SSE2 Subtract packed byte integers in xmm2/m128


PSUBB xmm1, xmm2/m128 from packed byte integers in xmm1.

NP 0F F9 /r1 A V/V MMX Subtract packed word integers in mm/m64


PSUBW mm, mm/m64 from packed word integers in mm.

66 0F F9 /r A V/V SSE2 Subtract packed word integers in


PSUBW xmm1, xmm2/m128 xmm2/m128 from packed word integers in
xmm1.
NP 0F FA /r1 A V/V MMX Subtract packed doubleword integers in
PSUBD mm, mm/m64 mm/m64 from packed doubleword integers in
mm.
66 0F FA /r A V/V SSE2 Subtract packed doubleword integers in
PSUBD xmm1, xmm2/m128 xmm2/mem128 from packed doubleword
integers in xmm1.
VEX.128.66.0F.WIG F8 /r B V/V AVX Subtract packed byte integers in xmm3/m128
VPSUBB xmm1, xmm2, xmm3/m128 from xmm2.
VEX.128.66.0F.WIG F9 /r B V/V AVX Subtract packed word integers in
VPSUBW xmm1, xmm2, xmm3/m128 xmm3/m128 from xmm2.

VEX.128.66.0F.WIG FA /r B V/V AVX Subtract packed doubleword integers in


VPSUBD xmm1, xmm2, xmm3/m128 xmm3/m128 from xmm2.
VEX.256.66.0F.WIG F8 /r B V/V AVX2 Subtract packed byte integers in ymm3/m256
VPSUBB ymm1, ymm2, ymm3/m256 from ymm2.
VEX.256.66.0F.WIG F9 /r B V/V AVX2 Subtract packed word integers in
VPSUBW ymm1, ymm2, ymm3/m256 ymm3/m256 from ymm2.
VEX.256.66.0F.WIG FA /r B V/V AVX2 Subtract packed doubleword integers in
VPSUBD ymm1, ymm2, ymm3/m256 ymm3/m256 from ymm2.
EVEX.128.66.0F.WIG F8 /r C V/V (AVX512VL AND Subtract packed byte integers in xmm3/m128
VPSUBB xmm1 {k1}{z}, xmm2, xmm3/m128 AVX512BW) OR from xmm2 and store in xmm1 using
AVX10.12 writemask k1.
EVEX.256.66.0F.WIG F8 /r C V/V (AVX512VL AND Subtract packed byte integers in ymm3/m256
VPSUBB ymm1 {k1}{z}, ymm2, ymm3/m256 AVX512BW) OR from ymm2 and store in ymm1 using
AVX10.12 writemask k1.
EVEX.512.66.0F.WIG F8 /r C V/V AVX512BW Subtract packed byte integers in zmm3/m512
VPSUBB zmm1 {k1}{z}, zmm2, zmm3/m512 OR AVX10.12 from zmm2 and store in zmm1 using
writemask k1.
EVEX.128.66.0F.WIG F9 /r C V/V (AVX512VL AND Subtract packed word integers in
VPSUBW xmm1 {k1}{z}, xmm2, xmm3/m128 AVX512BW) OR xmm3/m128 from xmm2 and store in xmm1
AVX10.12 using writemask k1.
EVEX.256.66.0F.WIG F9 /r C V/V (AVX512VL AND Subtract packed word integers in
VPSUBW ymm1 {k1}{z}, ymm2, ymm3/m256 AVX512BW) OR ymm3/m256 from ymm2 and store in ymm1
AVX10.12 using writemask k1.
EVEX.512.66.0F.WIG F9 /r C V/V AVX512BW Subtract packed word integers in
VPSUBW zmm1 {k1}{z}, zmm2, zmm3/m512 OR AVX10.12 zmm3/m512 from zmm2 and store in zmm1
using writemask k1.

PSUBB/PSUBW/PSUBD—Subtract Packed Integers Vol. 2B 4-473


Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
EVEX.128.66.0F.W0 FA /r D V/V (AVX512VL AND Subtract packed doubleword integers in
VPSUBD xmm1 {k1}{z}, xmm2, AVX512F) OR xmm3/m128/m32bcst from xmm2 and store
xmm3/m128/m32bcst AVX10.12 in xmm1 using writemask k1.
EVEX.256.66.0F.W0 FA /r D V/V (AVX512VL AND Subtract packed doubleword integers in
VPSUBD ymm1 {k1}{z}, ymm2, AVX512F) OR ymm3/m256/m32bcst from ymm2 and store
ymm3/m256/m32bcst AVX10.12 in ymm1 using writemask k1.
EVEX.512.66.0F.W0 FA /r D V/V AVX512F Subtract packed doubleword integers in
VPSUBD zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm3/m512/m32bcst from zmm2 and store
zmm3/m512/m32bcst in zmm1 using writemask k1

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A
D Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD subtract of the packed integers of the source operand (second operand) from the packed integers
of the destination operand (first operand), and stores the packed integer results in the destination operand. See
Figure 9-4 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a
SIMD operation. Overflow is handled with wraparound, as described in the following paragraphs.
The (V)PSUBB instruction subtracts packed byte integers. When an individual result is too large or too small to be
represented in a byte, the result is wrapped around and the low 8 bits are written to the destination element.
The (V)PSUBW instruction subtracts packed word integers. When an individual result is too large or too small to be
represented in a word, the result is wrapped around and the low 16 bits are written to the destination element.
The (V)PSUBD instruction subtracts packed doubleword integers. When an individual result is too large or too small
to be represented in a doubleword, the result is wrapped around and the low 32 bits are written to the destination
element.
Note that the (V)PSUBB, (V)PSUBW, and (V)PSUBD instructions can operate on either unsigned or signed (two's
complement notation) packed integers; however, it does not set bits in the EFLAGS register to indicate overflow
and/or a carry. To prevent undetected overflow conditions, software must control the ranges of values upon which
it operates.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE version 64-bit operand: The destination operand must be an MMX technology register and the source
operand can be either an MMX technology register or a 64-bit memory location.

PSUBB/PSUBW/PSUBD—Subtract Packed Integers Vol. 2B 4-474


128-bit Legacy SSE version: The second source operand is an XMM register or a 128-bit memory location. The first
source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM desti-
nation register remain unchanged.
VEX.128 encoded version: The second source operand is an XMM register or a 128-bit memory location. The first
source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the destination YMM register
are zeroed.
VEX.256 encoded versions: The second source operand is an YMM register or an 256-bit memory location. The first
source operand and destination operands are YMM registers. Bits (MAXVL-1:256) of the corresponding ZMM
register are zeroed.
EVEX encoded VPSUBD: The second source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. The first source operand and
destination operands are ZMM/YMM/XMM registers. The destination is conditionally updated with writemask k1.
EVEX encoded VPSUBB/W: The second source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory
location. The first source operand and destination operands are ZMM/YMM/XMM registers. The destination is condi-
tionally updated with writemask k1.

Operation
PSUBB (With 64-bit Operands)
DEST[7:0] := DEST[7:0] − SRC[7:0];
(* Repeat subtract operation for 2nd through 7th byte *)
DEST[63:56] := DEST[63:56] − SRC[63:56];

PSUBW (With 64-bit Operands)


DEST[15:0] := DEST[15:0] − SRC[15:0];
(* Repeat subtract operation for 2nd and 3rd word *)
DEST[63:48] := DEST[63:48] − SRC[63:48];

PSUBD (With 64-bit Operands)


DEST[31:0] := DEST[31:0] − SRC[31:0];
DEST[63:32] := DEST[63:32] − SRC[63:32];

PSUBD (With 128-bit Operands)


DEST[31:0] := DEST[31:0] − SRC[31:0];
(* Repeat subtract operation for 2nd and 3rd doubleword *)
DEST[127:96] := DEST[127:96] − SRC[127:96];

VPSUBB (EVEX Encoded Versions)


(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SRC1[i+7:i] - SRC2[i+7:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PSUBB/PSUBW/PSUBD—Subtract Packed Integers Vol. 2B 4-475


VPSUBW (EVEX Encoded Versions)
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SRC1[i+15:i] - SRC2[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

VPSUBD (EVEX Encoded Versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+31:i] := SRC1[i+31:i] - SRC2[31:0]
ELSE DEST[i+31:i] := SRC1[i+31:i] - SRC2[i+31:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

VPSUBB (VEX.256 Encoded Version)


DEST[7:0] := SRC1[7:0]-SRC2[7:0]
DEST[15:8] := SRC1[15:8]-SRC2[15:8]
DEST[23:16] := SRC1[23:16]-SRC2[23:16]
DEST[31:24] := SRC1[31:24]-SRC2[31:24]
DEST[39:32] := SRC1[39:32]-SRC2[39:32]
DEST[47:40] := SRC1[47:40]-SRC2[47:40]
DEST[55:48] := SRC1[55:48]-SRC2[55:48]
DEST[63:56] := SRC1[63:56]-SRC2[63:56]
DEST[71:64] := SRC1[71:64]-SRC2[71:64]
DEST[79:72] := SRC1[79:72]-SRC2[79:72]
DEST[87:80] := SRC1[87:80]-SRC2[87:80]
DEST[95:88] := SRC1[95:88]-SRC2[95:88]
DEST[103:96] := SRC1[103:96]-SRC2[103:96]
DEST[111:104] := SRC1[111:104]-SRC2[111:104]
DEST[119:112] := SRC1[119:112]-SRC2[119:112]
DEST[127:120] := SRC1[127:120]-SRC2[127:120]
DEST[135:128] := SRC1[135:128]-SRC2[135:128]
DEST[143:136] := SRC1[143:136]-SRC2[143:136]

PSUBB/PSUBW/PSUBD—Subtract Packed Integers Vol. 2B 4-476


DEST[151:144] := SRC1[151:144]-SRC2[151:144]
DEST[159:152] := SRC1[159:152]-SRC2[159:152]
DEST[167:160] := SRC1[167:160]-SRC2[167:160]
DEST[175:168] := SRC1[175:168]-SRC2[175:168]
DEST[183:176] := SRC1[183:176]-SRC2[183:176]
DEST[191:184] := SRC1[191:184]-SRC2[191:184]
DEST[199:192] := SRC1[199:192]-SRC2[199:192]
DEST[207:200] := SRC1[207:200]-SRC2[207:200]
DEST[215:208] := SRC1[215:208]-SRC2[215:208]
DEST[223:216] := SRC1[223:216]-SRC2[223:216]
DEST[231:224] := SRC1[231:224]-SRC2[231:224]
DEST[239:232] := SRC1[239:232]-SRC2[239:232]
DEST[247:240] := SRC1[247:240]-SRC2[247:240]
DEST[255:248] := SRC1[255:248]-SRC2[255:248]
DEST[MAXVL-1:256] := 0

VPSUBB (VEX.128 Encoded Version)


DEST[7:0] := SRC1[7:0]-SRC2[7:0]
DEST[15:8] := SRC1[15:8]-SRC2[15:8]
DEST[23:16] := SRC1[23:16]-SRC2[23:16]
DEST[31:24] := SRC1[31:24]-SRC2[31:24]
DEST[39:32] := SRC1[39:32]-SRC2[39:32]
DEST[47:40] := SRC1[47:40]-SRC2[47:40]
DEST[55:48] := SRC1[55:48]-SRC2[55:48]
DEST[63:56] := SRC1[63:56]-SRC2[63:56]
DEST[71:64] := SRC1[71:64]-SRC2[71:64]
DEST[79:72] := SRC1[79:72]-SRC2[79:72]
DEST[87:80] := SRC1[87:80]-SRC2[87:80]
DEST[95:88] := SRC1[95:88]-SRC2[95:88]
DEST[103:96] := SRC1[103:96]-SRC2[103:96]
DEST[111:104] := SRC1[111:104]-SRC2[111:104]
DEST[119:112] := SRC1[119:112]-SRC2[119:112]
DEST[127:120] := SRC1[127:120]-SRC2[127:120]
DEST[MAXVL-1:128] := 0

PSUBB (128-bit Legacy SSE Version)


DEST[7:0] := DEST[7:0]-SRC[7:0]
DEST[15:8] := DEST[15:8]-SRC[15:8]
DEST[23:16] := DEST[23:16]-SRC[23:16]
DEST[31:24] := DEST[31:24]-SRC[31:24]
DEST[39:32] := DEST[39:32]-SRC[39:32]
DEST[47:40] := DEST[47:40]-SRC[47:40]
DEST[55:48] := DEST[55:48]-SRC[55:48]
DEST[63:56] := DEST[63:56]-SRC[63:56]
DEST[71:64] := DEST[71:64]-SRC[71:64]
DEST[79:72] := DEST[79:72]-SRC[79:72]
DEST[87:80] := DEST[87:80]-SRC[87:80]
DEST[95:88] := DEST[95:88]-SRC[95:88]
DEST[103:96] := DEST[103:96]-SRC[103:96]
DEST[111:104] := DEST[111:104]-SRC[111:104]
DEST[119:112] := DEST[119:112]-SRC[119:112]
DEST[127:120] := DEST[127:120]-SRC[127:120]
DEST[MAXVL-1:128] (Unmodified)

PSUBB/PSUBW/PSUBD—Subtract Packed Integers Vol. 2B 4-477


VPSUBW (VEX.256 Encoded Version)
DEST[15:0] := SRC1[15:0]-SRC2[15:0]
DEST[31:16] := SRC1[31:16]-SRC2[31:16]
DEST[47:32] := SRC1[47:32]-SRC2[47:32]
DEST[63:48] := SRC1[63:48]-SRC2[63:48]
DEST[79:64] := SRC1[79:64]-SRC2[79:64]
DEST[95:80] := SRC1[95:80]-SRC2[95:80]
DEST[111:96] := SRC1[111:96]-SRC2[111:96]
DEST[127:112] := SRC1[127:112]-SRC2[127:112]
DEST[143:128] := SRC1[143:128]-SRC2[143:128]
DEST[159:144] := SRC1[159:144]-SRC2[159:144]
DEST[175:160] := SRC1[175:160]-SRC2[175:160]
DEST[191:176] := SRC1[191:176]-SRC2[191:176]
DEST[207:192] := SRC1207:192]-SRC2[207:192]
DEST[223:208] := SRC1[223:208]-SRC2[223:208]
DEST[239:224] := SRC1[239:224]-SRC2[239:224]
DEST[255:240] := SRC1[255:240]-SRC2[255:240]
DEST[MAXVL-1:256] := 0

VPSUBW (VEX.128 Encoded Version)


DEST[15:0] := SRC1[15:0]-SRC2[15:0]
DEST[31:16] := SRC1[31:16]-SRC2[31:16]
DEST[47:32] := SRC1[47:32]-SRC2[47:32]
DEST[63:48] := SRC1[63:48]-SRC2[63:48]
DEST[79:64] := SRC1[79:64]-SRC2[79:64]
DEST[95:80] := SRC1[95:80]-SRC2[95:80]
DEST[111:96] := SRC1[111:96]-SRC2[111:96]
DEST[127:112] := SRC1[127:112]-SRC2[127:112]
DEST[MAXVL-1:128] := 0

PSUBW (128-bit Legacy SSE Version)


DEST[15:0] := DEST[15:0]-SRC[15:0]
DEST[31:16] := DEST[31:16]-SRC[31:16]
DEST[47:32] := DEST[47:32]-SRC[47:32]
DEST[63:48] := DEST[63:48]-SRC[63:48]
DEST[79:64] := DEST[79:64]-SRC[79:64]
DEST[95:80] := DEST[95:80]-SRC[95:80]
DEST[111:96] := DEST[111:96]-SRC[111:96]
DEST[127:112] := DEST[127:112]-SRC[127:112]
DEST[MAXVL-1:128] (Unmodified)

VPSUBD (VEX.256 Encoded Version)


DEST[31:0] := SRC1[31:0]-SRC2[31:0]
DEST[63:32] := SRC1[63:32]-SRC2[63:32]
DEST[95:64] := SRC1[95:64]-SRC2[95:64]
DEST[127:96] := SRC1[127:96]-SRC2[127:96]
DEST[159:128] := SRC1[159:128]-SRC2[159:128]
DEST[191:160] := SRC1[191:160]-SRC2[191:160]
DEST[223:192] := SRC1[223:192]-SRC2[223:192]
DEST[255:224] := SRC1[255:224]-SRC2[255:224]
DEST[MAXVL-1:256] := 0

PSUBB/PSUBW/PSUBD—Subtract Packed Integers Vol. 2B 4-478


VPSUBD (VEX.128 Encoded Version)
DEST[31:0] := SRC1[31:0]-SRC2[31:0]
DEST[63:32] := SRC1[63:32]-SRC2[63:32]
DEST[95:64] := SRC1[95:64]-SRC2[95:64]
DEST[127:96] := SRC1[127:96]-SRC2[127:96]
DEST[MAXVL-1:128] := 0

PSUBD (128-bit Legacy SSE Version)


DEST[31:0] := DEST[31:0]-SRC[31:0]
DEST[63:32] := DEST[63:32]-SRC[63:32]
DEST[95:64] := DEST[95:64]-SRC[95:64]
DEST[127:96] := DEST[127:96]-SRC[127:96]
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalents


VPSUBB __m512i _mm512_sub_epi8(__m512i a, __m512i b);
VPSUBB __m512i _mm512_mask_sub_epi8(__m512i s, __mmask64 k, __m512i a, __m512i b);
VPSUBB __m512i _mm512_maskz_sub_epi8( __mmask64 k, __m512i a, __m512i b);
VPSUBB __m256i _mm256_mask_sub_epi8(__m256i s, __mmask32 k, __m256i a, __m256i b);
VPSUBB __m256i _mm256_maskz_sub_epi8( __mmask32 k, __m256i a, __m256i b);
VPSUBB __m128i _mm_mask_sub_epi8(__m128i s, __mmask16 k, __m128i a, __m128i b);
VPSUBB __m128i _mm_maskz_sub_epi8( __mmask16 k, __m128i a, __m128i b);
VPSUBW __m512i _mm512_sub_epi16(__m512i a, __m512i b);
VPSUBW __m512i _mm512_mask_sub_epi16(__m512i s, __mmask32 k, __m512i a, __m512i b);
VPSUBW __m512i _mm512_maskz_sub_epi16( __mmask32 k, __m512i a, __m512i b);
VPSUBW __m256i _mm256_mask_sub_epi16(__m256i s, __mmask16 k, __m256i a, __m256i b);
VPSUBW __m256i _mm256_maskz_sub_epi16( __mmask16 k, __m256i a, __m256i b);
VPSUBW __m128i _mm_mask_sub_epi16(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPSUBW __m128i _mm_maskz_sub_epi16( __mmask8 k, __m128i a, __m128i b);
VPSUBD __m512i _mm512_sub_epi32(__m512i a, __m512i b);
VPSUBD __m512i _mm512_mask_sub_epi32(__m512i s, __mmask16 k, __m512i a, __m512i b);
VPSUBD __m512i _mm512_maskz_sub_epi32( __mmask16 k, __m512i a, __m512i b);
VPSUBD __m256i _mm256_mask_sub_epi32(__m256i s, __mmask8 k, __m256i a, __m256i b);
VPSUBD __m256i _mm256_maskz_sub_epi32( __mmask8 k, __m256i a, __m256i b);
VPSUBD __m128i _mm_mask_sub_epi32(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPSUBD __m128i _mm_maskz_sub_epi32( __mmask8 k, __m128i a, __m128i b);
PSUBB __m64 _mm_sub_pi8(__m64 m1, __m64 m2)
(V)PSUBB __m128i _mm_sub_epi8 ( __m128i a, __m128i b)
VPSUBB __m256i _mm256_sub_epi8 ( __m256i a, __m256i b)
PSUBW __m64 _mm_sub_pi16(__m64 m1, __m64 m2)
(V)PSUBW __m128i _mm_sub_epi16 ( __m128i a, __m128i b)
VPSUBW __m256i _mm256_sub_epi16 ( __m256i a, __m256i b)
PSUBD __m64 _mm_sub_pi32(__m64 m1, __m64 m2)
(V)PSUBD __m128i _mm_sub_epi32 ( __m128i a, __m128i b)
VPSUBD __m256i _mm256_sub_epi32 ( __m256i a, __m256i b)

Flags Affected
None.

Numeric Exceptions
None.

PSUBB/PSUBW/PSUBD—Subtract Packed Integers Vol. 2B 4-479


Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPSUBD, see Table 2-51, “Type E4 Class Exception Conditions.”
EVEX-encoded VPSUBB/W, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

PSUBB/PSUBW/PSUBD—Subtract Packed Integers Vol. 2B 4-480


PSUBQ—Subtract Packed Quadword Integers
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F FB /r1 A V/V SSE2 Subtract quadword integer in mm1 from mm2
PSUBQ mm1, mm2/m64 /m64.

66 0F FB /r A V/V SSE2 Subtract packed quadword integers in xmm1


PSUBQ xmm1, xmm2/m128 from xmm2 /m128.

VEX.128.66.0F.WIG FB/r B V/V AVX Subtract packed quadword integers in


VPSUBQ xmm1, xmm2, xmm3/m128 xmm3/m128 from xmm2.

VEX.256.66.0F.WIG FB /r B V/V AVX2 Subtract packed quadword integers in


VPSUBQ ymm1, ymm2, ymm3/m256 ymm3/m256 from ymm2.

EVEX.128.66.0F.W1 FB /r C V/V (AVX512VL AND Subtract packed quadword integers in


VPSUBQ xmm1 {k1}{z}, xmm2, AVX512F) OR xmm3/m128/m64bcst from xmm2 and store
xmm3/m128/m64bcst AVX10.12 in xmm1 using writemask k1.
EVEX.256.66.0F.W1 FB /r C V/V (AVX512VL AND Subtract packed quadword integers in
VPSUBQ ymm1 {k1}{z}, ymm2, AVX512F) OR ymm3/m256/m64bcst from ymm2 and store
ymm3/m256/m64bcst AVX10.12 in ymm1 using writemask k1.
EVEX.512.66.0F.W1 FB/r C V/V AVX512F Subtract packed quadword integers in
VPSUBQ zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm3/m512/m64bcst from zmm2 and store
zmm3/m512/m64bcst in zmm1 using writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Subtracts the second operand (source operand) from the first operand (destination operand) and stores the result
in the destination operand. When packed quadword operands are used, a SIMD subtract is performed. When a
quadword result is too large to be represented in 64 bits (overflow), the result is wrapped around and the low 64
bits are written to the destination element (that is, the carry is ignored).
Note that the (V)PSUBQ instruction can operate on either unsigned or signed (two’s complement notation) inte-
gers; however, it does not set bits in the EFLAGS register to indicate overflow and/or a carry. To prevent undetected
overflow conditions, software must control the ranges of the values upon which it operates.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE version 64-bit operand: The source operand can be a quadword integer stored in an MMX technology
register or a 64-bit memory location.

PSUBQ—Subtract Packed Quadword Integers Vol. 2B 4-481


128-bit Legacy SSE version: The second source operand is an XMM register or a 128-bit memory location. The first
source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM desti-
nation register remain unchanged.
VEX.128 encoded version: The second source operand is an XMM register or a 128-bit memory location. The first
source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the destination YMM register
are zeroed.
VEX.256 encoded versions: The second source operand is an YMM register or an 256-bit memory location. The first
source operand and destination operands are YMM registers. Bits (MAXVL-1:256) of the corresponding ZMM
register are zeroed.
EVEX encoded VPSUBQ: The second source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. The first source operand and
destination operands are ZMM/YMM/XMM registers. The destination is conditionally updated with writemask k1.

Operation
PSUBQ (With 64-Bit Operands)
DEST[63:0] := DEST[63:0] − SRC[63:0];

PSUBQ (With 128-Bit Operands)


DEST[63:0] := DEST[63:0] − SRC[63:0];
DEST[127:64] := DEST[127:64] − SRC[127:64];

VPSUBQ (VEX.128 Encoded Version)


DEST[63:0] := SRC1[63:0]-SRC2[63:0]
DEST[127:64] := SRC1[127:64]-SRC2[127:64]
DEST[MAXVL-1:128] := 0

VPSUBQ (VEX.256 Encoded Version)


DEST[63:0] := SRC1[63:0]-SRC2[63:0]
DEST[127:64] := SRC1[127:64]-SRC2[127:64]
DEST[191:128] := SRC1[191:128]-SRC2[191:128]
DEST[255:192] := SRC1[255:192]-SRC2[255:192]
DEST[MAXVL-1:256] := 0

VPSUBQ (EVEX Encoded Versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+63:i] := SRC1[i+63:i] - SRC2[63:0]
ELSE DEST[i+63:i] := SRC1[i+63:i] - SRC2[i+63:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PSUBQ—Subtract Packed Quadword Integers Vol. 2B 4-482


Intel C/C++ Compiler Intrinsic Equivalents
VPSUBQ __m512i _mm512_sub_epi64(__m512i a, __m512i b);
VPSUBQ __m512i _mm512_mask_sub_epi64(__m512i s, __mmask8 k, __m512i a, __m512i b);
VPSUBQ __m512i _mm512_maskz_sub_epi64( __mmask8 k, __m512i a, __m512i b);
VPSUBQ __m256i _mm256_mask_sub_epi64(__m256i s, __mmask8 k, __m256i a, __m256i b);
VPSUBQ __m256i _mm256_maskz_sub_epi64( __mmask8 k, __m256i a, __m256i b);
VPSUBQ __m128i _mm_mask_sub_epi64(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPSUBQ __m128i _mm_maskz_sub_epi64( __mmask8 k, __m128i a, __m128i b);
PSUBQ __m64 _mm_sub_si64(__m64 m1, __m64 m2)
(V)PSUBQ __m128i _mm_sub_epi64(__m128i m1, __m128i m2)
VPSUBQ __m256i _mm256_sub_epi64(__m256i m1, __m256i m2)

Flags Affected
None.

Numeric Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPSUBQ, see Table 2-51, “Type E4 Class Exception Conditions.”

PSUBQ—Subtract Packed Quadword Integers Vol. 2B 4-483


PSUBSB/PSUBSW—Subtract Packed Signed Integers With Signed Saturation
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F E8 /r1 A V/V MMX Subtract signed packed bytes in mm/m64 from
PSUBSB mm, mm/m64 signed packed bytes in mm and saturate results.

66 0F E8 /r A V/V SSE2 Subtract packed signed byte integers in


PSUBSB xmm1, xmm2/m128 xmm2/m128 from packed signed byte integers in
xmm1 and saturate results.
NP 0F E9 /r1 A V/V MMX Subtract signed packed words in mm/m64 from
PSUBSW mm, mm/m64 signed packed words in mm and saturate results.

66 0F E9 /r A V/V SSE2 Subtract packed signed word integers in


PSUBSW xmm1, xmm2/m128 xmm2/m128 from packed signed word integers in
xmm1 and saturate results.
VEX.128.66.0F.WIG E8 /r B V/V AVX Subtract packed signed byte integers in
VPSUBSB xmm1, xmm2, xmm3/m128 xmm3/m128 from packed signed byte integers in
xmm2 and saturate results.
VEX.128.66.0F.WIG E9 /r B V/V AVX Subtract packed signed word integers in
VPSUBSW xmm1, xmm2, xmm3/m128 xmm3/m128 from packed signed word integers in
xmm2 and saturate results.
VEX.256.66.0F.WIG E8 /r B V/V AVX2 Subtract packed signed byte integers in
VPSUBSB ymm1, ymm2, ymm3/m256 ymm3/m256 from packed signed byte integers in
ymm2 and saturate results.
VEX.256.66.0F.WIG E9 /r B V/V AVX2 Subtract packed signed word integers in
VPSUBSW ymm1, ymm2, ymm3/m256 ymm3/m256 from packed signed word integers in
ymm2 and saturate results.
EVEX.128.66.0F.WIG E8 /r C V/V (AVX512VL AND Subtract packed signed byte integers in
VPSUBSB xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm3/m128 from packed signed byte integers in
xmm3/m128 AVX10.12 xmm2 and saturate results and store in xmm1
using writemask k1.
EVEX.256.66.0F.WIG E8 /r C V/V (AVX512VL AND Subtract packed signed byte integers in
VPSUBSB ymm1 {k1}{z}, ymm2, AVX512BW) OR ymm3/m256 from packed signed byte integers in
ymm3/m256 AVX10.12 ymm2 and saturate results and store in ymm1
using writemask k1.
EVEX.512.66.0F.WIG E8 /r C V/V AVX512BW Subtract packed signed byte integers in
VPSUBSB zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm3/m512 from packed signed byte integers in
zmm3/m512 zmm2 and saturate results and store in zmm1 using
writemask k1.
EVEX.128.66.0F.WIG E9 /r C V/V (AVX512VL AND Subtract packed signed word integers in
VPSUBSW xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm3/m128 from packed signed word integers in
xmm3/m128 AVX10.12 xmm2 and saturate results and store in xmm1
using writemask k1.
EVEX.256.66.0F.WIG E9 /r C V/V (AVX512VL AND Subtract packed signed word integers in
VPSUBSW ymm1 {k1}{z}, ymm2, AVX512BW) OR ymm3/m256 from packed signed word integers in
ymm3/m256 AVX10.12 ymm2 and saturate results and store in ymm1
using writemask k1.
EVEX.512.66.0F.WIG E9 /r C V/V AVX512BW Subtract packed signed word integers in
VPSUBSW zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm3/m512 from packed signed word integers in
zmm3/m512 zmm2 and saturate results and store in zmm1 using
writemask k1.

PSUBSB/PSUBSW—Subtract Packed Signed Integers With Signed Saturation Vol. 2B 4-484


NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD subtract of the packed signed integers of the source operand (second operand) from the packed
signed integers of the destination operand (first operand), and stores the packed integer results in the destination
operand. See Figure 9-4 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an
illustration of a SIMD operation. Overflow is handled with signed saturation, as described in the following para-
graphs.
The (V)PSUBSB instruction subtracts packed signed byte integers. When an individual byte result is beyond the
range of a signed byte integer (that is, greater than 7FH or less than 80H), the saturated value of 7FH or 80H,
respectively, is written to the destination operand.
The (V)PSUBSW instruction subtracts packed signed word integers. When an individual word result is beyond the
range of a signed word integer (that is, greater than 7FFFH or less than 8000H), the saturated value of 7FFFH or
8000H, respectively, is written to the destination operand.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE version 64-bit operand: The destination operand must be an MMX technology register and the source
operand can be either an MMX technology register or a 64-bit memory location.
128-bit Legacy SSE version: The second source operand is an XMM register or a 128-bit memory location. The first
source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM desti-
nation register remain unchanged.
VEX.128 encoded version: The second source operand is an XMM register or a 128-bit memory location. The first
source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the destination YMM register
are zeroed.
VEX.256 encoded versions: The second source operand is an YMM register or an 256-bit memory location. The first
source operand and destination operands are YMM registers. Bits (MAXVL-1:256) of the corresponding ZMM
register are zeroed.
EVEX encoded version: The second source operand is an ZMM/YMM/XMM register or an 512/256/128-bit memory
location. The first source operand and destination operands are ZMM/YMM/XMM registers. The destination is condi-
tionally updated with writemask k1.

Operation
PSUBSB (With 64-bit Operands)
DEST[7:0] := SaturateToSignedByte (DEST[7:0] − SRC (7:0]);
(* Repeat subtract operation for 2nd through 7th bytes *)
DEST[63:56] := SaturateToSignedByte (DEST[63:56] − SRC[63:56] );

PSUBSB/PSUBSW—Subtract Packed Signed Integers With Signed Saturation Vol. 2B 4-485


PSUBSW (With 64-bit Operands)
DEST[15:0] := SaturateToSignedWord (DEST[15:0] − SRC[15:0] );
(* Repeat subtract operation for 2nd and 7th words *)
DEST[63:48] := SaturateToSignedWord (DEST[63:48] − SRC[63:48] );

VPSUBSB (EVEX Encoded Versions)


(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8;
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SaturateToSignedByte (SRC1[i+7:i] - SRC2[i+7:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] := 0;
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

VPSUBSW (EVEX Encoded Versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SaturateToSignedWord (SRC1[i+15:i] - SRC2[i+15:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0;
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0;

VPSUBSB (VEX.256 Encoded Version)


DEST[7:0] := SaturateToSignedByte (SRC1[7:0] - SRC2[7:0]);
(* Repeat subtract operation for 2nd through 31th bytes *)
DEST[255:248] := SaturateToSignedByte (SRC1[255:248] - SRC2[255:248]);
DEST[MAXVL-1:256] := 0;

VPSUBSB (VEX.128 Encoded Version)


DEST[7:0] := SaturateToSignedByte (SRC1[7:0] - SRC2[7:0]);
(* Repeat subtract operation for 2nd through 14th bytes *)
DEST[127:120] := SaturateToSignedByte (SRC1[127:120] - SRC2[127:120]);
DEST[MAXVL-1:128] := 0;

PSUBSB (128-bit Legacy SSE Version)


DEST[7:0] := SaturateToSignedByte (DEST[7:0] - SRC[7:0]);
(* Repeat subtract operation for 2nd through 14th bytes *)
DEST[127:120] := SaturateToSignedByte (DEST[127:120] - SRC[127:120]);
DEST[MAXVL-1:128] (Unmodified);

PSUBSB/PSUBSW—Subtract Packed Signed Integers With Signed Saturation Vol. 2B 4-486


VPSUBSW (VEX.256 Encoded Version)
DEST[15:0] := SaturateToSignedWord (SRC1[15:0] - SRC2[15:0]);
(* Repeat subtract operation for 2nd through 15th words *)
DEST[255:240] := SaturateToSignedWord (SRC1[255:240] - SRC2[255:240]);
DEST[MAXVL-1:256] := 0;

VPSUBSW (VEX.128 Encoded Version)


DEST[15:0] := SaturateToSignedWord (SRC1[15:0] - SRC2[15:0]);
(* Repeat subtract operation for 2nd through 7th words *)
DEST[127:112] := SaturateToSignedWord (SRC1[127:112] - SRC2[127:112]);
DEST[MAXVL-1:128] := 0;

PSUBSW (128-bit Legacy SSE Version)


DEST[15:0] := SaturateToSignedWord (DEST[15:0] - SRC[15:0]);
(* Repeat subtract operation for 2nd through 7th words *)
DEST[127:112] := SaturateToSignedWord (DEST[127:112] - SRC[127:112]);
DEST[MAXVL-1:128] (Unmodified);

Intel C/C++ Compiler Intrinsic Equivalents


VPSUBSB __m512i _mm512_subs_epi8(__m512i a, __m512i b);
VPSUBSB __m512i _mm512_mask_subs_epi8(__m512i s, __mmask64 k, __m512i a, __m512i b);
VPSUBSB __m512i _mm512_maskz_subs_epi8( __mmask64 k, __m512i a, __m512i b);
VPSUBSB __m256i _mm256_mask_subs_epi8(__m256i s, __mmask32 k, __m256i a, __m256i b);
VPSUBSB __m256i _mm256_maskz_subs_epi8( __mmask32 k, __m256i a, __m256i b);
VPSUBSB __m128i _mm_mask_subs_epi8(__m128i s, __mmask16 k, __m128i a, __m128i b);
VPSUBSB __m128i _mm_maskz_subs_epi8( __mmask16 k, __m128i a, __m128i b);
VPSUBSW __m512i _mm512_subs_epi16(__m512i a, __m512i b);
VPSUBSW __m512i _mm512_mask_subs_epi16(__m512i s, __mmask32 k, __m512i a, __m512i b);
VPSUBSW __m512i _mm512_maskz_subs_epi16( __mmask32 k, __m512i a, __m512i b);
VPSUBSW __m256i _mm256_mask_subs_epi16(__m256i s, __mmask16 k, __m256i a, __m256i b);
VPSUBSW __m256i _mm256_maskz_subs_epi16( __mmask16 k, __m256i a, __m256i b);
VPSUBSW __m128i _mm_mask_subs_epi16(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPSUBSW __m128i _mm_maskz_subs_epi16( __mmask8 k, __m128i a, __m128i b);
PSUBSB __m64 _mm_subs_pi8(__m64 m1, __m64 m2)
(V)PSUBSB __m128i _mm_subs_epi8(__m128i m1, __m128i m2)
VPSUBSB __m256i _mm256_subs_epi8(__m256i m1, __m256i m2)
PSUBSW __m64 _mm_subs_pi16(__m64 m1, __m64 m2)
(V)PSUBSW __m128i _mm_subs_epi16(__m128i m1, __m128i m2)
VPSUBSW __m256i _mm256_subs_epi16(__m256i m1, __m256i m2)

Flags Affected
None.

Numeric Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

PSUBSB/PSUBSW—Subtract Packed Signed Integers With Signed Saturation Vol. 2B 4-487


PSUBUSB/PSUBUSW—Subtract Packed Unsigned Integers With Unsigned Saturation
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F D8 /r1 A V/V MMX Subtract unsigned packed bytes in mm/m64
PSUBUSB mm, mm/m64 from unsigned packed bytes in mm and saturate
result.
66 0F D8 /r A V/V SSE2 Subtract packed unsigned byte integers in
PSUBUSB xmm1, xmm2/m128 xmm2/m128 from packed unsigned byte
integers in xmm1 and saturate result.
NP 0F D9 /r1 A V/V MMX Subtract unsigned packed words in mm/m64
PSUBUSW mm, mm/m64 from unsigned packed words in mm and saturate
result.
66 0F D9 /r A V/V SSE2 Subtract packed unsigned word integers in
PSUBUSW xmm1, xmm2/m128 xmm2/m128 from packed unsigned word
integers in xmm1 and saturate result.
VEX.128.66.0F.WIG D8 /r B V/V AVX Subtract packed unsigned byte integers in
VPSUBUSB xmm1, xmm2, xmm3/m128 xmm3/m128 from packed unsigned byte
integers in xmm2 and saturate result.
VEX.128.66.0F.WIG D9 /r B V/V AVX Subtract packed unsigned word integers in
VPSUBUSW xmm1, xmm2, xmm3/m128 xmm3/m128 from packed unsigned word
integers in xmm2 and saturate result.
VEX.256.66.0F.WIG D8 /r B V/V AVX2 Subtract packed unsigned byte integers in
VPSUBUSB ymm1, ymm2, ymm3/m256 ymm3/m256 from packed unsigned byte
integers in ymm2 and saturate result.
VEX.256.66.0F.WIG D9 /r B V/V AVX2 Subtract packed unsigned word integers in
VPSUBUSW ymm1, ymm2, ymm3/m256 ymm3/m256 from packed unsigned word
integers in ymm2 and saturate result.
EVEX.128.66.0F.WIG D8 /r C V/V (AVX512VL AND Subtract packed unsigned byte integers in
VPSUBUSB xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm3/m128 from packed unsigned byte
xmm3/m128 AVX10.12 integers in xmm2, saturate results and store in
xmm1 using writemask k1.
EVEX.256.66.0F.WIG D8 /r C V/V (AVX512VL AND Subtract packed unsigned byte integers in
VPSUBUSB ymm1 {k1}{z}, ymm2, AVX512BW) OR ymm3/m256 from packed unsigned byte
ymm3/m256 AVX10.12 integers in ymm2, saturate results and store in
ymm1 using writemask k1.
EVEX.512.66.0F.WIG D8 /r C V/V AVX512BW Subtract packed unsigned byte integers in
VPSUBUSB zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm3/m512 from packed unsigned byte
zmm3/m512 integers in zmm2, saturate results and store in
zmm1 using writemask k1.
EVEX.128.66.0F.WIG D9 /r C V/V (AVX512VL AND Subtract packed unsigned word integers in
VPSUBUSW xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm3/m128 from packed unsigned word
xmm3/m128 AVX10.12 integers in xmm2 and saturate results and store
in xmm1 using writemask k1.
EVEX.256.66.0F.WIG D9 /r C V/V (AVX512VL AND Subtract packed unsigned word integers in
VPSUBUSW ymm1 {k1}{z}, ymm2, AVX512BW) OR ymm3/m256 from packed unsigned word
ymm3/m256 AVX10.12 integers in ymm2, saturate results and store in
ymm1 using writemask k1.
EVEX.512.66.0F.WIG D9 /r C V/V AVX512BW Subtract packed unsigned word integers in
VPSUBUSW zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm3/m512 from packed unsigned word
zmm3/m512 integers in zmm2, saturate results and store in
zmm1 using writemask k1.

PSUBUSB/PSUBUSW—Subtract Packed Unsigned Integers With Unsigned Saturation Vol. 2B 4-488


NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD subtract of the packed unsigned integers of the source operand (second operand) from the
packed unsigned integers of the destination operand (first operand), and stores the packed unsigned integer
results in the destination operand. See Figure 9-4 in the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 1, for an illustration of a SIMD operation. Overflow is handled with unsigned saturation, as
described in the following paragraphs.
These instructions can operate on either 64-bit or 128-bit operands.
The (V)PSUBUSB instruction subtracts packed unsigned byte integers. When an individual byte result is less than
zero, the saturated value of 00H is written to the destination operand.
The (V)PSUBUSW instruction subtracts packed unsigned word integers. When an individual word result is less than
zero, the saturated value of 0000H is written to the destination operand.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE version 64-bit operand: The destination operand must be an MMX technology register and the source
operand can be either an MMX technology register or a 64-bit memory location.
128-bit Legacy SSE version: The second source operand is an XMM register or a 128-bit memory location. The first
source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM desti-
nation register remain unchanged.
VEX.128 encoded version: The second source operand is an XMM register or a 128-bit memory location. The first
source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the destination YMM register
are zeroed.
VEX.256 encoded versions: The second source operand is an YMM register or an 256-bit memory location. The first
source operand and destination operands are YMM registers. Bits (MAXVL-1:256) of the corresponding ZMM
register are zeroed.
EVEX encoded version: The second source operand is an ZMM/YMM/XMM register or an 512/256/128-bit memory
location. The first source operand and destination operands are ZMM/YMM/XMM registers. The destination is condi-
tionally updated with writemask k1.

Operation
PSUBUSB (With 64-bit Operands)
DEST[7:0] := SaturateToUnsignedByte (DEST[7:0] − SRC (7:0] );
(* Repeat add operation for 2nd through 7th bytes *)
DEST[63:56] := SaturateToUnsignedByte (DEST[63:56] − SRC[63:56];

PSUBUSB/PSUBUSW—Subtract Packed Unsigned Integers With Unsigned Saturation Vol. 2B 4-489


PSUBUSW (With 64-bit Operands)
DEST[15:0] := SaturateToUnsignedWord (DEST[15:0] − SRC[15:0] );
(* Repeat add operation for 2nd and 3rd words *)
DEST[63:48] := SaturateToUnsignedWord (DEST[63:48] − SRC[63:48] );

VPSUBUSB (EVEX Encoded Versions)


(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8;
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SaturateToUnsignedByte (SRC1[i+7:i] - SRC2[i+7:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] := 0;
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0;

VPSUBUSW (EVEX Encoded Versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16;
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SaturateToUnsignedWord (SRC1[i+15:i] - SRC2[i+15:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0;
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0;

VPSUBUSB (VEX.256 Encoded Version)


DEST[7:0] := SaturateToUnsignedByte (SRC1[7:0] - SRC2[7:0]);
(* Repeat subtract operation for 2nd through 31st bytes *)
DEST[255:148] := SaturateToUnsignedByte (SRC1[255:248] - SRC2[255:248]);
DEST[MAXVL-1:256] := 0;

VPSUBUSB (VEX.128 Encoded Version)


DEST[7:0] := SaturateToUnsignedByte (SRC1[7:0] - SRC2[7:0]);
(* Repeat subtract operation for 2nd through 14th bytes *)
DEST[127:120] := SaturateToUnsignedByte (SRC1[127:120] - SRC2[127:120]);
DEST[MAXVL-1:128] := 0

PSUBUSB (128-bit Legacy SSE Version)


DEST[7:0] := SaturateToUnsignedByte (DEST[7:0] - SRC[7:0]);
(* Repeat subtract operation for 2nd through 14th bytes *)
DEST[127:120] := SaturateToUnsignedByte (DEST[127:120] - SRC[127:120]);
DEST[MAXVL-1:128] (Unmodified)

PSUBUSB/PSUBUSW—Subtract Packed Unsigned Integers With Unsigned Saturation Vol. 2B 4-490


VPSUBUSW (VEX.256 Encoded Version)
DEST[15:0] := SaturateToUnsignedWord (SRC1[15:0] - SRC2[15:0]);
(* Repeat subtract operation for 2nd through 15th words *)
DEST[255:240] := SaturateToUnsignedWord (SRC1[255:240] - SRC2[255:240]);
DEST[MAXVL-1:256] := 0;

VPSUBUSW (VEX.128 Encoded Version)


DEST[15:0] := SaturateToUnsignedWord (SRC1[15:0] - SRC2[15:0]);
(* Repeat subtract operation for 2nd through 7th words *)
DEST[127:112] := SaturateToUnsignedWord (SRC1[127:112] - SRC2[127:112]);
DEST[MAXVL-1:128] := 0

PSUBUSW (128-bit Legacy SSE Version)


DEST[15:0] := SaturateToUnsignedWord (DEST[15:0] - SRC[15:0]);
(* Repeat subtract operation for 2nd through 7th words *)
DEST[127:112] := SaturateToUnsignedWord (DEST[127:112] - SRC[127:112]);
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalents


VPSUBUSB __m512i _mm512_subs_epu8(__m512i a, __m512i b);
VPSUBUSB __m512i _mm512_mask_subs_epu8(__m512i s, __mmask64 k, __m512i a, __m512i b);
VPSUBUSB __m512i _mm512_maskz_subs_epu8( __mmask64 k, __m512i a, __m512i b);
VPSUBUSB __m256i _mm256_mask_subs_epu8(__m256i s, __mmask32 k, __m256i a, __m256i b);
VPSUBUSB __m256i _mm256_maskz_subs_epu8( __mmask32 k, __m256i a, __m256i b);
VPSUBUSB __m128i _mm_mask_subs_epu8(__m128i s, __mmask16 k, __m128i a, __m128i b);
VPSUBUSB __m128i _mm_maskz_subs_epu8( __mmask16 k, __m128i a, __m128i b);
VPSUBUSW __m512i _mm512_subs_epu16(__m512i a, __m512i b);
VPSUBUSW __m512i _mm512_mask_subs_epu16(__m512i s, __mmask32 k, __m512i a, __m512i b);
VPSUBUSW __m512i _mm512_maskz_subs_epu16( __mmask32 k, __m512i a, __m512i b);
VPSUBUSW __m256i _mm256_mask_subs_epu16(__m256i s, __mmask16 k, __m256i a, __m256i b);
VPSUBUSW __m256i _mm256_maskz_subs_epu16( __mmask16 k, __m256i a, __m256i b);
VPSUBUSW __m128i _mm_mask_subs_epu16(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPSUBUSW __m128i _mm_maskz_subs_epu16( __mmask8 k, __m128i a, __m128i b);
PSUBUSB __m64 _mm_subs_pu8(__m64 m1, __m64 m2)
(V)PSUBUSB __m128i _mm_subs_epu8(__m128i m1, __m128i m2)
VPSUBUSB __m256i _mm256_subs_epu8(__m256i m1, __m256i m2)
PSUBUSW __m64 _mm_subs_pu16(__m64 m1, __m64 m2)
(V)PSUBUSW __m128i _mm_subs_epu16(__m128i m1, __m128i m2)
VPSUBUSW __m256i _mm256_subs_epu16(__m256i m1, __m256i m2)

Flags Affected
None.

Numeric Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

PSUBUSB/PSUBUSW—Subtract Packed Unsigned Integers With Unsigned Saturation Vol. 2B 4-491


PUNPCKHBW/PUNPCKHWD/PUNPCKHDQ/PUNPCKHQDQ— Unpack High Data
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 68 /r1 A V/V MMX Unpack and interleave high-order bytes from
PUNPCKHBW mm, mm/m64 mm and mm/m64 into mm.

66 0F 68 /r A V/V SSE2 Unpack and interleave high-order bytes from


PUNPCKHBW xmm1, xmm2/m128 xmm1 and xmm2/m128 into xmm1.

NP 0F 69 /r1 A V/V MMX Unpack and interleave high-order words from


PUNPCKHWD mm, mm/m64 mm and mm/m64 into mm.

66 0F 69 /r A V/V SSE2 Unpack and interleave high-order words from


PUNPCKHWD xmm1, xmm2/m128 xmm1 and xmm2/m128 into xmm1.

NP 0F 6A /r1 A V/V MMX Unpack and interleave high-order


PUNPCKHDQ mm, mm/m64 doublewords from mm and mm/m64 into mm.

66 0F 6A /r A V/V SSE2 Unpack and interleave high-order


PUNPCKHDQ xmm1, xmm2/m128 doublewords from xmm1 and xmm2/m128
into xmm1.
66 0F 6D /r A V/V SSE2 Unpack and interleave high-order quadwords
PUNPCKHQDQ xmm1, xmm2/m128 from xmm1 and xmm2/m128 into xmm1.

VEX.128.66.0F.WIG 68/r B V/V AVX Interleave high-order bytes from xmm2 and
VPUNPCKHBW xmm1,xmm2, xmm3/m128 xmm3/m128 into xmm1.

VEX.128.66.0F.WIG 69/r B V/V AVX Interleave high-order words from xmm2 and
VPUNPCKHWD xmm1,xmm2, xmm3/m128 xmm3/m128 into xmm1.

VEX.128.66.0F.WIG 6A/r B V/V AVX Interleave high-order doublewords from


VPUNPCKHDQ xmm1, xmm2, xmm3/m128 xmm2 and xmm3/m128 into xmm1.

VEX.128.66.0F.WIG 6D/r B V/V AVX Interleave high-order quadword from xmm2


VPUNPCKHQDQ xmm1, xmm2, xmm3/m128 and xmm3/m128 into xmm1 register.
VEX.256.66.0F.WIG 68 /r B V/V AVX2 Interleave high-order bytes from ymm2 and
VPUNPCKHBW ymm1, ymm2, ymm3/m256 ymm3/m256 into ymm1 register.
VEX.256.66.0F.WIG 69 /r B V/V AVX2 Interleave high-order words from ymm2 and
VPUNPCKHWD ymm1, ymm2, ymm3/m256 ymm3/m256 into ymm1 register.
VEX.256.66.0F.WIG 6A /r B V/V AVX2 Interleave high-order doublewords from
VPUNPCKHDQ ymm1, ymm2, ymm3/m256 ymm2 and ymm3/m256 into ymm1 register.
VEX.256.66.0F.WIG 6D /r B V/V AVX2 Interleave high-order quadword from ymm2
VPUNPCKHQDQ ymm1, ymm2, ymm3/m256 and ymm3/m256 into ymm1 register.
EVEX.128.66.0F.WIG 68 /r C V/V (AVX512VL AND Interleave high-order bytes from xmm2 and
VPUNPCKHBW xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm3/m128 into xmm1 register using k1
xmm3/m128 AVX10.12 write mask.
EVEX.128.66.0F.WIG 69 /r C V/V (AVX512VL AND Interleave high-order words from xmm2 and
VPUNPCKHWD xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm3/m128 into xmm1 register using k1
xmm3/m128 AVX10.12 write mask.
EVEX.128.66.0F.W0 6A /r D V/V (AVX512VL AND Interleave high-order doublewords from
VPUNPCKHDQ xmm1 {k1}{z}, xmm2, AVX512F) OR xmm2 and xmm3/m128/m32bcst into xmm1
xmm3/m128/m32bcst AVX10.12 register using k1 write mask.
EVEX.128.66.0F.W1 6D /r D V/V (AVX512VL AND Interleave high-order quadword from xmm2
VPUNPCKHQDQ xmm1 {k1}{z}, xmm2, AVX512F) OR and xmm3/m128/m64bcst into xmm1
xmm3/m128/m64bcst AVX10.12 register using k1 write mask.

PUNPCKHBW/PUNPCKHWD/PUNPCKHDQ/PUNPCKHQDQ— Unpack High Data Vol. 2B 4-496


Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
EVEX.256.66.0F.WIG 68 /r C V/V (AVX512VL AND Interleave high-order bytes from ymm2 and
VPUNPCKHBW ymm1 {k1}{z}, ymm2, AVX512BW) OR ymm3/m256 into ymm1 register using k1
ymm3/m256 AVX10.12 write mask.
EVEX.256.66.0F.WIG 69 /r C V/V (AVX512VL AND Interleave high-order words from ymm2 and
VPUNPCKHWD ymm1 {k1}{z}, ymm2, AVX512BW) OR ymm3/m256 into ymm1 register using k1
ymm3/m256 AVX10.12 write mask.
EVEX.256.66.0F.W0 6A /r D V/V (AVX512VL AND Interleave high-order doublewords from
VPUNPCKHDQ ymm1 {k1}{z}, ymm2, AVX512F) OR ymm2 and ymm3/m256/m32bcst into ymm1
ymm3/m256/m32bcst AVX10.12 register using k1 write mask.

EVEX.256.66.0F.W1 6D /r D V/V (AVX512VL AND Interleave high-order quadword from ymm2


VPUNPCKHQDQ ymm1 {k1}{z}, ymm2, AVX512F) OR and ymm3/m256/m64bcst into ymm1
ymm3/m256/m64bcst AVX10.12 register using k1 write mask.

EVEX.512.66.0F.WIG 68/r C V/V AVX512BW Interleave high-order bytes from zmm2 and
VPUNPCKHBW zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm3/m512 into zmm1 register.
zmm3/m512
EVEX.512.66.0F.WIG 69/r C V/V AVX512BW Interleave high-order words from zmm2 and
VPUNPCKHWD zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm3/m512 into zmm1 register.
zmm3/m512
EVEX.512.66.0F.W0 6A /r D V/V AVX512F Interleave high-order doublewords from
VPUNPCKHDQ zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm2 and zmm3/m512/m32bcst into zmm1
zmm3/m512/m32bcst register using k1 write mask.
EVEX.512.66.0F.W1 6D /r D V/V AVX512F Interleave high-order quadword from zmm2
VPUNPCKHQDQ zmm1 {k1}{z}, zmm2, OR AVX10.12 and zmm3/m512/m64bcst into zmm1 register
zmm3/m512/m64bcst using k1 write mask.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A
D Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Unpacks and interleaves the high-order data elements (bytes, words, doublewords, or quadwords) of the destina-
tion operand (first operand) and source operand (second operand) into the destination operand. Figure 4-20 shows
the unpack operation for bytes in 64-bit operands. The low-order data elements are ignored.

PUNPCKHBW/PUNPCKHWD/PUNPCKHDQ/PUNPCKHQDQ— Unpack High Data Vol. 2B 4-497


SRC Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0 X7 X6 X5 X4 X3 X2 X1 X0 DEST

DEST Y7 X7 Y6 X6 Y5 X5 Y4 X4

Figure 4-20. PUNPCKHBW Instruction Operation Using 64-bit Operands

255 31 0 255 31 0

SRC Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0 X7 X6 X5 X4 X3 X2 X1 X0

255 0

DEST Y7 X7 Y6 X6 Y3 X3 Y2 X2

Figure 4-21. 256-bit VPUNPCKHDQ Instruction Operation

When the source data comes from a 64-bit memory operand, the full 64-bit operand is accessed from memory, but
the instruction uses only the high-order 32 bits. When the source data comes from a 128-bit memory operand, an
implementation may fetch only the appropriate 64 bits; however, alignment to a 16-byte boundary and normal
segment checking will still be enforced.
The (V)PUNPCKHBW instruction interleaves the high-order bytes of the source and destination operands, the
(V)PUNPCKHWD instruction interleaves the high-order words of the source and destination operands, the (V)PUNP-
CKHDQ instruction interleaves the high-order doubleword (or doublewords) of the source and destination oper-
ands, and the (V)PUNPCKHQDQ instruction interleaves the high-order quadwords of the source and destination
operands.
These instructions can be used to convert bytes to words, words to doublewords, doublewords to quadwords, and
quadwords to double quadwords, respectively, by placing all 0s in the source operand. Here, if the source operand
contains all 0s, the result (stored in the destination operand) contains zero extensions of the high-order data
elements from the original value in the destination operand. For example, with the (V)PUNPCKHBW instruction the
high-order bytes are zero extended (that is, unpacked into unsigned word integers), and with the (V)PUNPCKHWD
instruction, the high-order words are zero extended (unpacked into unsigned doubleword integers).
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE versions 64-bit operand: The source operand can be an MMX technology register or a 64-bit memory
location. The destination operand is an MMX technology register.
128-bit Legacy SSE versions: The second source operand is an XMM register or a 128-bit memory location. The
first source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM
destination register remain unchanged.
VEX.128 encoded versions: The second source operand is an XMM register or a 128-bit memory location. The first
source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the destination YMM register
are zeroed.
VEX.256 encoded version: The second source operand is an YMM register or an 256-bit memory location. The first
source operand and destination operands are YMM registers.

PUNPCKHBW/PUNPCKHWD/PUNPCKHDQ/PUNPCKHQDQ— Unpack High Data Vol. 2B 4-498


EVEX encoded VPUNPCKHDQ/QDQ: The second source operand is a ZMM/YMM/XMM register, a 512/256/128-bit
memory location or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. The first source
operand and destination operands are ZMM/YMM/XMM registers. The destination is conditionally updated with
writemask k1.
EVEX encoded VPUNPCKHWD/BW: The second source operand is a ZMM/YMM/XMM register, a 512/256/128-bit
memory location. The first source operand and destination operands are ZMM/YMM/XMM registers. The destination
is conditionally updated with writemask k1.

Operation
PUNPCKHBW Instruction With 64-bit Operands:
DEST[7:0] := DEST[39:32];
DEST[15:8] := SRC[39:32];
DEST[23:16] := DEST[47:40];
DEST[31:24] := SRC[47:40];
DEST[39:32] := DEST[55:48];
DEST[47:40] := SRC[55:48];
DEST[55:48] := DEST[63:56];
DEST[63:56] := SRC[63:56];

PUNPCKHW Instruction With 64-bit Operands:


DEST[15:0] := DEST[47:32];
DEST[31:16] := SRC[47:32];
DEST[47:32] := DEST[63:48];
DEST[63:48] := SRC[63:48];

PUNPCKHDQ Instruction With 64-bit Operands:


DEST[31:0] := DEST[63:32];
DEST[63:32] := SRC[63:32];

INTERLEAVE_HIGH_BYTES_512b (SRC1, SRC2)


TMP_DEST[255:0] := INTERLEAVE_HIGH_BYTES_256b(SRC1[255:0], SRC[255:0])
TMP_DEST[511:256] := INTERLEAVE_HIGH_BYTES_256b(SRC1[511:256], SRC[511:256])

INTERLEAVE_HIGH_BYTES_256b (SRC1, SRC2)


DEST[7:0] := SRC1[71:64]
DEST[15:8] := SRC2[71:64]
DEST[23:16] := SRC1[79:72]
DEST[31:24] := SRC2[79:72]
DEST[39:32] := SRC1[87:80]
DEST[47:40] := SRC2[87:80]
DEST[55:48] := SRC1[95:88]
DEST[63:56] := SRC2[95:88]
DEST[71:64] := SRC1[103:96]
DEST[79:72] := SRC2[103:96]
DEST[87:80] := SRC1[111:104]
DEST[95:88] := SRC2[111:104]
DEST[103:96] := SRC1[119:112]
DEST[111:104] := SRC2[119:112]
DEST[119:112] := SRC1[127:120]
DEST[127:120] := SRC2[127:120]
DEST[135:128] := SRC1[199:192]
DEST[143:136] := SRC2[199:192]
DEST[151:144] := SRC1[207:200]
DEST[159:152] := SRC2[207:200]

PUNPCKHBW/PUNPCKHWD/PUNPCKHDQ/PUNPCKHQDQ— Unpack High Data Vol. 2B 4-499


DEST[167:160] := SRC1[215:208]
DEST[175:168] := SRC2[215:208]
DEST[183:176] := SRC1[223:216]
DEST[191:184] := SRC2[223:216]
DEST[199:192] := SRC1[231:224]
DEST[207:200] := SRC2[231:224]
DEST[215:208] := SRC1[239:232]
DEST[223:216] := SRC2[239:232]
DEST[231:224] := SRC1[247:240]
DEST[239:232] := SRC2[247:240]
DEST[247:240] := SRC1[255:248]
DEST[255:248] := SRC2[255:248]

INTERLEAVE_HIGH_BYTES (SRC1, SRC2)


DEST[7:0] := SRC1[71:64]
DEST[15:8] := SRC2[71:64]
DEST[23:16] := SRC1[79:72]
DEST[31:24] := SRC2[79:72]
DEST[39:32] := SRC1[87:80]
DEST[47:40] := SRC2[87:80]
DEST[55:48] := SRC1[95:88]
DEST[63:56] := SRC2[95:88]
DEST[71:64] := SRC1[103:96]
DEST[79:72] := SRC2[103:96]
DEST[87:80] := SRC1[111:104]
DEST[95:88] := SRC2[111:104]
DEST[103:96] := SRC1[119:112]
DEST[111:104] := SRC2[119:112]
DEST[119:112] := SRC1[127:120]
DEST[127:120] := SRC2[127:120]

INTERLEAVE_HIGH_WORDS_512b (SRC1, SRC2)


TMP_DEST[255:0] := INTERLEAVE_HIGH_WORDS_256b(SRC1[255:0], SRC[255:0])
TMP_DEST[511:256] := INTERLEAVE_HIGH_WORDS_256b(SRC1[511:256], SRC[511:256])

INTERLEAVE_HIGH_WORDS_256b(SRC1, SRC2)
DEST[15:0] := SRC1[79:64]
DEST[31:16] := SRC2[79:64]
DEST[47:32] := SRC1[95:80]
DEST[63:48] := SRC2[95:80]
DEST[79:64] := SRC1[111:96]
DEST[95:80] := SRC2[111:96]
DEST[111:96] := SRC1[127:112]
DEST[127:112] := SRC2[127:112]
DEST[143:128] := SRC1[207:192]
DEST[159:144] := SRC2[207:192]
DEST[175:160] := SRC1[223:208]
DEST[191:176] := SRC2[223:208]
DEST[207:192] := SRC1[239:224]
DEST[223:208] := SRC2[239:224]
DEST[239:224] := SRC1[255:240]
DEST[255:240] := SRC2[255:240]

INTERLEAVE_HIGH_WORDS (SRC1, SRC2)

PUNPCKHBW/PUNPCKHWD/PUNPCKHDQ/PUNPCKHQDQ— Unpack High Data Vol. 2B 4-500


DEST[15:0] := SRC1[79:64]
DEST[31:16] := SRC2[79:64]
DEST[47:32] := SRC1[95:80]
DEST[63:48] := SRC2[95:80]
DEST[79:64] := SRC1[111:96]
DEST[95:80] := SRC2[111:96]
DEST[111:96] := SRC1[127:112]
DEST[127:112] := SRC2[127:112]

INTERLEAVE_HIGH_DWORDS_512b (SRC1, SRC2)


TMP_DEST[255:0] := INTERLEAVE_HIGH_DWORDS_256b(SRC1[255:0], SRC2[255:0])
TMP_DEST[511:256] := INTERLEAVE_HIGH_DWORDS_256b(SRC1[511:256], SRC2[511:256])

INTERLEAVE_HIGH_DWORDS_256b(SRC1, SRC2)
DEST[31:0] := SRC1[95:64]
DEST[63:32] := SRC2[95:64]
DEST[95:64] := SRC1[127:96]
DEST[127:96] := SRC2[127:96]
DEST[159:128] := SRC1[223:192]
DEST[191:160] := SRC2[223:192]
DEST[223:192] := SRC1[255:224]
DEST[255:224] := SRC2[255:224]

INTERLEAVE_HIGH_DWORDS(SRC1, SRC2)
DEST[31:0] := SRC1[95:64]
DEST[63:32] := SRC2[95:64]
DEST[95:64] := SRC1[127:96]
DEST[127:96] := SRC2[127:96]

INTERLEAVE_HIGH_QWORDS_512b (SRC1, SRC2)


TMP_DEST[255:0] := INTERLEAVE_HIGH_QWORDS_256b(SRC1[255:0], SRC2[255:0])
TMP_DEST[511:256] := INTERLEAVE_HIGH_QWORDS_256b(SRC1[511:256], SRC2[511:256])

INTERLEAVE_HIGH_QWORDS_256b(SRC1, SRC2)
DEST[63:0] := SRC1[127:64]
DEST[127:64] := SRC2[127:64]
DEST[191:128] := SRC1[255:192]
DEST[255:192] := SRC2[255:192]

INTERLEAVE_HIGH_QWORDS(SRC1, SRC2)
DEST[63:0] := SRC1[127:64]
DEST[127:64] := SRC2[127:64]

PUNPCKHBW (128-bit Legacy SSE Version)


DEST[127:0] := INTERLEAVE_HIGH_BYTES(DEST, SRC)
DEST[255:127] (Unmodified)

VPUNPCKHBW (VEX.128 Encoded Version)


DEST[127:0] := INTERLEAVE_HIGH_BYTES(SRC1, SRC2)
DEST[MAXVL-1:127] := 0

VPUNPCKHBW (VEX.256 Encoded Version)


DEST[255:0] := INTERLEAVE_HIGH_BYTES_256b(SRC1, SRC2)
DEST[MAXVL-1:256] := 0

PUNPCKHBW/PUNPCKHWD/PUNPCKHDQ/PUNPCKHQDQ— Unpack High Data Vol. 2B 4-501


VPUNPCKHBW (EVEX Encoded Versions)
(KL, VL) = (16, 128), (32, 256), (64, 512)
IF VL = 128
TMP_DEST[VL-1:0] := INTERLEAVE_HIGH_BYTES(SRC1[VL-1:0], SRC2[VL-1:0])
FI;
IF VL = 256
TMP_DEST[VL-1:0] := INTERLEAVE_HIGH_BYTES_256b(SRC1[VL-1:0], SRC2[VL-1:0])
FI;
IF VL = 512
TMP_DEST[VL-1:0] := INTERLEAVE_HIGH_BYTES_512b(SRC1[VL-1:0], SRC2[VL-1:0])
FI;

FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := TMP_DEST[i+7:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

PUNPCKHWD (128-bit Legacy SSE Version)


DEST[127:0] := INTERLEAVE_HIGH_WORDS(DEST, SRC)
DEST[255:127] (Unmodified)

VPUNPCKHWD (VEX.128 Encoded Version)


DEST[127:0] := INTERLEAVE_HIGH_WORDS(SRC1, SRC2)
DEST[MAXVL-1:127] := 0

VPUNPCKHWD (VEX.256 Encoded Version)


DEST[255:0] := INTERLEAVE_HIGH_WORDS_256b(SRC1, SRC2)
DEST[MAXVL-1:256] := 0

VPUNPCKHWD (EVEX Encoded Versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
IF VL = 128
TMP_DEST[VL-1:0] := INTERLEAVE_HIGH_WORDS(SRC1[VL-1:0], SRC2[VL-1:0])
FI;
IF VL = 256
TMP_DEST[VL-1:0] := INTERLEAVE_HIGH_WORDS_256b(SRC1[VL-1:0], SRC2[VL-1:0])
FI;
IF VL = 512
TMP_DEST[VL-1:0] := INTERLEAVE_HIGH_WORDS_512b(SRC1[VL-1:0], SRC2[VL-1:0])
FI;

FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*

PUNPCKHBW/PUNPCKHWD/PUNPCKHDQ/PUNPCKHQDQ— Unpack High Data Vol. 2B 4-502


THEN DEST[i+15:i] := TMP_DEST[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

PUNPCKHDQ (128-bit Legacy SSE Version)


DEST[127:0] := INTERLEAVE_HIGH_DWORDS(DEST, SRC)
DEST[255:127] (Unmodified)

VPUNPCKHDQ (VEX.128 Encoded Version)


DEST[127:0] := INTERLEAVE_HIGH_DWORDS(SRC1, SRC2)
DEST[MAXVL-1:127] := 0

VPUNPCKHDQ (VEX.256 Encoded Version)


DEST[255:0] := INTERLEAVE_HIGH_DWORDS_256b(SRC1, SRC2)
DEST[MAXVL-1:256] := 0

VPUNPCKHDQ (EVEX.512 Encoded Version)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN TMP_SRC2[i+31:i] := SRC2[31:0]
ELSE TMP_SRC2[i+31:i] := SRC2[i+31:i]
FI;
ENDFOR;
IF VL = 128
TMP_DEST[VL-1:0] := INTERLEAVE_HIGH_DWORDS(SRC1[VL-1:0], TMP_SRC2[VL-1:0])
FI;
IF VL = 256
TMP_DEST[VL-1:0] := INTERLEAVE_HIGH_DWORDS_256b(SRC1[VL-1:0], TMP_SRC2[VL-1:0])
FI;
IF VL = 512
TMP_DEST[VL-1:0] := INTERLEAVE_HIGH_DWORDS_512b(SRC1[VL-1:0], TMP_SRC2[VL-1:0])
FI;

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR

PUNPCKHBW/PUNPCKHWD/PUNPCKHDQ/PUNPCKHQDQ— Unpack High Data Vol. 2B 4-503


DEST[MAXVL-1:VL] := 0

PUNPCKHQDQ (128-bit Legacy SSE Version)


DEST[127:0] := INTERLEAVE_HIGH_QWORDS(DEST, SRC)
DEST[MAXVL-1:128] (Unmodified)

VPUNPCKHQDQ (VEX.128 Encoded Version)


DEST[127:0] := INTERLEAVE_HIGH_QWORDS(SRC1, SRC2)
DEST[MAXVL-1:128] := 0

VPUNPCKHQDQ (VEX.256 Encoded Version)


DEST[255:0] := INTERLEAVE_HIGH_QWORDS_256b(SRC1, SRC2)
DEST[MAXVL-1:256] := 0

VPUNPCKHQDQ (EVEX Encoded Versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN TMP_SRC2[i+63:i] := SRC2[63:0]
ELSE TMP_SRC2[i+63:i] := SRC2[i+63:i]
FI;
ENDFOR;
IF VL = 128
TMP_DEST[VL-1:0] := INTERLEAVE_HIGH_QWORDS(SRC1[VL-1:0], TMP_SRC2[VL-1:0])
FI;
IF VL = 256
TMP_DEST[VL-1:0] := INTERLEAVE_HIGH_QWORDS_256b(SRC1[VL-1:0], TMP_SRC2[VL-1:0])
FI;
IF VL = 512
TMP_DEST[VL-1:0] := INTERLEAVE_HIGH_QWORDS_512b(SRC1[VL-1:0], TMP_SRC2[VL-1:0])
FI;

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalents


VPUNPCKHBW __m512i _mm512_unpackhi_epi8(__m512i a, __m512i b);
VPUNPCKHBW __m512i _mm512_mask_unpackhi_epi8(__m512i s, __mmask64 k, __m512i a, __m512i b);
VPUNPCKHBW __m512i _mm512_maskz_unpackhi_epi8( __mmask64 k, __m512i a, __m512i b);
VPUNPCKHBW __m256i _mm256_mask_unpackhi_epi8(__m256i s, __mmask32 k, __m256i a, __m256i b);
VPUNPCKHBW __m256i _mm256_maskz_unpackhi_epi8( __mmask32 k, __m256i a, __m256i b);

PUNPCKHBW/PUNPCKHWD/PUNPCKHDQ/PUNPCKHQDQ— Unpack High Data Vol. 2B 4-504


VPUNPCKHBW __m128i _mm_mask_unpackhi_epi8(v s, __mmask16 k, __m128i a, __m128i b);
VPUNPCKHBW __m128i _mm_maskz_unpackhi_epi8( __mmask16 k, __m128i a, __m128i b);
VPUNPCKHWD __m512i _mm512_unpackhi_epi16(__m512i a, __m512i b);
VPUNPCKHWD __m512i _mm512_mask_unpackhi_epi16(__m512i s, __mmask32 k, __m512i a, __m512i b);
VPUNPCKHWD __m512i _mm512_maskz_unpackhi_epi16( __mmask32 k, __m512i a, __m512i b);
VPUNPCKHWD __m256i _mm256_mask_unpackhi_epi16(__m256i s, __mmask16 k, __m256i a, __m256i b);
VPUNPCKHWD __m256i _mm256_maskz_unpackhi_epi16( __mmask16 k, __m256i a, __m256i b);
VPUNPCKHWD __m128i _mm_mask_unpackhi_epi16(v s, __mmask8 k, __m128i a, __m128i b);
VPUNPCKHWD __m128i _mm_maskz_unpackhi_epi16( __mmask8 k, __m128i a, __m128i b);
VPUNPCKHDQ __m512i _mm512_unpackhi_epi32(__m512i a, __m512i b);
VPUNPCKHDQ __m512i _mm512_mask_unpackhi_epi32(__m512i s, __mmask16 k, __m512i a, __m512i b);
VPUNPCKHDQ __m512i _mm512_maskz_unpackhi_epi32( __mmask16 k, __m512i a, __m512i b);
VPUNPCKHDQ __m256i _mm256_mask_unpackhi_epi32(__m512i s, __mmask8 k, __m512i a, __m512i b);
VPUNPCKHDQ __m256i _mm256_maskz_unpackhi_epi32( __mmask8 k, __m512i a, __m512i b);
VPUNPCKHDQ __m128i _mm_mask_unpackhi_epi32(__m512i s, __mmask8 k, __m512i a, __m512i b);
VPUNPCKHDQ __m128i _mm_maskz_unpackhi_epi32( __mmask8 k, __m512i a, __m512i b);
VPUNPCKHQDQ __m512i _mm512_unpackhi_epi64(__m512i a, __m512i b);
VPUNPCKHQDQ __m512i _mm512_mask_unpackhi_epi64(__m512i s, __mmask8 k, __m512i a, __m512i b);
VPUNPCKHQDQ __m512i _mm512_maskz_unpackhi_epi64( __mmask8 k, __m512i a, __m512i b);
VPUNPCKHQDQ __m256i _mm256_mask_unpackhi_epi64(__m512i s, __mmask8 k, __m512i a, __m512i b);
VPUNPCKHQDQ __m256i _mm256_maskz_unpackhi_epi64( __mmask8 k, __m512i a, __m512i b);
VPUNPCKHQDQ __m128i _mm_mask_unpackhi_epi64(__m512i s, __mmask8 k, __m512i a, __m512i b);
VPUNPCKHQDQ __m128i _mm_maskz_unpackhi_epi64( __mmask8 k, __m512i a, __m512i b);
PUNPCKHBW __m64 _mm_unpackhi_pi8(__m64 m1, __m64 m2)
(V)PUNPCKHBW __m128i _mm_unpackhi_epi8(__m128i m1, __m128i m2)
VPUNPCKHBW __m256i _mm256_unpackhi_epi8(__m256i m1, __m256i m2)
PUNPCKHWD __m64 _mm_unpackhi_pi16(__m64 m1,__m64 m2)
(V)PUNPCKHWD __m128i _mm_unpackhi_epi16(__m128i m1,__m128i m2)
VPUNPCKHWD __m256i _mm256_unpackhi_epi16(__m256i m1,__m256i m2)
PUNPCKHDQ __m64 _mm_unpackhi_pi32(__m64 m1, __m64 m2)
(V)PUNPCKHDQ __m128i _mm_unpackhi_epi32(__m128i m1, __m128i m2)
VPUNPCKHDQ __m256i _mm256_unpackhi_epi32(__m256i m1, __m256i m2)
(V)PUNPCKHQDQ __m128i _mm_unpackhi_epi64 ( __m128i a, __m128i b)
VPUNPCKHQDQ __m256i _mm256_unpackhi_epi64 ( __m256i a, __m256i b)

Flags Affected
None.

Numeric Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPUNPCKHQDQ/QDQ, see Table 2-52, “Type E4NF Class Exception Conditions.”
EVEX-encoded VPUNPCKHBW/WD, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Condi-
tions.”

PUNPCKHBW/PUNPCKHWD/PUNPCKHDQ/PUNPCKHQDQ— Unpack High Data Vol. 2B 4-505


PUNPCKLBW/PUNPCKLWD/PUNPCKLDQ/PUNPCKLQDQ—Unpack Low Data
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 60 /r1 A V/V MMX Interleave low-order bytes from mm and
PUNPCKLBW mm, mm/m32 mm/m32 into mm.

66 0F 60 /r A V/V SSE2 Interleave low-order bytes from xmm1 and


PUNPCKLBW xmm1, xmm2/m128 xmm2/m128 into xmm1.

NP 0F 61 /r1 A V/V MMX Interleave low-order words from mm and


PUNPCKLWD mm, mm/m32 mm/m32 into mm.

66 0F 61 /r A V/V SSE2 Interleave low-order words from xmm1 and


PUNPCKLWD xmm1, xmm2/m128 xmm2/m128 into xmm1.

NP 0F 62 /r1 A V/V MMX Interleave low-order doublewords from mm


PUNPCKLDQ mm, mm/m32 and mm/m32 into mm.

66 0F 62 /r A V/V SSE2 Interleave low-order doublewords from xmm1


PUNPCKLDQ xmm1, xmm2/m128 and xmm2/m128 into xmm1.

66 0F 6C /r A V/V SSE2 Interleave low-order quadword from xmm1


PUNPCKLQDQ xmm1, xmm2/m128 and xmm2/m128 into xmm1 register.

VEX.128.66.0F.WIG 60/r B V/V AVX Interleave low-order bytes from xmm2 and
VPUNPCKLBW xmm1,xmm2, xmm3/m128 xmm3/m128 into xmm1.

VEX.128.66.0F.WIG 61/r B V/V AVX Interleave low-order words from xmm2 and
VPUNPCKLWD xmm1,xmm2, xmm3/m128 xmm3/m128 into xmm1.

VEX.128.66.0F.WIG 62/r B V/V AVX Interleave low-order doublewords from xmm2


VPUNPCKLDQ xmm1, xmm2, xmm3/m128 and xmm3/m128 into xmm1.

VEX.128.66.0F.WIG 6C/r B V/V AVX Interleave low-order quadword from xmm2


VPUNPCKLQDQ xmm1, xmm2, xmm3/m128 and xmm3/m128 into xmm1 register.

VEX.256.66.0F.WIG 60 /r B V/V AVX2 Interleave low-order bytes from ymm2 and


VPUNPCKLBW ymm1, ymm2, ymm3/m256 ymm3/m256 into ymm1 register.

VEX.256.66.0F.WIG 61 /r B V/V AVX2 Interleave low-order words from ymm2 and


VPUNPCKLWD ymm1, ymm2, ymm3/m256 ymm3/m256 into ymm1 register.

VEX.256.66.0F.WIG 62 /r B V/V AVX2 Interleave low-order doublewords from ymm2


VPUNPCKLDQ ymm1, ymm2, ymm3/m256 and ymm3/m256 into ymm1 register.

VEX.256.66.0F.WIG 6C /r B V/V AVX2 Interleave low-order quadword from ymm2


VPUNPCKLQDQ ymm1, ymm2, ymm3/m256 and ymm3/m256 into ymm1 register.

EVEX.128.66.0F.WIG 60 /r C V/V (AVX512VL AND Interleave low-order bytes from xmm2 and
VPUNPCKLBW xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm3/m128 into xmm1 register subject to
xmm3/m128 AVX10.12 write mask k1.
EVEX.128.66.0F.WIG 61 /r C V/V (AVX512VL AND Interleave low-order words from xmm2 and
VPUNPCKLWD xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm3/m128 into xmm1 register subject to
xmm3/m128 AVX10.12 write mask k1.
EVEX.128.66.0F.W0 62 /r D V/V (AVX512VL AND Interleave low-order doublewords from xmm2
VPUNPCKLDQ xmm1 {k1}{z}, xmm2, AVX512F) OR and xmm3/m128/m32bcst into xmm1
xmm3/m128/m32bcst AVX10.12 register subject to write mask k1.
EVEX.128.66.0F.W1 6C /r D V/V (AVX512VL AND Interleave low-order quadword from zmm2
VPUNPCKLQDQ xmm1 {k1}{z}, xmm2, AVX512F) OR and zmm3/m512/m64bcst into zmm1
xmm3/m128/m64bcst AVX10.12 register subject to write mask k1.

PUNPCKLBW/PUNPCKLWD/PUNPCKLDQ/PUNPCKLQDQ—Unpack Low Data Vol. 2B 4-506


Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
EVEX.256.66.0F.WIG 60 /r C V/V (AVX512VL AND Interleave low-order bytes from ymm2 and
VPUNPCKLBW ymm1 {k1}{z}, ymm2, AVX512BW) OR ymm3/m256 into ymm1 register subject to
ymm3/m256 AVX10.12 write mask k1.
EVEX.256.66.0F.WIG 61 /r C V/V (AVX512VL AND Interleave low-order words from ymm2 and
VPUNPCKLWD ymm1 {k1}{z}, ymm2, AVX512BW) OR ymm3/m256 into ymm1 register subject to
ymm3/m256 AVX10.12 write mask k1.
EVEX.256.66.0F.W0 62 /r D V/V (AVX512VL AND Interleave low-order doublewords from ymm2
VPUNPCKLDQ ymm1 {k1}{z}, ymm2, AVX512F) OR and ymm3/m256/m32bcst into ymm1
ymm3/m256/m32bcst AVX10.12 register subject to write mask k1.
EVEX.256.66.0F.W1 6C /r D V/V (AVX512VL AND Interleave low-order quadword from ymm2
VPUNPCKLQDQ ymm1 {k1}{z}, ymm2, AVX512F) OR and ymm3/m256/m64bcst into ymm1
ymm3/m256/m64bcst AVX10.12 register subject to write mask k1.
EVEX.512.66.0F.WIG 60/r C V/V AVX512BW Interleave low-order bytes from zmm2 and
VPUNPCKLBW zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm3/m512 into zmm1 register subject to
zmm3/m512 write mask k1.
EVEX.512.66.0F.WIG 61/r C V/V AVX512BW Interleave low-order words from zmm2 and
VPUNPCKLWD zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm3/m512 into zmm1 register subject to
zmm3/m512 write mask k1.
EVEX.512.66.0F.W0 62 /r D V/V AVX512F Interleave low-order doublewords from zmm2
VPUNPCKLDQ zmm1 {k1}{z}, zmm2, OR AVX10.12 and zmm3/m512/m32bcst into zmm1
zmm3/m512/m32bcst register subject to write mask k1.
EVEX.512.66.0F.W1 6C /r D V/V AVX512F Interleave low-order quadword from zmm2
VPUNPCKLQDQ zmm1 {k1}{z}, zmm2, OR AVX10.12 and zmm3/m512/m64bcst into zmm1
zmm3/m512/m64bcst register subject to write mask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A
D Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Unpacks and interleaves the low-order data elements (bytes, words, doublewords, and quadwords) of the destina-
tion operand (first operand) and source operand (second operand) into the destination operand. (Figure 4-22
shows the unpack operation for bytes in 64-bit operands.). The high-order data elements are ignored.

PUNPCKLBW/PUNPCKLWD/PUNPCKLDQ/PUNPCKLQDQ—Unpack Low Data Vol. 2B 4-507


SRC Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0 X7 X6 X5 X4 X3 X2 X1 X0 DEST

DEST Y3 X3 Y2 X2 Y1 X1 Y0 X0

Figure 4-22. PUNPCKLBW Instruction Operation Using 64-bit Operands

255 31 0 255 31 0

SRC Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0 X7 X6 X5 X4 X3 X2 X1 X0

255 0

DEST Y5 X5 Y4 X4 Y1 X1 Y0 X0

Figure 4-23. 256-bit VPUNPCKLDQ Instruction Operation

When the source data comes from a 128-bit memory operand, an implementation may fetch only the appropriate
64 bits; however, alignment to a 16-byte boundary and normal segment checking will still be enforced.
The (V)PUNPCKLBW instruction interleaves the low-order bytes of the source and destination operands, the
(V)PUNPCKLWD instruction interleaves the low-order words of the source and destination operands, the (V)PUNP-
CKLDQ instruction interleaves the low-order doubleword (or doublewords) of the source and destination operands,
and the (V)PUNPCKLQDQ instruction interleaves the low-order quadwords of the source and destination operands.
These instructions can be used to convert bytes to words, words to doublewords, doublewords to quadwords, and
quadwords to double quadwords, respectively, by placing all 0s in the source operand. Here, if the source operand
contains all 0s, the result (stored in the destination operand) contains zero extensions of the high-order data
elements from the original value in the destination operand. For example, with the (V)PUNPCKLBW instruction the
high-order bytes are zero extended (that is, unpacked into unsigned word integers), and with the (V)PUNPCKLWD
instruction, the high-order words are zero extended (unpacked into unsigned doubleword integers).
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE versions 64-bit operand: The source operand can be an MMX technology register or a 32-bit memory
location. The destination operand is an MMX technology register.
128-bit Legacy SSE versions: The second source operand is an XMM register or a 128-bit memory location. The
first source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM
destination register remain unchanged.
VEX.128 encoded versions: The second source operand is an XMM register or a 128-bit memory location. The first
source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the destination YMM register
are zeroed.
VEX.256 encoded version: The second source operand is an YMM register or an 256-bit memory location. The first
source operand and destination operands are YMM registers. Bits (MAXVL-1:256) of the corresponding ZMM
register are zeroed.
EVEX encoded VPUNPCKLDQ/QDQ: The second source operand is a ZMM/YMM/XMM register, a 512/256/128-bit
memory location or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. The first source

PUNPCKLBW/PUNPCKLWD/PUNPCKLDQ/PUNPCKLQDQ—Unpack Low Data Vol. 2B 4-508


operand and destination operands are ZMM/YMM/XMM registers. The destination is conditionally updated with
writemask k1.
EVEX encoded VPUNPCKLWD/BW: The second source operand is a ZMM/YMM/XMM register, a 512/256/128-bit
memory location. The first source operand and destination operands are ZMM/YMM/XMM registers. The destination
is conditionally updated with writemask k1.

Operation
PUNPCKLBW Instruction With 64-bit Operands:
DEST[63:56] := SRC[31:24];
DEST[55:48] := DEST[31:24];
DEST[47:40] := SRC[23:16];
DEST[39:32] := DEST[23:16];
DEST[31:24] := SRC[15:8];
DEST[23:16] := DEST[15:8];
DEST[15:8] := SRC[7:0];
DEST[7:0] := DEST[7:0];

PUNPCKLWD Instruction With 64-bit Operands:


DEST[63:48] := SRC[31:16];
DEST[47:32] := DEST[31:16];
DEST[31:16] := SRC[15:0];
DEST[15:0] := DEST[15:0];

PUNPCKLDQ Instruction With 64-bit Operands:


DEST[63:32] := SRC[31:0];
DEST[31:0] := DEST[31:0];
INTERLEAVE_BYTES_512b (SRC1, SRC2)
TMP_DEST[255:0] := INTERLEAVE_BYTES_256b(SRC1[255:0], SRC[255:0])
TMP_DEST[511:256] := INTERLEAVE_BYTES_256b(SRC1[511:256], SRC[511:256])

INTERLEAVE_BYTES_256b (SRC1, SRC2)


DEST[7:0] := SRC1[7:0]
DEST[15:8] := SRC2[7:0]
DEST[23:16] := SRC1[15:8]
DEST[31:24] := SRC2[15:8]
DEST[39:32] := SRC1[23:16]
DEST[47:40] := SRC2[23:16]
DEST[55:48] := SRC1[31:24]
DEST[63:56] := SRC2[31:24]
DEST[71:64] := SRC1[39:32]
DEST[79:72] := SRC2[39:32]
DEST[87:80] := SRC1[47:40]
DEST[95:88] := SRC2[47:40]
DEST[103:96] := SRC1[55:48]
DEST[111:104] := SRC2[55:48]
DEST[119:112] := SRC1[63:56]
DEST[127:120] := SRC2[63:56]
DEST[135:128] := SRC1[135:128]
DEST[143:136] := SRC2[135:128]
DEST[151:144] := SRC1[143:136]
DEST[159:152] := SRC2[143:136]
DEST[167:160] := SRC1[151:144]
DEST[175:168] := SRC2[151:144]

PUNPCKLBW/PUNPCKLWD/PUNPCKLDQ/PUNPCKLQDQ—Unpack Low Data Vol. 2B 4-509


DEST[183:176] := SRC1[159:152]
DEST[191:184] := SRC2[159:152]
DEST[199:192] := SRC1[167:160]
DEST[207:200] := SRC2[167:160]
DEST[215:208] := SRC1[175:168]
DEST[223:216] := SRC2[175:168]
DEST[231:224] := SRC1[183:176]
DEST[239:232] := SRC2[183:176]
DEST[247:240] := SRC1[191:184]
DEST[255:248] := SRC2[191:184]

INTERLEAVE_BYTES (SRC1, SRC2)


DEST[7:0] := SRC1[7:0]
DEST[15:8] := SRC2[7:0]
DEST[23:16] := SRC1[15:8]
DEST[31:24] := SRC2[15:8]
DEST[39:32] := SRC1[23:16]
DEST[47:40] := SRC2[23:16]
DEST[55:48] := SRC1[31:24]
DEST[63:56] := SRC2[31:24]
DEST[71:64] := SRC1[39:32]
DEST[79:72] := SRC2[39:32]
DEST[87:80] := SRC1[47:40]
DEST[95:88] := SRC2[47:40]
DEST[103:96] := SRC1[55:48]
DEST[111:104] := SRC2[55:48]
DEST[119:112] := SRC1[63:56]
DEST[127:120] := SRC2[63:56]

INTERLEAVE_WORDS_512b (SRC1, SRC2)


TMP_DEST[255:0] := INTERLEAVE_WORDS_256b(SRC1[255:0], SRC[255:0])
TMP_DEST[511:256] := INTERLEAVE_WORDS_256b(SRC1[511:256], SRC[511:256])

INTERLEAVE_WORDS_256b(SRC1, SRC2)
DEST[15:0] := SRC1[15:0]
DEST[31:16] := SRC2[15:0]
DEST[47:32] := SRC1[31:16]
DEST[63:48] := SRC2[31:16]
DEST[79:64] := SRC1[47:32]
DEST[95:80] := SRC2[47:32]
DEST[111:96] := SRC1[63:48]
DEST[127:112] := SRC2[63:48]
DEST[143:128] := SRC1[143:128]
DEST[159:144] := SRC2[143:128]
DEST[175:160] := SRC1[159:144]
DEST[191:176] := SRC2[159:144]
DEST[207:192] := SRC1[175:160]
DEST[223:208] := SRC2[175:160]
DEST[239:224] := SRC1[191:176]
DEST[255:240] := SRC2[191:176]

INTERLEAVE_WORDS (SRC1, SRC2)


DEST[15:0] := SRC1[15:0]
DEST[31:16] := SRC2[15:0]

PUNPCKLBW/PUNPCKLWD/PUNPCKLDQ/PUNPCKLQDQ—Unpack Low Data Vol. 2B 4-510


DEST[47:32] := SRC1[31:16]
DEST[63:48] := SRC2[31:16]
DEST[79:64] := SRC1[47:32]
DEST[95:80] := SRC2[47:32]
DEST[111:96] := SRC1[63:48]
DEST[127:112] := SRC2[63:48]

INTERLEAVE_DWORDS_512b (SRC1, SRC2)


TMP_DEST[255:0] := INTERLEAVE_DWORDS_256b(SRC1[255:0], SRC2[255:0])
TMP_DEST[511:256] := INTERLEAVE_DWORDS_256b(SRC1[511:256], SRC2[511:256])

INTERLEAVE_DWORDS_256b(SRC1, SRC2)
DEST[31:0] := SRC1[31:0]
DEST[63:32] := SRC2[31:0]
DEST[95:64] := SRC1[63:32]
DEST[127:96] := SRC2[63:32]
DEST[159:128] := SRC1[159:128]
DEST[191:160] := SRC2[159:128]
DEST[223:192] := SRC1[191:160]
DEST[255:224] := SRC2[191:160]

INTERLEAVE_DWORDS(SRC1, SRC2)
DEST[31:0] := SRC1[31:0]
DEST[63:32] := SRC2[31:0]
DEST[95:64] := SRC1[63:32]
DEST[127:96] := SRC2[63:32]
INTERLEAVE_QWORDS_512b (SRC1, SRC2)
TMP_DEST[255:0] := INTERLEAVE_QWORDS_256b(SRC1[255:0], SRC2[255:0])
TMP_DEST[511:256] := INTERLEAVE_QWORDS_256b(SRC1[511:256], SRC2[511:256])

INTERLEAVE_QWORDS_256b(SRC1, SRC2)
DEST[63:0] := SRC1[63:0]
DEST[127:64] := SRC2[63:0]
DEST[191:128] := SRC1[191:128]
DEST[255:192] := SRC2[191:128]

INTERLEAVE_QWORDS(SRC1, SRC2)
DEST[63:0] := SRC1[63:0]
DEST[127:64] := SRC2[63:0]

PUNPCKLBW
DEST[127:0] := INTERLEAVE_BYTES(DEST, SRC)
DEST[255:127] (Unmodified)

VPUNPCKLBW (VEX.128 Encoded Instruction)


DEST[127:0] := INTERLEAVE_BYTES(SRC1, SRC2)
DEST[MAXVL-1:127] := 0

VPUNPCKLBW (VEX.256 Encoded Instruction)


DEST[255:0] := INTERLEAVE_BYTES_256b(SRC1, SRC2)
DEST[MAXVL-1:256] := 0

PUNPCKLBW/PUNPCKLWD/PUNPCKLDQ/PUNPCKLQDQ—Unpack Low Data Vol. 2B 4-511


VPUNPCKLBW (EVEX.512 Encoded Instruction)
(KL, VL) = (16, 128), (32, 256), (64, 512)
IF VL = 128
TMP_DEST[VL-1:0] := INTERLEAVE_BYTES(SRC1[VL-1:0], SRC2[VL-1:0])
FI;
IF VL = 256
TMP_DEST[VL-1:0] := INTERLEAVE_BYTES_256b(SRC1[VL-1:0], SRC2[VL-1:0])
FI;
IF VL = 512
TMP_DEST[VL-1:0] := INTERLEAVE_BYTES_512b(SRC1[VL-1:0], SRC2[VL-1:0])
FI;

FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := TMP_DEST[i+7:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
DEST[511:0] := INTERLEAVE_BYTES_512b(SRC1, SRC2)

PUNPCKLWD
DEST[127:0] := INTERLEAVE_WORDS(DEST, SRC)
DEST[255:127] (Unmodified)

VPUNPCKLWD (VEX.128 Encoded Instruction)


DEST[127:0] := INTERLEAVE_WORDS(SRC1, SRC2)
DEST[MAXVL-1:127] := 0

VPUNPCKLWD (VEX.256 Encoded Instruction)


DEST[255:0] := INTERLEAVE_WORDS_256b(SRC1, SRC2)
DEST[MAXVL-1:256] := 0

VPUNPCKLWD (EVEX.512 Encoded Instruction)


(KL, VL) = (8, 128), (16, 256), (32, 512)
IF VL = 128
TMP_DEST[VL-1:0] := INTERLEAVE_WORDS(SRC1[VL-1:0], SRC2[VL-1:0])
FI;
IF VL = 256
TMP_DEST[VL-1:0] := INTERLEAVE_WORDS_256b(SRC1[VL-1:0], SRC2[VL-1:0])
FI;
IF VL = 512
TMP_DEST[VL-1:0] := INTERLEAVE_WORDS_512b(SRC1[VL-1:0], SRC2[VL-1:0])
FI;

FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*

PUNPCKLBW/PUNPCKLWD/PUNPCKLDQ/PUNPCKLQDQ—Unpack Low Data Vol. 2B 4-512


THEN DEST[i+15:i] := TMP_DEST[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
DEST[511:0] := INTERLEAVE_WORDS_512b(SRC1, SRC2)

PUNPCKLDQ
DEST[127:0] := INTERLEAVE_DWORDS(DEST, SRC)
DEST[MAXVL-1:128] (Unmodified)

VPUNPCKLDQ (VEX.128 Encoded Instruction)


DEST[127:0] := INTERLEAVE_DWORDS(SRC1, SRC2)
DEST[MAXVL-1:128] := 0

VPUNPCKLDQ (VEX.256 Encoded Instruction)


DEST[255:0] := INTERLEAVE_DWORDS_256b(SRC1, SRC2)
DEST[MAXVL-1:256] := 0

VPUNPCKLDQ (EVEX Encoded Instructions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN TMP_SRC2[i+31:i] := SRC2[31:0]
ELSE TMP_SRC2[i+31:i] := SRC2[i+31:i]
FI;
ENDFOR;
IF VL = 128
TMP_DEST[VL-1:0] := INTERLEAVE_DWORDS(SRC1[VL-1:0], TMP_SRC2[VL-1:0])
FI;
IF VL = 256
TMP_DEST[VL-1:0] := INTERLEAVE_DWORDS_256b(SRC1[VL-1:0], TMP_SRC2[VL-1:0])
FI;
IF VL = 512
TMP_DEST[VL-1:0] := INTERLEAVE_DWORDS_512b(SRC1[VL-1:0], TMP_SRC2[VL-1:0])
FI;

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;

PUNPCKLBW/PUNPCKLWD/PUNPCKLDQ/PUNPCKLQDQ—Unpack Low Data Vol. 2B 4-513


ENDFOR
DEST511:0] := INTERLEAVE_DWORDS_512b(SRC1, SRC2)
DEST[MAXVL-1:VL] := 0

PUNPCKLQDQ
DEST[127:0] := INTERLEAVE_QWORDS(DEST, SRC)
DEST[MAXVL-1:128] (Unmodified)

VPUNPCKLQDQ (VEX.128 Encoded Instruction)


DEST[127:0] := INTERLEAVE_QWORDS(SRC1, SRC2)
DEST[MAXVL-1:128] := 0

VPUNPCKLQDQ (VEX.256 Encoded Instruction)


DEST[255:0] := INTERLEAVE_QWORDS_256b(SRC1, SRC2)
DEST[MAXVL-1:256] := 0

VPUNPCKLQDQ (EVEX Encoded Instructions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN TMP_SRC2[i+63:i] := SRC2[63:0]
ELSE TMP_SRC2[i+63:i] := SRC2[i+63:i]
FI;
ENDFOR;
IF VL = 128
TMP_DEST[VL-1:0] := INTERLEAVE_QWORDS(SRC1[VL-1:0], TMP_SRC2[VL-1:0])
FI;
IF VL = 256
TMP_DEST[VL-1:0] := INTERLEAVE_QWORDS_256b(SRC1[VL-1:0], TMP_SRC2[VL-1:0])
FI;
IF VL = 512
TMP_DEST[VL-1:0] := INTERLEAVE_QWORDS_512b(SRC1[VL-1:0], TMP_SRC2[VL-1:0])
FI;

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalents


VPUNPCKLBW __m512i _mm512_unpacklo_epi8(__m512i a, __m512i b);
VPUNPCKLBW __m512i _mm512_mask_unpacklo_epi8(__m512i s, __mmask64 k, __m512i a, __m512i b);
VPUNPCKLBW __m512i _mm512_maskz_unpacklo_epi8( __mmask64 k, __m512i a, __m512i b);

PUNPCKLBW/PUNPCKLWD/PUNPCKLDQ/PUNPCKLQDQ—Unpack Low Data Vol. 2B 4-514


VPUNPCKLBW __m256i _mm256_mask_unpacklo_epi8(__m256i s, __mmask32 k, __m256i a, __m256i b);
VPUNPCKLBW __m256i _mm256_maskz_unpacklo_epi8( __mmask32 k, __m256i a, __m256i b);
VPUNPCKLBW __m128i _mm_mask_unpacklo_epi8(v s, __mmask16 k, __m128i a, __m128i b);
VPUNPCKLBW __m128i _mm_maskz_unpacklo_epi8( __mmask16 k, __m128i a, __m128i b);
VPUNPCKLWD __m512i _mm512_unpacklo_epi16(__m512i a, __m512i b);
VPUNPCKLWD __m512i _mm512_mask_unpacklo_epi16(__m512i s, __mmask32 k, __m512i a, __m512i b);
VPUNPCKLWD __m512i _mm512_maskz_unpacklo_epi16( __mmask32 k, __m512i a, __m512i b);
VPUNPCKLWD __m256i _mm256_mask_unpacklo_epi16(__m256i s, __mmask16 k, __m256i a, __m256i b);
VPUNPCKLWD __m256i _mm256_maskz_unpacklo_epi16( __mmask16 k, __m256i a, __m256i b);
VPUNPCKLWD __m128i _mm_mask_unpacklo_epi16(v s, __mmask8 k, __m128i a, __m128i b);
VPUNPCKLWD __m128i _mm_maskz_unpacklo_epi16( __mmask8 k, __m128i a, __m128i b);
VPUNPCKLDQ __m512i _mm512_unpacklo_epi32(__m512i a, __m512i b);
VPUNPCKLDQ __m512i _mm512_mask_unpacklo_epi32(__m512i s, __mmask16 k, __m512i a, __m512i b);
VPUNPCKLDQ __m512i _mm512_maskz_unpacklo_epi32( __mmask16 k, __m512i a, __m512i b);
VPUNPCKLDQ __m256i _mm256_mask_unpacklo_epi32(__m256i s, __mmask8 k, __m256i a, __m256i b);
VPUNPCKLDQ __m256i _mm256_maskz_unpacklo_epi32( __mmask8 k, __m256i a, __m256i b);
VPUNPCKLDQ __m128i _mm_mask_unpacklo_epi32(v s, __mmask8 k, __m128i a, __m128i b);
VPUNPCKLDQ __m128i _mm_maskz_unpacklo_epi32( __mmask8 k, __m128i a, __m128i b);
VPUNPCKLQDQ __m512i _mm512_unpacklo_epi64(__m512i a, __m512i b);
VPUNPCKLQDQ __m512i _mm512_mask_unpacklo_epi64(__m512i s, __mmask8 k, __m512i a, __m512i b);
VPUNPCKLQDQ __m512i _mm512_maskz_unpacklo_epi64( __mmask8 k, __m512i a, __m512i b);
VPUNPCKLQDQ __m256i _mm256_mask_unpacklo_epi64(__m256i s, __mmask8 k, __m256i a, __m256i b);
VPUNPCKLQDQ __m256i _mm256_maskz_unpacklo_epi64( __mmask8 k, __m256i a, __m256i b);
VPUNPCKLQDQ __m128i _mm_mask_unpacklo_epi64(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPUNPCKLQDQ __m128i _mm_maskz_unpacklo_epi64( __mmask8 k, __m128i a, __m128i b);
PUNPCKLBW __m64 _mm_unpacklo_pi8 (__m64 m1, __m64 m2)
(V)PUNPCKLBW __m128i _mm_unpacklo_epi8 (__m128i m1, __m128i m2)
VPUNPCKLBW __m256i _mm256_unpacklo_epi8 (__m256i m1, __m256i m2)
PUNPCKLWD __m64 _mm_unpacklo_pi16 (__m64 m1, __m64 m2)
(V)PUNPCKLWD __m128i _mm_unpacklo_epi16 (__m128i m1, __m128i m2)
VPUNPCKLWD __m256i _mm256_unpacklo_epi16 (__m256i m1, __m256i m2)
PUNPCKLDQ __m64 _mm_unpacklo_pi32 (__m64 m1, __m64 m2)
(V)PUNPCKLDQ __m128i _mm_unpacklo_epi32 (__m128i m1, __m128i m2)
VPUNPCKLDQ __m256i _mm256_unpacklo_epi32 (__m256i m1, __m256i m2)
(V)PUNPCKLQDQ __m128i _mm_unpacklo_epi64 (__m128i m1, __m128i m2)
VPUNPCKLQDQ __m256i _mm256_unpacklo_epi64 (__m256i m1, __m256i m2)

Flags Affected
None.

Numeric Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPUNPCKLDQ/QDQ, see Table 2-52, “Type E4NF Class Exception Conditions.”
EVEX-encoded VPUNPCKLBW/WD, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Condi-
tions.”

PUNPCKLBW/PUNPCKLWD/PUNPCKLDQ/PUNPCKLQDQ—Unpack Low Data Vol. 2B 4-515


PXOR—Logical Exclusive OR
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F EF /r1 A V/V MMX Bitwise XOR of mm/m64 and mm.
PXOR mm, mm/m64
66 0F EF /r A V/V SSE2 Bitwise XOR of xmm2/m128 and xmm1.
PXOR xmm1, xmm2/m128
VEX.128.66.0F.WIG EF /r B V/V AVX Bitwise XOR of xmm3/m128 and xmm2.
VPXOR xmm1, xmm2, xmm3/m128
VEX.256.66.0F.WIG EF /r B V/V AVX2 Bitwise XOR of ymm3/m256 and ymm2.
VPXOR ymm1, ymm2, ymm3/m256
EVEX.128.66.0F.W0 EF /r C V/V (AVX512VL AND Bitwise XOR of packed doubleword integers in
VPXORD xmm1 {k1}{z}, xmm2, AVX512F) OR xmm2 and xmm3/m128 using writemask k1.
xmm3/m128/m32bcst AVX10.12
EVEX.256.66.0F.W0 EF /r C V/V (AVX512VL AND Bitwise XOR of packed doubleword integers in
VPXORD ymm1 {k1}{z}, ymm2, AVX512F) OR ymm2 and ymm3/m256 using writemask k1.
ymm3/m256/m32bcst AVX10.12
EVEX.512.66.0F.W0 EF /r C V/V AVX512F Bitwise XOR of packed doubleword integers in
VPXORD zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm2 and zmm3/m512/m32bcst using
zmm3/m512/m32bcst writemask k1.
EVEX.128.66.0F.W1 EF /r C V/V (AVX512VL AND Bitwise XOR of packed quadword integers in
VPXORQ xmm1 {k1}{z}, xmm2, AVX512F) OR xmm2 and xmm3/m128 using writemask k1.
xmm3/m128/m64bcst AVX10.12
EVEX.256.66.0F.W1 EF /r C V/V (AVX512VL AND Bitwise XOR of packed quadword integers in
VPXORQ ymm1 {k1}{z}, ymm2, AVX512F) OR ymm2 and ymm3/m256 using writemask k1.
ymm3/m256/m64bcst AVX10.12
EVEX.512.66.0F.W1 EF /r C V/V AVX512F Bitwise XOR of packed quadword integers in
VPXORQ zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm2 and zmm3/m512/m64bcst using
zmm3/m512/m64bcst writemask k1.

NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a bitwise logical exclusive-OR (XOR) operation on the source operand (second operand) and the destina-
tion operand (first operand) and stores the result in the destination operand. Each bit of the result is 1 if the corre-
sponding bits of the two operands are different; each bit is 0 if the corresponding bits of the operands are the
same.

PXOR—Logical Exclusive OR Vol. 2B 4-524


In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE instructions 64-bit operand: The source operand can be an MMX technology register or a 64-bit
memory location. The destination operand is an MMX technology register.
128-bit Legacy SSE version: The second source operand is an XMM register or a 128-bit memory location. The first
source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM desti-
nation register remain unchanged.
VEX.128 encoded version: The second source operand is an XMM register or a 128-bit memory location. The first
source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the destination YMM register
are zeroed.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the
corresponding register destination are zeroed.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a
32/64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with write-
mask k1.

Operation
PXOR (64-bit Operand)
DEST := DEST XOR SRC

PXOR (128-bit Legacy SSE Version)


DEST := DEST XOR SRC
DEST[MAXVL-1:128] (Unmodified)

VPXOR (VEX.128 Encoded Version)


DEST := SRC1 XOR SRC2
DEST[MAXVL-1:128] := 0

VPXOR (VEX.256 Encoded Version)


DEST := SRC1 XOR SRC2
DEST[MAXVL-1:256] := 0

VPXORD (EVEX Encoded Versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+31:i] := SRC1[i+31:i] BITWISE XOR SRC2[31:0]
ELSE DEST[i+31:i] := SRC1[i+31:i] BITWISE XOR SRC2[i+31:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
DEST[31:0] := 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

PXOR—Logical Exclusive OR Vol. 2B 4-525


VPXORQ (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+63:i] := SRC1[i+63:i] BITWISE XOR SRC2[63:0]
ELSE DEST[i+63:i] := SRC1[i+63:i] BITWISE XOR SRC2[i+63:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
DEST[63:0] := 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPXORD __m512i _mm512_xor_epi32(__m512i a, __m512i b)
VPXORD __m512i _mm512_mask_xor_epi32(__m512i s, __mmask16 m, __m512i a, __m512i b)
VPXORD __m512i _mm512_maskz_xor_epi32( __mmask16 m, __m512i a, __m512i b)
VPXORD __m256i _mm256_xor_epi32(__m256i a, __m256i b)
VPXORD __m256i _mm256_mask_xor_epi32(__m256i s, __mmask8 m, __m256i a, __m256i b)
VPXORD __m256i _mm256_maskz_xor_epi32( __mmask8 m, __m256i a, __m256i b)
VPXORD __m128i _mm_xor_epi32(__m128i a, __m128i b)
VPXORD __m128i _mm_mask_xor_epi32(__m128i s, __mmask8 m, __m128i a, __m128i b)
VPXORD __m128i _mm_maskz_xor_epi32( __mmask16 m, __m128i a, __m128i b)
VPXORQ __m512i _mm512_xor_epi64( __m512i a, __m512i b);
VPXORQ __m512i _mm512_mask_xor_epi64(__m512i s, __mmask8 m, __m512i a, __m512i b);
VPXORQ __m512i _mm512_maskz_xor_epi64(__mmask8 m, __m512i a, __m512i b);
VPXORQ __m256i _mm256_xor_epi64( __m256i a, __m256i b);
VPXORQ __m256i _mm256_mask_xor_epi64(__m256i s, __mmask8 m, __m256i a, __m256i b);
VPXORQ __m256i _mm256_maskz_xor_epi64(__mmask8 m, __m256i a, __m256i b);
VPXORQ __m128i _mm_xor_epi64( __m128i a, __m128i b);
VPXORQ __m128i _mm_mask_xor_epi64(__m128i s, __mmask8 m, __m128i a, __m128i b);
VPXORQ __m128i _mm_maskz_xor_epi64(__mmask8 m, __m128i a, __m128i b);
PXOR:__m64 _mm_xor_si64 (__m64 m1, __m64 m2)
(V)PXOR:__m128i _mm_xor_si128 ( __m128i a, __m128i b)
VPXOR:__m256i _mm256_xor_si256 ( __m256i a, __m256i b)

Flags Affected
None.

Numeric Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

PXOR—Logical Exclusive OR Vol. 2B 4-526


RDMSRLIST—Read List of Model Specific Registers
Opcode / Op/ 64/32 bit CPUID Feature Flag Description
Instruction En Mode
Support
F2 0F 01 C6 ZO V/N.E. MSRLIST Read the requested list of MSRs, and store
RDMSRLIST the read values to memory.

Instruction Operand Encoding


Op/En Operand 1 Operand 2 Operand 3 Operand 4
ZO N/A N/A N/A N/A

Description
This instruction reads a software-provided list of up to 64 MSRs and stores their values in memory.
RDMSRLIST takes three implied input operands:
• RSI: Linear address of a table of MSR addresses (8 bytes per address)1.
• RDI: Linear address of a table into which MSR data is stored (8 bytes per MSR).
• RCX: 64-bit bitmask of valid bits for the MSRs. Bit 0 is the valid bit for entry 0 in each table, etc.
For each RCX bit [n] from 0 to 63, if RCX[n] is 1, RDMSRLIST will read the MSR specified at entry [n] in the RSI-
based table and write it out to memory at the entry [n] in the RDI-based table.
This implies a maximum of 64 MSRs that can be processed by this instruction. The processor will clear RCX[n] after
it finishes handling that MSR. Similar to repeated string operations, RDMSRLIST supports partial completion for
interrupts, exceptions, and traps. In these situations, the RIP register saved will point to the RDMSRLIST instruc-
tion while the RCX register will have cleared bits corresponding to all completed iterations.
This instruction must be executed at privilege level 0; otherwise, a general protection exception #GP(0) is gener-
ated. This instruction performs MSR-specific checks in the same manner as RDMSR.
Although RDMSRLIST accesses the entries in the two tables in order, the actual reads of the MSRs may be
performed out of order: for table entries m < n, the processor may read the MSR for entry n before reading the
MSR for entry m. (This may be true also for a sequence of executions of RDMSR.) Ordering is guaranteed if the
address of the IA32_BARRIER MSR (2FH) appears in the table of MSR addresses. Specifically, if IA32_BARRIER
appears at entry m, then the MSR read for any entry n with n > m will not occur until (1) all instructions prior to
RDMSRLIST have completed locally; and (2) MSRs have been read for all table entries before entry m.
The processor is allowed (but not required) to “load ahead” in the list. For example, it may cause a page fault for
an access to a table entry after the nth, despite the processor having read only n MSRs.2

Operation
DO WHILE RCX != 0
MSR_index := position of least significant bit set in RCX;
Load MSR_address_table_entry from 8 bytes at the linear address RSI + (MSR_index * 8);
IF MSR_address_table_entry[63:32] != 0 THEN #GP(0); FI;
MSR_address := MSR_address_table_entry[31:0];
IF RDMSR of the MSR with address MSR_address would #GP THEN #GP(0); FI;
Store the value of the MSR with address MSR_address into 8 bytes at the linear address RDI + (MSR_index * 8);
RCX[MSR_index] := 0;
Allow delivery of any pending interrupts or traps;
OD;

1. Since MSR addresses are only 32-bits wide, bits 63:32 of each MSR address table entry is reserved.
2. For example, the processor may take a page fault due to a linear address for the 10th entry in the MSR address table despite only
having completed the MSR reads up to entry 5.

RDMSRLIST—Read List of Model Specific Registers Vol. 2A 3-538


Flags Affected
None.

Protected Mode Exceptions


#UD The RDMSRLIST instruction is not recognized in protected mode.

Real-Address Mode Exceptions


#UD The RDMSRLIST instruction is not recognized in real-address mode.

Virtual-8086 Mode Exceptions


#UD The RDMSRLIST instruction is not recognized in virtual-8086 mode.

Compatibility Mode Exceptions


#UD The RDMSRLIST instruction is not recognized in compatibility mode.

RDMSRLIST—Read List of Model Specific Registers Vol. 2A 3-539


64-Bit Mode Exceptions
#GP(0) If the current privilege level is not 0.
If RSI [2:0] ≠ 0, RDI [2:0] ≠ 0, or bits 63:32 of an MSR-address table entry are not all zero.
If an execution of RDMSR from a specified MSR would generate a general protection exception
#GP(0).
#UD If the LOCK prefix is used.
If CPUID.(EAX=07H, ECX=01H):EAX.MSRLIST[bit 27] = 0.

RDMSRLIST—Read List of Model Specific Registers Vol. 2A 3-540


RDPMC—Read Performance-Monitoring Counters
Opcode* Instruction Op/ 64-Bit Compat/ Description
En Mode Leg Mode
0F 33 RDPMC ZO Valid Valid Read performance-monitoring counter
specified by ECX into EDX:EAX.

Instruction Operand Encoding


Op/En Operand 1 Operand 2 Operand 3 Operand 4
ZO N/A N/A N/A N/A

Description
Reads the contents of the performance monitoring counter (PMC) specified in ECX register into registers EDX:EAX.
(On processors that support the Intel 64 architecture, the high-order 32 bits of RCX are ignored.) The EDX register
is loaded with the high-order 32 bits of the PMC and the EAX register is loaded with the low-order 32 bits. (On
processors that support the Intel 64 architecture, the high-order 32 bits of each of RAX and RDX are cleared.) If
fewer than 64 bits are implemented in the PMC being read, unimplemented bits returned to EDX:EAX will have
value zero.
The width of PMCs on processors supporting architectural performance monitoring (CPUID.0AH:EAX[7:0] ≠ 0) are
reported by CPUID.0AH:EAX[23:16]. On processors that do not support architectural performance monitoring
(CPUID.0AH:EAX[7:0]=0), the width of general-purpose performance PMCs is 40 bits, while the widths of special-
purpose PMCs are implementation specific.
Use of ECX to specify a PMC depends on whether the processor supports architectural performance monitoring:
• If the processor does not support architectural performance monitoring (CPUID.0AH:EAX[7:0]=0), ECX[30:0]
specifies the index of the PMC to be read. Setting ECX[31] selects “fast” read mode if supported. In this mode,
RDPMC returns bits 31:0 of the PMC in EAX while clearing EDX to zero.
• If the processor does support architectural performance monitoring (CPUID.0AH:EAX[7:0] ≠ 0), ECX[31:16]
specifies type of PMC while ECX[15:0] specifies the index of the PMC to be read within that type. The following
PMC types are currently defined:
— General-purpose counters use type 0. To read IA32_PMCx, one of the following must hold for the index x:
• It is less than the value enumerated by CPUID.0AH.EAX[15:8]; or
• It is at most 31 and the value enumerated by CPUID.(EAX=23H,ECX=1):EAX[bit x] is 1.
— Fixed-function counters use type 4000H. To read IA32_FIXED_CTRx, one of the following must hold for the
index x:
• It is less than the value enumerated by CPUID.0AH:EDX[4:0];
• It is at most 31 and the value enumerated by CPUID.0AH:ECX[bit x] is 1; or
• It is at most 31 and the value enumerated by CPUID.(EAX=23H,ECX=1):EBX[bit x] is 1.
— Performance metrics use type 2000H. This type can be used only if IA32_PERF_CAPABILITIES.PERF_MET-
RICS_AVAILABLE[bit 15]=1. For this type, the index in ECX[15:0] is implementation specific.
Specifying an unsupported PMC encoding will cause a general protection exception #GP(0). For PMC details see
Chapter 21, “Performance Monitoring,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 3B.
When in protected or virtual 8086 mode, the Performance-monitoring Counters Enabled (PCE) flag in register
CR4 restricts the use of the RDPMC instruction. When the PCE flag is set, the RDPMC instruction can be executed
at any privilege level; when the flag is clear, the instruction can only be executed at privilege level 0. (When in real-
address mode, the RDPMC instruction is always enabled.) The PMCs can also be read with the RDMSR instruction,
when executing at privilege level 0.
Processors that support performance metrics may also support clearing them on read if the
IA32_PERF_CAPABILITIES.RDPMC_METRICS_CLEAR[bit 19] is set. Since the IA32_PERF_CAPABILITIES MSR

RDPMC—Read Performance-Monitoring Counters Vol. 2B 4-546


enumerates non-architectural PMU features, software should check DisplayFamily and DisplayModel to confirm
that the processor supports the functionality described in the next paragraph.
When the IA32_FIXED_CTR_CTRL.METRICS_CLEAR_EN[bit 14] is set, an RDPMC instruction for PERF_METRICS
(that is, when ECX=0x2000'0000) clears PERF_METRICS-related resources as well as fixed-function performance
monitoring counter 3 after the read is performed. When METRICS_CLEAR_EN is clear, the RDPMC instruction only
reads PERF_METRICS.
The RDPMC instruction is not a serializing instruction; that is, it does not imply that all the events caused by the
preceding instructions have been completed or that events caused by subsequent instructions have not begun. If
an exact event count is desired, software must insert a serializing instruction (such as the CPUID instruction)
before and/or after the RDPMC instruction.
Performing back-to-back fast reads are not guaranteed to be monotonic. To guarantee monotonicity on back-to-
back reads, a serializing instruction must be placed between the two RDPMC instructions.
The RDPMC instruction can execute in 16-bit addressing mode or virtual-8086 mode; however, the full contents of
the ECX register are used to select the PMC, and the event count is stored in the full EAX and EDX registers. The
RDPMC instruction was introduced into the IA-32 Architecture in the Pentium Pro processor and the Pentium
processor with MMX technology. The earlier Pentium processors have PMCs, but they must be read with the RDMSR
instruction.

Operation
MSCB = Most Significant Counter Bit (* Model-specific *)
IF (((CR4.PCE = 1) or (CPL = 0) or (CR0.PE = 0)) and (ECX indicates a supported counter))
THEN
EAX := counter[31:0];
EDX := ZeroExtend(counter[MSCB:32]);
ELSE (* ECX is not valid or CR4.PCE is 0 and CPL is 1, 2, or 3 and CR0.PE is 1 *)
#GP(0);
FI;

Flags Affected
None.

Protected Mode Exceptions


#GP(0) If the current privilege level is not 0 and the PCE flag in the CR4 register is clear.
If an invalid performance counter index is specified.
#UD If the LOCK prefix is used.

Real-Address Mode Exceptions


#GP If an invalid performance counter index is specified.
#UD If the LOCK prefix is used.

Virtual-8086 Mode Exceptions


#GP(0) If the PCE flag in the CR4 register is clear.
If an invalid performance counter index is specified.
#UD If the LOCK prefix is used.

Compatibility Mode Exceptions


Same exceptions as in protected mode.

RDPMC—Read Performance-Monitoring Counters Vol. 2B 4-547


64-Bit Mode Exceptions
#GP(0) If the current privilege level is not 0 and the PCE flag in the CR4 register is clear.
If an invalid performance counter index is specified.
#UD If the LOCK prefix is used.

RDPMC—Read Performance-Monitoring Counters Vol. 2B 4-548


SHUFPD—Packed Interleave Shuffle of Pairs of Double Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F C6 /r ib A V/V SSE2 Shuffle two pairs of double precision floating-point
SHUFPD xmm1, xmm2/m128, imm8 values from xmm1 and xmm2/m128 using imm8 to
select from each pair, interleaved result is stored in
xmm1.
VEX.128.66.0F.WIG C6 /r ib B V/V AVX Shuffle two pairs of double precision floating-point
VSHUFPD xmm1, xmm2, values from xmm2 and xmm3/m128 using imm8 to
xmm3/m128, imm8 select from each pair, interleaved result is stored in
xmm1.
VEX.256.66.0F.WIG C6 /r ib B V/V AVX Shuffle four pairs of double precision floating-point
VSHUFPD ymm1, ymm2, values from ymm2 and ymm3/m256 using imm8 to
ymm3/m256, imm8 select from each pair, interleaved result is stored in
xmm1.
EVEX.128.66.0F.W1 C6 /r ib C V/V (AVX512VL AND Shuffle two pairs of double precision floating-point
VSHUFPD xmm1{k1}{z}, xmm2, AVX512F) OR values from xmm2 and xmm3/m128/m64bcst using
xmm3/m128/m64bcst, imm8 AVX10.11 imm8 to select from each pair. store interleaved
results in xmm1 subject to writemask k1.
EVEX.256.66.0F.W1 C6 /r ib C V/V (AVX512VL AND Shuffle four pairs of double precision floating-point
VSHUFPD ymm1{k1}{z}, ymm2, AVX512F) OR values from ymm2 and ymm3/m256/m64bcst using
ymm3/m256/m64bcst, imm8 AVX10.11 imm8 to select from each pair. store interleaved
results in ymm1 subject to writemask k1.
EVEX.512.66.0F.W1 C6 /r ib C V/V AVX512F Shuffle eight pairs of double precision floating-point
VSHUFPD zmm1{k1}{z}, zmm2, OR AVX10.11 values from zmm2 and zmm3/m512/m64bcst using
zmm3/m512/m64bcst, imm8 imm8 to select from each pair. store interleaved
results in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) imm8 N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
Selects a double precision floating-point value of an input pair using a bit control and move to a designated element
of the destination operand. The low-to-high order of double precision element of the destination operand is inter-
leaved between the first source operand and the second source operand at the granularity of input pair of 128 bits.
Each bit in the imm8 byte, starting from bit 0, is the select control of the corresponding element of the destination
to received the shuffled result of an input pair.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a
64-bit memory location The destination operand is a ZMM/YMM/XMM register updated according to the writemask.
The select controls are the lower 8/4/2 bits of the imm8 byte.

SHUFPD—Packed Interleave Shuffle of Pairs of Double Precision Floating-Point Values Vol. 2B 4-637
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register. The select controls are the bit 3:0
of the imm8 byte, imm8[7:4) are ignored.
VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed. The select controls are the bit 1:0 of the imm8 byte,
imm8[7:2) are ignored.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation operand and the first source operand is the same and is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are unmodified. The select controls are the bit 1:0 of the imm8 byte,
imm8[7:2) are ignored.

SRC1 X3 X2 X1 X0

SRC2 Y3 Y2 Y1 Y0

DEST Y2 or Y3 X2 or X3 Y0 or Y1 X0 or X1

Figure 4-25. 256-bit VSHUFPD Operation of Four Pairs of Double Precision Floating-Point Values

Operation
VSHUFPD (EVEX Encoded Versions When SRC2 is a Vector Register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF IMM0[0] = 0
THEN TMP_DEST[63:0] := SRC1[63:0]
ELSE TMP_DEST[63:0] := SRC1[127:64] FI;
IF IMM0[1] = 0
THEN TMP_DEST[127:64] := SRC2[63:0]
ELSE TMP_DEST[127:64] := SRC2[127:64] FI;
IF VL >= 256
IF IMM0[2] = 0
THEN TMP_DEST[191:128] := SRC1[191:128]
ELSE TMP_DEST[191:128] := SRC1[255:192] FI;
IF IMM0[3] = 0
THEN TMP_DEST[255:192] := SRC2[191:128]
ELSE TMP_DEST[255:192] := SRC2[255:192] FI;
FI;
IF VL >= 512
IF IMM0[4] = 0
THEN TMP_DEST[319:256] := SRC1[319:256]
ELSE TMP_DEST[319:256] := SRC1[383:320] FI;
IF IMM0[5] = 0
THEN TMP_DEST[383:320] := SRC2[319:256]
ELSE TMP_DEST[383:320] := SRC2[383:320] FI;
IF IMM0[6] = 0
THEN TMP_DEST[447:384] := SRC1[447:384]
ELSE TMP_DEST[447:384] := SRC1[511:448] FI;
IF IMM0[7] = 0

SHUFPD—Packed Interleave Shuffle of Pairs of Double Precision Floating-Point Values Vol. 2B 4-638
THEN TMP_DEST[511:448] := SRC2[447:384]
ELSE TMP_DEST[511:448] := SRC2[511:448] FI;
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VSHUFPD (EVEX Encoded Versions When SRC2 is Memory)


(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1)
THEN TMP_SRC2[i+63:i] := SRC2[63:0]
ELSE TMP_SRC2[i+63:i] := SRC2[i+63:i]
FI;
ENDFOR;
IF IMM0[0] = 0
THEN TMP_DEST[63:0] := SRC1[63:0]
ELSE TMP_DEST[63:0] := SRC1[127:64] FI;
IF IMM0[1] = 0
THEN TMP_DEST[127:64] := TMP_SRC2[63:0]
ELSE TMP_DEST[127:64] := TMP_SRC2[127:64] FI;
IF VL >= 256
IF IMM0[2] = 0
THEN TMP_DEST[191:128] := SRC1[191:128]
ELSE TMP_DEST[191:128] := SRC1[255:192] FI;
IF IMM0[3] = 0
THEN TMP_DEST[255:192] := TMP_SRC2[191:128]
ELSE TMP_DEST[255:192] := TMP_SRC2[255:192] FI;
FI;
IF VL >= 512
IF IMM0[4] = 0
THEN TMP_DEST[319:256] := SRC1[319:256]
ELSE TMP_DEST[319:256] := SRC1[383:320] FI;
IF IMM0[5] = 0
THEN TMP_DEST[383:320] := TMP_SRC2[319:256]
ELSE TMP_DEST[383:320] := TMP_SRC2[383:320] FI;
IF IMM0[6] = 0
THEN TMP_DEST[447:384] := SRC1[447:384]
ELSE TMP_DEST[447:384] := SRC1[511:448] FI;
IF IMM0[7] = 0
THEN TMP_DEST[511:448] := TMP_SRC2[447:384]
ELSE TMP_DEST[511:448] := TMP_SRC2[511:448] FI;

SHUFPD—Packed Interleave Shuffle of Pairs of Double Precision Floating-Point Values Vol. 2B 4-639
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VSHUFPD (VEX.256 Encoded Version)


IF IMM0[0] = 0
THEN DEST[63:0] := SRC1[63:0]
ELSE DEST[63:0] := SRC1[127:64] FI;
IF IMM0[1] = 0
THEN DEST[127:64] := SRC2[63:0]
ELSE DEST[127:64] := SRC2[127:64] FI;
IF IMM0[2] = 0
THEN DEST[191:128] := SRC1[191:128]
ELSE DEST[191:128] := SRC1[255:192] FI;
IF IMM0[3] = 0
THEN DEST[255:192] := SRC2[191:128]
ELSE DEST[255:192] := SRC2[255:192] FI;
DEST[MAXVL-1:256] (Unmodified)

VSHUFPD (VEX.128 Encoded Version)


IF IMM0[0] = 0
THEN DEST[63:0] := SRC1[63:0]
ELSE DEST[63:0] := SRC1[127:64] FI;
IF IMM0[1] = 0
THEN DEST[127:64] := SRC2[63:0]
ELSE DEST[127:64] := SRC2[127:64] FI;
DEST[MAXVL-1:128] := 0

VSHUFPD (128-bit Legacy SSE Version)


IF IMM0[0] = 0
THEN DEST[63:0] := SRC1[63:0]
ELSE DEST[63:0] := SRC1[127:64] FI;
IF IMM0[1] = 0
THEN DEST[127:64] := SRC2[63:0]
ELSE DEST[127:64] := SRC2[127:64] FI;
DEST[MAXVL-1:128] (Unmodified)

SHUFPD—Packed Interleave Shuffle of Pairs of Double Precision Floating-Point Values Vol. 2B 4-640
Intel C/C++ Compiler Intrinsic Equivalent
VSHUFPD __m512d _mm512_shuffle_pd(__m512d a, __m512d b, int imm);
VSHUFPD __m512d _mm512_mask_shuffle_pd(__m512d s, __mmask8 k, __m512d a, __m512d b, int imm);
VSHUFPD __m512d _mm512_maskz_shuffle_pd( __mmask8 k, __m512d a, __m512d b, int imm);
VSHUFPD __m256d _mm256_shuffle_pd (__m256d a, __m256d b, const int select);
VSHUFPD __m256d _mm256_mask_shuffle_pd(__m256d s, __mmask8 k, __m256d a, __m256d b, int imm);
VSHUFPD __m256d _mm256_maskz_shuffle_pd( __mmask8 k, __m256d a, __m256d b, int imm);
SHUFPD __m128d _mm_shuffle_pd (__m128d a, __m128d b, const int select);
VSHUFPD __m128d _mm_mask_shuffle_pd(__m128d s, __mmask8 k, __m128d a, __m128d b, int imm);
VSHUFPD __m128d _mm_maskz_shuffle_pd( __mmask8 k, __m128d a, __m128d b, int imm);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-52, “Type E4NF Class Exception Conditions.”

SHUFPD—Packed Interleave Shuffle of Pairs of Double Precision Floating-Point Values Vol. 2B 4-641
SHUFPS—Packed Interleave Shuffle of Quadruplets of Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F C6 /r ib A V/V SSE Select from quadruplet of single precision floating-
SHUFPS xmm1, xmm3/m128, imm8 point values in xmm1 and xmm2/m128 using
imm8, interleaved result pairs are stored in xmm1.
VEX.128.0F.WIG C6 /r ib B V/V AVX Select from quadruplet of single precision floating-
VSHUFPS xmm1, xmm2, xmm3/m128, point values in xmm1 and xmm2/m128 using
imm8 imm8, interleaved result pairs are stored in xmm1.
VEX.256.0F.WIG C6 /r ib B V/V AVX Select from quadruplet of single precision floating-
VSHUFPS ymm1, ymm2, ymm3/m256, point values in ymm2 and ymm3/m256 using
imm8 imm8, interleaved result pairs are stored in ymm1.
EVEX.128.0F.W0 C6 /r ib C V/V (AVX512VL AND Select from quadruplet of single precision floating-
VSHUFPS xmm1{k1}{z}, xmm2, AVX512F) OR point values in xmm1 and xmm2/m128 using
xmm3/m128/m32bcst, imm8 AVX10.11 imm8, interleaved result pairs are stored in xmm1,
subject to writemask k1.
EVEX.256.0F.W0 C6 /r ib C V/V (AVX512VL AND Select from quadruplet of single precision floating-
VSHUFPS ymm1{k1}{z}, ymm2, AVX512F) OR point values in ymm2 and ymm3/m256 using
ymm3/m256/m32bcst, imm8 AVX10.11 imm8, interleaved result pairs are stored in ymm1,
subject to writemask k1.
EVEX.512.0F.W0 C6 /r ib C V/V AVX512F Select from quadruplet of single precision floating-
VSHUFPS zmm1{k1}{z}, zmm2, OR AVX10.11 point values in zmm2 and zmm3/m512 using imm8,
zmm3/m512/m32bcst, imm8 interleaved result pairs are stored in zmm1, subject
to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) imm8 N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
Selects a single precision floating-point value of an input quadruplet using a two-bit control and move to a desig-
nated element of the destination operand. Each 64-bit element-pair of a 128-bit lane of the destination operand is
interleaved between the corresponding lane of the first source operand and the second source operand at the gran-
ularity 128 bits. Each two bits in the imm8 byte, starting from bit 0, is the select control of the corresponding
element of a 128-bit lane of the destination to received the shuffled result of an input quadruplet. The two lower
elements of a 128-bit lane in the destination receives shuffle results from the quadruple of the first source operand.
The next two elements of the destination receives shuffle results from the quadruple of the second source operand.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a
32-bit memory location. The destination operand is a ZMM/YMM/XMM register updated according to the writemask.
imm8[7:0] provides 4 select controls for each applicable 128-bit lane of the destination.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register. Imm8[7:0] provides 4 select
controls for the high and low 128-bit of the destination.

SHUFPS—Packed Interleave Shuffle of Quadruplets of Single Precision Floating-Point Values Vol. 2B 4-642
VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed. Imm8[7:0] provides 4 select controls for each element of
the destination.
128-bit Legacy SSE version: The source can be an XMM register or an 128-bit memory location. The destination is
not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding ZMM
register destination are unmodified. Imm8[7:0] provides 4 select controls for each element of the destination.

SRC1 X7 X6 X5 X4 X3 X2 X1 X0

SRC2 Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0

DEST Y7 .. Y4 Y7 .. Y4 X7 .. X4 X7 .. X4 Y3 ..Y0 Y3 ..Y0 X3 .. X0 X3 .. X0

Figure 4-26. 256-bit VSHUFPS Operation of Selection from Input Quadruplet and Pair-wise Interleaved Result

Operation
Select4(SRC, control) {
CASE (control[1:0]) OF
0: TMP := SRC[31:0];
1: TMP := SRC[63:32];
2: TMP := SRC[95:64];
3: TMP := SRC[127:96];
ESAC;
RETURN TMP
}

VPSHUFPS (EVEX Encoded Versions When SRC2 is a Vector Register)


(KL, VL) = (4, 128), (8, 256), (16, 512)

TMP_DEST[31:0] := Select4(SRC1[127:0], imm8[1:0]);


TMP_DEST[63:32] := Select4(SRC1[127:0], imm8[3:2]);
TMP_DEST[95:64] := Select4(SRC2[127:0], imm8[5:4]);
TMP_DEST[127:96] := Select4(SRC2[127:0], imm8[7:6]);
IF VL >= 256
TMP_DEST[159:128] := Select4(SRC1[255:128], imm8[1:0]);
TMP_DEST[191:160] := Select4(SRC1[255:128], imm8[3:2]);
TMP_DEST[223:192] := Select4(SRC2[255:128], imm8[5:4]);
TMP_DEST[255:224] := Select4(SRC2[255:128], imm8[7:6]);
FI;
IF VL >= 512
TMP_DEST[287:256] := Select4(SRC1[383:256], imm8[1:0]);
TMP_DEST[319:288] := Select4(SRC1[383:256], imm8[3:2]);
TMP_DEST[351:320] := Select4(SRC2[383:256], imm8[5:4]);
TMP_DEST[383:352] := Select4(SRC2[383:256], imm8[7:6]);
TMP_DEST[415:384] := Select4(SRC1[511:384], imm8[1:0]);
TMP_DEST[447:416] := Select4(SRC1[511:384], imm8[3:2]);
TMP_DEST[479:448] := Select4(SRC2[511:384], imm8[5:4]);
TMP_DEST[511:480] := Select4(SRC2[511:384], imm8[7:6]);

SHUFPS—Packed Interleave Shuffle of Quadruplets of Single Precision Floating-Point Values Vol. 2B 4-643
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPSHUFPS (EVEX Encoded Versions When SRC2 is Memory)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF (EVEX.b = 1)
THEN TMP_SRC2[i+31:i] := SRC2[31:0]
ELSE TMP_SRC2[i+31:i] := SRC2[i+31:i]
FI;
ENDFOR;
TMP_DEST[31:0] := Select4(SRC1[127:0], imm8[1:0]);
TMP_DEST[63:32] := Select4(SRC1[127:0], imm8[3:2]);
TMP_DEST[95:64] := Select4(TMP_SRC2[127:0], imm8[5:4]);
TMP_DEST[127:96] := Select4(TMP_SRC2[127:0], imm8[7:6]);
IF VL >= 256
TMP_DEST[159:128] := Select4(SRC1[255:128], imm8[1:0]);
TMP_DEST[191:160] := Select4(SRC1[255:128], imm8[3:2]);
TMP_DEST[223:192] := Select4(TMP_SRC2[255:128], imm8[5:4]);
TMP_DEST[255:224] := Select4(TMP_SRC2[255:128], imm8[7:6]);
FI;
IF VL >= 512
TMP_DEST[287:256] := Select4(SRC1[383:256], imm8[1:0]);
TMP_DEST[319:288] := Select4(SRC1[383:256], imm8[3:2]);
TMP_DEST[351:320] := Select4(TMP_SRC2[383:256], imm8[5:4]);
TMP_DEST[383:352] := Select4(TMP_SRC2[383:256], imm8[7:6]);
TMP_DEST[415:384] := Select4(SRC1[511:384], imm8[1:0]);
TMP_DEST[447:416] := Select4(SRC1[511:384], imm8[3:2]);
TMP_DEST[479:448] := Select4(TMP_SRC2[511:384], imm8[5:4]);
TMP_DEST[511:480] := Select4(TMP_SRC2[511:384], imm8[7:6]);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI

SHUFPS—Packed Interleave Shuffle of Quadruplets of Single Precision Floating-Point Values Vol. 2B 4-644
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VSHUFPS (VEX.256 Encoded Version)


DEST[31:0] := Select4(SRC1[127:0], imm8[1:0]);
DEST[63:32] := Select4(SRC1[127:0], imm8[3:2]);
DEST[95:64] := Select4(SRC2[127:0], imm8[5:4]);
DEST[127:96] := Select4(SRC2[127:0], imm8[7:6]);
DEST[159:128] := Select4(SRC1[255:128], imm8[1:0]);
DEST[191:160] := Select4(SRC1[255:128], imm8[3:2]);
DEST[223:192] := Select4(SRC2[255:128], imm8[5:4]);
DEST[255:224] := Select4(SRC2[255:128], imm8[7:6]);
DEST[MAXVL-1:256] := 0

VSHUFPS (VEX.128 Encoded Version)


DEST[31:0] := Select4(SRC1[127:0], imm8[1:0]);
DEST[63:32] := Select4(SRC1[127:0], imm8[3:2]);
DEST[95:64] := Select4(SRC2[127:0], imm8[5:4]);
DEST[127:96] := Select4(SRC2[127:0], imm8[7:6]);
DEST[MAXVL-1:128] := 0

SHUFPS (128-bit Legacy SSE Version)


DEST[31:0] := Select4(SRC1[127:0], imm8[1:0]);
DEST[63:32] := Select4(SRC1[127:0], imm8[3:2]);
DEST[95:64] := Select4(SRC2[127:0], imm8[5:4]);
DEST[127:96] := Select4(SRC2[127:0], imm8[7:6]);
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VSHUFPS __m512 _mm512_shuffle_ps(__m512 a, __m512 b, int imm);
VSHUFPS __m512 _mm512_mask_shuffle_ps(__m512 s, __mmask16 k, __m512 a, __m512 b, int imm);
VSHUFPS __m512 _mm512_maskz_shuffle_ps(__mmask16 k, __m512 a, __m512 b, int imm);
VSHUFPS __m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int select);
VSHUFPS __m256 _mm256_mask_shuffle_ps(__m256 s, __mmask8 k, __m256 a, __m256 b, int imm);
VSHUFPS __m256 _mm256_maskz_shuffle_ps(__mmask8 k, __m256 a, __m256 b, int imm);
SHUFPS __m128 _mm_shuffle_ps (__m128 a, __m128 b, const int select);
VSHUFPS __m128 _mm_mask_shuffle_ps(__m128 s, __mmask8 k, __m128 a, __m128 b, int imm);
VSHUFPS __m128 _mm_maskz_shuffle_ps(__mmask8 k, __m128 a, __m128 b, int imm);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-52, “Type E4NF Class Exception Conditions.”

SHUFPS—Packed Interleave Shuffle of Quadruplets of Single Precision Floating-Point Values Vol. 2B 4-645
SQRTPD—Square Root of Double Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 51 /r A V/V SSE2 Computes Square Roots of the packed double precision
SQRTPD xmm1, xmm2/m128 floating-point values in xmm2/m128 and stores the
result in xmm1.
VEX.128.66.0F.WIG 51 /r A V/V AVX Computes Square Roots of the packed double precision
VSQRTPD xmm1, xmm2/m128 floating-point values in xmm2/m128 and stores the
result in xmm1.
VEX.256.66.0F.WIG 51 /r A V/V AVX Computes Square Roots of the packed double precision
VSQRTPD ymm1, ymm2/m256 floating-point values in ymm2/m256 and stores the
result in ymm1.
EVEX.128.66.0F.W1 51 /r B V/V (AVX512VL AND Computes Square Roots of the packed double precision
VSQRTPD xmm1 {k1}{z}, AVX512F) OR floating-point values in xmm2/m128/m64bcst and
xmm2/m128/m64bcst AVX10.11 stores the result in xmm1 subject to writemask k1.
EVEX.256.66.0F.W1 51 /r B V/V (AVX512VL AND Computes Square Roots of the packed double precision
VSQRTPD ymm1 {k1}{z}, AVX512F) OR floating-point values in ymm2/m256/m64bcst and
ymm2/m256/m64bcst AVX10.11 stores the result in ymm1 subject to writemask k1.
EVEX.512.66.0F.W1 51 /r B V/V AVX512F OR Computes Square Roots of the packed double precision
VSQRTPD zmm1 {k1}{z}, AVX10.11 floating-point values in zmm2/m512/m64bcst and
zmm2/m512/m64bcst{er} stores the result in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Performs a SIMD computation of the square roots of the two, four or eight packed double precision floating-point
values in the source operand (the second operand) stores the packed double precision floating-point results in the
destination operand (the first operand).
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or
a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is a
ZMM/YMM/XMM register updated according to the writemask.
VEX.256 encoded version: The source operand is a YMM register or a 256-bit memory location. The destination
operand is a YMM register. The upper bits (MAXVL-1:256) of the corresponding ZMM register destination are
zeroed.
VEX.128 encoded version: the source operand second source operand or a 128-bit memory location. The destina-
tion operand is an XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are
zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or 128-bit memory location. The destina-
tion is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding ZMM
register destination are unmodified.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

SQRTPD—Square Root of Double Precision Floating-Point Values Vol. 2B 4-652


Operation
VSQRTPD (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1) AND (SRC *is register*)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC *is memory*)
THEN DEST[i+63:i] := SQRT(SRC[63:0])
ELSE DEST[i+63:i] := SQRT(SRC[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VSQRTPD (VEX.256 Encoded Version)


DEST[63:0] := SQRT(SRC[63:0])
DEST[127:64] := SQRT(SRC[127:64])
DEST[191:128] := SQRT(SRC[191:128])
DEST[255:192] := SQRT(SRC[255:192])
DEST[MAXVL-1:256] := 0
.
VSQRTPD (VEX.128 Encoded Version)
DEST[63:0] := SQRT(SRC[63:0])
DEST[127:64] := SQRT(SRC[127:64])
DEST[MAXVL-1:128] := 0

SQRTPD (128-bit Legacy SSE Version)


DEST[63:0] := SQRT(SRC[63:0])
DEST[127:64] := SQRT(SRC[127:64])
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VSQRTPD __m512d _mm512_sqrt_round_pd(__m512d a, int r);
VSQRTPD __m512d _mm512_mask_sqrt_round_pd(__m512d s, __mmask8 k, __m512d a, int r);
VSQRTPD __m512d _mm512_maskz_sqrt_round_pd( __mmask8 k, __m512d a, int r);
VSQRTPD __m256d _mm256_sqrt_pd (__m256d a);
VSQRTPD __m256d _mm256_mask_sqrt_pd(__m256d s, __mmask8 k, __m256d a, int r);
VSQRTPD __m256d _mm256_maskz_sqrt_pd( __mmask8 k, __m256d a, int r);
SQRTPD __m128d _mm_sqrt_pd (__m128d a);
VSQRTPD __m128d _mm_mask_sqrt_pd(__m128d s, __mmask8 k, __m128d a, int r);
VSQRTPD __m128d _mm_maskz_sqrt_pd( __mmask8 k, __m128d a, int r);

SQRTPD—Square Root of Double Precision Floating-Point Values Vol. 2B 4-653


SIMD Floating-Point Exceptions
Invalid, Precision, Denormal.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions,” additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions,” additionally:
#UD If EVEX.vvvv != 1111B.

SQRTPD—Square Root of Double Precision Floating-Point Values Vol. 2B 4-654


SQRTPS—Square Root of Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 51 /r A V/V SSE Computes Square Roots of the packed single precision
SQRTPS xmm1, xmm2/m128 floating-point values in xmm2/m128 and stores the result
in xmm1.
VEX.128.0F.WIG 51 /r A V/V AVX Computes Square Roots of the packed single precision
VSQRTPS xmm1, xmm2/m128 floating-point values in xmm2/m128 and stores the result
in xmm1.
VEX.256.0F.WIG 51/r A V/V AVX Computes Square Roots of the packed single precision
VSQRTPS ymm1, ymm2/m256 floating-point values in ymm2/m256 and stores the result
in ymm1.
EVEX.128.0F.W0 51 /r B V/V (AVX512VL AND Computes Square Roots of the packed single precision
VSQRTPS xmm1 {k1}{z}, AVX512F) OR floating-point values in xmm2/m128/m32bcst and stores
xmm2/m128/m32bcst AVX10.11 the result in xmm1 subject to writemask k1.
EVEX.256.0F.W0 51 /r B V/V (AVX512VL AND Computes Square Roots of the packed single precision
VSQRTPS ymm1 {k1}{z}, AVX512F) OR floating-point values in ymm2/m256/m32bcst and stores
ymm2/m256/m32bcst AVX10.11 the result in ymm1 subject to writemask k1.
EVEX.512.0F.W0 51/r B V/V AVX512F Computes Square Roots of the packed single precision
VSQRTPS zmm1 {k1}{z}, OR AVX10.11 floating-point values in zmm2/m512/m32bcst and stores
zmm2/m512/m32bcst{er} the result in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Performs a SIMD computation of the square roots of the four, eight or sixteen packed single precision floating-point
values in the source operand (second operand) stores the packed single precision floating-point results in the desti-
nation operand.
EVEX.512 encoded versions: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location
or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a
ZMM/YMM/XMM register updated according to the writemask.
VEX.256 encoded version: The source operand is a YMM register or a 256-bit memory location. The destination
operand is a YMM register. The upper bits (MAXVL-1:256) of the corresponding ZMM register destination are
zeroed.
VEX.128 encoded version: the source operand second source operand or a 128-bit memory location. The destina-
tion operand is an XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are
zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or 128-bit memory location. The destina-
tion is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding ZMM
register destination are unmodified.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

SQRTPS—Square Root of Single Precision Floating-Point Values Vol. 2B 4-655


Operation
VSQRTPS (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1) AND (SRC *is register*)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC *is memory*)
THEN DEST[i+31:i] := SQRT(SRC[31:0])
ELSE DEST[i+31:i] := SQRT(SRC[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VSQRTPS (VEX.256 Encoded Version)


DEST[31:0] := SQRT(SRC[31:0])
DEST[63:32] := SQRT(SRC[63:32])
DEST[95:64] := SQRT(SRC[95:64])
DEST[127:96] := SQRT(SRC[127:96])
DEST[159:128] := SQRT(SRC[159:128])
DEST[191:160] := SQRT(SRC[191:160])
DEST[223:192] := SQRT(SRC[223:192])
DEST[255:224] := SQRT(SRC[255:224])

VSQRTPS (VEX.128 Encoded Version)


DEST[31:0] := SQRT(SRC[31:0])
DEST[63:32] := SQRT(SRC[63:32])
DEST[95:64] := SQRT(SRC[95:64])
DEST[127:96] := SQRT(SRC[127:96])
DEST[MAXVL-1:128] := 0

SQRTPS (128-bit Legacy SSE Version)


DEST[31:0] := SQRT(SRC[31:0])
DEST[63:32] := SQRT(SRC[63:32])
DEST[95:64] := SQRT(SRC[95:64])
DEST[127:96] := SQRT(SRC[127:96])
DEST[MAXVL-1:128] (Unmodified)

SQRTPS—Square Root of Single Precision Floating-Point Values Vol. 2B 4-656


Intel C/C++ Compiler Intrinsic Equivalent
VSQRTPS __m512 _mm512_sqrt_round_ps(__m512 a, int r);
VSQRTPS __m512 _mm512_mask_sqrt_round_ps(__m512 s, __mmask16 k, __m512 a, int r);
VSQRTPS __m512 _mm512_maskz_sqrt_round_ps( __mmask16 k, __m512 a, int r);
VSQRTPS __m256 _mm256_sqrt_ps (__m256 a);
VSQRTPS __m256 _mm256_mask_sqrt_ps(__m256 s, __mmask8 k, __m256 a, int r);
VSQRTPS __m256 _mm256_maskz_sqrt_ps( __mmask8 k, __m256 a, int r);
SQRTPS __m128 _mm_sqrt_ps (__m128 a);
VSQRTPS __m128 _mm_mask_sqrt_ps(__m128 s, __mmask8 k, __m128 a, int r);
VSQRTPS __m128 _mm_maskz_sqrt_ps( __mmask8 k, __m128 a, int r);

SIMD Floating-Point Exceptions


Invalid, Precision, Denormal.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions,” additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions,” additionally:
#UD If EVEX.vvvv != 1111B.

SQRTPS—Square Root of Single Precision Floating-Point Values Vol. 2B 4-657


SQRTSD—Compute Square Root of Scalar Double Precision Floating-Point Value
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F2 0F 51/r A V/V SSE2 Computes square root of the low double precision floating-
SQRTSD xmm1,xmm2/m64 point value in xmm2/m64 and stores the results in xmm1.
VEX.LIG.F2.0F.WIG 51/r B V/V AVX Computes square root of the low double precision floating-
VSQRTSD xmm1,xmm2, xmm3/m64 point value in xmm3/m64 and stores the results in xmm1.
Also, upper double precision floating-point value
(bits[127:64]) from xmm2 is copied to xmm1[127:64].
EVEX.LLIG.F2.0F.W1 51/r C V/V AVX512F Computes square root of the low double precision floating-
VSQRTSD xmm1 {k1}{z}, xmm2, OR AVX10.11 point value in xmm3/m64 and stores the results in xmm1
xmm3/m64{er} under writemask k1. Also, upper double precision floating-
point value (bits[127:64]) from xmm2 is copied to
xmm1[127:64].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Computes the square root of the low double precision floating-point value in the second source operand and stores
the double precision floating-point result in the destination operand. The second source operand can be an XMM
register or a 64-bit memory location. The first source and destination operands are XMM registers.
128-bit Legacy SSE version: The first source operand and the destination operand are the same. The quadword at
bits 127:64 of the destination operand remains unchanged. Bits (MAXVL-1:64) of the corresponding destination
register remain unchanged.
VEX.128 and EVEX encoded versions: Bits 127:64 of the destination operand are copied from the corresponding
bits of the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.
EVEX encoded version: The low quadword element of the destination operand is updated according to the write-
mask.
Software should ensure VSQRTSD is encoded with VEX.L=0. Encoding VSQRTSD with VEX.L=1 may encounter
unpredictable behavior across different processor generations.

SQRTSD—Compute Square Root of Scalar Double Precision Floating-Point Value Vol. 2B 4-658
Operation
VSQRTSD (EVEX Encoded Version)
IF (EVEX.b = 1) AND (SRC2 *is register*)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[63:0] := SQRT(SRC2[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

VSQRTSD (VEX.128 Encoded Version)


DEST[63:0] := SQRT(SRC2[63:0])
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

SQRTSD (128-bit Legacy SSE Version)


DEST[63:0] := SQRT(SRC[63:0])
DEST[MAXVL-1:64] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VSQRTSD __m128d _mm_sqrt_round_sd(__m128d a, __m128d b, int r);
VSQRTSD __m128d _mm_mask_sqrt_round_sd(__m128d s, __mmask8 k, __m128d a, __m128d b, int r);
VSQRTSD __m128d _mm_maskz_sqrt_round_sd(__mmask8 k, __m128d a, __m128d b, int r);
SQRTSD __m128d _mm_sqrt_sd (__m128d a, __m128d b)

SIMD Floating-Point Exceptions


Invalid, Precision, Denormal.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-49, “Type E3 Class Exception Conditions.”

SQRTSD—Compute Square Root of Scalar Double Precision Floating-Point Value Vol. 2B 4-659
SQRTSS—Compute Square Root of Scalar Single Precision Value
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
F3 0F 51 /r A V/V SSE Computes square root of the low single precision floating-
SQRTSS xmm1, xmm2/m32 point value in xmm2/m32 and stores the results in xmm1.
VEX.LIG.F3.0F.WIG 51 /r B V/V AVX Computes square root of the low single precision floating-
VSQRTSS xmm1, xmm2, point value in xmm3/m32 and stores the results in xmm1.
xmm3/m32 Also, upper single precision floating-point values
(bits[127:32]) from xmm2 are copied to xmm1[127:32].
EVEX.LLIG.F3.0F.W0 51 /r C V/V AVX512F Computes square root of the low single precision floating-
VSQRTSS xmm1 {k1}{z}, xmm2, OR AVX10.11 point value in xmm3/m32 and stores the results in xmm1
xmm3/m32{er} under writemask k1. Also, upper single precision floating-
point values (bits[127:32]) from xmm2 are copied to
xmm1[127:32].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Computes the square root of the low single precision floating-point value in the second source operand and stores
the single precision floating-point result in the destination operand. The second source operand can be an XMM
register or a 32-bit memory location. The first source and destination operands is an XMM register.
128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (MAXVL-
1:32) of the corresponding YMM destination register remain unchanged.
VEX.128 and EVEX encoded versions: Bits 127:32 of the destination operand are copied from the corresponding
bits of the first source operand. Bits (MAXVL-1:128) of the destination ZMM register are zeroed.
EVEX encoded version: The low doubleword element of the destination operand is updated according to the write-
mask.
Software should ensure VSQRTSS is encoded with VEX.L=0. Encoding VSQRTSS with VEX.L=1 may encounter
unpredictable behavior across different processor generations.

SQRTSS—Compute Square Root of Scalar Single Precision Value Vol. 2B 4-660


Operation
VSQRTSS (EVEX Encoded Version)
IF (EVEX.b = 1) AND (SRC2 *is register*)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := SQRT(SRC2[31:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
DEST[31:0] := 0
FI;
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

VSQRTSS (VEX.128 Encoded Version)


DEST[31:0] := SQRT(SRC2[31:0])
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

SQRTSS (128-bit Legacy SSE Version)


DEST[31:0] := SQRT(SRC2[31:0])
DEST[MAXVL-1:32] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VSQRTSS __m128 _mm_sqrt_round_ss(__m128 a, __m128 b, int r);
VSQRTSS __m128 _mm_mask_sqrt_round_ss(__m128 s, __mmask8 k, __m128 a, __m128 b, int r);
VSQRTSS __m128 _mm_maskz_sqrt_round_ss( __mmask8 k, __m128 a, __m128 b, int r);
SQRTSS __m128 _mm_sqrt_ss(__m128 a)

SIMD Floating-Point Exceptions


Invalid, Precision, Denormal.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-49, “Type E3 Class Exception Conditions.”

SQRTSS—Compute Square Root of Scalar Single Precision Value Vol. 2B 4-661


SUBPD—Subtract Packed Double Precision Floating-Point Values
Opcode/ Op/E 64/32 bit CPUID Feature Description
Instruction n Mode Flag
Support
66 0F 5C /r A V/V SSE2 Subtract packed double precision floating-point values
SUBPD xmm1, xmm2/m128 in xmm2/mem from xmm1 and store result in xmm1.
VEX.128.66.0F.WIG 5C /r B V/V AVX Subtract packed double precision floating-point values
VSUBPD xmm1,xmm2, xmm3/m128 in xmm3/mem from xmm2 and store result in xmm1.
VEX.256.66.0F.WIG 5C /r B V/V AVX Subtract packed double precision floating-point values
VSUBPD ymm1, ymm2, ymm3/m256 in ymm3/mem from ymm2 and store result in ymm1.
EVEX.128.66.0F.W1 5C /r C V/V (AVX512VL AND Subtract packed double precision floating-point values
VSUBPD xmm1 {k1}{z}, xmm2, AVX512F) OR from xmm3/m128/m64bcst to xmm2 and store result
xmm3/m128/m64bcst AVX10.11 in xmm1 with writemask k1.
EVEX.256.66.0F.W1 5C /r C V/V (AVX512VL AND Subtract packed double precision floating-point values
VSUBPD ymm1 {k1}{z}, ymm2, AVX512F) OR from ymm3/m256/m64bcst to ymm2 and store result
ymm3/m256/m64bcst AVX10.11 in ymm1 with writemask k1.
EVEX.512.66.0F.W1 5C /r C V/V AVX512F Subtract packed double precision floating-point values
VSUBPD zmm1 {k1}{z}, zmm2, OR AVX10.11 from zmm3/m512/m64bcst to zmm2 and store result
zmm3/m512/m64bcst{er} in zmm1 with writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD subtract of the two, four or eight packed double precision floating-point values of the second
Source operand from the first Source operand, and stores the packed double precision floating-point results in the
destination operand.
VEX.128 and EVEX.128 encoded versions: The second source operand is an XMM register or an 128-bit memory
location. The first source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the corre-
sponding destination register are zeroed.
VEX.256 and EVEX.256 encoded versions: The second source operand is an YMM register or an 256-bit memory
location. The first source operand and destination operands are YMM registers. Bits (MAXVL-1:256) of the corre-
sponding destination register are zeroed.
EVEX.512 encoded version: The second source operand is a ZMM register, a 512-bit memory location or a 512-bit
vector broadcasted from a 64-bit memory location. The first source operand and destination operands are ZMM
registers. The destination operand is conditionally updated according to the writemask.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper Bits (MAXVL-1:128) of the corresponding
register destination are unmodified.

SUBPD—Subtract Packed Double Precision Floating-Point Values Vol. 2B 4-678


Operation
VSUBPD (EVEX Encoded Versions When SRC2 Operand is a Vector Register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC1[i+63:i] - SRC2[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
DEST[63:0] := 0
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VSUBPD (EVEX Encoded Versions When SRC2 Operand is a Memory Source)


(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1)
THEN DEST[i+63:i] := SRC1[i+63:i] - SRC2[63:0];
ELSE EST[i+63:i] := SRC1[i+63:i] - SRC2[i+63:i];
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
DEST[63:0] := 0
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VSUBPD (VEX.256 Encoded Version)


DEST[63:0] := SRC1[63:0] - SRC2[63:0]
DEST[127:64] := SRC1[127:64] - SRC2[127:64]
DEST[191:128] := SRC1[191:128] - SRC2[191:128]
DEST[255:192] := SRC1[255:192] - SRC2[255:192]
DEST[MAXVL-1:256] := 0

SUBPD—Subtract Packed Double Precision Floating-Point Values Vol. 2B 4-679


VSUBPD (VEX.128 Encoded Version)
DEST[63:0] := SRC1[63:0] - SRC2[63:0]
DEST[127:64] := SRC1[127:64] - SRC2[127:64]
DEST[MAXVL-1:128] := 0

SUBPD (128-bit Legacy SSE Version)


DEST[63:0] := DEST[63:0] - SRC[63:0]
DEST[127:64] := DEST[127:64] - SRC[127:64]
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VSUBPD __m512d _mm512_sub_pd (__m512d a, __m512d b);
VSUBPD __m512d _mm512_mask_sub_pd (__m512d s, __mmask8 k, __m512d a, __m512d b);
VSUBPD __m512d _mm512_maskz_sub_pd (__mmask8 k, __m512d a, __m512d b);
VSUBPD __m512d _mm512_sub_round_pd (__m512d a, __m512d b, int);
VSUBPD __m512d _mm512_mask_sub_round_pd (__m512d s, __mmask8 k, __m512d a, __m512d b, int);
VSUBPD __m512d _mm512_maskz_sub_round_pd (__mmask8 k, __m512d a, __m512d b, int);
VSUBPD __m256d _mm256_sub_pd (__m256d a, __m256d b);
VSUBPD __m256d _mm256_mask_sub_pd (__m256d s, __mmask8 k, __m256d a, __m256d b);
VSUBPD __m256d _mm256_maskz_sub_pd (__mmask8 k, __m256d a, __m256d b);
SUBPD __m128d _mm_sub_pd (__m128d a, __m128d b);
VSUBPD __m128d _mm_mask_sub_pd (__m128d s, __mmask8 k, __m128d a, __m128d b);
VSUBPD __m128d _mm_maskz_sub_pd (__mmask8 k, __m128d a, __m128d b);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

SUBPD—Subtract Packed Double Precision Floating-Point Values Vol. 2B 4-680


SUBPS—Subtract Packed Single Precision Floating-Point Values
Opcode/ Op/E 64/32 bit CPUID Feature Description
Instruction n Mode Flag
Support
NP 0F 5C /r A V/V SSE Subtract packed single precision floating-point values
SUBPS xmm1, xmm2/m128 in xmm2/mem from xmm1 and store result in xmm1.
VEX.128.0F.WIG 5C /r B V/V AVX Subtract packed single precision floating-point values
VSUBPS xmm1,xmm2, xmm3/m128 in xmm3/mem from xmm2 and stores result in xmm1.
VEX.256.0F.WIG 5C /r B V/V AVX Subtract packed single precision floating-point values
VSUBPS ymm1, ymm2, ymm3/m256 in ymm3/mem from ymm2 and stores result in ymm1.
EVEX.128.0F.W0 5C /r C V/V (AVX512VL AND Subtract packed single precision floating-point values
VSUBPS xmm1 {k1}{z}, xmm2, AVX512F) OR from xmm3/m128/m32bcst to xmm2 and stores
xmm3/m128/m32bcst AVX10.11 result in xmm1 with writemask k1.
EVEX.256.0F.W0 5C /r C V/V (AVX512VL AND Subtract packed single precision floating-point values
VSUBPS ymm1 {k1}{z}, ymm2, AVX512F) OR from ymm3/m256/m32bcst to ymm2 and stores
ymm3/m256/m32bcst AVX10.11 result in ymm1 with writemask k1.
EVEX.512.0F.W0 5C /r C V/V AVX512F Subtract packed single precision floating-point values
VSUBPS zmm1 {k1}{z}, zmm2, OR AVX10.11 in zmm3/m512/m32bcst from zmm2 and stores result
zmm3/m512/m32bcst{er} in zmm1 with writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD subtract of the packed single precision floating-point values in the second Source operand from
the First Source operand, and stores the packed single precision floating-point results in the destination operand.
VEX.128 and EVEX.128 encoded versions: The second source operand is an XMM register or an 128-bit memory
location. The first source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the corre-
sponding destination register are zeroed.
VEX.256 and EVEX.256 encoded versions: The second source operand is an YMM register or an 256-bit memory
location. The first source operand and destination operands are YMM registers. Bits (MAXVL-1:256) of the corre-
sponding destination register are zeroed.
EVEX.512 encoded version: The second source operand is a ZMM register, a 512-bit memory location or a 512-bit
vector broadcasted from a 32-bit memory location. The first source operand and destination operands are ZMM
registers. The destination operand is conditionally updated according to the writemask.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper Bits (MAXVL-1:128) of the corresponding
register destination are unmodified.

SUBPS—Subtract Packed Single Precision Floating-Point Values Vol. 2B 4-681


Operation
VSUBPS (EVEX Encoded Versions When SRC2 Operand is a Vector Register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC1[i+31:i] - SRC2[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
DEST[31:0] := 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

VSUBPS (EVEX Encoded Versions When SRC2 Operand is a Memory Source)


(KL, VL) = (4, 128), (8, 256),(16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1)
THEN DEST[i+31:i] := SRC1[i+31:i] - SRC2[31:0];
ELSE DEST[i+31:i] := SRC1[i+31:i] - SRC2[i+31:i];
FI;

ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
DEST[31:0] := 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

VSUBPS (VEX.256 Encoded Version)


DEST[31:0] := SRC1[31:0] - SRC2[31:0]
DEST[63:32] := SRC1[63:32] - SRC2[63:32]
DEST[95:64] := SRC1[95:64] - SRC2[95:64]
DEST[127:96] := SRC1[127:96] - SRC2[127:96]
DEST[159:128] := SRC1[159:128] - SRC2[159:128]
DEST[191:160] := SRC1[191:160] - SRC2[191:160]
DEST[223:192] := SRC1[223:192] - SRC2[223:192]
DEST[255:224] := SRC1[255:224] - SRC2[255:224].
DEST[MAXVL-1:256] := 0

SUBPS—Subtract Packed Single Precision Floating-Point Values Vol. 2B 4-682


VSUBPS (VEX.128 Encoded Version)
DEST[31:0] := SRC1[31:0] - SRC2[31:0]
DEST[63:32] := SRC1[63:32] - SRC2[63:32]
DEST[95:64] := SRC1[95:64] - SRC2[95:64]
DEST[127:96] := SRC1[127:96] - SRC2[127:96]
DEST[MAXVL-1:128] := 0

SUBPS (128-bit Legacy SSE Version)


DEST[31:0] := SRC1[31:0] - SRC2[31:0]
DEST[63:32] := SRC1[63:32] - SRC2[63:32]
DEST[95:64] := SRC1[95:64] - SRC2[95:64]
DEST[127:96] := SRC1[127:96] - SRC2[127:96]
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VSUBPS __m512 _mm512_sub_ps (__m512 a, __m512 b);
VSUBPS __m512 _mm512_mask_sub_ps (__m512 s, __mmask16 k, __m512 a, __m512 b);
VSUBPS __m512 _mm512_maskz_sub_ps (__mmask16 k, __m512 a, __m512 b);
VSUBPS __m512 _mm512_sub_round_ps (__m512 a, __m512 b, int);
VSUBPS __m512 _mm512_mask_sub_round_ps (__m512 s, __mmask16 k, __m512 a, __m512 b, int);
VSUBPS __m512 _mm512_maskz_sub_round_ps (__mmask16 k, __m512 a, __m512 b, int);
VSUBPS __m256 _mm256_sub_ps (__m256 a, __m256 b);
VSUBPS __m256 _mm256_mask_sub_ps (__m256 s, __mmask8 k, __m256 a, __m256 b);
VSUBPS __m256 _mm256_maskz_sub_ps (__mmask16 k, __m256 a, __m256 b);
SUBPS __m128 _mm_sub_ps (__m128 a, __m128 b);
VSUBPS __m128 _mm_mask_sub_ps (__m128 s, __mmask8 k, __m128 a, __m128 b);
VSUBPS __m128 _mm_maskz_sub_ps (__mmask16 k, __m128 a, __m128 b);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

SUBPS—Subtract Packed Single Precision Floating-Point Values Vol. 2B 4-683


SUBSD—Subtract Scalar Double Precision Floating-Point Value
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F2 0F 5C /r A V/V SSE2 Subtract the low double precision floating-point value in
SUBSD xmm1, xmm2/m64 xmm2/m64 from xmm1 and store the result in xmm1.
VEX.LIG.F2.0F.WIG 5C /r B V/V AVX Subtract the low double precision floating-point value in
VSUBSD xmm1,xmm2, xmm3/m64 xmm3/m64 from xmm2 and store the result in xmm1.
EVEX.LLIG.F2.0F.W1 5C /r C V/V AVX512F Subtract the low double precision floating-point value in
VSUBSD xmm1 {k1}{z}, xmm2, OR AVX10.11 xmm3/m64 from xmm2 and store the result in xmm1
xmm3/m64{er} under writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Subtract the low double precision floating-point value in the second source operand from the first source operand
and stores the double precision floating-point result in the low quadword of the destination operand.
The second source operand can be an XMM register or a 64-bit memory location. The first source and destination
operands are XMM registers.
128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAXVL-1:64) of the
corresponding destination register remain unchanged.
VEX.128 and EVEX encoded versions: Bits (127:64) of the XMM register destination are copied from corresponding
bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.
EVEX encoded version: The low quadword element of the destination operand is updated according to the write-
mask.
Software should ensure VSUBSD is encoded with VEX.L=0. Encoding VSUBSD with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.

SUBSD—Subtract Scalar Double Precision Floating-Point Value Vol. 2B 4-684


Operation
VSUBSD (EVEX Encoded Version)
IF (SRC2 *is register*) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[63:0] := SRC1[63:0] - SRC2[63:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

VSUBSD (VEX.128 Encoded Version)


DEST[63:0] := SRC1[63:0] - SRC2[63:0]
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

SUBSD (128-bit Legacy SSE Version)


DEST[63:0] := DEST[63:0] - SRC[63:0]
DEST[MAXVL-1:64] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VSUBSD __m128d _mm_mask_sub_sd (__m128d s, __mmask8 k, __m128d a, __m128d b);
VSUBSD __m128d _mm_maskz_sub_sd (__mmask8 k, __m128d a, __m128d b);
VSUBSD __m128d _mm_sub_round_sd (__m128d a, __m128d b, int);
VSUBSD __m128d _mm_mask_sub_round_sd (__m128d s, __mmask8 k, __m128d a, __m128d b, int);
VSUBSD __m128d _mm_maskz_sub_round_sd (__mmask8 k, __m128d a, __m128d b, int);
SUBSD __m128d _mm_sub_sd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

SUBSD—Subtract Scalar Double Precision Floating-Point Value Vol. 2B 4-685


SUBSS—Subtract Scalar Single Precision Floating-Point Value
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
F3 0F 5C /r A V/V SSE Subtract the low single precision floating-point value in
SUBSS xmm1, xmm2/m32 xmm2/m32 from xmm1 and store the result in xmm1.
VEX.LIG.F3.0F.WIG 5C /r B V/V AVX Subtract the low single precision floating-point value in
VSUBSS xmm1,xmm2, xmm3/m32 xmm3/m32 from xmm2 and store the result in xmm1.
EVEX.LLIG.F3.0F.W0 5C /r C V/V AVX512F Subtract the low single precision floating-point value in
VSUBSS xmm1 {k1}{z}, xmm2, OR AVX10.11 xmm3/m32 from xmm2 and store the result in xmm1
xmm3/m32{er} under writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Subtract the low single precision floating-point value from the second source operand and the first source operand
and store the double precision floating-point result in the low doubleword of the destination operand.
The second source operand can be an XMM register or a 32-bit memory location. The first source and destination
operands are XMM registers.
128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAXVL-1:32) of the
corresponding destination register remain unchanged.
VEX.128 and EVEX encoded versions: Bits (127:32) of the XMM register destination are copied from corresponding
bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.
EVEX encoded version: The low doubleword element of the destination operand is updated according to the write-
mask.
Software should ensure VSUBSS is encoded with VEX.L=0. Encoding VSUBSD with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.

SUBSS—Subtract Scalar Single Precision Floating-Point Value Vol. 2B 4-686


Operation
VSUBSS (EVEX Encoded Version)
IF (SRC2 *is register*) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := SRC1[31:0] - SRC2[31:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

VSUBSS (VEX.128 Encoded Version)


DEST[31:0] := SRC1[31:0] - SRC2[31:0]
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

SUBSS (128-bit Legacy SSE Version)


DEST[31:0] := DEST[31:0] - SRC[31:0]
DEST[MAXVL-1:32] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VSUBSS __m128 _mm_mask_sub_ss (__m128 s, __mmask8 k, __m128 a, __m128 b);
VSUBSS __m128 _mm_maskz_sub_ss (__mmask8 k, __m128 a, __m128 b);
VSUBSS __m128 _mm_sub_round_ss (__m128 a, __m128 b, int);
VSUBSS __m128 _mm_mask_sub_round_ss (__m128 s, __mmask8 k, __m128 a, __m128 b, int);
VSUBSS __m128 _mm_maskz_sub_round_ss (__mmask8 k, __m128 a, __m128 b, int);
SUBSS __m128 _mm_sub_ss (__m128 a, __m128 b);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

SUBSS—Subtract Scalar Single Precision Floating-Point Value Vol. 2B 4-687


TDPFP16PS—Dot Product of FP16 Tiles Accumulated into Packed Single Precision Tile
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.F2.0F38.W0 5C 11:rrr:bbb A V/N.E. AMX-FP16 Matrix multiply FP16 elements from tmm2 and
TDPFP16PS tmm1, tmm2, tmm3 tmm3, and accumulate the packed single precision
elements in tmm1.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) VEX.vvvv (r) N/A

Description
This instruction performs a set of SIMD dot-products of two FP16 elements and accumulates the results into a
packed single precision tile. Each dword element in input tiles tmm2 and tmm3 is interpreted as a FP16 pair. For
each possible combination of (row of tmm2, column of tmm3), the instruction performs a set of SIMD dot-products
on all corresponding FP16 pairs (one pair from tmm2 and one pair from tmm3), adds the results of those dot-prod-
ucts, and then accumulates the result into the corresponding row and column of tmm1.
“Round to nearest even” rounding mode is used when doing each accumulation of the Fused Multiply-Add (FMA).
Output FP32 denormals are always flushed to zero. Input FP16 denormals are always handled and not treated as
zero.
MXCSR is not consulted nor updated.
Any attempt to execute the TDPFP16PS instruction inside an Intel TSX transaction will result in a transaction abort.

Operation
TDPFP16PS tsrcdest, tsrc1, tsrc2
// C = m x n (tsrcdest), A = m x k (tsrc1), B = k x n (tsrc2)

# src1 and src2 elements are pairs of fp16


elements_src1 := tsrc1.colsb / 4
elements_src2 := tsrc2.colsb / 4
elements_dest := tsrcdest.colsb / 4
elements_temp := tsrcdest.colsb / 2 // Count is in fp16 prior to horizontal

for m in 0 ... tsrcdest.rows-1:


temp1[ 0 ... elements_temp-1 ] := 0
for k in 0 ... elements_src1-1:
for n in 0 ... elements_dest-1:

// For this operation:


// Handle FP16 denorms. Not forcing input FP16 denorms to 0.
// FP32 FMA with DAZ=FTZ=1, RNE rounding.
// MXCSR is neither consulted nor updated.
// No exceptions raised or denoted.

temp1.fp32[2*n+0] += cvt_fp16_to_fp32(tsrc1.row[m].fp16[2*k+0]) *cvt_fp16_to_fp32(tsrc2.row[k].fp16[2*n+0])


temp1.fp32[2*n+1] += cvt_fp16_to_fp32(tsrc1.row[m].fp16[2*k+1]) *cvt_fp16_to_fp32(tsrc2.row[k].fp16[2*n+1])

for n in 0 ... elements_dest-1:


// DAZ=FTZ=1, RNE rounding.
// MXCSR is neither consulted nor updated.

TDPFP16PS—Dot Product of FP16 Tiles Accumulated into Packed Single Precision Tile Vol. 2B 4-711
// No exceptions raised or denoted.
tmpf32 := temp1.fp32[2*n] + temp1.fp32[2*n+1]
srcdest.row[m].fp32[n] := srcdest.row[m].fp32[n] + tmpf32
write_row_and_zero(tsrcdest, m, tmp, tsrcdest.colsb)
zero_upper_rows(tsrcdest, tsrcdest.rows)
zero_tileconfig_start()

Flags Affected
None.

Exceptions
AMX-E4; see Section 3.6, “Exception Classes” for details.

TDPFP16PS—Dot Product of FP16 Tiles Accumulated into Packed Single Precision Tile Vol. 2B 4-712
UCOMISD—Unordered Compare Scalar Double Precision Floating-Point Values and Set EFLAGS
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 2E /r A V/V SSE2 Compare low double precision floating-point values in
UCOMISD xmm1, xmm2/m64 xmm1 and xmm2/mem64 and set the EFLAGS flags
accordingly.
VEX.LIG.66.0F.WIG 2E /r A V/V AVX Compare low double precision floating-point values in
VUCOMISD xmm1, xmm2/m64 xmm1 and xmm2/mem64 and set the EFLAGS flags
accordingly.
EVEX.LLIG.66.0F.W1 2E /r B V/V AVX512F Compare low double precision floating-point values in
VUCOMISD xmm1, xmm2/m64{sae} OR AVX10.11 xmm1 and xmm2/m64 and set the EFLAGS flags
accordingly.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r) ModRM:r/m (r) N/A N/A
B Tuple1 Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Performs an unordered compare of the double precision floating-point values in the low quadwords of operand 1
(first operand) and operand 2 (second operand), and sets the ZF, PF, and CF flags in the EFLAGS register according
to the result (unordered, greater than, less than, or equal). The OF, SF, and AF flags in the EFLAGS register are set
to 0. The unordered result is returned if either source operand is a NaN (QNaN or SNaN).
Operand 1 is an XMM register; operand 2 can be an XMM register or a 64 bit memory
location.
The UCOMISD instruction differs from the COMISD instruction in that it signals a SIMD floating-point invalid oper-
ation exception (#I) only when a source operand is an SNaN. The COMISD instruction signals an invalid operation
exception only if a source operand is either an SNaN or a QNaN.
The EFLAGS register is not updated if an unmasked SIMD floating-point exception is generated.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.
Software should ensure VCOMISD is encoded with VEX.L=0. Encoding VCOMISD with VEX.L=1 may encounter
unpredictable behavior across different processor generations.

Operation
(V)UCOMISD (All Versions)
RESULT := UnorderedCompare(DEST[63:0] <> SRC[63:0]) {
(* Set EFLAGS *) CASE (RESULT) OF
UNORDERED: ZF,PF,CF := 111;
GREATER_THAN: ZF,PF,CF := 000;
LESS_THAN: ZF,PF,CF := 001;
EQUAL: ZF,PF,CF := 100;
ESAC;
OF, AF, SF := 0; }

UCOMISD—Unordered Compare Scalar Double Precision Floating-Point Values and Set EFLAGS Vol. 2B 4-718
Intel C/C++ Compiler Intrinsic Equivalent
VUCOMISD int _mm_comi_round_sd(__m128d a, __m128d b, int imm, int sae);
UCOMISD int _mm_ucomieq_sd(__m128d a, __m128d b)
UCOMISD int _mm_ucomilt_sd(__m128d a, __m128d b)
UCOMISD int _mm_ucomile_sd(__m128d a, __m128d b)
UCOMISD int _mm_ucomigt_sd(__m128d a, __m128d b)
UCOMISD int _mm_ucomige_sd(__m128d a, __m128d b)
UCOMISD int _mm_ucomineq_sd(__m128d a, __m128d b)

SIMD Floating-Point Exceptions


Invalid (if SNaN operands), Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions,” additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”

UCOMISD—Unordered Compare Scalar Double Precision Floating-Point Values and Set EFLAGS Vol. 2B 4-719
UCOMISS—Unordered Compare Scalar Single Precision Floating-Point Values and Set EFLAGS
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 2E /r A V/V SSE Compare low single precision floating-point values in
UCOMISS xmm1, xmm2/m32 xmm1 and xmm2/mem32 and set the EFLAGS flags
accordingly.
VEX.LIG.0F.WIG 2E /r A V/V AVX Compare low single precision floating-point values in
VUCOMISS xmm1, xmm2/m32 xmm1 and xmm2/mem32 and set the EFLAGS flags
accordingly.
EVEX.LLIG.0F.W0 2E /r B V/V AVX512F Compare low single precision floating-point values in
VUCOMISS xmm1, xmm2/m32{sae} OR AVX10.11 xmm1 and xmm2/mem32 and set the EFLAGS flags
accordingly.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r) ModRM:r/m (r) N/A N/A
B Tuple1 Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Compares the single precision floating-point values in the low doublewords of operand 1 (first operand) and
operand 2 (second operand), and sets the ZF, PF, and CF flags in the EFLAGS register according to the result (unor-
dered, greater than, less than, or equal). The OF, SF, and AF flags in the EFLAGS register are set to 0. The unor-
dered result is returned if either source operand is a NaN (QNaN or SNaN).
Operand 1 is an XMM register; operand 2 can be an XMM register or a 32 bit memory location.
The UCOMISS instruction differs from the COMISS instruction in that it signals a SIMD floating-point invalid opera-
tion exception (#I) only if a source operand is an SNaN. The COMISS instruction signals an invalid operation excep-
tion when a source operand is either a QNaN or SNaN.
The EFLAGS register is not updated if an unmasked SIMD floating-point exception is generated.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.
Software should ensure VCOMISS is encoded with VEX.L=0. Encoding VCOMISS with VEX.L=1 may encounter
unpredictable behavior across different processor generations.

Operation
(V)UCOMISS (All Versions)
RESULT := UnorderedCompare(DEST[31:0] <> SRC[31:0]) {
(* Set EFLAGS *) CASE (RESULT) OF
UNORDERED: ZF,PF,CF := 111;
GREATER_THAN: ZF,PF,CF := 000;
LESS_THAN: ZF,PF,CF := 001;
EQUAL: ZF,PF,CF := 100;
ESAC;
OF, AF, SF := 0; }

UCOMISS—Unordered Compare Scalar Single Precision Floating-Point Values and Set EFLAGS Vol. 2B 4-720
Intel C/C++ Compiler Intrinsic Equivalent
VUCOMISS int _mm_comi_round_ss(__m128 a, __m128 b, int imm, int sae);
UCOMISS int _mm_ucomieq_ss(__m128 a, __m128 b);
UCOMISS int _mm_ucomilt_ss(__m128 a, __m128 b);
UCOMISS int _mm_ucomile_ss(__m128 a, __m128 b);
UCOMISS int _mm_ucomigt_ss(__m128 a, __m128 b);
UCOMISS int _mm_ucomige_ss(__m128 a, __m128 b);
UCOMISS int _mm_ucomineq_ss(__m128 a, __m128 b);

SIMD Floating-Point Exceptions


Invalid (if SNaN Operands), Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions,” additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”

UCOMISS—Unordered Compare Scalar Single Precision Floating-Point Values and Set EFLAGS Vol. 2B 4-721
UIRET—User-Interrupt Return
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
F3 0F 01 EC ZO V/I UINTR Return from handling a user interrupt.
UIRET

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
ZO N/A N/A N/A N/A N/A

Description
UIRET returns from the handling of a user interrupt. It can be executed regardless of CPL.
Execution of UIRET inside a transactional region causes a transactional abort; the abort loads EAX as it would have
had it been due to an execution of IRET.
UIRET can be tracked by Architectural Last Branch Records (LBRs), Intel Processor Trace (Intel PT), and Perfor-
mance Monitoring. For both Intel PT and LBRs, UIRET is recorded in precisely the same manner as IRET. Hence for
LBRs, UIRETs fall into the OTHER_BRANCH category, which implies that IA32_LBR_CTL.OTHER_BRANCH[bit 22]
must be set to record user-interrupt delivery, and that the IA32_LBR_x_INFO.BR_TYPE field will indicate
OTHER_BRANCH for any recorded user interrupt. For Intel PT, control flow tracing must be enabled by setting
IA32_RTIT_CTL.BranchEn[bit 13].
UIRET will also increment performance counters for which counting BR_INST_RETIRED.FAR_BRANCH is enabled.

Operation
Pop tempRIP;
Pop tempRFLAGS; // see below for how this is used to load RFLAGS
Pop tempRSP;
IF tempRIP is not canonical in current paging mode
THEN #GP(0);
FI;
IF ShadowStackEnabled(CPL)
THEN
PopShadowStack SSRIP;
IF SSRIP ≠ tempRIP
THEN #CP (FAR-RET/IRET);
FI;
FI;
RIP := tempRIP;
// update in RFLAGS only CF, PF, AF, ZF, SF, TF, DF, OF, NT, RF, AC, and ID
RFLAGS := (RFLAGS & ~254DD5H) | (tempRFLAGS & 254DD5H);
RSP := tempRSP;
IF CPUID.(EAX=07H, ECX=01H):EDX.UIRET_UIF[bit 17] = 1
THEN UIF := tempRFLAGS[1];
ELSE UIF := 1;
FI;
Clear any cache-line monitoring established by MONITOR or UMONITOR;

Flags Affected
See the Operation section.

UIRET—User-Interrupt Return Vol. 2B 4-723


Protected Mode Exceptions
#UD The UIRET instruction is not recognized in protected mode.

Real-Address Mode Exceptions


#UD The UIRET instruction is not recognized in real-address mode.

Virtual-8086 Mode Exceptions


#UD The UIRET instruction is not recognized in virtual-8086 mode.

Compatibility Mode Exceptions


#UD The UIRET instruction is not recognized in compatibility mode.

64-Bit Mode Exceptions


#GP(0) If the return instruction pointer is non-canonical.
#SS(0) If an attempt to pop a value off the stack causes a non-canonical address to be referenced.
#PF(fault-code) If a page fault occurs.
#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the
current privilege level is 3.
#CP If return instruction pointer from stack and shadow stack do not match.
#UD If the LOCK prefix is used.
If executed inside an enclave.
If CR4.UINTR = 0.
If CPUID.07H.0H:EDX.UINTR[bit 5] = 0.

UIRET—User-Interrupt Return Vol. 2B 4-724


UNPCKHPD—Unpack and Interleave High Packed Double Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 15 /r A V/V SSE2 Unpacks and Interleaves double precision floating-
UNPCKHPD xmm1, xmm2/m128 point values from high quadwords of xmm1 and
xmm2/m128.
VEX.128.66.0F.WIG 15 /r B V/V AVX Unpacks and Interleaves double precision floating-
VUNPCKHPD xmm1,xmm2, point values from high quadwords of xmm2 and
xmm3/m128 xmm3/m128.
VEX.256.66.0F.WIG 15 /r B V/V AVX Unpacks and Interleaves double precision floating-
VUNPCKHPD ymm1,ymm2, point values from high quadwords of ymm2 and
ymm3/m256 ymm3/m256.
EVEX.128.66.0F.W1 15 /r C V/V (AVX512VL AND Unpacks and Interleaves double precision floating-
VUNPCKHPD xmm1 {k1}{z}, xmm2, AVX512F) OR point values from high quadwords of xmm2 and
xmm3/m128/m64bcst AVX10.11 xmm3/m128/m64bcst subject to writemask k1.
EVEX.256.66.0F.W1 15 /r C V/V (AVX512VL AND Unpacks and Interleaves double precision floating-
VUNPCKHPD ymm1 {k1}{z}, ymm2, AVX512F) OR point values from high quadwords of ymm2 and
ymm3/m256/m64bcst AVX10.11 ymm3/m256/m64bcst subject to writemask k1.
EVEX.512.66.0F.W1 15 /r C V/V AVX512F Unpacks and Interleaves double precision floating-
VUNPCKHPD zmm1 {k1}{z}, zmm2, OR AVX10.11 point values from high quadwords of zmm2 and
zmm3/m512/m64bcst zmm3/m512/m64bcst subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs an interleaved unpack of the high double precision floating-point values from the first source operand and
the second source operand. See Figure 4-15 in the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 2B.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified. When unpacking from a memory operand, an implementation may fetch
only the appropriate 64 bits; however, alignment to 16-byte boundary and normal segment checking will still be
enforced.
VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register.
EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM
register, a 512-bit memory location, or a 512-bit vector broadcasted from a 64-bit memory location. The destina-
tion operand is a ZMM register, conditionally updated using writemask k1.

UNPCKHPD—Unpack and Interleave High Packed Double Precision Floating-Point Values Vol. 2B 4-729
EVEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM
register, a 256-bit memory location, or a 256-bit vector broadcasted from a 64-bit memory location. The destina-
tion operand is a YMM register, conditionally updated using writemask k1.
EVEX.128 encoded version: The first source operand is a XMM register. The second source operand is a XMM
register, a 128-bit memory location, or a 128-bit vector broadcasted from a 64-bit memory location. The destina-
tion operand is a XMM register, conditionally updated using writemask k1.

Operation
VUNPCKHPD (EVEX Encoded Versions When SRC2 is a Register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF VL >= 128
TMP_DEST[63:0] := SRC1[127:64]
TMP_DEST[127:64] := SRC2[127:64]
FI;
IF VL >= 256
TMP_DEST[191:128] := SRC1[255:192]
TMP_DEST[255:192] := SRC2[255:192]
FI;
IF VL >= 512
TMP_DEST[319:256] := SRC1[383:320]
TMP_DEST[383:320] := SRC2[383:320]
TMP_DEST[447:384] := SRC1[511:448]
TMP_DEST[511:448] := SRC2[511:448]
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

UNPCKHPD—Unpack and Interleave High Packed Double Precision Floating-Point Values Vol. 2B 4-730
VUNPCKHPD (EVEX Encoded Version When SRC2 is Memory)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1)
THEN TMP_SRC2[i+63:i] := SRC2[63:0]
ELSE TMP_SRC2[i+63:i] := SRC2[i+63:i]
FI;
ENDFOR;
IF VL >= 128
TMP_DEST[63:0] := SRC1[127:64]
TMP_DEST[127:64] := TMP_SRC2[127:64]
FI;
IF VL >= 256
TMP_DEST[191:128] := SRC1[255:192]
TMP_DEST[255:192] := TMP_SRC2[255:192]
FI;
IF VL >= 512
TMP_DEST[319:256] := SRC1[383:320]
TMP_DEST[383:320] := TMP_SRC2[383:320]
TMP_DEST[447:384] := SRC1[511:448]
TMP_DEST[511:448] := TMP_SRC2[511:448]
FI;

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VUNPCKHPD (VEX.256 Encoded Version)


DEST[63:0] := SRC1[127:64]
DEST[127:64] := SRC2[127:64]
DEST[191:128] := SRC1[255:192]
DEST[255:192] := SRC2[255:192]
DEST[MAXVL-1:256] := 0

VUNPCKHPD (VEX.128 Encoded Version)


DEST[63:0] := SRC1[127:64]
DEST[127:64] := SRC2[127:64]
DEST[MAXVL-1:128] := 0

UNPCKHPD (128-bit Legacy SSE Version)


DEST[63:0] := SRC1[127:64]
DEST[127:64] := SRC2[127:64]
DEST[MAXVL-1:128] (Unmodified)

UNPCKHPD—Unpack and Interleave High Packed Double Precision Floating-Point Values Vol. 2B 4-731
Intel C/C++ Compiler Intrinsic Equivalent
VUNPCKHPD __m512d _mm512_unpackhi_pd( __m512d a, __m512d b);
VUNPCKHPD __m512d _mm512_mask_unpackhi_pd(__m512d s, __mmask8 k, __m512d a, __m512d b);
VUNPCKHPD __m512d _mm512_maskz_unpackhi_pd(__mmask8 k, __m512d a, __m512d b);
VUNPCKHPD __m256d _mm256_unpackhi_pd(__m256d a, __m256d b)
VUNPCKHPD __m256d _mm256_mask_unpackhi_pd(__m256d s, __mmask8 k, __m256d a, __m256d b);
VUNPCKHPD __m256d _mm256_maskz_unpackhi_pd(__mmask8 k, __m256d a, __m256d b);
UNPCKHPD __m128d _mm_unpackhi_pd(__m128d a, __m128d b)
VUNPCKHPD __m128d _mm_mask_unpackhi_pd(__m128d s, __mmask8 k, __m128d a, __m128d b);
VUNPCKHPD __m128d _mm_maskz_unpackhi_pd(__mmask8 k, __m128d a, __m128d b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-52, “Type E4NF Class Exception Conditions.”

UNPCKHPD—Unpack and Interleave High Packed Double Precision Floating-Point Values Vol. 2B 4-732
UNPCKHPS—Unpack and Interleave High Packed Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 15 /r A V/V SSE Unpacks and Interleaves single precision floating-point
UNPCKHPS xmm1, xmm2/m128 values from high quadwords of xmm1 and xmm2/m128.
VEX.128.0F.WIG 15 /r B V/V AVX Unpacks and Interleaves single precision floating-point
VUNPCKHPS xmm1, xmm2, values from high quadwords of xmm2 and xmm3/m128.
xmm3/m128
VEX.256.0F.WIG 15 /r B V/V AVX Unpacks and Interleaves single precision floating-point
VUNPCKHPS ymm1, ymm2, values from high quadwords of ymm2 and ymm3/m256.
ymm3/m256
EVEX.128.0F.W0 15 /r C V/V (AVX512VL AND Unpacks and Interleaves single precision floating-point
VUNPCKHPS xmm1 {k1}{z}, xmm2, AVX512F) OR values from high quadwords of xmm2 and
xmm3/m128/m32bcst AVX10.11 xmm3/m128/m32bcst and write result to xmm1
subject to writemask k1.
EVEX.256.0F.W0 15 /r C V/V (AVX512VL AND Unpacks and Interleaves single precision floating-point
VUNPCKHPS ymm1 {k1}{z}, ymm2, AVX512F) OR values from high quadwords of ymm2 and
ymm3/m256/m32bcst AVX10.11 ymm3/m256/m32bcst and write result to ymm1
subject to writemask k1.
EVEX.512.0F.W0 15 /r C V/V AVX512F Unpacks and Interleaves single precision floating-point
VUNPCKHPS zmm1 {k1}{z}, zmm2, OR AVX10.11 values from high quadwords of zmm2 and
zmm3/m512/m32bcst zmm3/m512/m32bcst and write result to zmm1 subject
to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs an interleaved unpack of the high single precision floating-point values from the first source operand and
the second source operand.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified. When unpacking from a memory operand, an implementation may fetch
only the appropriate 64 bits; however, alignment to 16-byte boundary and normal segment checking will still be
enforced.
VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
VEX.256 encoded version: The second source operand is an YMM register or an 256-bit memory location. The first
source operand and destination operands are YMM registers.

UNPCKHPS—Unpack and Interleave High Packed Single Precision Floating-Point Values Vol. 2B 4-733
SRC1 X7 X6 X5 X4 X3 X2 X1 X0

SRC2 Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0

DEST Y7 X7 Y6 X6 Y3 X3 Y2 X2

Figure 4-27. VUNPCKHPS Operation

EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM
register, a 512-bit memory location, or a 512-bit vector broadcasted from a 32-bit memory location. The destina-
tion operand is a ZMM register, conditionally updated using writemask k1.
EVEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM
register, a 256-bit memory location, or a 256-bit vector broadcasted from a 32-bit memory location. The destina-
tion operand is a YMM register, conditionally updated using writemask k1.
EVEX.128 encoded version: The first source operand is a XMM register. The second source operand is a XMM
register, a 128-bit memory location, or a 128-bit vector broadcasted from a 32-bit memory location. The destina-
tion operand is a XMM register, conditionally updated using writemask k1.

Operation
VUNPCKHPS (EVEX Encoded Version When SRC2 is a Register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF VL >= 128
TMP_DEST[31:0] := SRC1[95:64]
TMP_DEST[63:32] := SRC2[95:64]
TMP_DEST[95:64] := SRC1[127:96]
TMP_DEST[127:96] := SRC2[127:96]
FI;
IF VL >= 256
TMP_DEST[159:128] := SRC1[223:192]
TMP_DEST[191:160] := SRC2[223:192]
TMP_DEST[223:192] := SRC1[255:224]
TMP_DEST[255:224] := SRC2[255:224]
FI;
IF VL >= 512
TMP_DEST[287:256] := SRC1[351:320]
TMP_DEST[319:288] := SRC2[351:320]
TMP_DEST[351:320] := SRC1[383:352]
TMP_DEST[383:352] := SRC2[383:352]
TMP_DEST[415:384] := SRC1[479:448]
TMP_DEST[447:416] := SRC2[479:448]
TMP_DEST[479:448] := SRC1[511:480]
TMP_DEST[511:480] := SRC2[511:480]
FI;

UNPCKHPS—Unpack and Interleave High Packed Single Precision Floating-Point Values Vol. 2B 4-734
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VUNPCKHPS (EVEX Encoded Version When SRC2 is Memory)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF (EVEX.b = 1)
THEN TMP_SRC2[i+31:i] := SRC2[31:0]
ELSE TMP_SRC2[i+31:i] := SRC2[i+31:i]
FI;
ENDFOR;
IF VL >= 128
TMP_DEST[31:0] := SRC1[95:64]
TMP_DEST[63:32] := TMP_SRC2[95:64]
TMP_DEST[95:64] := SRC1[127:96]
TMP_DEST[127:96] := TMP_SRC2[127:96]
FI;
IF VL >= 256
TMP_DEST[159:128] := SRC1[223:192]
TMP_DEST[191:160] := TMP_SRC2[223:192]
TMP_DEST[223:192] := SRC1[255:224]
TMP_DEST[255:224] := TMP_SRC2[255:224]
FI;
IF VL >= 512
TMP_DEST[287:256] := SRC1[351:320]
TMP_DEST[319:288] := TMP_SRC2[351:320]
TMP_DEST[351:320] := SRC1[383:352]
TMP_DEST[383:352] := TMP_SRC2[383:352]
TMP_DEST[415:384] := SRC1[479:448]
TMP_DEST[447:416] := TMP_SRC2[479:448]
TMP_DEST[479:448] := SRC1[511:480]
TMP_DEST[511:480] := TMP_SRC2[511:480]
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0

UNPCKHPS—Unpack and Interleave High Packed Single Precision Floating-Point Values Vol. 2B 4-735
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VUNPCKHPS (VEX.256 Encoded Version)


DEST[31:0] := SRC1[95:64]
DEST[63:32] := SRC2[95:64]
DEST[95:64] := SRC1[127:96]
DEST[127:96] := SRC2[127:96]
DEST[159:128] := SRC1[223:192]
DEST[191:160] := SRC2[223:192]
DEST[223:192] := SRC1[255:224]
DEST[255:224] := SRC2[255:224]
DEST[MAXVL-1:256] := 0

VUNPCKHPS (VEX.128 Encoded Version)


DEST[31:0] := SRC1[95:64]
DEST[63:32] := SRC2[95:64]
DEST[95:64] := SRC1[127:96]
DEST[127:96] := SRC2[127:96]
DEST[MAXVL-1:128] := 0

UNPCKHPS (128-bit Legacy SSE Version)


DEST[31:0] := SRC1[95:64]
DEST[63:32] := SRC2[95:64]
DEST[95:64] := SRC1[127:96]
DEST[127:96] := SRC2[127:96]
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VUNPCKHPS __m512 _mm512_unpackhi_ps( __m512 a, __m512 b);
VUNPCKHPS __m512 _mm512_mask_unpackhi_ps(__m512 s, __mmask16 k, __m512 a, __m512 b);
VUNPCKHPS __m512 _mm512_maskz_unpackhi_ps(__mmask16 k, __m512 a, __m512 b);
VUNPCKHPS __m256 _mm256_unpackhi_ps (__m256 a, __m256 b);
VUNPCKHPS __m256 _mm256_mask_unpackhi_ps(__m256 s, __mmask8 k, __m256 a, __m256 b);
VUNPCKHPS __m256 _mm256_maskz_unpackhi_ps(__mmask8 k, __m256 a, __m256 b);
UNPCKHPS __m128 _mm_unpackhi_ps (__m128 a, __m128 b);
VUNPCKHPS __m128 _mm_mask_unpackhi_ps(__m128 s, __mmask8 k, __m128 a, __m128 b);
VUNPCKHPS __m128 _mm_maskz_unpackhi_ps(__mmask8 k, __m128 a, __m128 b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-52, “Type E4NF Class Exception Conditions.”

UNPCKHPS—Unpack and Interleave High Packed Single Precision Floating-Point Values Vol. 2B 4-736
UNPCKLPD—Unpack and Interleave Low Packed Double Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 14 /r A V/V SSE2 Unpacks and Interleaves double precision floating-point
UNPCKLPD xmm1, xmm2/m128 values from low quadwords of xmm1 and xmm2/m128.
VEX.128.66.0F.WIG 14 /r B V/V AVX Unpacks and Interleaves double precision floating-point
VUNPCKLPD xmm1,xmm2, values from low quadwords of xmm2 and xmm3/m128.
xmm3/m128
VEX.256.66.0F.WIG 14 /r B V/V AVX Unpacks and Interleaves double precision floating-point
VUNPCKLPD ymm1,ymm2, values from low quadwords of ymm2 and ymm3/m256.
ymm3/m256
EVEX.128.66.0F.W1 14 /r C V/V (AVX512VL AND Unpacks and Interleaves double precision floating-point
VUNPCKLPD xmm1 {k1}{z}, xmm2, AVX512F) OR values from low quadwords of xmm2 and
xmm3/m128/m64bcst AVX10.11 xmm3/m128/m64bcst subject to write mask k1.
EVEX.256.66.0F.W1 14 /r C V/V (AVX512VL AND Unpacks and Interleaves double precision floating-point
VUNPCKLPD ymm1 {k1}{z}, ymm2, AVX512F) OR values from low quadwords of ymm2 and
ymm3/m256/m64bcst AVX10.11 ymm3/m256/m64bcst subject to write mask k1.
EVEX.512.66.0F.W1 14 /r C V/V AVX512F Unpacks and Interleaves double precision floating-point
VUNPCKLPD zmm1 {k1}{z}, zmm2, OR AVX10.11 values from low quadwords of zmm2 and
zmm3/m512/m64bcst zmm3/m512/m64bcst subject to write mask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs an interleaved unpack of the low double precision floating-point values from the first source operand and
the second source operand.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified. When unpacking from a memory operand, an implementation may fetch
only the appropriate 64 bits; however, alignment to 16-byte boundary and normal segment checking will still be
enforced.
VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register.
EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM
register, a 512-bit memory location, or a 512-bit vector broadcasted from a 64-bit memory location. The destina-
tion operand is a ZMM register, conditionally updated using writemask k1.

UNPCKLPD—Unpack and Interleave Low Packed Double Precision Floating-Point Values Vol. 2B 4-737
EVEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM
register, a 256-bit memory location, or a 256-bit vector broadcasted from a 64-bit memory location. The destina-
tion operand is a YMM register, conditionally updated using writemask k1.
EVEX.128 encoded version: The first source operand is an XMM register. The second source operand is a XMM
register, a 128-bit memory location, or a 128-bit vector broadcasted from a 64-bit memory location. The destina-
tion operand is a XMM register, conditionally updated using writemask k1.

Operation
VUNPCKLPD (EVEX Encoded Versions When SRC2 is a Register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF VL >= 128
TMP_DEST[63:0] := SRC1[63:0]
TMP_DEST[127:64] := SRC2[63:0]
FI;
IF VL >= 256
TMP_DEST[191:128] := SRC1[191:128]
TMP_DEST[255:192] := SRC2[191:128]
FI;
IF VL >= 512
TMP_DEST[319:256] := SRC1[319:256]
TMP_DEST[383:320] := SRC2[319:256]
TMP_DEST[447:384] := SRC1[447:384]
TMP_DEST[511:448] := SRC2[447:384]
FI;

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

UNPCKLPD—Unpack and Interleave Low Packed Double Precision Floating-Point Values Vol. 2B 4-738
VUNPCKLPD (EVEX Encoded Version When SRC2 is Memory)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1)
THEN TMP_SRC2[i+63:i] := SRC2[63:0]
ELSE TMP_SRC2[i+63:i] := SRC2[i+63:i]
FI;
ENDFOR;
IF VL >= 128
TMP_DEST[63:0] := SRC1[63:0]
TMP_DEST[127:64] := TMP_SRC2[63:0]
FI;
IF VL >= 256
TMP_DEST[191:128] := SRC1[191:128]
TMP_DEST[255:192] := TMP_SRC2[191:128]
FI;
IF VL >= 512
TMP_DEST[319:256] := SRC1[319:256]
TMP_DEST[383:320] := TMP_SRC2[319:256]
TMP_DEST[447:384] := SRC1[447:384]
TMP_DEST[511:448] := TMP_SRC2[447:384]
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VUNPCKLPD (VEX.256 Encoded Version)


DEST[63:0] := SRC1[63:0]
DEST[127:64] := SRC2[63:0]
DEST[191:128] := SRC1[191:128]
DEST[255:192] := SRC2[191:128]
DEST[MAXVL-1:256] := 0

VUNPCKLPD (VEX.128 Encoded Version)


DEST[63:0] := SRC1[63:0]
DEST[127:64] := SRC2[63:0]
DEST[MAXVL-1:128] := 0

UNPCKLPD (128-bit Legacy SSE Version)


DEST[63:0] := SRC1[63:0]
DEST[127:64] := SRC2[63:0]
DEST[MAXVL-1:128] (Unmodified)

UNPCKLPD—Unpack and Interleave Low Packed Double Precision Floating-Point Values Vol. 2B 4-739
Intel C/C++ Compiler Intrinsic Equivalent
VUNPCKLPD __m512d _mm512_unpacklo_pd( __m512d a, __m512d b);
VUNPCKLPD __m512d _mm512_mask_unpacklo_pd(__m512d s, __mmask8 k, __m512d a, __m512d b);
VUNPCKLPD __m512d _mm512_maskz_unpacklo_pd(__mmask8 k, __m512d a, __m512d b);
VUNPCKLPD __m256d _mm256_unpacklo_pd(__m256d a, __m256d b)
VUNPCKLPD __m256d _mm256_mask_unpacklo_pd(__m256d s, __mmask8 k, __m256d a, __m256d b);
VUNPCKLPD __m256d _mm256_maskz_unpacklo_pd(__mmask8 k, __m256d a, __m256d b);
UNPCKLPD __m128d _mm_unpacklo_pd(__m128d a, __m128d b)
VUNPCKLPD __m128d _mm_mask_unpacklo_pd(__m128d s, __mmask8 k, __m128d a, __m128d b);
VUNPCKLPD __m128d _mm_maskz_unpacklo_pd(__mmask8 k, __m128d a, __m128d b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-52, “Type E4NF Class Exception Conditions.”

UNPCKLPD—Unpack and Interleave Low Packed Double Precision Floating-Point Values Vol. 2B 4-740
UNPCKLPS—Unpack and Interleave Low Packed Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 14 /r A V/V SSE Unpacks and Interleaves single precision floating-point
UNPCKLPS xmm1, xmm2/m128 values from low quadwords of xmm1 and xmm2/m128.
VEX.128.0F.WIG 14 /r B V/V AVX Unpacks and Interleaves single precision floating-point
VUNPCKLPS xmm1,xmm2, values from low quadwords of xmm2 and xmm3/m128.
xmm3/m128
VEX.256.0F.WIG 14 /r B V/V AVX Unpacks and Interleaves single precision floating-point
VUNPCKLPS values from low quadwords of ymm2 and ymm3/m256.
ymm1,ymm2,ymm3/m256
EVEX.128.0F.W0 14 /r C V/V (AVX512VL AND Unpacks and Interleaves single precision floating-point
VUNPCKLPS xmm1 {k1}{z}, xmm2, AVX512F) OR values from low quadwords of xmm2 and xmm3/mem
xmm3/m128/m32bcst AVX10.11 and write result to xmm1 subject to write mask k1.
EVEX.256.0F.W0 14 /r C V/V (AVX512VL AND Unpacks and Interleaves single precision floating-point
VUNPCKLPS ymm1 {k1}{z}, ymm2, AVX512F) OR values from low quadwords of ymm2 and ymm3/mem
ymm3/m256/m32bcst AVX10.11 and write result to ymm1 subject to write mask k1.
EVEX.512.0F.W0 14 /r C V/V AVX512F Unpacks and Interleaves single precision floating-point
VUNPCKLPS zmm1 {k1}{z}, zmm2, OR AVX10.11 values from low quadwords of zmm2 and
zmm3/m512/m32bcst zmm3/m512/m32bcst and write result to zmm1
subject to write mask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs an interleaved unpack of the low single precision floating-point values from the first source operand and
the second source operand.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified. When unpacking from a memory operand, an implementation may fetch
only the appropriate 64 bits; however, alignment to 16-byte boundary and normal segment checking will still be
enforced.
VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register.

UNPCKLPS—Unpack and Interleave Low Packed Single Precision Floating-Point Values Vol. 2B 4-741
SRC1 X7 X6 X5 X4 X3 X2 X1 X0

SRC2 Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0

DEST Y5 X5 Y4 X4 Y1 X1 Y0 X0

Figure 4-28. VUNPCKLPS Operation

EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM
register, a 512-bit memory location, or a 512-bit vector broadcasted from a 32-bit memory location. The destina-
tion operand is a ZMM register, conditionally updated using writemask k1.
EVEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM
register, a 256-bit memory location, or a 256-bit vector broadcasted from a 32-bit memory location. The destina-
tion operand is a YMM register, conditionally updated using writemask k1.
EVEX.128 encoded version: The first source operand is an XMM register. The second source operand is a XMM
register, a 128-bit memory location, or a 128-bit vector broadcasted from a 32-bit memory location. The destina-
tion operand is a XMM register, conditionally updated using writemask k1.

Operation
VUNPCKLPS (EVEX Encoded Version When SRC2 is a ZMM Register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF VL >= 128
TMP_DEST[31:0] := SRC1[31:0]
TMP_DEST[63:32] := SRC2[31:0]
TMP_DEST[95:64] := SRC1[63:32]
TMP_DEST[127:96] := SRC2[63:32]
FI;
IF VL >= 256
TMP_DEST[159:128] := SRC1[159:128]
TMP_DEST[191:160] := SRC2[159:128]
TMP_DEST[223:192] := SRC1[191:160]
TMP_DEST[255:224] := SRC2[191:160]
FI;
IF VL >= 512
TMP_DEST[287:256] := SRC1[287:256]
TMP_DEST[319:288] := SRC2[287:256]
TMP_DEST[351:320] := SRC1[319:288]
TMP_DEST[383:352] := SRC2[319:288]
TMP_DEST[415:384] := SRC1[415:384]
TMP_DEST[447:416] := SRC2[415:384]
TMP_DEST[479:448] := SRC1[447:416]
TMP_DEST[511:480] := SRC2[447:416]
FI;
FOR j := 0 TO KL-1

UNPCKLPS—Unpack and Interleave Low Packed Single Precision Floating-Point Values Vol. 2B 4-742
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VUNPCKLPS (EVEX Encoded Version When SRC2 is Memory)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 31
IF (EVEX.b = 1)
THEN TMP_SRC2[i+31:i] := SRC2[31:0]
ELSE TMP_SRC2[i+31:i] := SRC2[i+31:i]
FI;
ENDFOR;
IF VL >= 128
TMP_DEST[31:0] := SRC1[31:0]
TMP_DEST[63:32] := TMP_SRC2[31:0]
TMP_DEST[95:64] := SRC1[63:32]
TMP_DEST[127:96] := TMP_SRC2[63:32]
FI;
IF VL >= 256
TMP_DEST[159:128] := SRC1[159:128]
TMP_DEST[191:160] := TMP_SRC2[159:128]
TMP_DEST[223:192] := SRC1[191:160]
TMP_DEST[255:224] := TMP_SRC2[191:160]
FI;
IF VL >= 512
TMP_DEST[287:256] := SRC1[287:256]
TMP_DEST[319:288] := TMP_SRC2[287:256]
TMP_DEST[351:320] := SRC1[319:288]
TMP_DEST[383:352] := TMP_SRC2[319:288]
TMP_DEST[415:384] := SRC1[415:384]
TMP_DEST[447:416] := TMP_SRC2[415:384]
TMP_DEST[479:448] := SRC1[447:416]
TMP_DEST[511:480] := TMP_SRC2[447:416]
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI

UNPCKLPS—Unpack and Interleave Low Packed Single Precision Floating-Point Values Vol. 2B 4-743
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

UNPCKLPS (VEX.256 Encoded Version)


DEST[31:0] := SRC1[31:0]
DEST[63:32] := SRC2[31:0]
DEST[95:64] := SRC1[63:32]
DEST[127:96] := SRC2[63:32]
DEST[159:128] := SRC1[159:128]
DEST[191:160] := SRC2[159:128]
DEST[223:192] := SRC1[191:160]
DEST[255:224] := SRC2[191:160]
DEST[MAXVL-1:256] := 0

VUNPCKLPS (VEX.128 Encoded Version)


DEST[31:0] := SRC1[31:0]
DEST[63:32] := SRC2[31:0]
DEST[95:64] := SRC1[63:32]
DEST[127:96] := SRC2[63:32]
DEST[MAXVL-1:128] := 0

UNPCKLPS (128-bit Legacy SSE Version)


DEST[31:0] := SRC1[31:0]
DEST[63:32] := SRC2[31:0]
DEST[95:64] := SRC1[63:32]
DEST[127:96] := SRC2[63:32]
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VUNPCKLPS __m512 _mm512_unpacklo_ps(__m512 a, __m512 b);
VUNPCKLPS __m512 _mm512_mask_unpacklo_ps(__m512 s, __mmask16 k, __m512 a, __m512 b);
VUNPCKLPS __m512 _mm512_maskz_unpacklo_ps(__mmask16 k, __m512 a, __m512 b);
VUNPCKLPS __m256 _mm256_unpacklo_ps (__m256 a, __m256 b);
VUNPCKLPS __m256 _mm256_mask_unpacklo_ps(__m256 s, __mmask8 k, __m256 a, __m256 b);
VUNPCKLPS __m256 _mm256_maskz_unpacklo_ps(__mmask8 k, __m256 a, __m256 b);
UNPCKLPS __m128 _mm_unpacklo_ps (__m128 a, __m128 b);
VUNPCKLPS __m128 _mm_mask_unpacklo_ps(__m128 s, __mmask8 k, __m128 a, __m128 b);
VUNPCKLPS __m128 _mm_maskz_unpacklo_ps(__mmask8 k, __m128 a, __m128 b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-52, “Type E4NF Class Exception Conditions.”

UNPCKLPS—Unpack and Interleave Low Packed Single Precision Floating-Point Values Vol. 2B 4-744
8. Updates to Chapter 5, Volume 2C
Change bars and violet text show changes to Chapter 5 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 2C: Instruction Set Reference, V.

------------------------------------------------------------------------------------------
Changes to this chapter:
• Removed erroneous instruction listing.
• Updated the following instructions to add VEX-encoded forms: VCVTNEPS2BF16, VPMADD52HUQ, and
VPMADD52LUQ.
• Added the following instructions:
— VBCSTNEBF162PS
— VBCSTNESH2PS
— VCVTNEEBF162PS
— VCVTNEEBF162PS
— VCVTNEEPH2PS
— VCVTNEOBF162PS
— VCVTNEOPH2PS
— VSHA512MSG1
— VSHA512MSG2
— VSHA512RNDS2
— VSM3MSG1
— VSM3MSG2
— VSM3RNDS2
— VSM4KEY4
— VSM4RNDS4
— VPDPB[SU,UU,SS]D[,S]
— VPDPW[SU,US,UU]D[,S]
• Updated the VSCATTERDPS/VSCATTERDPD/VSCATTERQPS/VSCATTERQPD instructions with corrections.
• Added Intel® AVX10.1 information to the following instructions:
— VADDPH
— VADDSH
— VALIGND/VALIGNQ
— VBLENDMPD/VBLENDMPS
— VBROADCAST
— VCMPPH
— VCMPSH
— VCOMISH
— VCOMPRESSPD
— VCOMPRESSPS
— VCVTDQ2PH
— VCVTNE2PS2BF16
— VCVTNEPS2BF16
— VCVTPD2PH
— VCVTPD2QQ
— VCVTPD2UDQ
— VCVTPD2UQQ
— VCVTPH2DQ

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 13


— VCVTPH2PD
— VCVTPH2PS/VCVTPH2PSX
— VCVTPH2QQ
— VCVTPH2UDQ
— VCVTPH2UQQ
— VCVTPH2UW
— VCVTPH2W
— VCVTPS2PH
— VCVTPS2PHX
— VCVTPS2QQ
— VCVTPS2UDQ
— VCVTPS2UQQ
— VCVTQQ2PD
— VCVTQQ2PH
— VCVTQQ2PS
— VCVTSD2SH
— VCVTSD2USI
— VCVTSH2SD
— VCVTSH2SI
— VCVTSH2SS
— VCVTSH2USI
— VCVTSI2SH
— VCVTSS2SH
— VCVTSS2USI
— VCVTTPD2QQ
— VCVTTPD2UDQ
— VCVTTPD2UQQ
— VCVTTPH2DQ
— VCVTTPH2QQ
— VCVTTPH2UDQ
— VCVTTPH2UQQ
— VCVTTPH2UW
— VCVTTPH2W
— VCVTTPS2QQ
— VCVTTPS2UDQ
— VCVTTPS2UQQ
— VCVTTSD2USI
— VCVTTSH2SI
— VCVTTSH2USI
— VCVTTSS2USI
— VCVTUDQ2PD
— VCVTUDQ2PH
— VCVTUDQ2PS
— VCVTUQQ2PD
— VCVTUQQ2PH
— VCVTUQQ2PS

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 14


— VCVTUSI2SD
— VCVTUSI2SS
— VCVTUW2PH
— VCVTW2PH
— VDBPSADBW
— VDIVPH
— VDIVSH
— VDPBF16PS
— VEXPANDPD
— VEXPANDPS
— VEXTRACTF128/VEXTRACTF32x4/VEXTRACTF64x2/VEXTRACTF32x8/VEXTRACTF64x4
— VEXTRACTI128/VEXTRACTI32x4/VEXTRACTI64x2/VEXTRACTI32x8/VEXTRACTI64x4
— VFCMADDCPH/VFMADDCPH
— VFCMADDCSH/VFMADDCSH
— VFCMULCPH/VFMULCPH
— VFCMULCSH/VFMULCSH
— VFIXUPIMMPD
— VFIXUPIMMPS
— VFIXUPIMMSD
— VFIXUPIMMSS
— VFMADD132PD/VFMADD213PD/VFMADD231PD
— VF[,N]MADD[132,213,231]PH
— VFMADD132PS/VFMADD213PS/VFMADD231PS
— VFMADD132SD/VFMADD213SD/VFMADD231SD
— VF[,N]MADD[132,213,231]SH
— VFMADD132SS/VFMADD213SS/VFMADD231SS
— VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD
— VFMADDSUB132PH/VFMADDSUB213PH/VFMADDSUB231PH
— VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS
— VFMSUB132PD/VFMSUB213PD/VFMSUB231PD
— VF[,N]MSUB[132,213,231]PH
— VFMSUB132PS/VFMSUB213PS/VFMSUB231PS
— VFMSUB132SD/VFMSUB213SD/VFMSUB231SD
— VF[,N]MSUB[132,213,231]SH
— VFMSUB132SS/VFMSUB213SS/VFMSUB231SS
— VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD
— VFMSUBADD132PH/VFMSUBADD213PH/VFMSUBADD231PH
— VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS
— VFNMADD132PD/VFNMADD213PD/VFNMADD231PD
— VFNMADD132PS/VFNMADD213PS/VFNMADD231PS
— VFNMADD132SD/VFNMADD213SD/VFNMADD231SD
— VFNMADD132SS/VFNMADD213SS/VFNMADD231SS
— VFNMSUB132PD/VFNMSUB213PD/VFNMSUB231PD
— VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS
— VFNMSUB132SD/VFNMSUB213SD/VFNMSUB231SD
— VF[,N]MSUB[132,213,231]SH

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 15


— VFNMSUB132SS/VFNMSUB213SS/VFNMSUB231SS
— VFPCLASSPD
— VFPCLASSPH
— VFPCLASSPS
— VFPCLASSSD
— VFPCLASSSH
— VFPCLASSSS
— VGATHERDPS/VGATHERDPD
— VGATHERQPS/VGATHERQPD
— VGETEXPPD
— VGETEXPPH
— VGETEXPPS
— VGETEXPSD
— VGETEXPSH
— VGETEXPSS
— VGETMANTPD
— VGETMANTPH
— VGETMANTPS
— VGETMANTSD
— VGETMANTSH
— VGETMANTSS
— VINSERTF128/VINSERTF32x4/VINSERTF64x2/VINSERTF32x8/VINSERTF64x4
— VINSERTI128/VINSERTI32x4/VINSERTI64x2/VINSERTI32x8/VINSERTI64x4
— VMAXPH
— VMAXSH
— VMINPH
— VMINSH
— VMOVSH
— VMOVW
— VMULPH
— VMULSH
— VPBLENDMB/VPBLENDMW
— VPBLENDMD/VPBLENDMQ
— VPBROADCASTB/W/D/Q
— VPBROADCAST
— VPBROADCASTM
— VPCMPB/VPCMPUB
— VPCMPD/VPCMPUD
— VPCMPQ/VPCMPUQ
— VPCMPW/VPCMPUW
— VPCOMPRESSB/VCOMPRESSW
— VPCOMPRESSD
— VPCOMPRESSQ
— VPCONFLICTD/Q
— VPDPBUSD
— VPDPBUSDS

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 16


— VPDPWSSD
— VPDPWSSDS
— VPERMB
— VPERMD/VPERMW
— VPERMI2B
— VPERMI2W/D/Q/PS/PD
— VPERMILPD
— VPERMILPS
— VPERMPD
— VPERMPS
— VPERMQ
— VPERMT2B
— VPERMT2W/D/Q/PS/PD
— VPEXPANDB/VPEXPANDW
— VPEXPANDD
— VPEXPANDQ
— VPGATHERDD/VPGATHERDQ
— VPGATHERQD/VPGATHERQQ
— VPLZCNTD/Q
— VPMADD52HUQ
— VPMADD52LUQ
— VPMOVB2M/VPMOVW2M/VPMOVD2M/VPMOVQ2M
— VPMOVDB/VPMOVSDB/VPMOVUSDB
— VPMOVDW/VPMOVSDW/VPMOVUSDW
— VPMOVM2B/VPMOVM2W/VPMOVM2D/VPMOVM2Q
— VPMOVQB/VPMOVSQB/VPMOVUSQB
— VPMOVQD/VPMOVSQD/VPMOVUSQD
— VPMOVQW/VPMOVSQW/VPMOVUSQW
— VPMOVWB/VPMOVSWB/VPMOVUSWB
— VPMULTISHIFTQB
— VPOPCNT
— VPROLD/VPROLVD/VPROLQ/VPROLVQ
— VPRORD/VPRORVD/VPRORQ/VPRORVQ
— VPSCATTERDD/VPSCATTERDQ/VPSCATTERQD/VPSCATTERQQ
— VPSHLD
— VPSHLDV
— VPSHRD
— VPSHRDV
— VPSHUFBITQMB
— VPSLLVW/VPSLLVD/VPSLLVQ
— VPSRAVW/VPSRAVD/VPSRAVQ
— VPSRLVW/VPSRLVD/VPSRLVQ
— VPTERNLOGD/VPTERNLOGQ
— VPTESTMB/VPTESTMW/VPTESTMD/VPTESTMQ
— VPTESTNMB/W/D/Q
— VRANGEPD

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 17


— VRANGEPS
— VRANGESD
— VRANGESS
— VRCP14PD
— VRCP14PS
— VRCP14SD
— VRCP14SS
— VRCPPH
— VRCPSH
— VREDUCEPD
— VREDUCEPH
— VREDUCEPS
— VREDUCESD
— VREDUCESH
— VREDUCESS
— VRNDSCALEPD
— VRNDSCALEPH
— VRNDSCALEPS
— VRNDSCALESD
— VRNDSCALESH
— VRNDSCALESS
— VRSQRT14PD
— VRSQRT14PS
— VRSQRT14SD
— VRSQRT14SS
— VRSQRTPH
— VRSQRTSH
— VSCALEFPD
— VSCALEFPH
— VSCALEFPS
— VSCALEFSD
— VSCALEFSH
— VSCALEFSS
— VSCATTERDPS/VSCATTERDPD/VSCATTERQPS/VSCATTERQPD
— VSHUFF32x4/VSHUFF64x2/VSHUFI32x4/VSHUFI64x2
— VSQRTPH
— VSQRTSH
— VSUBPH
— VSUBSH
— VUCOMISH

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 18


VADDPH—Add Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.NP.MAP5.W0 58 /r A V/V (AVX512-FP16 Add packed FP16 value from
VADDPH xmm1{k1}{z}, xmm2, AND AVX512VL) xmm3/m128/m16bcst to xmm2, and store result
xmm3/m128/m16bcst OR AVX10.11 in xmm1 subject to writemask k1.
EVEX.256.NP.MAP5.W0 58 /r A V/V (AVX512-FP16 Add packed FP16 value from
VADDPH ymm1{k1}{z}, ymm2, AND AVX512VL) ymm3/m256/m16bcst to ymm2, and store result
ymm3/m256/m16bcst OR AVX10.11 in ymm1 subject to writemask k1.
EVEX.512.NP.MAP5.W0 58 /r A V/V AVX512-FP16 Add packed FP16 value from
VADDPH zmm1{k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m16bcst to zmm2, and store result
zmm3/m512/m16bcst {er} in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction adds packed FP16 values from source operands and stores the packed FP16 result in the destina-
tion operand. The destination elements are updated according to the writemask.

Operation
VADDPH (EVEX Encoded Versions) When SRC2 Operand is a Register
VL = 128, 256 or 512
KL := VL/16
IF (VL = 512) AND (EVEX.b = 1):
SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.fp16[j] := SRC1.fp16[j] + SRC2.fp16[j]
ELSEIF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0

VADDPH—Add Packed FP16 Values Vol. 2C 5-5


VADDPH (EVEX Encoded Versions) When SRC2 Operand is a Memory Source
VL = 128, 256 or 512
KL := VL/16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
DEST.fp16[j] := SRC1.fp16[j] + SRC2.fp16[0]
ELSE:
DEST.fp16[j] := SRC1.fp16[j] + SRC2.fp16[j]
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VADDPH __m128h _mm_add_ph (__m128h a, __m128h b);
VADDPH __m128h _mm_mask_add_ph (__m128h src, __mmask8 k, __m128h a, __m128h b);
VADDPH __m128h _mm_maskz_add_ph (__mmask8 k, __m128h a, __m128h b);
VADDPH __m256h _mm256_add_ph (__m256h a, __m256h b);
VADDPH __m256h _mm256_mask_add_ph (__m256h src, __mmask16 k, __m256h a, __m256h b);
VADDPH __m256h _mm256_maskz_add_ph (__mmask16 k, __m256h a, __m256h b);
VADDPH __m512h _mm512_add_ph (__m512h a, __m512h b);
VADDPH __m512h _mm512_add_ph (__m512h a, __m512h b);
VADDPH __m512h _mm512_mask_add_ph (__m512h src, __mmask32 k, __m512h a, __m512h b);
VADDPH __m512h _mm512_maskz_add_ph (__mmask32 k, __m512h a, __m512h b);
VADDPH __m512h _mm512_add_round_ph (__m512h a, __m512h b, int rounding);
VADDPH __m512h _mm512_mask_add_round_ph (__m512h src, __mmask32 k, __m512h a, __m512h b, int rounding);
VADDPH __m512h _mm512_maskz_add_round_ph (__mmask32 k, __m512h a, __m512h b, int rounding);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VADDPH—Add Packed FP16 Values Vol. 2C 5-6


VADDSH—Add Scalar FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.F3.MAP5.W0 58 /r A V/V AVX512-FP16 Add the low FP16 value from xmm3/m16 to
VADDSH xmm1{k1}{z}, xmm2, OR AVX10.11 xmm2, and store the result in xmm1 subject to
xmm3/m16 {er} writemask k1. Bits 127:16 of xmm2 are copied
to xmm1[127:16].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction adds the low FP16 value from the source operands and stores the FP16 result in the destination
operand.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.

Operation
VADDSH (EVEX Encoded Versions)
IF EVEX.b = 1 and SRC2 is a register:
SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)
IF k1[0] OR *no writemask*:
DEST.fp16[0] := SRC1.fp16[0] + SRC2.fp16[0]
ELSEIF *zeroing*:
DEST.fp16[0] := 0
// else dest.fp16[0] remains unchanged
DEST[127:16] := SRC1[127:16]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VADDSH __m128h _mm_add_round_sh (__m128h a, __m128h b, int rounding);
VADDSH ___m128h _mm_mask_add_round_sh (__m128h src, __mmask8 k, __m128h a, __m128h b, int rounding);
VADDSH ___m128h _mm_maskz_add_round_sh (__mmask8 k, __m128h a, __m128h b, int rounding);
VADDSH ___m128h _mm_add_sh (__m128h a, __m128h b);
VADDSH ___m128h _mm_mask_add_sh (__m128h src, __mmask8 k, __m128h a, __m128h b);
VADDSH ___m128h _mm_maskz_add_sh (__mmask8 k, __m128h a, __m128h b);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal.

VADDSH—Add Scalar FP16 Values Vol. 2C 5-7


Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VADDSH—Add Scalar FP16 Values Vol. 2C 5-8


VALIGND/VALIGNQ—Align Doubleword/Quadword Vectors
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F3A.W0 03 /r ib A V/V (AVX512VL AND Shift right and merge vectors xmm2 and
VALIGND xmm1 {k1}{z}, xmm2, AVX512F) OR xmm3/m128/m32bcst with double-word granularity
xmm3/m128/m32bcst, imm8 AVX10.11 using imm8 as number of elements to shift, and store
the final result in xmm1, under writemask.
EVEX.128.66.0F3A.W1 03 /r ib A V/V (AVX512VL AND Shift right and merge vectors xmm2 and
VALIGNQ xmm1 {k1}{z}, xmm2, AVX512F) OR xmm3/m128/m64bcst with quad-word granularity
xmm3/m128/m64bcst, imm8 AVX10.11 using imm8 as number of elements to shift, and store
the final result in xmm1, under writemask.
EVEX.256.66.0F3A.W0 03 /r ib A V/V (AVX512VL AND Shift right and merge vectors ymm2 and
VALIGND ymm1 {k1}{z}, ymm2, AVX512F) OR ymm3/m256/m32bcst with double-word granularity
ymm3/m256/m32bcst, imm8 AVX10.11 using imm8 as number of elements to shift, and store
the final result in ymm1, under writemask.
EVEX.256.66.0F3A.W1 03 /r ib A V/V (AVX512VL AND Shift right and merge vectors ymm2 and
VALIGNQ ymm1 {k1}{z}, ymm2, AVX512F) OR ymm3/m256/m64bcst with quad-word granularity
ymm3/m256/m64bcst, imm8 AVX10.11 using imm8 as number of elements to shift, and store
the final result in ymm1, under writemask.
EVEX.512.66.0F3A.W0 03 /r ib A V/V AVX512F Shift right and merge vectors zmm2 and
VALIGND zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m32bcst with double-word granularity
zmm3/m512/m32bcst, imm8 using imm8 as number of elements to shift, and store
the final result in zmm1, under writemask.
EVEX.512.66.0F3A.W1 03 /r ib A V/V AVX512F Shift right and merge vectors zmm2 and
VALIGNQ zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m64bcst with quad-word granularity
zmm3/m512/m64bcst, imm8 using imm8 as number of elements to shift, and store
the final result in zmm1, under writemask.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Concatenates and shifts right doubleword/quadword elements of the first source operand (the second operand)
and the second source operand (the third operand) into a 1024/512/256-bit intermediate vector. The low
512/256/128-bit of the intermediate vector is written to the destination operand (the first operand) using the
writemask k1. The destination and first source operands are ZMM/YMM/XMM registers. The second source operand
can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted
from a 32/64-bit memory location.
This instruction is writemasked, so only those elements with the corresponding bit set in vector mask register k1
are computed and stored into zmm1. Elements in zmm1 with the corresponding bit clear in k1 retain their previous
values (merging-masking) or are set to 0 (zeroing-masking).

VALIGND/VALIGNQ—Align Doubleword/Quadword Vectors Vol. 2C 5-8


Operation
VALIGND (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)

IF (SRC2 *is memory*) (AND EVEX.b = 1)


THEN
FOR j := 0 TO KL-1
i := j * 32
src[i+31:i] := SRC2[31:0]
ENDFOR;
ELSE src := SRC2
FI
; Concatenate sources
tmp[VL-1:0] := src[VL-1:0]
tmp[2VL-1:VL] := SRC1[VL-1:0]
; Shift right doubleword elements
IF VL = 128
THEN SHIFT = imm8[1:0]
ELSE
IF VL = 256
THEN SHIFT = imm8[2:0]
ELSE SHIFT = imm8[3:0]
FI
FI;
tmp[2VL-1:0] := tmp[2VL-1:0] >> (32*SHIFT)
; Apply writemask
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := tmp[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

VALIGNQ (EVEX Encoded Versions)


(KL, VL) = (2, 128), (4, 256),(8, 512)
IF (SRC2 *is memory*) (AND EVEX.b = 1)
THEN
FOR j := 0 TO KL-1
i := j * 64
src[i+63:i] := SRC2[63:0]
ENDFOR;
ELSE src := SRC2
FI
; Concatenate sources
tmp[VL-1:0] := src[VL-1:0]
tmp[2VL-1:VL] := SRC1[VL-1:0]
; Shift right quadword elements

VALIGND/VALIGNQ—Align Doubleword/Quadword Vectors Vol. 2C 5-9


IF VL = 128
THEN SHIFT = imm8[0]
ELSE
IF VL = 256
THEN SHIFT = imm8[1:0]
ELSE SHIFT = imm8[2:0]
FI
FI;
tmp[2VL-1:0] := tmp[2VL-1:0] >> (64*SHIFT)
; Apply writemask
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := tmp[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VALIGND __m512i _mm512_alignr_epi32( __m512i a, __m512i b, int cnt);
VALIGND __m512i _mm512_mask_alignr_epi32(__m512i s, __mmask16 k, __m512i a, __m512i b, int cnt);
VALIGND __m512i _mm512_maskz_alignr_epi32( __mmask16 k, __m512i a, __m512i b, int cnt);
VALIGND __m256i _mm256_mask_alignr_epi32(__m256i s, __mmask8 k, __m256i a, __m256i b, int cnt);
VALIGND __m256i _mm256_maskz_alignr_epi32( __mmask8 k, __m256i a, __m256i b, int cnt);
VALIGND __m128i _mm_mask_alignr_epi32(__m128i s, __mmask8 k, __m128i a, __m128i b, int cnt);
VALIGND __m128i _mm_maskz_alignr_epi32( __mmask8 k, __m128i a, __m128i b, int cnt);
VALIGNQ __m512i _mm512_alignr_epi64( __m512i a, __m512i b, int cnt);
VALIGNQ __m512i _mm512_mask_alignr_epi64(__m512i s, __mmask8 k, __m512i a, __m512i b, int cnt);
VALIGNQ __m512i _mm512_maskz_alignr_epi64( __mmask8 k, __m512i a, __m512i b, int cnt);
VALIGNQ __m256i _mm256_mask_alignr_epi64(__m256i s, __mmask8 k, __m256i a, __m256i b, int cnt);
VALIGNQ __m256i _mm256_maskz_alignr_epi64( __mmask8 k, __m256i a, __m256i b, int cnt);
VALIGNQ __m128i _mm_mask_alignr_epi64(__m128i s, __mmask8 k, __m128i a, __m128i b, int cnt);
VALIGNQ __m128i _mm_maskz_alignr_epi64( __mmask8 k, __m128i a, __m128i b, int cnt);

Exceptions
See Table 2-52, “Type E4NF Class Exception Conditions.”

VALIGND/VALIGNQ—Align Doubleword/Quadword Vectors Vol. 2C 5-10


VBLENDMPD/VBLENDMPS—Blend Float64/Float32 Vectors Using an OpMask Control
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F38.W1 65 /r A V/V (AVX512VL AND Blend double precision vector xmm2 and double
VBLENDMPD xmm1 {k1}{z}, AVX512F) OR precision vector xmm3/m128/m64bcst and store the
xmm2, xmm3/m128/m64bcst AVX10.11 result in xmm1, under control mask.
EVEX.256.66.0F38.W1 65 /r A V/V (AVX512VL AND Blend double precision vector ymm2 and double
VBLENDMPD ymm1 {k1}{z}, AVX512F) OR precision vector ymm3/m256/m64bcst and store the
ymm2, ymm3/m256/m64bcst AVX10.11 result in ymm1, under control mask.
EVEX.512.66.0F38.W1 65 /r A V/V AVX512F Blend double precision vector zmm2 and double
VBLENDMPD zmm1 {k1}{z}, OR AVX10.11 precision vector zmm3/m512/m64bcst and store the
zmm2, zmm3/m512/m64bcst result in zmm1, under control mask.
EVEX.128.66.0F38.W0 65 /r A V/V (AVX512VL AND Blend single precision vector xmm2 and single
VBLENDMPS xmm1 {k1}{z}, AVX512F) OR precision vector xmm3/m128/m32bcst and store the
xmm2, xmm3/m128/m32bcst AVX10.11 result in xmm1, under control mask.
EVEX.256.66.0F38.W0 65 /r A V/V (AVX512VL AND Blend single precision vector ymm2 and single
VBLENDMPS ymm1 {k1}{z}, AVX512F) OR precision vector ymm3/m256/m32bcst and store the
ymm2, ymm3/m256/m32bcst AVX10.11 result in ymm1, under control mask.
EVEX.512.66.0F38.W0 65 /r A V/V AVX512F Blend single precision vector zmm2 and single
VBLENDMPS zmm1 {k1}{z}, OR AVX10.11 precision vector zmm3/m512/m32bcst using k1 as
zmm2, zmm3/m512/m32bcst select control and store the result in zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs an element-by-element blending between float64/float32 elements in the first source operand (the
second operand) with the elements in the second source operand (the third operand) using an opmask register as
select control. The blended result is written to the destination register.
The destination and first source operands are ZMM/YMM/XMM registers. The second source operand can be a
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-
bit memory location.
The opmask register is not used as a writemask for this instruction. Instead, the mask is used as an element
selector: every element of the destination is conditionally selected between first source or second source using the
value of the related mask bit (0 for first source operand, 1 for second source operand).
If EVEX.z is set, the elements with corresponding mask bit value of 0 in the destination operand are zeroed.

VBLENDMPD/VBLENDMPS—Blend Float64/Float32 Vectors Using an OpMask Control Vol. 2C 5-13


Operation
VBLENDMPD (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no controlmask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+63:i] := SRC2[63:0]
ELSE
DEST[i+63:i] := SRC2[i+63:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN DEST[i+63:i] := SRC1[i+63:i]
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VBLENDMPS (EVEX Encoded Versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no controlmask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+31:i] := SRC2[31:0]
ELSE
DEST[i+31:i] := SRC2[i+31:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN DEST[i+31:i] := SRC1[i+31:i]
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VBLENDMPD __m512d _mm512_mask_blend_pd(__mmask8 k, __m512d a, __m512d b);
VBLENDMPD __m256d _mm256_mask_blend_pd(__mmask8 k, __m256d a, __m256d b);
VBLENDMPD __m128d _mm_mask_blend_pd(__mmask8 k, __m128d a, __m128d b);
VBLENDMPS __m512 _mm512_mask_blend_ps(__mmask16 k, __m512 a, __m512 b);
VBLENDMPS __m256 _mm256_mask_blend_ps(__mmask8 k, __m256 a, __m256 b);
VBLENDMPS __m128 _mm_mask_blend_ps(__mmask8 k, __m128 a, __m128 b);

VBLENDMPD/VBLENDMPS—Blend Float64/Float32 Vectors Using an OpMask Control Vol. 2C 5-14


SIMD Floating-Point Exceptions
None.

Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”

VBLENDMPD/VBLENDMPS—Blend Float64/Float32 Vectors Using an OpMask Control Vol. 2C 5-15


VBROADCAST—Load with Broadcast Floating-Point Data
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.66.0F38.W0 18 /r A V/V AVX Broadcast single precision floating-point
VBROADCASTSS xmm1, m32 element in mem to four locations in xmm1.
VEX.256.66.0F38.W0 18 /r A V/V AVX Broadcast single precision floating-point
VBROADCASTSS ymm1, m32 element in mem to eight locations in ymm1.
VEX.256.66.0F38.W0 19 /r A V/V AVX Broadcast double precision floating-point
VBROADCASTSD ymm1, m64 element in mem to four locations in ymm1.
VEX.256.66.0F38.W0 1A /r A V/V AVX Broadcast 128 bits of floating-point data in
VBROADCASTF128 ymm1, m128 mem to low and high 128-bits in ymm1.
VEX.128.66.0F38.W0 18/r A V/V AVX2 Broadcast the low single precision floating-point
VBROADCASTSS xmm1, xmm2 element in the source operand to four locations
in xmm1.
VEX.256.66.0F38.W0 18 /r A V/V AVX2 Broadcast low single precision floating-point
VBROADCASTSS ymm1, xmm2 element in the source operand to eight
locations in ymm1.
VEX.256.66.0F38.W0 19 /r A V/V AVX2 Broadcast low double precision floating-point
VBROADCASTSD ymm1, xmm2 element in the source operand to four locations
in ymm1.
EVEX.256.66.0F38.W1 19 /r B V/V (AVX512VL AND Broadcast low double precision floating-point
VBROADCASTSD ymm1 {k1}{z}, AVX512F) OR element in xmm2/m64 to four locations in
xmm2/m64 AVX10.11 ymm1 using writemask k1.
EVEX.512.66.0F38.W1 19 /r B V/V AVX512F Broadcast low double precision floating-point
VBROADCASTSD zmm1 {k1}{z}, OR AVX10.11 element in xmm2/m64 to eight locations in
xmm2/m64 zmm1 using writemask k1.
EVEX.256.66.0F38.W0 19 /r C V/V (AVX512VL AND Broadcast two single precision floating-point
VBROADCASTF32X2 ymm1 {k1}{z}, AVX512DQ) OR elements in xmm2/m64 to locations in ymm1
xmm2/m64 AVX10.11 using writemask k1.
EVEX.512.66.0F38.W0 19 /r C V/V AVX512DQ Broadcast two single precision floating-point
VBROADCASTF32X2 zmm1 {k1}{z}, OR AVX10.11 elements in xmm2/m64 to locations in zmm1
xmm2/m64 using writemask k1.
EVEX.128.66.0F38.W0 18 /r B V/V (AVX512VL AND Broadcast low single precision floating-point
VBROADCASTSS xmm1 {k1}{z}, AVX512F) OR element in xmm2/m32 to all locations in xmm1
xmm2/m32 AVX10.11 using writemask k1.
EVEX.256.66.0F38.W0 18 /r B V/V (AVX512VL AND Broadcast low single precision floating-point
VBROADCASTSS ymm1 {k1}{z}, AVX512F) OR element in xmm2/m32 to all locations in ymm1
xmm2/m32 AVX10.11 using writemask k1.
EVEX.512.66.0F38.W0 18 /r B V/V AVX512F Broadcast low single precision floating-point
VBROADCASTSS zmm1 {k1}{z}, OR AVX10.11 element in xmm2/m32 to all locations in zmm1
xmm2/m32 using writemask k1.
EVEX.256.66.0F38.W0 1A /r D V/V (AVX512VL AND Broadcast 128 bits of 4 single precision
VBROADCASTF32X4 ymm1 {k1}{z}, AVX512F) OR floating-point data in mem to locations in ymm1
m128 AVX10.11 using writemask k1.
EVEX.512.66.0F38.W0 1A /r D V/V AVX512F Broadcast 128 bits of 4 single precision
VBROADCASTF32X4 zmm1 {k1}{z}, OR AVX10.11 floating-point data in mem to locations in zmm1
m128 using writemask k1.
EVEX.256.66.0F38.W1 1A /r C V/V (AVX512VL AND Broadcast 128 bits of 2 double precision
VBROADCASTF64X2 ymm1 {k1}{z}, AVX512DQ) OR floating-point data in mem to locations in ymm1
m128 AVX10.11 using writemask k1.

VBROADCAST—Load with Broadcast Floating-Point Data Vol. 2C 5-15


Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.512.66.0F38.W1 1A /r C V/V AVX512DQ Broadcast 128 bits of 2 double precision
VBROADCASTF64X2 zmm1 {k1}{z}, OR AVX10.11 floating-point data in mem to locations in zmm1
m128 using writemask k1.
EVEX.512.66.0F38.W0 1B /r E V/V AVX512DQ Broadcast 256 bits of 8 single precision
VBROADCASTF32X8 zmm1 {k1}{z}, OR AVX10.11 floating-point data in mem to locations in zmm1
m256 using writemask k1.
EVEX.512.66.0F38.W1 1B /r D V/V AVX512F Broadcast 256 bits of 4 double precision
VBROADCASTF64X4 zmm1 {k1}{z}, OR AVX10.11 floating-point data in mem to locations in zmm1
m256 using writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Tuple1 Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A
C Tuple2 ModRM:reg (w) ModRM:r/m (r) N/A N/A
D Tuple4 ModRM:reg (w) ModRM:r/m (r) N/A N/A
E Tuple8 ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
VBROADCASTSD/VBROADCASTSS/VBROADCASTF128 load floating-point values as one tuple from the source
operand (second operand) in memory and broadcast to all elements of the destination operand (first operand).
VEX256-encoded versions: The destination operand is a YMM register. The source operand is either a 32-bit, 64-
bit, or 128-bit memory location. Register source encodings are reserved and will #UD. Bits (MAXVL-1:256) of the
destination register are zeroed.
EVEX-encoded versions: The destination operand is a ZMM/YMM/XMM register and updated according to the write-
mask k1. The source operand is either a 32-bit, 64-bit memory location or the low doubleword/quadword element
of an XMM register.
VBROADCASTF32X2/VBROADCASTF32X4/VBROADCASTF64X2/VBROADCASTF32X8/VBROADCASTF64X4 load
floating-point values as tuples from the source operand (the second operand) in memory or register and broadcast
to all elements of the destination operand (the first operand). The destination operand is a YMM/ZMM register
updated according to the writemask k1. The source operand is either a register or 64-bit/128-bit/256-bit memory
location.
VBROADCASTSD and VBROADCASTF128,F32x4 and F64x2 are only supported as 256-bit and 512-bit wide
versions and up. VBROADCASTSS is supported in 128-bit, 256-bit and 512-bit wide versions. F32x8 and F64x4 are
only supported as 512-bit wide versions.
VBROADCASTF32X2/VBROADCASTF32X4/VBROADCASTF32X8 have 32-bit granularity. VBROADCASTF64X2 and
VBROADCASTF64X4 have 64-bit granularity.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
If VBROADCASTSD or VBROADCASTF128 is encoded with VEX.L= 0, an attempt to execute the instruction encoded
with VEX.L= 0 will cause an #UD exception.

VBROADCAST—Load with Broadcast Floating-Point Data Vol. 2C 5-16


m32 X0

DEST X0 X0 X0 X0 X0 X0 X0 X0

Figure 5-1. VBROADCASTSS Operation (VEX.256 encoded version)

m32 X0

DEST 0 0 0 0 X0 X0 X0 X0

Figure 5-2. VBROADCASTSS Operation (VEX.128-bit version)

m64 X0

DEST X0 X0 X0 X0

Figure 5-3. VBROADCASTSD Operation (VEX.256-bit version)

m128 X0

DEST X0 X0

Figure 5-4. VBROADCASTF128 Operation (VEX.256-bit version)

VBROADCAST—Load with Broadcast Floating-Point Data Vol. 2C 5-17


m256 X0

DEST X0 X0

Figure 5-5. VBROADCASTF64X4 Operation (512-bit version with writemask all 1s)

Operation
VBROADCASTSS (128-bit Version VEX and Legacy)
temp := SRC[31:0]
DEST[31:0] := temp
DEST[63:32] := temp
DEST[95:64] := temp
DEST[127:96] := temp
DEST[MAXVL-1:128] := 0

VBROADCASTSS (VEX.256 Encoded Version)


temp := SRC[31:0]
DEST[31:0] := temp
DEST[63:32] := temp
DEST[95:64] := temp
DEST[127:96] := temp
DEST[159:128] := temp
DEST[191:160] := temp
DEST[223:192] := temp
DEST[255:224] := temp
DEST[MAXVL-1:256] := 0

VBROADCASTSS (EVEX Encoded Versions)


(KL, VL) (4, 128), (8, 256),= (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[31:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VBROADCAST—Load with Broadcast Floating-Point Data Vol. 2C 5-18


VBROADCASTSD (VEX.256 Encoded Version)
temp := SRC[63:0]
DEST[63:0] := temp
DEST[127:64] := temp
DEST[191:128] := temp
DEST[255:192] := temp
DEST[MAXVL-1:256] := 0

VBROADCASTSD (EVEX Encoded Versions)


(KL, VL) = (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[63:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VBROADCASTF32x2 (EVEX Encoded Versions)


(KL, VL) = (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
n := (j mod 2) * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[n+31:n]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VBROADCASTF128 (VEX.256 Encoded Version)


temp := SRC[127:0]
DEST[127:0] := temp
DEST[255:128] := temp
DEST[MAXVL-1:256] := 0

VBROADCAST—Load with Broadcast Floating-Point Data Vol. 2C 5-19


VBROADCASTF32X4 (EVEX Encoded Versions)
(KL, VL) = (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j* 32
n := (j modulo 4) * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[n+31:n]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VBROADCASTF64X2 (EVEX Encoded Versions)


(KL, VL) = (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
n := (j modulo 2) * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[n+63:n]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] = 0
FI
FI;
ENDFOR;

VBROADCASTF32X8 (EVEX.U1.512 Encoded Version)


FOR j := 0 TO 15
i := j * 32
n := (j modulo 8) * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[n+31:n]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VBROADCAST—Load with Broadcast Floating-Point Data Vol. 2C 5-20


VBROADCASTF64X4 (EVEX.512 Encoded Version)
FOR j := 0 TO 7
i := j * 64
n := (j modulo 4) * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[n+63:n]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VBROADCASTF32x2 __m512 _mm512_broadcast_f32x2( __m128 a);
VBROADCASTF32x2 __m512 _mm512_mask_broadcast_f32x2(__m512 s, __mmask16 k, __m128 a);
VBROADCASTF32x2 __m512 _mm512_maskz_broadcast_f32x2( __mmask16 k, __m128 a);
VBROADCASTF32x2 __m256 _mm256_broadcast_f32x2( __m128 a);
VBROADCASTF32x2 __m256 _mm256_mask_broadcast_f32x2(__m256 s, __mmask8 k, __m128 a);
VBROADCASTF32x2 __m256 _mm256_maskz_broadcast_f32x2( __mmask8 k, __m128 a);
VBROADCASTF32x4 __m512 _mm512_broadcast_f32x4( __m128 a);
VBROADCASTF32x4 __m512 _mm512_mask_broadcast_f32x4(__m512 s, __mmask16 k, __m128 a);
VBROADCASTF32x4 __m512 _mm512_maskz_broadcast_f32x4( __mmask16 k, __m128 a);
VBROADCASTF32x4 __m256 _mm256_broadcast_f32x4( __m128 a);
VBROADCASTF32x4 __m256 _mm256_mask_broadcast_f32x4(__m256 s, __mmask8 k, __m128 a);
VBROADCASTF32x4 __m256 _mm256_maskz_broadcast_f32x4( __mmask8 k, __m128 a);
VBROADCASTF32x8 __m512 _mm512_broadcast_f32x8( __m256 a);
VBROADCASTF32x8 __m512 _mm512_mask_broadcast_f32x8(__m512 s, __mmask16 k, __m256 a);
VBROADCASTF32x8 __m512 _mm512_maskz_broadcast_f32x8( __mmask16 k, __m256 a);
VBROADCASTF64x2 __m512d _mm512_broadcast_f64x2( __m128d a);
VBROADCASTF64x2 __m512d _mm512_mask_broadcast_f64x2(__m512d s, __mmask8 k, __m128d a);
VBROADCASTF64x2 __m512d _mm512_maskz_broadcast_f64x2( __mmask8 k, __m128d a);
VBROADCASTF64x2 __m256d _mm256_broadcast_f64x2( __m128d a);
VBROADCASTF64x2 __m256d _mm256_mask_broadcast_f64x2(__m256d s, __mmask8 k, __m128d a);
VBROADCASTF64x2 __m256d _mm256_maskz_broadcast_f64x2( __mmask8 k, __m128d a);
VBROADCASTF64x4 __m512d _mm512_broadcast_f64x4( __m256d a);
VBROADCASTF64x4 __m512d _mm512_mask_broadcast_f64x4(__m512d s, __mmask8 k, __m256d a);
VBROADCASTF64x4 __m512d _mm512_maskz_broadcast_f64x4( __mmask8 k, __m256d a);
VBROADCASTSD __m512d _mm512_broadcastsd_pd( __m128d a);
VBROADCASTSD __m512d _mm512_mask_broadcastsd_pd(__m512d s, __mmask8 k, __m128d a);
VBROADCASTSD __m512d _mm512_maskz_broadcastsd_pd(__mmask8 k, __m128d a);
VBROADCASTSD __m256d _mm256_broadcastsd_pd(__m128d a);
VBROADCASTSD __m256d _mm256_mask_broadcastsd_pd(__m256d s, __mmask8 k, __m128d a);
VBROADCASTSD __m256d _mm256_maskz_broadcastsd_pd( __mmask8 k, __m128d a);
VBROADCASTSD __m256d _mm256_broadcast_sd(double *a);
VBROADCASTSS __m512 _mm512_broadcastss_ps( __m128 a);
VBROADCASTSS __m512 _mm512_mask_broadcastss_ps(__m512 s, __mmask16 k, __m128 a);
VBROADCASTSS __m512 _mm512_maskz_broadcastss_ps( __mmask16 k, __m128 a);
VBROADCASTSS __m256 _mm256_broadcastss_ps(__m128 a);
VBROADCASTSS __m256 _mm256_mask_broadcastss_ps(__m256 s, __mmask8 k, __m128 a);
VBROADCASTSS __m256 _mm256_maskz_broadcastss_ps( __mmask8 k, __m128 a);

VBROADCAST—Load with Broadcast Floating-Point Data Vol. 2C 5-21


VBROADCASTSS __m128 _mm_broadcastss_ps(__m128 a);
VBROADCASTSS __m128 _mm_mask_broadcastss_ps(__m128 s, __mmask8 k, __m128 a);
VBROADCASTSS __m128 _mm_maskz_broadcastss_ps( __mmask8 k, __m128 a);
VBROADCASTSS __m128 _mm_broadcast_ss(float *a);
VBROADCASTSS __m256 _mm256_broadcast_ss(float *a);
VBROADCASTF128 __m256 _mm256_broadcast_ps(__m128 * a);
VBROADCASTF128 __m256d _mm256_broadcast_pd(__m128d * a);

Exceptions
VEX-encoded instructions, see Table 2-23, “Type 6 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-55, “Type E6 Class Exception Conditions.”
Additionally:
#UD If VEX.L = 0 for VBROADCASTSD or VBROADCASTF128.
If EVEX.L’L = 0 for VBROADCASTSD/VBROADCASTF32X2/VBROADCASTF32X4/VBROAD-
CASTF64X2.
If EVEX.L’L < 10b for VBROADCASTF32X8/VBROADCASTF64X4.

VBROADCAST—Load with Broadcast Floating-Point Data Vol. 2C 5-22


VCMPPH—Compare Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.NP.0F3A.W0 C2 /r /ib A V/V (AVX512-FP16 Compare packed FP16 values in
VCMPPH k1{k2}, xmm2, AND AVX512VL) xmm3/m128/m16bcst and xmm2 using bits 4:0 of
xmm3/m128/m16bcst, imm8 OR AVX10.11 imm8 as a comparison predicate subject to
writemask k2, and store the result in mask
register k1.
EVEX.256.NP.0F3A.W0 C2 /r /ib A V/V (AVX512-FP16 Compare packed FP16 values in
VCMPPH k1{k2}, ymm2, AND AVX512VL) ymm3/m256/m16bcst and ymm2 using bits 4:0 of
ymm3/m256/m16bcst, imm8 OR AVX10.11 imm8 as a comparison predicate subject to
writemask k2, and store the result in mask
register k1.
EVEX.512.NP.0F3A.W0 C2 /r /ib A V/V AVX512-FP16 Compare packed FP16 values in
VCMPPH k1{k2}, zmm2, OR AVX10.11 zmm3/m512/m16bcst and zmm2 using bits 4:0 of
zmm3/m512/m16bcst {sae}, imm8 imm8 as a comparison predicate subject to
writemask k2, and store the result in mask
register k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8 (r)

Description
This instruction compares packed FP16 values from source operands and stores the result in the destination mask
operand. The comparison predicate operand (immediate byte bits 4:0) specifies the type of comparison performed
on each of the pairs of packed values. The destination elements are updated according to the writemask.

Operation
CASE (imm8 & 0x1F) OF
0: CMP_OPERATOR := EQ_OQ;
1: CMP_OPERATOR := LT_OS;
2: CMP_OPERATOR := LE_OS;
3: CMP_OPERATOR := UNORD_Q;
4: CMP_OPERATOR := NEQ_UQ;
5: CMP_OPERATOR := NLT_US;
6: CMP_OPERATOR := NLE_US;
7: CMP_OPERATOR := ORD_Q;
8: CMP_OPERATOR := EQ_UQ;
9: CMP_OPERATOR := NGE_US;
10: CMP_OPERATOR := NGT_US;
11: CMP_OPERATOR := FALSE_OQ;
12: CMP_OPERATOR := NEQ_OQ;
13: CMP_OPERATOR := GE_OS;
14: CMP_OPERATOR := GT_OS;
15: CMP_OPERATOR := TRUE_UQ;
16: CMP_OPERATOR := EQ_OS;

VCMPPH—Compare Packed FP16 Values Vol. 2C 5-23


17: CMP_OPERATOR := LT_OQ;
18: CMP_OPERATOR := LE_OQ;
19: CMP_OPERATOR := UNORD_S;
20: CMP_OPERATOR := NEQ_US;
21: CMP_OPERATOR := NLT_UQ;
22: CMP_OPERATOR := NLE_UQ;
23: CMP_OPERATOR := ORD_S;
24: CMP_OPERATOR := EQ_US;
25: CMP_OPERATOR := NGE_UQ;
26: CMP_OPERATOR := NGT_UQ;
27: CMP_OPERATOR := FALSE_OS;
28: CMP_OPERATOR := NEQ_OS;
29: CMP_OPERATOR := GE_OQ;
30: CMP_OPERATOR := GT_OQ;
31: CMP_OPERATOR := TRUE_US;
ESAC

VCMPPH (EVEX Encoded Versions)


VL = 128, 256 or 512
KL := VL/16

FOR j := 0 TO KL-1:
IF k2[j] OR *no writemask*:
IF EVEX.b = 1:
tsrc2 := SRC2.fp16[0]
ELSE:
tsrc2 := SRC2.fp16[j]
DEST.bit[j] := SRC1.fp16[j] CMP_OPERATOR tsrc2
ELSE
DEST.bit[j] := 0

DEST[MAXKL-1:KL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCMPPH ___mmask8 _mm_cmp_ph_mask (__m128h a, __m128h b, const int imm8);
VCMPPH ___mmask8 _mm_mask_cmp_ph_mask (__mmask8 k1, __m128h a, __m128h b, const int imm8);
VCMPPH ___mmask16 _mm256_cmp_ph_mask (__m256h a, __m256h b, const int imm8);
VCMPPH ___mmask16 _mm256_mask_cmp_ph_mask (__mmask16 k1, __m256h a, __m256h b, const int imm8);
VCMPPH ___mmask32 _mm512_cmp_ph_mask (__m512h a, __m512h b, const int imm8);
VCMPPH ___mmask32 _mm512_mask_cmp_ph_mask (__mmask32 k1, __m512h a, __m512h b, const int imm8);
VCMPPH ___mmask32 _mm512_cmp_round_ph_mask (__m512h a, __m512h b, const int imm8, const int sae);
VCMPPH ___mmask32 _mm512_mask_cmp_round_ph_mask (__mmask32 k1, __m512h a, __m512h b, const int imm8, const int sae);

SIMD Floating-Point Exceptions


Invalid, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCMPPH—Compare Packed FP16 Values Vol. 2C 5-24


VCMPSH—Compare Scalar FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.F3.0F3A.W0 C2 /r /ib A V/V AVX512-FP16 Compare low FP16 values in xmm3/m16 and
VCMPSH k1{k2}, xmm2, xmm3/m16 OR AVX10.11 xmm2 using bits 4:0 of imm8 as a comparison
{sae}, imm8 predicate subject to writemask k2, and store the
result in mask register k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8 (r)

Description
This instruction compares the FP16 values from the lowest element of the source operands and stores the result in
the destination mask operand. The comparison predicate operand (immediate byte bits 4:0) specifies the type of
comparison performed on the pair of packed FP16 values. The low destination bit is updated according to the write-
mask. Bits MAXKL-1:1 of the destination operand are zeroed.

Operation
CASE (imm8 & 0x1F) OF
0: CMP_OPERATOR := EQ_OQ;
1: CMP_OPERATOR := LT_OS;
2: CMP_OPERATOR := LE_OS;
3: CMP_OPERATOR := UNORD_Q;
4: CMP_OPERATOR := NEQ_UQ;
5: CMP_OPERATOR := NLT_US;
6: CMP_OPERATOR := NLE_US;
7: CMP_OPERATOR := ORD_Q;
8: CMP_OPERATOR := EQ_UQ;
9: CMP_OPERATOR := NGE_US;
10: CMP_OPERATOR := NGT_US;
11: CMP_OPERATOR := FALSE_OQ;
12: CMP_OPERATOR := NEQ_OQ;
13: CMP_OPERATOR := GE_OS;
14: CMP_OPERATOR := GT_OS;
15: CMP_OPERATOR := TRUE_UQ;
16: CMP_OPERATOR := EQ_OS;
17: CMP_OPERATOR := LT_OQ;
18: CMP_OPERATOR := LE_OQ;
19: CMP_OPERATOR := UNORD_S;
20: CMP_OPERATOR := NEQ_US;
21: CMP_OPERATOR := NLT_UQ;
22: CMP_OPERATOR := NLE_UQ;
23: CMP_OPERATOR := ORD_S;
24: CMP_OPERATOR := EQ_US;
25: CMP_OPERATOR := NGE_UQ;

VCMPSH—Compare Scalar FP16 Values Vol. 2C 5-25


26: CMP_OPERATOR := NGT_UQ;
27: CMP_OPERATOR := FALSE_OS;
28: CMP_OPERATOR := NEQ_OS;
29: CMP_OPERATOR := GE_OQ;
30: CMP_OPERATOR := GT_OQ;
31: CMP_OPERATOR := TRUE_US;
ESAC

VCMPSH (EVEX Encoded Versions)


IF k2[0] OR *no writemask*:
DEST.bit[0] := SRC1.fp16[0] CMP_OPERATOR SRC2.fp16[0]
ELSE
DEST.bit[0] := 0

DEST[MAXKL-1:1] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCMPSH __mmask8 _mm_cmp_round_sh_mask (__m128h a, __m128h b, const int imm8, const int sae);
VCMPSH __mmask8 _mm_mask_cmp_round_sh_mask (__mmask8 k1, __m128h a, __m128h b, const int imm8, const int sae);
VCMPSH __mmask8 _mm_cmp_sh_mask (__m128h a, __m128h b, const int imm8);
VCMPSH __mmask8 _mm_mask_cmp_sh_mask (__mmask8 k1, __m128h a, __m128h b, const int imm8);

SIMD Floating-Point Exceptions


Invalid, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VCMPSH—Compare Scalar FP16 Values Vol. 2C 5-26


VCOMISH—Compare Scalar Ordered FP16 Values and Set EFLAGS
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.NP.MAP5.W0 2F /r A V/V AVX512-FP16 Compare low FP16 values in xmm1 and
VCOMISH xmm1, xmm2/m16 {sae} OR AVX10.11 xmm2/m16, and set the EFLAGS flags accordingly.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (r) ModRM:r/m (r) N/A N/A

Description
This instruction compares the FP16 values in the low word of operand 1 (first operand) and operand 2 (second
operand), and sets the ZF, PF, and CF flags in the EFLAGS register according to the result (unordered, greater than,
less than, or equal). The OF, SF and AF flags in the EFLAGS register are set to 0. The unordered result is returned
if either source operand is a NaN (QNaN or SNaN).
Operand 1 is an XMM register; operand 2 can be an XMM register or a 16-bit memory location.
The VCOMISH instruction differs from the VUCOMISH instruction in that it signals a SIMD floating-point invalid oper-
ation exception (#I) when a source operand is either a QNaN or SNaN. The VUCOMISH instruction signals an invalid
numeric exception only if a source operand is an SNaN.
The EFLAGS register is not updated if an unmasked SIMD floating-point exception is generated. EVEX.vvvv is
reserved and must be 1111b, otherwise instructions will #UD.

Operation
VCOMISH SRC1, SRC2
RESULT := OrderedCompare(SRC1.fp16[0],SRC2.fp16[0])
IF RESULT is UNORDERED:
ZF, PF, CF := 1, 1, 1
ELSE IF RESULT is GREATER_THAN:
ZF, PF, CF := 0, 0, 0
ELSE IF RESULT is LESS_THAN:
ZF, PF, CF := 0, 0, 1
ELSE: // RESULT is EQUALS
ZF, PF, CF := 1, 0, 0

OF, AF, SF := 0, 0, 0

Intel C/C++ Compiler Intrinsic Equivalent


VCOMISH int _mm_comi_round_sh (__m128h a, __m128h b, const int imm8, const int sae);
VCOMISH int _mm_comi_sh (__m128h a, __m128h b, const int imm8);
VCOMISH int _mm_comieq_sh (__m128h a, __m128h b);
VCOMISH int _mm_comige_sh (__m128h a, __m128h b);
VCOMISH int _mm_comigt_sh (__m128h a, __m128h b);
VCOMISH int _mm_comile_sh (__m128h a, __m128h b);
VCOMISH int _mm_comilt_sh (__m128h a, __m128h b);
VCOMISH int _mm_comineq_sh (__m128h a, __m128h b);

VCOMISH—Compare Scalar Ordered FP16 Values and Set EFLAGS Vol. 2C 5-27
SIMD Floating-Point Exceptions
Invalid, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”

VCOMISH—Compare Scalar Ordered FP16 Values and Set EFLAGS Vol. 2C 5-28
VCOMPRESSPD—Store Sparse Packed Double Precision Floating-Point Values Into Dense
Memory
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F38.W1 8A /r A V/V (AVX512VL AND Compress packed double precision floating-
VCOMPRESSPD xmm1/m128 {k1}{z}, AVX512F) OR point values from xmm2 to xmm1/m128 using
xmm2 AVX10.11 writemask k1.
EVEX.256.66.0F38.W1 8A /r A V/V (AVX512VL AND Compress packed double precision floating-
VCOMPRESSPD ymm1/m256 {k1}{z}, AVX512F) OR point values from ymm2 to ymm1/m256 using
ymm2 AVX10.11 writemask k1.
EVEX.512.66.0F38.W1 8A /r A V/V AVX512F Compress packed double precision floating-
VCOMPRESSPD zmm1/m512 {k1}{z}, OR AVX10.11 point values from zmm2 using control mask k1
zmm2 to zmm1/m512.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
Compress (store) up to 8 double precision floating-point values from the source operand (the second operand) as
a contiguous vector to the destination operand (the first operand) The source operand is a ZMM/YMM/XMM register,
the destination operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location.
The opmask register k1 selects the active elements (partial vector or possibly non-contiguous if less than 8 active
elements) from the source operand to compress into a contiguous vector. The contiguous vector is written to the
destination starting from the low element of the destination operand.
Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z
must be zero.
Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the
source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper
bits are zeroed.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.

VCOMPRESSPD—Store Sparse Packed Double Precision Floating-Point Values Into Dense Memory Vol. 2C 5-29
Operation
VCOMPRESSPD (EVEX Encoded Versions) Store Form
(KL, VL) = (2, 128), (4, 256), (8, 512)
SIZE := 64
k := 0
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
DEST[k+SIZE-1:k] := SRC[i+63:i]
k := k + SIZE
FI;
ENDFOR

VCOMPRESSPD (EVEX Encoded Versions) Reg-Reg Form


(KL, VL) = (2, 128), (4, 256), (8, 512)
SIZE := 64
k := 0
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
DEST[k+SIZE-1:k] := SRC[i+63:i]
k := k + SIZE
FI;
ENDFOR
IF *merging-masking*
THEN *DEST[VL-1:k] remains unchanged*
ELSE DEST[VL-1:k] := 0
FI
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCOMPRESSPD __m512d _mm512_mask_compress_pd( __m512d s, __mmask8 k, __m512d a);
VCOMPRESSPD __m512d _mm512_maskz_compress_pd( __mmask8 k, __m512d a);
VCOMPRESSPD void _mm512_mask_compressstoreu_pd( void * d, __mmask8 k, __m512d a);
VCOMPRESSPD __m256d _mm256_mask_compress_pd( __m256d s, __mmask8 k, __m256d a);
VCOMPRESSPD __m256d _mm256_maskz_compress_pd( __mmask8 k, __m256d a);
VCOMPRESSPD void _mm256_mask_compressstoreu_pd( void * d, __mmask8 k, __m256d a);
VCOMPRESSPD __m128d _mm_mask_compress_pd( __m128d s, __mmask8 k, __m128d a);
VCOMPRESSPD __m128d _mm_maskz_compress_pd( __mmask8 k, __m128d a);
VCOMPRESSPD void _mm_mask_compressstoreu_pd( void * d, __mmask8 k, __m128d a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instructions, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VCOMPRESSPD—Store Sparse Packed Double Precision Floating-Point Values Into Dense Memory Vol. 2C 5-30
VCOMPRESSPS—Store Sparse Packed Single Precision Floating-Point Values Into Dense Memory
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F38.W0 8A /r A V/V (AVX512VL AND Compress packed single precision floating-
VCOMPRESSPS xmm1/m128 {k1}{z}, AVX512F) OR point values from xmm2 to xmm1/m128 using
xmm2 AVX10.11 writemask k1.
EVEX.256.66.0F38.W0 8A /r A V/V (AVX512VL AND Compress packed single precision floating-
VCOMPRESSPS ymm1/m256 {k1}{z}, AVX512F) OR point values from ymm2 to ymm1/m256 using
ymm2 AVX10.11 writemask k1.
EVEX.512.66.0F38.W0 8A /r A V/V AVX512F Compress packed single precision floating-
VCOMPRESSPS zmm1/m512 {k1}{z}, OR AVX10.11 point values from zmm2 using control mask k1
zmm2 to zmm1/m512.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
Compress (stores) up to 16 single precision floating-point values from the source operand (the second operand) to
the destination operand (the first operand). The source operand is a ZMM/YMM/XMM register, the destination
operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location.
The opmask register k1 selects the active elements (a partial vector or possibly non-contiguous if less than 16
active elements) from the source operand to compress into a contiguous vector. The contiguous vector is written to
the destination starting from the low element of the destination operand.
Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z
must be zero.
Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the
source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper
bits are zeroed.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.

VCOMPRESSPS—Store Sparse Packed Single Precision Floating-Point Values Into Dense Memory Vol. 2C 5-31
Operation
VCOMPRESSPS (EVEX Encoded Versions) Store Form
(KL, VL) = (4, 128), (8, 256), (16, 512)
SIZE := 32
k := 0
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
DEST[k+SIZE-1:k] := SRC[i+31:i]
k := k + SIZE
FI;
ENDFOR;

VCOMPRESSPS (EVEX Encoded Versions) Reg-Reg Form


(KL, VL) = (4, 128), (8, 256), (16, 512)
SIZE := 32
k := 0
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
DEST[k+SIZE-1:k] := SRC[i+31:i]
k := k + SIZE
FI;
ENDFOR
IF *merging-masking*
THEN *DEST[VL-1:k] remains unchanged*
ELSE DEST[VL-1:k] := 0
FI
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCOMPRESSPS __m512 _mm512_mask_compress_ps( __m512 s, __mmask16 k, __m512 a);
VCOMPRESSPS __m512 _mm512_maskz_compress_ps( __mmask16 k, __m512 a);
VCOMPRESSPS void _mm512_mask_compressstoreu_ps( void * d, __mmask16 k, __m512 a);
VCOMPRESSPS __m256 _mm256_mask_compress_ps( __m256 s, __mmask8 k, __m256 a);
VCOMPRESSPS __m256 _mm256_maskz_compress_ps( __mmask8 k, __m256 a);
VCOMPRESSPS void _mm256_mask_compressstoreu_ps( void * d, __mmask8 k, __m256 a);
VCOMPRESSPS __m128 _mm_mask_compress_ps( __m128 s, __mmask8 k, __m128 a);
VCOMPRESSPS __m128 _mm_maskz_compress_ps( __mmask8 k, __m128 a);
VCOMPRESSPS void _mm_mask_compressstoreu_ps( void * d, __mmask8 k, __m128 a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instructions, see Exceptions Type E4.nb. in Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VCOMPRESSPS—Store Sparse Packed Single Precision Floating-Point Values Into Dense Memory Vol. 2C 5-32
VCVTDQ2PH—Convert Packed Signed Doubleword Integers to Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.NP.MAP5.W0 5B /r A V/V (AVX512-FP16 Convert four packed signed doubleword integers
VCVTDQ2PH xmm1{k1}{z}, AND AVX512VL) from xmm2/m128/m32bcst to four packed FP16
xmm2/m128/m32bcst OR AVX10.11 values, and store the result in xmm1 subject to
writemask k1.
EVEX.256.NP.MAP5.W0 5B /r A V/V (AVX512-FP16 Convert eight packed signed doubleword integers
VCVTDQ2PH xmm1{k1}{z}, AND AVX512VL) from ymm2/m256/m32bcst to eight packed
ymm2/m256/m32bcst OR AVX10.11 FP16 values, and store the result in xmm1
subject to writemask k1.
EVEX.512.NP.MAP5.W0 5B /r A V/V AVX512-FP16 Convert sixteen packed signed doubleword
VCVTDQ2PH ymm1{k1}{z}, OR AVX10.11 integers from zmm2/m512/m32bcst to sixteen
zmm2/m512/m32bcst {er} packed FP16 values, and store the result in
ymm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts four, eight, or sixteen packed signed doubleword integers in the source operand to four,
eight, or sixteen packed FP16 values in the destination operand.
EVEX encoded versions: The source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcast from a 32-bit memory location. The destination operand is a YMM/XMM
register conditionally updated with writemask k1.
EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
If the result of the convert operation is overflow and MXCSR.OM=0 then a SIMD exception will be raised with OE=1,
PE=1.

VCVTDQ2PH—Convert Packed Signed Doubleword Integers to Packed FP16 Values Vol. 2C 5-33
Operation
VCVTDQ2PH DEST, SRC
VL = 128, 256 or 512
KL := VL / 32

IF *SRC is a register* and (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.dword[0]
ELSE
tsrc := SRC.dword[j]
DEST.fp16[j] := Convert_integer32_to_fp16(tsrc)
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL/2] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTDQ2PH __m256h _mm512_cvt_roundepi32_ph (__m512i a, int rounding);
VCVTDQ2PH __m256h _mm512_mask_cvt_roundepi32_ph (__m256h src, __mmask16 k, __m512i a, int rounding);
VCVTDQ2PH __m256h _mm512_maskz_cvt_roundepi32_ph (__mmask16 k, __m512i a, int rounding);
VCVTDQ2PH __m128h _mm_cvtepi32_ph (__m128i a);
VCVTDQ2PH __m128h _mm_mask_cvtepi32_ph (__m128h src, __mmask8 k, __m128i a);
VCVTDQ2PH __m128h _mm_maskz_cvtepi32_ph (__mmask8 k, __m128i a);
VCVTDQ2PH __m128h _mm256_cvtepi32_ph (__m256i a);
VCVTDQ2PH __m128h _mm256_mask_cvtepi32_ph (__m128h src, __mmask8 k, __m256i a);
VCVTDQ2PH __m128h _mm256_maskz_cvtepi32_ph (__mmask8 k, __m256i a);
VCVTDQ2PH __m256h _mm512_cvtepi32_ph (__m512i a);
VCVTDQ2PH __m256h _mm512_mask_cvtepi32_ph (__m256h src, __mmask16 k, __m512i a);
VCVTDQ2PH __m256h _mm512_maskz_cvtepi32_ph (__mmask16 k, __m512i a);

SIMD Floating-Point Exceptions


Overflow, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCVTDQ2PH—Convert Packed Signed Doubleword Integers to Packed FP16 Values Vol. 2C 5-34
VCVTNE2PS2BF16—Convert Two Packed Single Data to One Packed BF16 Data
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F2.0F38.W0 72 /r A V/V (AVX512_BF16 Convert packed single data from xmm2 and
VCVTNE2PS2BF16 xmm1{k1}{z}, AND AVX512VL) xmm3/m128/m32bcst to packed BF16 data in
xmm2, xmm3/m128/m32bcst OR AVX10.11 xmm1 with writemask k1.
EVEX.256.F2.0F38.W0 72 /r A V/V (AVX512_BF16 Convert packed single data from ymm2 and
VCVTNE2PS2BF16 ymm1{k1}{z}, AND AVX512VL) ymm3/m256/m32bcst to packed BF16 data in
ymm2, ymm3/m256/m32bcst OR AVX10.11 ymm1 with writemask k1.
EVEX.512.F2.0F38.W0 72 /r A V/V (AVX512_BF16 Convert packed single data from zmm2 and
VCVTNE2PS2BF16 zmm1{k1}{z}, AND AVX512F) zmm3/m512/m32bcst to packed BF16 data in
zmm2, zmm3/m512/m32bcst OR AVX10.11 zmm1 with writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Converts two SIMD registers of packed single data into a single register of packed BF16 data.
This instruction does not support memory fault suppression.
This instruction uses “Round to nearest (even)” rounding mode. Output denormals are always flushed to zero and
input denormals are always treated as zero. MXCSR is not consulted nor updated. No floating-point exceptions are
generated.

Operation
VCVTNE2PS2BF16 dest, src1, src2
VL = (128, 256, 512)
KL = VL/16

origdest := dest
FOR i := 0 to KL-1:
IF k1[ i ] or *no writemask*:
IF i < KL/2:
IF src2 is memory and evex.b == 1:
t := src2.fp32[0]
ELSE:
t := src2.fp32[ i ]
ELSE:
t := src1.fp32[ i-KL/2]

// See VCVTNEPS2BF16 for definition of convert helper function


dest.word[i] := convert_fp32_to_bfloat16(t)

ELSE IF *zeroing*:
dest.word[ i ] := 0
ELSE: // Merge masking, dest element unchanged

VCVTNE2PS2BF16—Convert Two Packed Single Data to One Packed BF16 Data Vol. 2C 5-35
dest.word[ i ] := origdest.word[ i ]
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTNE2PS2BF16 __m128bh _mm_cvtne2ps_pbh (__m128, __m128);
VCVTNE2PS2BF16 __m128bh _mm_mask_cvtne2ps_pbh (__m128bh, __mmask8, __m128, __m128);
VCVTNE2PS2BF16 __m128bh _mm_maskz_cvtne2ps_pbh (__mmask8, __m128, __m128);
VCVTNE2PS2BF16 __m256bh _mm256_cvtne2ps_pbh (__m256, __m256);
VCVTNE2PS2BF16 __m256bh _mm256_mask_cvtne2ps_pbh (__m256bh, __mmask16, __m256, __m256);
VCVTNE2PS2BF16 __m256bh _mm256_maskz_cvtne2ps_ pbh (__mmask16, __m256, __m256);
VCVTNE2PS2BF16 __m512bh _mm512_cvtne2ps_pbh (__m512, __m512);
VCVTNE2PS2BF16 __m512bh _mm512_mask_cvtne2ps_pbh (__m512bh, __mmask32, __m512, __m512);
VCVTNE2PS2BF16 __m512bh _mm512_maskz_cvtne2ps_pbh (__mmask32, __m512, __m512);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-52, “Type E4NF Class Exception Conditions.”

VCVTNE2PS2BF16—Convert Two Packed Single Data to One Packed BF16 Data Vol. 2C 5-36
VCVTNEPS2BF16—Convert Packed Single Data to Packed BF16 Data
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.F3.0F38.W0 72 /r A V/V AVX-NE- Convert packed single precision floating-point
VCVTNEPS2BF16 xmm1, CONVERT values from xmm2/m128 to packed BF16
xmm2/m128 values and store in xmm1.

VEX.256.F3.0F38.W0 72 /r A V/V AVX-NE- Convert packed single precision floating-point


VCVTNEPS2BF16 xmm1, CONVERT values from ymm2/m256 to packed BF16
ymm2/m256 values and store in xmm1.

EVEX.128.F3.0F38.W0 72 /r B V/V (AVX512_BF16 Convert packed single data from xmm2/m128


VCVTNEPS2BF16 xmm1{k1}{z}, AND AVX512VL) to packed BF16 data in xmm1 with writemask
xmm2/m128/m32bcst OR AVX10.11 k1.
EVEX.256.F3.0F38.W0 72 /r B V/V (AVX512_BF16 Convert packed single data from ymm2/m256
VCVTNEPS2BF16 xmm1{k1}{z}, AND AVX512VL) to packed BF16 data in xmm1 with writemask
ymm2/m256/m32bcst OR AVX10.11 k1.
EVEX.512.F3.0F38.W0 72 /r B V/V (AVX512_BF16 Convert packed single data from zmm2/m512
VCVTNEPS2BF16 ymm1{k1}{z}, AND AVX512F) to packed BF16 data in ymm1 with writemask
zmm2/m512/m32bcst OR AVX10.11 k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction loads packed FP32 elements from a SIMD register or memory, converts the elements to BF16, and
writes the result to the destination SIMD register.
The upper bits of the destination register beyond the down-converted BF16 elements are zeroed.
This instruction uses “Round to nearest (even)” rounding mode. Output denormals are always flushed to zero and
input denormals are always treated as zero. MXCSR is not consulted nor updated.
As the instruction operand encoding table shows, the EVEX.vvvv field is not used for encoding an operand.
EVEX.vvvv is reserved and must be 0b1111 otherwise instructions will #UD.

Operation
Define convert_fp32_to_bfloat16(x):
IF x is zero or denormal:
dest[15] := x[31] // sign preserving zero (denormal go to zero)
dest[14:0] := 0
ELSE IF x is infinity:
dest[15:0] := x[31:16]
ELSE IF x is NAN:
dest[15:0] := x[31:16] // truncate and set MSB of the mantissa to force QNAN
dest[6] := 1
ELSE // normal number

VCVTNEPS2BF16—Convert Packed Single Data to Packed BF16 Data Vol. 2C 5-41


LSB := x[16]
rounding_bias := 0x00007FFF + LSB
temp[31:0] := x[31:0] + rounding_bias // integer add
dest[15:0] := temp[31:16]
RETURN dest

VCVTNEPS2BF16 dest, src (VEX encoded version)


VL = (128, 256)
KL = VL/16

FOR i := 0 to KL/2-1:
t := src.fp32[i]
dest.word[i] := convert_fp32_to_bfloat16(t)

DEST[MAXVL-1:VL/2] := 0

VCVTNEPS2BF16 dest, src (EVEX encoded version)


VL = (128, 256, 512)
KL = VL/16

origdest := dest
FOR i := 0 to KL/2-1:
IF k1[ i ] or *no writemask*:
IF src is memory and evex.b == 1:
t := src.fp32[0]
ELSE:
t := src.fp32[ i ]

dest.word[i] := convert_fp32_to_bfloat16(t)

ELSE IF *zeroing*:
dest.word[ i ] := 0
ELSE: // Merge masking, dest element unchanged
dest.word[ i ] := origdest.word[ i ]
DEST[MAXVL-1:VL/2] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTNEPS2BF16 __m128bh _mm_cvtneps_avx_pbh (__m128 __A);
VCVTNEPS2BF16 __m128bh _mm256_cvtneps_avx_pbh (__m256 __A);
VCVTNEPS2BF16 __m128bh _mm_cvtneps_pbh (__m128 a);
VCVTNEPS2BF16 __m128bh _mm_cvtneps_pbh (__m128 __A);
VCVTNEPS2BF16 __m128bh _mm_mask_cvtneps_pbh (__m128bh src, __mmask8 k, __m128 a);
VCVTNEPS2BF16 __m128bh _mm_maskz_cvtneps_pbh (__mmask8 k, __m128 a);
VCVTNEPS2BF16 __m128bh _mm256_cvtneps_pbh (__m256 a);
VCVTNEPS2BF16 __m128bh _mm256_cvtneps_pbh (__m256 __A);
VCVTNEPS2BF16 __m128bh _mm256_mask_cvtneps_pbh (__m128bh src, __mmask8 k, __m256 a);
VCVTNEPS2BF16 __m128bh _mm256_maskz_cvtneps_pbh (__mmask8 k, __m256 a);
VCVTNEPS2BF16 __m256bh _mm512_cvtneps_pbh (__m512 a);
VCVTNEPS2BF16 __m256bh _mm512_mask_cvtneps_pbh (__m256bh src, __mmask16 k, __m512 a);
VCVTNEPS2BF16 __m256bh _mm512_maskz_cvtneps_pbh (__mmask16 k, __m512 a);

SIMD Floating-Point Exceptions


None.

VCVTNEPS2BF16—Convert Packed Single Data to Packed BF16 Data Vol. 2C 5-42


Other Exceptions
VEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-51, “Type E4 Class Exception Conditions.”

VCVTNEPS2BF16—Convert Packed Single Data to Packed BF16 Data Vol. 2C 5-43


VCVTPD2PH—Convert Packed Double Precision FP Values to Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.MAP5.W1 5A /r A V/V (AVX512-FP16 Convert two packed double precision floating-
VCVTPD2PH xmm1{k1}{z}, AND AVX512VL) point values in xmm2/m128/m64bcst to two
xmm2/m128/m64bcst OR AVX10.11 packed FP16 values, and store the result in xmm1
subject to writemask k1.
EVEX.256.66.MAP5.W1 5A /r A V/V (AVX512-FP16 Convert four packed double precision floating-
VCVTPD2PH xmm1{k1}{z}, AND AVX512VL) point values in ymm2/m256/m64bcst to four
ymm2/m256/m64bcst OR AVX10.11 packed FP16 values, and store the result in xmm1
subject to writemask k1.
EVEX.512.66.MAP5.W1 5A /r A V/V AVX512-FP16 Convert eight packed double precision floating-
VCVTPD2PH xmm1{k1}{z}, OR AVX10.11 point values in zmm2/m512/m64bcst to eight
zmm2/m512/m64bcst {er} packed FP16 values, and store the result in ymm1
subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts two, four, or eight packed double precision floating-point values in the source operand
(second operand) to two, four, or eight packed FP16 values in the destination operand (first operand). When a
conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or
the embedded rounding control bits.
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or
a 512/256/128-bit vector broadcasts from a 64-bit memory location. The destination operand is a XMM register
conditionally updated with writemask k1. The upper bits (MAXVL-1:128/64/32) of the corresponding destination
are zeroed.
EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
This instruction uses MXCSR.DAZ for handling FP64 inputs. FP16 outputs can be normal or denormal, and are not
conditionally flushed to zero.

VCVTPD2PH—Convert Packed Double Precision FP Values to Packed FP16 Values Vol. 2C 5-43
Operation
VCVTPD2PH DEST, SRC
VL = 128, 256 or 512
KL := VL / 64

IF *SRC is a register* and (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.double[0]
ELSE
tsrc := SRC.double[j]
DEST.fp16[j] := Convert_fp64_to_fp16(tsrc)
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL/4] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTPD2PH __m128h _mm512_cvt_roundpd_ph (__m512d a, int rounding);
VCVTPD2PH __m128h _mm512_mask_cvt_roundpd_ph (__m128h src, __mmask8 k, __m512d a, int rounding);
VCVTPD2PH __m128h _mm512_maskz_cvt_roundpd_ph (__mmask8 k, __m512d a, int rounding);
VCVTPD2PH __m128h _mm_cvtpd_ph (__m128d a);
VCVTPD2PH __m128h _mm_mask_cvtpd_ph (__m128h src, __mmask8 k, __m128d a);
VCVTPD2PH __m128h _mm_maskz_cvtpd_ph (__mmask8 k, __m128d a);
VCVTPD2PH __m128h _mm256_cvtpd_ph (__m256d a);
VCVTPD2PH __m128h _mm256_mask_cvtpd_ph (__m128h src, __mmask8 k, __m256d a);
VCVTPD2PH __m128h _mm256_maskz_cvtpd_ph (__mmask8 k, __m256d a);
VCVTPD2PH __m128h _mm512_cvtpd_ph (__m512d a);
VCVTPD2PH __m128h _mm512_mask_cvtpd_ph (__m128h src, __mmask8 k, __m512d a);
VCVTPD2PH __m128h _mm512_maskz_cvtpd_ph (__mmask8 k, __m512d a);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCVTPD2PH—Convert Packed Double Precision FP Values to Packed FP16 Values Vol. 2C 5-44
VCVTPD2QQ—Convert Packed Double Precision Floating-Point Values to Packed Quadword
Integers
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F.W1 7B /r A V/V (AVX512VL AND Convert two packed double precision floating-point values
VCVTPD2QQ xmm1 {k1}{z}, AVX512DQ) OR from xmm2/m128/m64bcst to two packed quadword
xmm2/m128/m64bcst AVX10.11 integers in xmm1 with writemask k1.
EVEX.256.66.0F.W1 7B /r A V/V (AVX512VL AND Convert four packed double precision floating-point
VCVTPD2QQ ymm1 {k1}{z}, AVX512DQ) OR values from ymm2/m256/m64bcst to four packed
ymm2/m256/m64bcst AVX10.11 quadword integers in ymm1 with writemask k1.
EVEX.512.66.0F.W1 7B /r A V/V AVX512DQ Convert eight packed double precision floating-point
VCVTPD2QQ zmm1 {k1}{z}, OR AVX10.11 values from zmm2/m512/m64bcst to eight packed
zmm2/m512/m64bcst {er} quadword integers in zmm1 with writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts packed double precision floating-point values in the source operand (second operand) to packed quad-
word integers in the destination operand (first operand).
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location.
The destination operation is a ZMM/YMM/XMM register conditionally updated with writemask k1.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value
(2w-1, where w represents the number of bits in the destination format) is returned.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
VCVTPD2QQ (EVEX Encoded Version) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL == 512) AND (EVEX.b == 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_Double_Precision_Floating_Point_To_QuadInteger(SRC[i+63:i])
ELSE

VCVTPD2QQ—Convert Packed Double Precision Floating-Point Values to Packed Quadword Integers Vol. 2C 5-45
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTPD2QQ (EVEX Encoded Version) When SRC Operand is a Memory Source


(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[i+63:i] := Convert_Double_Precision_Floating_Point_To_QuadInteger(SRC[63:0])
ELSE
DEST[i+63:i] := Convert_Double_Precision_Floating_Point_To_QuadInteger(SRC[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTPD2QQ __m512i _mm512_cvtpd_epi64( __m512d a);
VCVTPD2QQ __m512i _mm512_mask_cvtpd_epi64( __m512i s, __mmask8 k, __m512d a);
VCVTPD2QQ __m512i _mm512_maskz_cvtpd_epi64( __mmask8 k, __m512d a);
VCVTPD2QQ __m512i _mm512_cvt_roundpd_epi64( __m512d a, int r);
VCVTPD2QQ __m512i _mm512_mask_cvt_roundpd_epi64( __m512i s, __mmask8 k, __m512d a, int r);
VCVTPD2QQ __m512i _mm512_maskz_cvt_roundpd_epi64( __mmask8 k, __m512d a, int r);
VCVTPD2QQ __m256i _mm256_mask_cvtpd_epi64( __m256i s, __mmask8 k, __m256d a);
VCVTPD2QQ __m256i _mm256_maskz_cvtpd_epi64( __mmask8 k, __m256d a);
VCVTPD2QQ __m128i _mm_mask_cvtpd_epi64( __m128i s, __mmask8 k, __m128d a);
VCVTPD2QQ __m128i _mm_maskz_cvtpd_epi64( __mmask8 k, __m128d a);
VCVTPD2QQ __m256i _mm256_cvtpd_epi64 (__m256d src)
VCVTPD2QQ __m128i _mm_cvtpd_epi64 (__m128d src)

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VCVTPD2QQ—Convert Packed Double Precision Floating-Point Values to Packed Quadword Integers Vol. 2C 5-46
VCVTPD2UDQ—Convert Packed Double Precision Floating-Point Values to Packed Unsigned
Doubleword Integers
Opcode Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.0F.W1 79 /r A V/V (AVX512VL AND Convert two packed double precision floating-point
VCVTPD2UDQ xmm1 {k1}{z}, AVX512F) OR values in xmm2/m128/m64bcst to two unsigned
xmm2/m128/m64bcst AVX10.11 doubleword integers in xmm1 subject to writemask
k1.
EVEX.256.0F.W1 79 /r A V/V (AVX512VL AND Convert four packed double precision floating-point
VCVTPD2UDQ xmm1 {k1}{z}, AVX512F) OR values in ymm2/m256/m64bcst to four unsigned
ymm2/m256/m64bcst AVX10.11 doubleword integers in xmm1 subject to writemask
k1.
EVEX.512.0F.W1 79 /r A V/V AVX512F Convert eight packed double precision floating-point
VCVTPD2UDQ ymm1 {k1}{z}, OR AVX10.11 values in zmm2/m512/m64bcst to eight unsigned
zmm2/m512/m64bcst {er} doubleword integers in ymm1 subject to writemask
k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts packed double precision floating-point values in the source operand (the second operand) to packed
unsigned doubleword integers in the destination operand (the first operand).
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is
returned, where w represents the number of bits in the destination format.
The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector
broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1. The upper bits (MAXVL-1:256) of the corresponding destination are zeroed.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
VCVTPD2UDQ (EVEX Encoded Versions) When SRC2 Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;

FOR j := 0 TO KL-1
i := j * 32
k := j * 64

VCVTPD2UDQ—Convert Packed Double Precision Floating-Point Values to Packed Unsigned Doubleword Integers Vol. 2C 5-47
IF k1[j] OR *no writemask*
THEN
DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_UInteger(SRC[k+63:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0

VCVTPD2UDQ (EVEX Encoded Versions) When SRC Operand is a Memory Source


(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_UInteger(SRC[63:0])
ELSE
DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_UInteger(SRC[k+63:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTPD2UDQ __m256i _mm512_cvtpd_epu32( __m512d a);
VCVTPD2UDQ __m256i _mm512_mask_cvtpd_epu32( __m256i s, __mmask8 k, __m512d a);
VCVTPD2UDQ __m256i _mm512_maskz_cvtpd_epu32( __mmask8 k, __m512d a);
VCVTPD2UDQ __m256i _mm512_cvt_roundpd_epu32( __m512d a, int r);
VCVTPD2UDQ __m256i _mm512_mask_cvt_roundpd_epu32( __m256i s, __mmask8 k, __m512d a, int r);
VCVTPD2UDQ __m256i _mm512_maskz_cvt_roundpd_epu32( __mmask8 k, __m512d a, int r);
VCVTPD2UDQ __m128i _mm256_mask_cvtpd_epu32( __m128i s, __mmask8 k, __m256d a);
VCVTPD2UDQ __m128i _mm256_maskz_cvtpd_epu32( __mmask8 k, __m256d a);
VCVTPD2UDQ __m128i _mm_mask_cvtpd_epu32( __m128i s, __mmask8 k, __m128d a);
VCVTPD2UDQ __m128i _mm_maskz_cvtpd_epu32( __mmask8 k, __m128d a);

SIMD Floating-Point Exceptions


Invalid, Precision.

VCVTPD2UDQ—Convert Packed Double Precision Floating-Point Values to Packed Unsigned Doubleword Integers Vol. 2C 5-48
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VCVTPD2UDQ—Convert Packed Double Precision Floating-Point Values to Packed Unsigned Doubleword Integers Vol. 2C 5-49
VCVTPD2UQQ—Convert Packed Double Precision Floating-Point Values to Packed Unsigned
Quadword Integers
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F.W1 79 /r A V/V (AVX512VL AND Convert two packed double precision floating-point
VCVTPD2UQQ xmm1 {k1}{z}, AVX512DQ) OR values from xmm2/mem to two packed unsigned
xmm2/m128/m64bcst AVX10.11 quadword integers in xmm1 with writemask k1.
EVEX.256.66.0F.W1 79 /r A V/V (AVX512VL AND Convert fourth packed double precision floating-point
VCVTPD2UQQ ymm1 {k1}{z}, AVX512DQ) OR values from ymm2/mem to four packed unsigned
ymm2/m256/m64bcst AVX10.11 quadword integers in ymm1 with writemask k1.
EVEX.512.66.0F.W1 79 /r A V/V AVX512DQ Convert eight packed double precision floating-point
VCVTPD2UQQ zmm1 {k1}{z}, OR AVX10.11 values from zmm2/mem to eight packed unsigned
zmm2/m512/m64bcst {er} quadword integers in zmm1 with writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts packed double precision floating-point values in the source operand (second operand) to packed unsigned
quadword integers in the destination operand (first operand).
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is
returned, where w represents the number of bits in the destination format.
The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operation
is a ZMM/YMM/XMM register conditionally updated with writemask k1.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
VCVTPD2UQQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL == 512) AND (EVEX.b == 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_Double_Precision_Floating_Point_To_UQuadInteger(SRC[i+63:i])
ELSE

VCVTPD2UQQ—Convert Packed Double Precision Floating-Point Values to Packed Unsigned Quadword Integers Vol. 2C 5-49
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTPD2UQQ (EVEX Encoded Versions) When SRC Operand is a Memory Source


(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[i+63:i] :=
Convert_Double_Precision_Floating_Point_To_UQuadInteger(SRC[63:0])
ELSE
DEST[i+63:i] :=
Convert_Double_Precision_Floating_Point_To_UQuadInteger(SRC[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTPD2UQQ __m512i _mm512_cvtpd_epu64( __m512d a);
VCVTPD2UQQ __m512i _mm512_mask_cvtpd_epu64( __m512i s, __mmask8 k, __m512d a);
VCVTPD2UQQ __m512i _mm512_maskz_cvtpd_epu64( __mmask8 k, __m512d a);
VCVTPD2UQQ __m512i _mm512_cvt_roundpd_epu64( __m512d a, int r);
VCVTPD2UQQ __m512i _mm512_mask_cvt_roundpd_epu64( __m512i s, __mmask8 k, __m512d a, int r);
VCVTPD2UQQ __m512i _mm512_maskz_cvt_roundpd_epu64( __mmask8 k, __m512d a, int r);
VCVTPD2UQQ __m256i _mm256_mask_cvtpd_epu64( __m256i s, __mmask8 k, __m256d a);
VCVTPD2UQQ __m256i _mm256_maskz_cvtpd_epu64( __mmask8 k, __m256d a);
VCVTPD2UQQ __m128i _mm_mask_cvtpd_epu64( __m128i s, __mmask8 k, __m128d a);
VCVTPD2UQQ __m128i _mm_maskz_cvtpd_epu64( __mmask8 k, __m128d a);
VCVTPD2UQQ __m256i _mm256_cvtpd_epu64 (__m256d src)
VCVTPD2UQQ __m128i _mm_cvtpd_epu64 (__m128d src)

SIMD Floating-Point Exceptions


Invalid, Precision.

VCVTPD2UQQ—Convert Packed Double Precision Floating-Point Values to Packed Unsigned Quadword Integers Vol. 2C 5-50
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VCVTPD2UQQ—Convert Packed Double Precision Floating-Point Values to Packed Unsigned Quadword Integers Vol. 2C 5-51
VCVTPH2DQ—Convert Packed FP16 Values to Signed Doubleword Integers
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.MAP5.W0 5B /r A V/V (AVX512-FP16 Convert four packed FP16 values in
VCVTPH2DQ xmm1{k1}{z}, AND AVX512VL) xmm2/m64/m16bcst to four signed doubleword
xmm2/m64/m16bcst OR AVX10.11 integers, and store the result in xmm1 subject to
writemask k1.
EVEX.256.66.MAP5.W0 5B /r A V/V (AVX512-FP16 Convert eight packed FP16 values in
VCVTPH2DQ ymm1{k1}{z}, AND AVX512VL) xmm2/m128/m16bcst to eight signed
xmm2/m128/m16bcst OR AVX10.11 doubleword integers, and store the result in
ymm1 subject to writemask k1.
EVEX.512.66.MAP5.W0 5B /r A V/V AVX512-FP16 Convert sixteen packed FP16 values in
VCVTPH2DQ zmm1{k1}{z}, OR AVX10.11 ymm2/m256/m16bcst to sixteen signed
ymm2/m256/m16bcst {er} doubleword integers, and store the result in
zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Half ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts packed FP16 values in the source operand to signed doubleword integers in destination
operand.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value is
returned.
The destination elements are updated according to the writemask.

VCVTPH2DQ—Convert Packed FP16 Values to Signed Doubleword Integers Vol. 2C 5-51


Operation
VCVTPH2DQ DEST, SRC
VL = 128, 256 or 512
KL := VL / 32

IF *SRC is a register* and (VL = 512) and (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.dword[j] := Convert_fp16_to_integer32(tsrc)
ELSE IF *zeroing*:
DEST.dword[j] := 0
// else dest.dword[j] remains unchanged

DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTPH2DQ __m512i _mm512_cvt_roundph_epi32 (__m256h a, int rounding);
VCVTPH2DQ __m512i _mm512_mask_cvt_roundph_epi32 (__m512i src, __mmask16 k, __m256h a, int rounding);
VCVTPH2DQ __m512i _mm512_maskz_cvt_roundph_epi32 (__mmask16 k, __m256h a, int rounding);
VCVTPH2DQ __m128i _mm_cvtph_epi32 (__m128h a);
VCVTPH2DQ __m128i _mm_mask_cvtph_epi32 (__m128i src, __mmask8 k, __m128h a);
VCVTPH2DQ __m128i _mm_maskz_cvtph_epi32 (__mmask8 k, __m128h a);
VCVTPH2DQ __m256i _mm256_cvtph_epi32 (__m128h a);
VCVTPH2DQ __m256i _mm256_mask_cvtph_epi32 (__m256i src, __mmask8 k, __m128h a);
VCVTPH2DQ __m256i _mm256_maskz_cvtph_epi32 (__mmask8 k, __m128h a);
VCVTPH2DQ __m512i _mm512_cvtph_epi32 (__m256h a);
VCVTPH2DQ __m512i _mm512_mask_cvtph_epi32 (__m512i src, __mmask16 k, __m256h a);
VCVTPH2DQ __m512i _mm512_maskz_cvtph_epi32 (__mmask16 k, __m256h a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCVTPH2DQ—Convert Packed FP16 Values to Signed Doubleword Integers Vol. 2C 5-52


VCVTPH2PD—Convert Packed FP16 Values to FP64 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.NP.MAP5.W0 5A /r A V/V (AVX512-FP16 Convert packed FP16 values in
VCVTPH2PD xmm1{k1}{z}, AND AVX512VL) xmm2/m32/m16bcst to FP64 values, and store
xmm2/m32/m16bcst OR AVX10.11 result in xmm1 subject to writemask k1.
EVEX.256.NP.MAP5.W0 5A /r A V/V (AVX512-FP16 Convert packed FP16 values in
VCVTPH2PD ymm1{k1}{z}, AND AVX512VL) xmm2/m64/m16bcst to FP64 values, and store
xmm2/m64/m16bcst OR AVX10.11 result in ymm1 subject to writemask k1.
EVEX.512.NP.MAP5.W0 5A /r A V/V AVX512-FP16 Convert packed FP16 values in
VCVTPH2PD zmm1{k1}{z}, OR AVX10.11 xmm2/m128/m16bcst to FP64 values, and store
xmm2/m128/m16bcst {sae} result in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Quarter ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts packed FP16 values to FP64 values in the destination register. The destination elements
are updated according to the writemask.
This instruction handles both normal and denormal FP16 inputs.

Operation
VCVTPH2PD DEST, SRC
VL = 128, 256, or 512
KL := VL/64

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.fp64[j] := Convert_fp16_to_fp64(tsrc)
ELSE IF *zeroing*:
DEST.fp64[j] := 0
// else dest.fp64[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VCVTPH2PD—Convert Packed FP16 Values to FP64 Values Vol. 2C 5-53


Intel C/C++ Compiler Intrinsic Equivalent
VCVTPH2PD __m512d _mm512_cvt_roundph_pd (__m128h a, int sae);
VCVTPH2PD __m512d _mm512_mask_cvt_roundph_pd (__m512d src, __mmask8 k, __m128h a, int sae);
VCVTPH2PD __m512d _mm512_maskz_cvt_roundph_pd (__mmask8 k, __m128h a, int sae);
VCVTPH2PD __m128d _mm_cvtph_pd (__m128h a);
VCVTPH2PD __m128d _mm_mask_cvtph_pd (__m128d src, __mmask8 k, __m128h a);
VCVTPH2PD __m128d _mm_maskz_cvtph_pd (__mmask8 k, __m128h a);
VCVTPH2PD __m256d _mm256_cvtph_pd (__m128h a);
VCVTPH2PD __m256d _mm256_mask_cvtph_pd (__m256d src, __mmask8 k, __m128h a);
VCVTPH2PD __m256d _mm256_maskz_cvtph_pd (__mmask8 k, __m128h a);
VCVTPH2PD __m512d _mm512_cvtph_pd (__m128h a);
VCVTPH2PD __m512d _mm512_mask_cvtph_pd (__m512d src, __mmask8 k, __m128h a);
VCVTPH2PD __m512d _mm512_maskz_cvtph_pd (__mmask8 k, __m128h a);

SIMD Floating-Point Exceptions


Invalid, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCVTPH2PD—Convert Packed FP16 Values to FP64 Values Vol. 2C 5-54


VCVTPH2PS/VCVTPH2PSX—Convert Packed FP16 Values to Single Precision Floating-Point
Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.66.0F38.W0 13 /r A V/V F16C Convert four packed FP16 values in xmm2/m64 to
VCVTPH2PS xmm1, xmm2/m64 packed single precision floating-point value in xmm1.
VEX.256.66.0F38.W0 13 /r A V/V F16C Convert eight packed FP16 values in xmm2/m128 to
VCVTPH2PS ymm1, xmm2/m128 packed single precision floating-point value in ymm1.
EVEX.128.66.0F38.W0 13 /r B V/V (AVX512VL AND Convert four packed FP16 values in xmm2/m64 to
VCVTPH2PS xmm1 {k1}{z}, AVX512F) OR packed single precision floating-point values in xmm1
xmm2/m64 AVX10.11 subject to writemask k1.
EVEX.256.66.0F38.W0 13 /r B V/V (AVX512VL AND Convert eight packed FP16 values in xmm2/m128 to
VCVTPH2PS ymm1 {k1}{z}, AVX512F) OR packed single precision floating-point values in ymm1
xmm2/m128 AVX10.11 subject to writemask k1.
EVEX.512.66.0F38.W0 13 /r B V/V AVX512F Convert sixteen packed FP16 values in ymm2/m256
VCVTPH2PS zmm1 {k1}{z}, OR AVX10.11 to packed single precision floating-point values in
ymm2/m256 {sae} zmm1 subject to writemask k1.
EVEX.128.66.MAP6.W0 13 /r C V/V (AVX512-FP16 Convert four packed FP16 values in
VCVTPH2PSX xmm1{k1}{z}, AND AVX512VL) xmm2/m64/m16bcst to four packed single precision
xmm2/m64/m16bcst OR AVX10.11 floating-point values, and store result in xmm1
subject to writemask k1.
EVEX.256.66.MAP6.W0 13 /r C V/V (AVX512-FP16 Convert eight packed FP16 values in
VCVTPH2PSX ymm1{k1}{z}, AND AVX512VL) xmm2/m128/m16bcst to eight packed single
xmm2/m128/m16bcst OR AVX10.11 precision floating-point values, and store result in
ymm1 subject to writemask k1.
EVEX.512.66.MAP6.W0 13 /r C V/V AVX512-FP16 Convert sixteen packed FP16 values in
VCVTPH2PSX zmm1{k1}{z}, OR AVX10.11 ymm2/m256/m16bcst to sixteen packed single
ymm2/m256/m16bcst {sae} precision floating-point values, and store result in
zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Half Mem ModRM:reg (w) ModRM:r/m (r) N/A N/A
C Half ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts packed half precision (16-bits) floating-point values in the low-order bits of the source
operand (the second operand) to packed single precision floating-point values and writes the converted values into
the destination operand (the first operand).
If case of a denormal operand, the correct normal result is returned. MXCSR.DAZ is ignored and is treated as if it
0. No denormal exception is reported on MXCSR.
VEX.128 version: The source operand is a XMM register or 64-bit memory location. The destination operand is a
XMM register. The upper bits (MAXVL-1:128) of the corresponding destination register are zeroed.

VCVTPH2PS/VCVTPH2PSX—Convert Packed FP16 Values to Single Precision Floating-Point Values Vol. 2C 5-55
VEX.256 version: The source operand is a XMM register or 128-bit memory location. The destination operand is a
YMM register. Bits (MAXVL-1:256) of the corresponding destination register are zeroed.
EVEX encoded versions: The source operand is a YMM/XMM/XMM (low 64-bits) register or a 256/128/64-bit
memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.
The diagram below illustrates how data is converted from four packed half precision (in 64 bits) to four single preci-
sion (in 128 bits) floating-point values.
Note: VEX.vvvv and EVEX.vvvv are reserved (must be 1111b).

VCVTPH2PS xmm1, xmm2/mem64, imm8


127 96 95 64 63 48 47 32 31 16 15 0
VH3 VH2 VH1 VH0 xmm2/mem64

convert convert
convert convert

127 96 95 64 63 32 31 0
VS3 VS2 VS1 VS0 xmm1

Figure 5-6. VCVTPH2PS (128-bit Version)

The VCVTPH2PSX instruction is a new form of the PH to PS conversion instruction, encoded in map 6. The previous
version of the instruction, VCVTPH2PS, that is present in AVX512F (encoded in map 2, 0F38) does not support
embedded broadcasting. The VCVTPH2PSX instruction has the embedded broadcasting option available.
The instructions associated with AVX512_FP16 always handle FP16 denormal number inputs; denormal inputs are
not treated as zero.

Operation
vCvt_h2s(SRC1[15:0])
{
RETURN Cvt_Half_Precision_To_Single_Precision(SRC1[15:0]);
}

VCVTPH2PS (EVEX Encoded Versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
k := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
vCvt_h2s(SRC[k+15:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTPH2PS/VCVTPH2PSX—Convert Packed FP16 Values to Single Precision Floating-Point Values Vol. 2C 5-56
VCVTPH2PS (VEX.256 Encoded Version)
DEST[31:0] := vCvt_h2s(SRC1[15:0]);
DEST[63:32] := vCvt_h2s(SRC1[31:16]);
DEST[95:64] := vCvt_h2s(SRC1[47:32]);
DEST[127:96] := vCvt_h2s(SRC1[63:48]);
DEST[159:128] := vCvt_h2s(SRC1[79:64]);
DEST[191:160] := vCvt_h2s(SRC1[95:80]);
DEST[223:192] := vCvt_h2s(SRC1[111:96]);
DEST[255:224] := vCvt_h2s(SRC1[127:112]);
DEST[MAXVL-1:256] := 0

VCVTPH2PS (VEX.128 Encoded Version)


DEST[31:0] := vCvt_h2s(SRC1[15:0]);
DEST[63:32] := vCvt_h2s(SRC1[31:16]);
DEST[95:64] := vCvt_h2s(SRC1[47:32]);
DEST[127:96] := vCvt_h2s(SRC1[63:48]);
DEST[MAXVL-1:128] := 0

VCVTPH2PSX DEST, SRC


VL = 128, 256, or 512
KL := VL/32

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.fp32[j] := Convert_fp16_to_fp32(tsrc)
ELSE IF *zeroing*:
DEST.fp32[j] := 0
// else dest.fp32[j] remains unchanged

DEST[MAXVL-1:VL] := 0

Flags Affected
None.

Intel C/C++ Compiler Intrinsic Equivalent


VCVTPH2PS __m512 _mm512_cvtph_ps( __m256i a);
VCVTPH2PS __m512 _mm512_mask_cvtph_ps(__m512 s, __mmask16 k, __m256i a);
VCVTPH2PS __m512 _mm512_maskz_cvtph_ps(__mmask16 k, __m256i a);
VCVTPH2PS __m512 _mm512_cvt_roundph_ps( __m256i a, int sae);
VCVTPH2PS __m512 _mm512_mask_cvt_roundph_ps(__m512 s, __mmask16 k, __m256i a, int sae);
VCVTPH2PS __m512 _mm512_maskz_cvt_roundph_ps( __mmask16 k, __m256i a, int sae);
VCVTPH2PS __m256 _mm256_mask_cvtph_ps(__m256 s, __mmask8 k, __m128i a);
VCVTPH2PS __m256 _mm256_maskz_cvtph_ps(__mmask8 k, __m128i a);
VCVTPH2PS __m128 _mm_mask_cvtph_ps(__m128 s, __mmask8 k, __m128i a);
VCVTPH2PS __m128 _mm_maskz_cvtph_ps(__mmask8 k, __m128i a);
VCVTPH2PS __m128 _mm_cvtph_ps ( __m128i m1);
VCVTPH2PS __m256 _mm256_cvtph_ps ( __m128i m1)

VCVTPH2PS/VCVTPH2PSX—Convert Packed FP16 Values to Single Precision Floating-Point Values Vol. 2C 5-57
VCVTPH2PSX __m512 _mm512_cvtx_roundph_ps (__m256h a, int sae);
VCVTPH2PSX __m512 _mm512_mask_cvtx_roundph_ps (__m512 src, __mmask16 k, __m256h a, int sae);
VCVTPH2PSX __m512 _mm512_maskz_cvtx_roundph_ps (__mmask16 k, __m256h a, int sae);
VCVTPH2PSX __m128 _mm_cvtxph_ps (__m128h a);
VCVTPH2PSX __m128 _mm_mask_cvtxph_ps (__m128 src, __mmask8 k, __m128h a);
VCVTPH2PSX __m128 _mm_maskz_cvtxph_ps (__mmask8 k, __m128h a);
VCVTPH2PSX __m256 _mm256_cvtxph_ps (__m128h a);
VCVTPH2PSX __m256 _mm256_mask_cvtxph_ps (__m256 src, __mmask8 k, __m128h a);
VCVTPH2PSX __m256 _mm256_maskz_cvtxph_ps (__mmask8 k, __m128h a);
VCVTPH2PSX __m512 _mm512_cvtxph_ps (__m256h a);
VCVTPH2PSX __m512 _mm512_mask_cvtxph_ps (__m512 src, __mmask16 k, __m256h a);
VCVTPH2PSX __m512 _mm512_maskz_cvtxph_ps (__mmask16 k, __m256h a);

SIMD Floating-Point Exceptions


VEX-encoded instructions: Invalid.
EVEX-encoded instructions: Invalid.
EVEX-encoded instructions with broadcast (VCVTPH2PSX): Invalid, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-26, “Type 11 Class Exception Conditions” (do not report #AC).
EVEX-encoded instructions, see Table 2-62, “Type E11 Class Exception Conditions.”
EVEX-encoded instructions with broadcast (VCVTPH2PSX), see Table 2-46, “Type E2 Class Exception Conditions.”
Additionally:
#UD If VEX.W=1.
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

VCVTPH2PS/VCVTPH2PSX—Convert Packed FP16 Values to Single Precision Floating-Point Values Vol. 2C 5-58
VCVTPH2QQ—Convert Packed FP16 Values to Signed Quadword Integer Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.MAP5.W0 7B /r A V/V (AVX512-FP16 Convert two packed FP16 values in
VCVTPH2QQ xmm1{k1}{z}, AND AVX512VL) xmm2/m32/m16bcst to two signed quadword
xmm2/m32/m16bcst OR AVX10.11 integers, and store the result in xmm1 subject to
writemask k1.
EVEX.256.66.MAP5.W0 7B /r A V/V (AVX512-FP16 Convert four packed FP16 values in
VCVTPH2QQ ymm1{k1}{z}, AND AVX512VL) xmm2/m64/m16bcst to four signed quadword
xmm2/m64/m16bcst OR AVX10.11 integers, and store the result in ymm1 subject to
writemask k1.
EVEX.512.66.MAP5.W0 7B /r A V/V AVX512-FP16 Convert eight packed FP16 values in
VCVTPH2QQ zmm1{k1}{z}, OR AVX10.11 xmm2/m128/m16bcst to eight signed quadword
xmm2/m128/m16bcst {er} integers, and store the result in zmm1 subject to
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Quarter ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts packed FP16 values in the source operand to signed quadword integers in destination
operand.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value is
returned.
The destination elements are updated according to the writemask.

VCVTPH2QQ—Convert Packed FP16 Values to Signed Quadword Integer Values Vol. 2C 5-59
Operation
VCVTPH2QQ DEST, SRC
VL = 128, 256 or 512
KL := VL / 64

IF *SRC is a register* and (VL = 512) and (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.qword[j] := Convert_fp16_to_integer64(tsrc)
ELSE IF *zeroing*:
DEST.qword[j] := 0
// else dest.qword[j] remains unchanged

DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTPH2QQ __m512i _mm512_cvt_roundph_epi64 (__m128h a, int rounding);
VCVTPH2QQ __m512i _mm512_mask_cvt_roundph_epi64 (__m512i src, __mmask8 k, __m128h a, int rounding);
VCVTPH2QQ __m512i _mm512_maskz_cvt_roundph_epi64 (__mmask8 k, __m128h a, int rounding);
VCVTPH2QQ __m128i _mm_cvtph_epi64 (__m128h a);
VCVTPH2QQ __m128i _mm_mask_cvtph_epi64 (__m128i src, __mmask8 k, __m128h a);
VCVTPH2QQ __m128i _mm_maskz_cvtph_epi64 (__mmask8 k, __m128h a);
VCVTPH2QQ __m256i _mm256_cvtph_epi64 (__m128h a);
VCVTPH2QQ __m256i _mm256_mask_cvtph_epi64 (__m256i src, __mmask8 k, __m128h a);
VCVTPH2QQ __m256i _mm256_maskz_cvtph_epi64 (__mmask8 k, __m128h a);
VCVTPH2QQ __m512i _mm512_cvtph_epi64 (__m128h a);
VCVTPH2QQ __m512i _mm512_mask_cvtph_epi64 (__m512i src, __mmask8 k, __m128h a);
VCVTPH2QQ __m512i _mm512_maskz_cvtph_epi64 (__mmask8 k, __m128h a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCVTPH2QQ—Convert Packed FP16 Values to Signed Quadword Integer Values Vol. 2C 5-60
VCVTPH2UDQ—Convert Packed FP16 Values to Unsigned Doubleword Integers
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.NP.MAP5.W0 79 /r A V/V (AVX512-FP16 Convert four packed FP16 values in
VCVTPH2UDQ xmm1{k1}{z}, AND AVX512VL) xmm2/m64/m16bcst to four unsigned
xmm2/m64/m16bcst OR AVX10.11 doubleword integers, and store the result in
xmm1 subject to writemask k1.
EVEX.256.NP.MAP5.W0 79 /r A V/V (AVX512-FP16 Convert eight packed FP16 values in
VCVTPH2UDQ ymm1{k1}{z}, AND AVX512VL) xmm2/m128/m16bcst to eight unsigned
xmm2/m128/m16bcst OR AVX10.11 doubleword integers, and store the result in
ymm1 subject to writemask k1.
EVEX.512.NP.MAP5.W0 79 /r A V/V AVX512-FP16 Convert sixteen packed FP16 values in
VCVTPH2UDQ zmm1{k1}{z}, OR AVX10.11 ymm2/m256/m16bcst to sixteen unsigned
ymm2/m256/m16bcst {er} doubleword integers, and store the result in
zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Half ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts packed FP16 values in the source operand to unsigned doubleword integers in destination
operand.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value is
returned.
The destination elements are updated according to the writemask.

VCVTPH2UDQ—Convert Packed FP16 Values to Unsigned Doubleword Integers Vol. 2C 5-61


Operation
VCVTPH2UDQ DEST, SRC
VL = 128, 256 or 512
KL := VL / 32

IF *SRC is a register* and (VL = 512) and (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.dword[j] := Convert_fp16_to_unsigned_integer32(tsrc)
ELSE IF *zeroing*:
DEST.dword[j] := 0
// else dest.dword[j] remains unchanged

DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTPH2UDQ __m512i _mm512_cvt_roundph_epu32 (__m256h a, int rounding);
VCVTPH2UDQ __m512i _mm512_mask_cvt_roundph_epu32 (__m512i src, __mmask16 k, __m256h a, int rounding);
VCVTPH2UDQ __m512i _mm512_maskz_cvt_roundph_epu32 (__mmask16 k, __m256h a, int rounding);
VCVTPH2UDQ __m128i _mm_cvtph_epu32 (__m128h a);
VCVTPH2UDQ __m128i _mm_mask_cvtph_epu32 (__m128i src, __mmask8 k, __m128h a);
VCVTPH2UDQ __m128i _mm_maskz_cvtph_epu32 (__mmask8 k, __m128h a);
VCVTPH2UDQ __m256i _mm256_cvtph_epu32 (__m128h a);
VCVTPH2UDQ __m256i _mm256_mask_cvtph_epu32 (__m256i src, __mmask8 k, __m128h a);
VCVTPH2UDQ __m256i _mm256_maskz_cvtph_epu32 (__mmask8 k, __m128h a);
VCVTPH2UDQ __m512i _mm512_cvtph_epu32 (__m256h a);
VCVTPH2UDQ __m512i _mm512_mask_cvtph_epu32 (__m512i src, __mmask16 k, __m256h a);
VCVTPH2UDQ __m512i _mm512_maskz_cvtph_epu32 (__mmask16 k, __m256h a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCVTPH2UDQ—Convert Packed FP16 Values to Unsigned Doubleword Integers Vol. 2C 5-62


VCVTPH2UQQ—Convert Packed FP16 Values to Unsigned Quadword Integers
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.MAP5.W0 79 /r A V/V (AVX512-FP16 Convert two packed FP16 values in
VCVTPH2UQQ xmm1{k1}{z}, AND AVX512VL) xmm2/m32/m16bcst to two unsigned quadword
xmm2/m32/m16bcst OR AVX10.11 integers, and store the result in xmm1 subject to
writemask k1.
EVEX.256.66.MAP5.W0 79 /r A V/V (AVX512-FP16 Convert four packed FP16 values in
VCVTPH2UQQ ymm1{k1}{z}, AND AVX512VL) xmm2/m64/m16bcst to four unsigned quadword
xmm2/m64/m16bcst OR AVX10.11 integers, and store the result in ymm1 subject to
writemask k1.
EVEX.512.66.MAP5.W0 79 /r A V/V AVX512-FP16 Convert eight packed FP16 values in
VCVTPH2UQQ zmm1{k1}{z}, OR AVX10.11 xmm2/m128/m16bcst to eight unsigned
xmm2/m128/m16bcst {er} quadword integers, and store the result in zmm1
subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Quarter ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts packed FP16 values in the source operand to unsigned quadword integers in destination
operand.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value is
returned.
The destination elements are updated according to the writemask.

VCVTPH2UQQ—Convert Packed FP16 Values to Unsigned Quadword Integers Vol. 2C 5-63


Operation
VCVTPH2UQQ DEST, SRC
VL = 128, 256 or 512
KL := VL / 64

IF *SRC is a register* and (VL = 512) and (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.qword[j] := Convert_fp16_to_unsigned_integer64(tsrc)
ELSE IF *zeroing*:
DEST.qword[j] := 0
// else dest.qword[j] remains unchanged

DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTPH2UQQ __m512i _mm512_cvt_roundph_epu64 (__m128h a, int rounding);
VCVTPH2UQQ __m512i _mm512_mask_cvt_roundph_epu64 (__m512i src, __mmask8 k, __m128h a, int rounding);
VCVTPH2UQQ __m512i _mm512_maskz_cvt_roundph_epu64 (__mmask8 k, __m128h a, int rounding);
VCVTPH2UQQ __m128i _mm_cvtph_epu64 (__m128h a);
VCVTPH2UQQ __m128i _mm_mask_cvtph_epu64 (__m128i src, __mmask8 k, __m128h a);
VCVTPH2UQQ __m128i _mm_maskz_cvtph_epu64 (__mmask8 k, __m128h a);
VCVTPH2UQQ __m256i _mm256_cvtph_epu64 (__m128h a);
VCVTPH2UQQ __m256i _mm256_mask_cvtph_epu64 (__m256i src, __mmask8 k, __m128h a);
VCVTPH2UQQ __m256i _mm256_maskz_cvtph_epu64 (__mmask8 k, __m128h a);
VCVTPH2UQQ __m512i _mm512_cvtph_epu64 (__m128h a);
VCVTPH2UQQ __m512i _mm512_mask_cvtph_epu64 (__m512i src, __mmask8 k, __m128h a);
VCVTPH2UQQ __m512i _mm512_maskz_cvtph_epu64 (__mmask8 k, __m128h a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCVTPH2UQQ—Convert Packed FP16 Values to Unsigned Quadword Integers Vol. 2C 5-64


VCVTPH2UW—Convert Packed FP16 Values to Unsigned Word Integers
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.NP.MAP5.W0 7D /r A V/V (AVX512-FP16 Convert packed FP16 values in
VCVTPH2UW xmm1{k1}{z}, AND AVX512VL) xmm2/m128/m16bcst to unsigned word integers,
xmm2/m128/m16bcst OR AVX10.11 and store the result in xmm1.
EVEX.256.NP.MAP5.W0 7D /r A V/V (AVX512-FP16 Convert packed FP16 values in
VCVTPH2UW ymm1{k1}{z}, AND AVX512VL) ymm2/m256/m16bcst to unsigned word integers,
ymm2/m256/m16bcst OR AVX10.11 and store the result in ymm1.
EVEX.512.NP.MAP5.W0 7D /r A V/V AVX512-FP16 Convert packed FP16 values in
VCVTPH2UW zmm1{k1}{z}, OR AVX10.11 zmm2/m512/m16bcst to unsigned word integers,
zmm2/m512/m16bcst {er} and store the result in zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts packed FP16 values in the source operand to unsigned word integers in the destination
operand.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value is
returned.
The destination elements are updated according to the writemask.

Operation
VCVTPH2UW DEST, SRC
VL = 128, 256 or 512
KL := VL / 16

IF *SRC is a register* and (VL = 512) and (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.word[j] := Convert_fp16_to_unsigned_integer16(tsrc)
ELSE IF *zeroing*:
DEST.word[j] := 0
// else dest.word[j] remains unchanged

VCVTPH2UW—Convert Packed FP16 Values to Unsigned Word Integers Vol. 2C 5-65


DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTPH2UW __m512i _mm512_cvt_roundph_epu16 (__m512h a, int sae);
VCVTPH2UW __m512i _mm512_mask_cvt_roundph_epu16 (__m512i src, __mmask32 k, __m512h a, int sae);
VCVTPH2UW __m512i _mm512_maskz_cvt_roundph_epu16 (__mmask32 k, __m512h a, int sae);
VCVTPH2UW __m128i _mm_cvtph_epu16 (__m128h a);
VCVTPH2UW __m128i _mm_mask_cvtph_epu16 (__m128i src, __mmask8 k, __m128h a);
VCVTPH2UW __m128i _mm_maskz_cvtph_epu16 (__mmask8 k, __m128h a);
VCVTPH2UW __m256i _mm256_cvtph_epu16 (__m256h a);
VCVTPH2UW __m256i _mm256_mask_cvtph_epu16 (__m256i src, __mmask16 k, __m256h a);
VCVTPH2UW __m256i _mm256_maskz_cvtph_epu16 (__mmask16 k, __m256h a);
VCVTPH2UW __m512i _mm512_cvtph_epu16 (__m512h a);
VCVTPH2UW __m512i _mm512_mask_cvtph_epu16 (__m512i src, __mmask32 k, __m512h a);
VCVTPH2UW __m512i _mm512_maskz_cvtph_epu16 (__mmask32 k, __m512h a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCVTPH2UW—Convert Packed FP16 Values to Unsigned Word Integers Vol. 2C 5-66


VCVTPH2W—Convert Packed FP16 Values to Signed Word Integers
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.MAP5.W0 7D /r A V/V (AVX512-FP16 Convert packed FP16 values in
VCVTPH2W xmm1{k1}{z}, AND AVX512VL) xmm2/m128/m16bcst to signed word integers,
xmm2/m128/m16bcst OR AVX10.11 and store the result in xmm1.
EVEX.256.66.MAP5.W0 7D /r A V/V (AVX512-FP16 Convert packed FP16 values in
VCVTPH2W ymm1{k1}{z}, AND AVX512VL) ymm2/m256/m16bcst to signed word integers,
ymm2/m256/m16bcst OR AVX10.11 and store the result in ymm1.
EVEX.512.66.MAP5.W0 7D /r A V/V AVX512-FP16 Convert packed FP16 values in
VCVTPH2W zmm1{k1}{z}, OR AVX10.11 zmm2/m512/m16bcst to signed word integers,
zmm2/m512/m16bcst {er} and store the result in zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts packed FP16 values in the source operand to signed word integers in the destination
operand.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value is
returned.
The destination elements are updated according to the writemask.

Operation
VCVTPH2W DEST, SRC
VL = 128, 256 or 512
KL := VL / 16

IF *SRC is a register* and (VL = 512) and (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.word[j] := Convert_fp16_to_integer16(tsrc)
ELSE IF *zeroing*:
DEST.word[j] := 0
// else dest.word[j] remains unchanged

VCVTPH2W—Convert Packed FP16 Values to Signed Word Integers Vol. 2C 5-67


DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTPH2W __m512i _mm512_cvt_roundph_epi16 (__m512h a, int rounding);
VCVTPH2W __m512i _mm512_mask_cvt_roundph_epi16 (__m512i src, __mmask32 k, __m512h a, int rounding);
VCVTPH2W __m512i _mm512_maskz_cvt_roundph_epi16 (__mmask32 k, __m512h a, int rounding);
VCVTPH2W __m128i _mm_cvtph_epi16 (__m128h a);
VCVTPH2W __m128i _mm_mask_cvtph_epi16 (__m128i src, __mmask8 k, __m128h a);
VCVTPH2W __m128i _mm_maskz_cvtph_epi16 (__mmask8 k, __m128h a);
VCVTPH2W __m256i _mm256_cvtph_epi16 (__m256h a);
VCVTPH2W __m256i _mm256_mask_cvtph_epi16 (__m256i src, __mmask16 k, __m256h a);
VCVTPH2W __m256i _mm256_maskz_cvtph_epi16 (__mmask16 k, __m256h a);
VCVTPH2W __m512i _mm512_cvtph_epi16 (__m512h a);
VCVTPH2W __m512i _mm512_mask_cvtph_epi16 (__m512i src, __mmask32 k, __m512h a);
VCVTPH2W __m512i _mm512_maskz_cvtph_epi16 (__mmask32 k, __m512h a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCVTPH2W—Convert Packed FP16 Values to Signed Word Integers Vol. 2C 5-68


VCVTPS2PH—Convert Single Precision FP Value to 16-bit FP Value
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.66.0F3A.W0 1D /r ib A V/V F16C Convert four packed single precision floating-point
VCVTPS2PH xmm1/m64, xmm2, values in xmm2 to packed half-precision (16-bit)
imm8 floating-point values in xmm1/m64. Imm8 provides
rounding controls.
VEX.256.66.0F3A.W0 1D /r ib A V/V F16C Convert eight packed single precision floating-point
VCVTPS2PH xmm1/m128, ymm2, values in ymm2 to packed half-precision (16-bit)
imm8 floating-point values in xmm1/m128. Imm8 provides
rounding controls.
EVEX.128.66.0F3A.W0 1D /r ib B V/V (AVX512VL AND Convert four packed single-precision floating-point
VCVTPS2PH xmm1/m64 {k1}{z}, AVX512F) OR values in xmm2 to packed half-precision (16-bit)
xmm2, imm8 AVX10.11 floating-point values in xmm1/m64. Imm8 provides
rounding controls.
EVEX.256.66.0F3A.W0 1D /r ib B V/V (AVX512VL AND Convert eight packed single-precision floating-point
VCVTPS2PH xmm1/m128 {k1}{z}, AVX512F) OR values in ymm2 to packed half-precision (16-bit)
ymm2, imm8 AVX10.11 floating-point values in xmm1/m128. Imm8 provides
rounding controls.
EVEX.512.66.0F3A.W0 1D /r ib B V/V AVX512F Convert sixteen packed single-precision floating-
VCVTPS2PH ymm1/m256 {k1}{z}, OR AVX10.11 point values in zmm2 to packed half-precision (16-
zmm2 {sae}, imm8 bit) floating-point values in ymm1/m256. Imm8
provides rounding controls.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:r/m (w) ModRM:reg (r) imm8 N/A
B Half Mem ModRM:r/m (w) ModRM:reg (r) imm8 N/A

Description
Convert packed single precision floating values in the source operand to half-precision (16-bit) floating-point
values and store to the destination operand. The rounding mode is specified using the immediate field (imm8).
Underflow results (i.e., tiny results) are converted to denormals. MXCSR.FTZ is ignored. If a source element is
denormal relative to the input format with DM masked and at least one of PM or UM unmasked; a SIMD exception
will be raised with DE, UE and PE set.

VCVTPS2PH—Convert Single Precision FP Value to 16-bit FP Value Vol. 2C 5-69


VCVTPS2PH xmm1/mem64, xmm2, imm8
127 96 95 64 63 32 31 0
VS3 VS2 VS1 VS0 xmm2

convert convert convert


convert

127 96 95 64 63 48 47 32 31 16 15 0
VH3 VH2 VH1 VH0 xmm1/mem64

Figure 5-7. VCVTPS2PH (128-bit Version)

The immediate byte defines several bit fields that control rounding operation. The effect and encoding of the RC
field are listed in Table 5-3.

Table 5-3. Immediate Byte Encoding for 16-bit Floating-Point Conversion Instructions
Bits Field Name/value Description Comment
Imm[1:0] RC=00B Round to nearest even If Imm[2] = 0
RC=01B Round down
RC=10B Round up
RC=11B Truncate
Imm[2] MS1=0 Use imm[1:0] for rounding Ignore MXCSR.RC
MS1=1 Use MXCSR.RC for rounding
Imm[7:3] Ignored Ignored by processor

VEX.128 version: The source operand is a XMM register. The destination operand is a XMM register or 64-bit
memory location. If the destination operand is a register then the upper bits (MAXVL-1:64) of corresponding
register are zeroed.
VEX.256 version: The source operand is a YMM register. The destination operand is a XMM register or 128-bit
memory location. If the destination operand is a register, the upper bits (MAXVL-1:128) of the corresponding desti-
nation register are zeroed.
Note: VEX.vvvv and EVEX.vvvv are reserved (must be 1111b).
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register. The destination operand is a
YMM/XMM/XMM (low 64-bits) register or a 256/128/64-bit memory location, conditionally updated with writemask
k1. Bits (MAXVL-1:256/128/64) of the corresponding destination register are zeroed.

Operation
vCvt_s2h(SRC1[31:0])
{
IF Imm[2] = 0
THEN ; using Imm[1:0] for rounding control, see Table 5-3
RETURN Cvt_Single_Precision_To_Half_Precision_FP_Imm(SRC1[31:0]);
ELSE ; using MXCSR.RC for rounding control
RETURN Cvt_Single_Precision_To_Half_Precision_FP_Mxcsr(SRC1[31:0]);
FI;
}

VCVTPS2PH—Convert Single Precision FP Value to 16-bit FP Value Vol. 2C 5-70


VCVTPS2PH (EVEX Encoded Versions) When DEST is a Register
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 16
k := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] :=
vCvt_s2h(SRC[k+31:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0

VCVTPS2PH (EVEX Encoded Versions) When DEST is Memory


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 16
k := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] :=
vCvt_s2h(SRC[k+31:k])
ELSE
*DEST[i+15:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VCVTPS2PH (VEX.256 Encoded Version)


DEST[15:0] := vCvt_s2h(SRC1[31:0]);
DEST[31:16] := vCvt_s2h(SRC1[63:32]);
DEST[47:32] := vCvt_s2h(SRC1[95:64]);
DEST[63:48] := vCvt_s2h(SRC1[127:96]);
DEST[79:64] := vCvt_s2h(SRC1[159:128]);
DEST[95:80] := vCvt_s2h(SRC1[191:160]);
DEST[111:96] := vCvt_s2h(SRC1[223:192]);
DEST[127:112] := vCvt_s2h(SRC1[255:224]);
DEST[MAXVL-1:128] := 0

VCVTPS2PH (VEX.128 Encoded Version)


DEST[15:0] := vCvt_s2h(SRC1[31:0]);
DEST[31:16] := vCvt_s2h(SRC1[63:32]);
DEST[47:32] := vCvt_s2h(SRC1[95:64]);
DEST[63:48] := vCvt_s2h(SRC1[127:96]);
DEST[MAXVL-1:64] := 0

Flags Affected
None.

VCVTPS2PH—Convert Single Precision FP Value to 16-bit FP Value Vol. 2C 5-71


Intel C/C++ Compiler Intrinsic Equivalent
VCVTPS2PH __m256i _mm512_cvtps_ph(__m512 a);
VCVTPS2PH __m256i _mm512_mask_cvtps_ph(__m256i s, __mmask16 k,__m512 a);
VCVTPS2PH __m256i _mm512_maskz_cvtps_ph(__mmask16 k,__m512 a);
VCVTPS2PH __m256i _mm512_cvt_roundps_ph(__m512 a, const int imm);
VCVTPS2PH __m256i _mm512_mask_cvt_roundps_ph(__m256i s, __mmask16 k,__m512 a, const int imm);
VCVTPS2PH __m256i _mm512_maskz_cvt_roundps_ph(__mmask16 k,__m512 a, const int imm);
VCVTPS2PH __m128i _mm256_mask_cvtps_ph(__m128i s, __mmask8 k,__m256 a);
VCVTPS2PH __m128i _mm256_maskz_cvtps_ph(__mmask8 k,__m256 a);
VCVTPS2PH __m128i _mm_mask_cvtps_ph(__m128i s, __mmask8 k,__m128 a);
VCVTPS2PH __m128i _mm_maskz_cvtps_ph(__mmask8 k,__m128 a);
VCVTPS2PH __m128i _mm_cvtps_ph ( __m128 m1, const int imm);
VCVTPS2PH __m128i _mm256_cvtps_ph(__m256 m1, const int imm);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal (if MXCSR.DAZ=0).

Other Exceptions
VEX-encoded instructions, see Table 2-26, “Type 11 Class Exception Conditions” (do not report #AC);
EVEX-encoded instructions, see Table 2-62, “Type E11 Class Exception Conditions.”
Additionally:
#UD If VEX.W=1.
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

VCVTPS2PH—Convert Single Precision FP Value to 16-bit FP Value Vol. 2C 5-72


VCVTPS2PHX—Convert Packed Single Precision Floating-Point Values to Packed FP16 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.MAP5.W0 1D /r A V/V (AVX512-FP16 Convert four packed single precision floating-
VCVTPS2PHX xmm1{k1}{z}, AND AVX512VL) point values in xmm2/m128/m32bcst to
xmm2/m128/m32bcst OR AVX10.11 packed FP16 values, and store the result in
xmm1 subject to writemask k1.
EVEX.256.66.MAP5.W0 1D /r A V/V (AVX512-FP16 Convert eight packed single precision floating-
VCVTPS2PHX xmm1{k1}{z}, AND AVX512VL) point values in ymm2/m256/m32bcst to
ymm2/m256/m32bcst OR AVX10.11 packed FP16 values, and store the result in
xmm1 subject to writemask k1.
EVEX.512.66.MAP5.W0 1D /r A V/V AVX512-FP16 Convert sixteen packed single precision
VCVTPS2PHX ymm1{k1}{z}, OR AVX10.11 floating-point values in zmm2 /m512/m32bcst
zmm2/m512/m32bcst {er} to packed FP16 values, and store the result in
ymm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts packed single precision floating values in the source operand to FP16 values and stores to
the destination operand.
The VCVTPS2PHX instruction supports broadcasting.
This instruction uses MXCSR.DAZ for handling FP32 inputs. FP16 outputs can be normal or denormal numbers, and
are not conditionally flushed based on MXCSR settings.

Operation
VCVTPS2PHX DEST, SRC (AVX512_FP16 Load Version With Broadcast Support)
VL = 128, 256, or 512
KL := VL / 32

IF *SRC is a register* and (VL == 512) and (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp32[0]
ELSE
tsrc := SRC.fp32[j]
DEST.fp16[j] := Convert_fp32_to_fp16(tsrc)
ELSE IF *zeroing*:
DEST.fp16[j] := 0

VCVTPS2PHX—Convert Packed Single Precision Floating-Point Values to Packed FP16 Values Vol. 2C 5-73
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL/2] := 0

Flags Affected
None.

Intel C/C++ Compiler Intrinsic Equivalent


VCVTPS2PHX __m256h _mm512_cvtx_roundps_ph (__m512 a, int rounding);
VCVTPS2PHX __m256h _mm512_mask_cvtx_roundps_ph (__m256h src, __mmask16 k, __m512 a, int rounding);
VCVTPS2PHX __m256h _mm512_maskz_cvtx_roundps_ph (__mmask16 k, __m512 a, int rounding);
VCVTPS2PHX __m128h _mm_cvtxps_ph (__m128 a);
VCVTPS2PHX __m128h _mm_mask_cvtxps_ph (__m128h src, __mmask8 k, __m128 a);
VCVTPS2PHX __m128h _mm_maskz_cvtxps_ph (__mmask8 k, __m128 a);
VCVTPS2PHX __m128h _mm256_cvtxps_ph (__m256 a);
VCVTPS2PHX __m128h _mm256_mask_cvtxps_ph (__m128h src, __mmask8 k, __m256 a);
VCVTPS2PHX __m128h _mm256_maskz_cvtxps_ph (__mmask8 k, __m256 a);
VCVTPS2PHX __m256h _mm512_cvtxps_ph (__m512 a);
VCVTPS2PHX __m256h _mm512_mask_cvtxps_ph (__m256h src, __mmask16 k, __m512 a);
VCVTPS2PHX __m256h _mm512_maskz_cvtxps_ph (__mmask16 k, __m512 a);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal (if MXCSR.DAZ=0).

Other Exceptions
EVEX-encoded instructions, see Table 2-46, “Type E2 Class Exception Conditions.”
Additionally:
#UD If VEX.W=1.
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

VCVTPS2PHX—Convert Packed Single Precision Floating-Point Values to Packed FP16 Values Vol. 2C 5-74
VCVTPS2QQ—Convert Packed Single Precision Floating-Point Values to Packed Signed
Quadword Integer Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F.W0 7B /r A V/V (AVX512VL AND Convert two packed single precision floating-point values
VCVTPS2QQ xmm1 {k1}{z}, AVX512DQ) OR from xmm2/m64/m32bcst to two packed signed
xmm2/m64/m32bcst AVX10.11 quadword values in xmm1 subject to writemask k1.
EVEX.256.66.0F.W0 7B /r A V/V (AVX512VL AND Convert four packed single precision floating-point values
VCVTPS2QQ ymm1 {k1}{z}, AVX512DQ) OR from xmm2/m128/m32bcst to four packed signed
xmm2/m128/m32bcst AVX10.11 quadword values in ymm1 subject to writemask k1.
EVEX.512.66.0F.W0 7B /r A V/V AVX512DQ Convert eight packed single precision floating-point values
VCVTPS2QQ zmm1 {k1}{z}, OR AVX10.11 from ymm2/m256/m32bcst to eight packed signed
ymm2/m256/m32bcst {er} quadword values in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Half ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts eight packed single precision floating-point values in the source operand to eight signed quadword inte-
gers in the destination operand.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value
(2w-1, where w represents the number of bits in the destination format) is returned.
The source operand is a YMM/XMM/XMM (low 64- bits) register or a 256/128/64-bit memory location. The destina-
tion operation is a ZMM/YMM/XMM register conditionally updated with writemask k1.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
VCVTPS2QQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL == 512) AND (EVEX.b == 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_Single_Precision_To_QuadInteger(SRC[k+31:k])
ELSE

VCVTPS2QQ—Convert Packed Single Precision Floating-Point Values to Packed Signed Quadword Integer Values Vol. 2C 5-75
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTPS2QQ (EVEX Encoded Versions) When SRC Operand is a Memory Source


(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[i+63:i] :=
Convert_Single_Precision_To_QuadInteger(SRC[31:0])
ELSE
DEST[i+63:i] :=
Convert_Single_Precision_To_QuadInteger(SRC[k+31:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTPS2QQ __m512i _mm512_cvtps_epi64( __m512 a);
VCVTPS2QQ __m512i _mm512_mask_cvtps_epi64( __m512i s, __mmask16 k, __m512 a);
VCVTPS2QQ __m512i _mm512_maskz_cvtps_epi64( __mmask16 k, __m512 a);
VCVTPS2QQ __m512i _mm512_cvt_roundps_epi64( __m512 a, int r);
VCVTPS2QQ __m512i _mm512_mask_cvt_roundps_epi64( __m512i s, __mmask16 k, __m512 a, int r);
VCVTPS2QQ __m512i _mm512_maskz_cvt_roundps_epi64( __mmask16 k, __m512 a, int r);
VCVTPS2QQ __m256i _mm256_cvtps_epi64( __m256 a);
VCVTPS2QQ __m256i _mm256_mask_cvtps_epi64( __m256i s, __mmask8 k, __m256 a);
VCVTPS2QQ __m256i _mm256_maskz_cvtps_epi64( __mmask8 k, __m256 a);
VCVTPS2QQ __m128i _mm_cvtps_epi64( __m128 a);
VCVTPS2QQ __m128i _mm_mask_cvtps_epi64( __m128i s, __mmask8 k, __m128 a);
VCVTPS2QQ __m128i _mm_maskz_cvtps_epi64( __mmask8 k, __m128 a);

SIMD Floating-Point Exceptions


Invalid, Precision.

VCVTPS2QQ—Convert Packed Single Precision Floating-Point Values to Packed Signed Quadword Integer Values Vol. 2C 5-76
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VCVTPS2QQ—Convert Packed Single Precision Floating-Point Values to Packed Signed Quadword Integer Values Vol. 2C 5-77
VCVTPS2UDQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned
Doubleword Integer Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.0F.W0 79 /r A V/V (AVX512VL AND Convert four packed single precision floating-
VCVTPS2UDQ xmm1 {k1}{z}, AVX512F) OR point values from xmm2/m128/m32bcst to four
xmm2/m128/m32bcst AVX10.11 packed unsigned doubleword values in xmm1
subject to writemask k1.
EVEX.256.0F.W0 79 /r A V/V (AVX512VL AND Convert eight packed single precision floating-
VCVTPS2UDQ ymm1 {k1}{z}, AVX512F) OR point values from ymm2/m256/m32bcst to eight
ymm2/m256/m32bcst AVX10.11 packed unsigned doubleword values in ymm1
subject to writemask k1.
EVEX.512.0F.W0 79 /r A V/V AVX512F Convert sixteen packed single precision floating-
VCVTPS2UDQ zmm1 {k1}{z}, OR AVX10.11 point values from zmm2/m512/m32bcst to
zmm2/m512/m32bcst {er} sixteen packed unsigned doubleword values in
zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts sixteen packed single precision floating-point values in the source operand to sixteen unsigned double-
word integers in the destination operand.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is
returned, where w represents the number of bits in the destination format.
The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector
broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

VCVTPS2UDQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned Doubleword Integer Values Vol. 2C 5-77
Operation
VCVTPS2UDQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_UInteger(SRC[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTPS2UDQ (EVEX Encoded Versions) When SRC Operand is a Memory Source


(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no *
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_UInteger(SRC[31:0])
ELSE
DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_UInteger(SRC[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTPS2UDQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned Doubleword Integer Values Vol. 2C 5-78
Intel C/C++ Compiler Intrinsic Equivalent
VCVTPS2UDQ __m512i _mm512_cvtps_epu32( __m512 a);
VCVTPS2UDQ __m512i _mm512_mask_cvtps_epu32( __m512i s, __mmask16 k, __m512 a);
VCVTPS2UDQ __m512i _mm512_maskz_cvtps_epu32( __mmask16 k, __m512 a);
VCVTPS2UDQ __m512i _mm512_cvt_roundps_epu32( __m512 a, int r);
VCVTPS2UDQ __m512i _mm512_mask_cvt_roundps_epu32( __m512i s, __mmask16 k, __m512 a, int r);
VCVTPS2UDQ __m512i _mm512_maskz_cvt_roundps_epu32( __mmask16 k, __m512 a, int r);
VCVTPS2UDQ __m256i _mm256_cvtps_epu32( __m256d a);
VCVTPS2UDQ __m256i _mm256_mask_cvtps_epu32( __m256i s, __mmask8 k, __m256 a);
VCVTPS2UDQ __m256i _mm256_maskz_cvtps_epu32( __mmask8 k, __m256 a);
VCVTPS2UDQ __m128i _mm_cvtps_epu32( __m128 a);
VCVTPS2UDQ __m128i _mm_mask_cvtps_epu32( __m128i s, __mmask8 k, __m128 a);
VCVTPS2UDQ __m128i _mm_maskz_cvtps_epu32( __mmask8 k, __m128 a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VCVTPS2UDQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned Doubleword Integer Values Vol. 2C 5-79
VCVTPS2UQQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned
Quadword Integer Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F.W0 79 /r A V/V (AVX512VL AND Convert two packed single precision floating-point values
VCVTPS2UQQ xmm1 {k1}{z}, AVX512DQ) OR from zmm2/m64/m32bcst to two packed unsigned
xmm2/m64/m32bcst AVX10.11 quadword values in zmm1 subject to writemask k1.
EVEX.256.66.0F.W0 79 /r A V/V (AVX512VL AND Convert four packed single precision floating-point values
VCVTPS2UQQ ymm1 {k1}{z}, AVX512DQ) OR from xmm2/m128/m32bcst to four packed unsigned
xmm2/m128/m32bcst AVX10.11 quadword values in ymm1 subject to writemask k1.
EVEX.512.66.0F.W0 79 /r A V/V AVX512DQ Convert eight packed single precision floating-point values
VCVTPS2UQQ zmm1 {k1}{z}, OR AVX10.11 from ymm2/m256/m32bcst to eight packed unsigned
ymm2/m256/m32bcst {er} quadword values in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Half ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts up to eight packed single precision floating-point values in the source operand to unsigned quadword
integers in the destination operand.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is
returned, where w represents the number of bits in the destination format.
The source operand is a YMM/XMM/XMM (low 64- bits) register or a 256/128/64-bit memory location. The destina-
tion operation is a ZMM/YMM/XMM register conditionally updated with writemask k1.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
VCVTPS2UQQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL == 512) AND (EVEX.b == 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_Single_Precision_To_UQuadInteger(SRC[k+31:k])
ELSE

VCVTPS2UQQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned Quadword Integer Values Vol. 2C 5-80
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTPS2UQQ (EVEX Encoded Versions) When SRC Operand is a Memory Source


(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[i+63:i] :=
Convert_Single_Precision_To_UQuadInteger(SRC[31:0])
ELSE
DEST[i+63:i] :=
Convert_Single_Precision_To_UQuadInteger(SRC[k+31:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTPS2UQQ __m512i _mm512_cvtps_epu64( __m512 a);
VCVTPS2UQQ __m512i _mm512_mask_cvtps_epu64( __m512i s, __mmask16 k, __m512 a);
VCVTPS2UQQ __m512i _mm512_maskz_cvtps_epu64( __mmask16 k, __m512 a);
VCVTPS2UQQ __m512i _mm512_cvt_roundps_epu64( __m512 a, int r);
VCVTPS2UQQ __m512i _mm512_mask_cvt_roundps_epu64( __m512i s, __mmask16 k, __m512 a, int r);
VCVTPS2UQQ __m512i _mm512_maskz_cvt_roundps_epu64( __mmask16 k, __m512 a, int r);
VCVTPS2UQQ __m256i _mm256_cvtps_epu64( __m256 a);
VCVTPS2UQQ __m256i _mm256_mask_cvtps_epu64( __m256i s, __mmask8 k, __m256 a);
VCVTPS2UQQ __m256i _mm256_maskz_cvtps_epu64( __mmask8 k, __m256 a);
VCVTPS2UQQ __m128i _mm_cvtps_epu64( __m128 a);
VCVTPS2UQQ __m128i _mm_mask_cvtps_epu64( __m128i s, __mmask8 k, __m128 a);
VCVTPS2UQQ __m128i _mm_maskz_cvtps_epu64( __mmask8 k, __m128 a);

SIMD Floating-Point Exceptions


Invalid, Precision.

VCVTPS2UQQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned Quadword Integer Values Vol. 2C 5-81
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VCVTPS2UQQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned Quadword Integer Values Vol. 2C 5-82
VCVTQQ2PD—Convert Packed Quadword Integers to Packed Double Precision Floating-Point
Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F3.0F.W1 E6 /r A V/V (AVX512VL AND Convert two packed quadword integers from
VCVTQQ2PD xmm1 {k1}{z}, AVX512DQ) OR xmm2/m128/m64bcst to packed double precision
xmm2/m128/m64bcst AVX10.11 floating-point values in xmm1 with writemask k1.
EVEX.256.F3.0F.W1 E6 /r A V/V (AVX512VL AND Convert four packed quadword integers from
VCVTQQ2PD ymm1 {k1}{z}, AVX512DQ) OR ymm2/m256/m64bcst to packed double precision
ymm2/m256/m64bcst AVX10.11 floating-point values in ymm1 with writemask k1.
EVEX.512.F3.0F.W1 E6 /r A V/V AVX512DQ Convert eight packed quadword integers from
VCVTQQ2PD zmm1 {k1}{z}, OR AVX10.11 zmm2/m512/m64bcst to eight packed double precision
zmm2/m512/m64bcst {er} floating-point values in zmm1 with writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts packed quadword integers in the source operand (second operand) to packed double precision floating-
point values in the destination operand (first operand).
The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operation
is a ZMM/YMM/XMM register conditionally updated with writemask k1.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
VCVTQQ2PD (EVEX2 Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL == 512) AND (EVEX.b == 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_QuadInteger_To_Double_Precision_Floating_Point(SRC[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0

VCVTQQ2PD—Convert Packed Quadword Integers to Packed Double Precision Floating-Point Values Vol. 2C 5-82
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTQQ2PD (EVEX Encoded Versions) when SRC Operand is a Memory Source


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[i+63:i] :=
Convert_QuadInteger_To_Double_Precision_Floating_Point(SRC[63:0])
ELSE
DEST[i+63:i] :=
Convert_QuadInteger_To_Double_Precision_Floating_Point(SRC[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTQQ2PD __m512d _mm512_cvtepi64_pd( __m512i a);
VCVTQQ2PD __m512d _mm512_mask_cvtepi64_pd( __m512d s, __mmask16 k, __m512i a);
VCVTQQ2PD __m512d _mm512_maskz_cvtepi64_pd( __mmask16 k, __m512i a);
VCVTQQ2PD __m512d _mm512_cvt_roundepi64_pd( __m512i a, int r);
VCVTQQ2PD __m512d _mm512_mask_cvt_roundepi64_pd( __m512d s, __mmask8 k, __m512i a, int r);
VCVTQQ2PD __m512d _mm512_maskz_cvt_roundepi64_pd( __mmask8 k, __m512i a, int r);
VCVTQQ2PD __m256d _mm256_mask_cvtepi64_pd( __m256d s, __mmask8 k, __m256i a);
VCVTQQ2PD __m256d _mm256_maskz_cvtepi64_pd( __mmask8 k, __m256i a);
VCVTQQ2PD __m128d _mm_mask_cvtepi64_pd( __m128d s, __mmask8 k, __m128i a);
VCVTQQ2PD __m128d _mm_maskz_cvtepi64_pd( __mmask8 k, __m128i a);

SIMD Floating-Point Exceptions


Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VCVTQQ2PD—Convert Packed Quadword Integers to Packed Double Precision Floating-Point Values Vol. 2C 5-83
VCVTQQ2PH—Convert Packed Signed Quadword Integers to Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.NP.MAP5.W1 5B /r A V/V (AVX512-FP16 Convert two packed signed quadword integers in
VCVTQQ2PH xmm1{k1}{z}, AND AVX512VL) xmm2/m128/m64bcst to packed FP16 values,
xmm2/m128/m64bcst OR AVX10.11 and store the result in xmm1 subject to
writemask k1.
EVEX.256.NP.MAP5.W1 5B /r A V/V (AVX512-FP16 Convert four packed signed quadword integers in
VCVTQQ2PH xmm1{k1}{z}, AND AVX512VL) ymm2/m256/m64bcst to packed FP16 values,
ymm2/m256/m64bcst OR AVX10.11 and store the result in xmm1 subject to
writemask k1.
EVEX.512.NP.MAP5.W1 5B /r A V/V AVX512-FP16 Convert eight packed signed quadword integers in
VCVTQQ2PH xmm1{k1}{z}, OR AVX10.11 zmm2/m512/m64bcst to packed FP16 values,
zmm2/m512/m64bcst {er} and store the result in xmm1 subject to
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts packed signed quadword integers in the source operand to packed FP16 values in the desti-
nation operand. The destination elements are updated according to the writemask.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
If the result of the convert operation is overflow and MXCSR.OM=0 then a SIMD exception will be raised with OE=1,
PE=1.

Operation
VCVTQQ2PH DEST, SRC
VL = 128, 256 or 512
KL := VL / 64

IF *SRC is a register* and (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.qword[0]
ELSE
tsrc := SRC.qword[j]
DEST.fp16[j] := Convert_integer64_to_fp16(tsrc)
ELSE IF *zeroing*:
DEST.fp16[j] := 0

VCVTQQ2PH—Convert Packed Signed Quadword Integers to Packed FP16 Values Vol. 2C 5-84
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL/4] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTQQ2PH __m128h _mm512_cvt_roundepi64_ph (__m512i a, int rounding);
VCVTQQ2PH __m128h _mm512_mask_cvt_roundepi64_ph (__m128h src, __mmask8 k, __m512i a, int rounding);
VCVTQQ2PH __m128h _mm512_maskz_cvt_roundepi64_ph (__mmask8 k, __m512i a, int rounding);
VCVTQQ2PH __m128h _mm_cvtepi64_ph (__m128i a);
VCVTQQ2PH __m128h _mm_mask_cvtepi64_ph (__m128h src, __mmask8 k, __m128i a);
VCVTQQ2PH __m128h _mm_maskz_cvtepi64_ph (__mmask8 k, __m128i a);
VCVTQQ2PH __m128h _mm256_cvtepi64_ph (__m256i a);
VCVTQQ2PH __m128h _mm256_mask_cvtepi64_ph (__m128h src, __mmask8 k, __m256i a);
VCVTQQ2PH __m128h _mm256_maskz_cvtepi64_ph (__mmask8 k, __m256i a);
VCVTQQ2PH __m128h _mm512_cvtepi64_ph (__m512i a);
VCVTQQ2PH __m128h _mm512_mask_cvtepi64_ph (__m128h src, __mmask8 k, __m512i a);
VCVTQQ2PH __m128h _mm512_maskz_cvtepi64_ph (__mmask8 k, __m512i a);

SIMD Floating-Point Exceptions


Overflow, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCVTQQ2PH—Convert Packed Signed Quadword Integers to Packed FP16 Values Vol. 2C 5-85
VCVTQQ2PS—Convert Packed Quadword Integers to Packed Single Precision Floating-Point
Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.0F.W1 5B /r A V/V (AVX512VL AND Convert two packed quadword integers from xmm2/mem
VCVTQQ2PS xmm1 {k1}{z}, AVX512DQ) OR to packed single precision floating-point values in xmm1
xmm2/m128/m64bcst AVX10.11 with writemask k1.
EVEX.256.0F.W1 5B /r A V/V (AVX512VL AND Convert four packed quadword integers from ymm2/mem
VCVTQQ2PS xmm1 {k1}{z}, AVX512DQ) OR to packed single precision floating-point values in xmm1
ymm2/m256/m64bcst AVX10.11 with writemask k1.
EVEX.512.0F.W1 5B /r A V/V AVX512DQ Convert eight packed quadword integers from
VCVTQQ2PS ymm1 {k1}{z}, OR AVX10.11 zmm2/mem to eight packed single precision floating-point
zmm2/m512/m64bcst {er} values in ymm1 with writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts packed quadword integers in the source operand (second operand) to packed single precision floating-
point values in the destination operand (first operand).
The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operation
is a YMM/XMM/XMM (lower 64 bits) register conditionally updated with writemask k1.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
VCVTQQ2PS (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN DEST[k+31:k] :=
Convert_QuadInteger_To_Single_Precision_Floating_Point(SRC[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[k+31:k] remains unchanged*
ELSE ; zeroing-masking
DEST[k+31:k] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0

VCVTQQ2PS—Convert Packed Quadword Integers to Packed Single Precision Floating-Point Values Vol. 2C 5-86
VCVTQQ2PS (EVEX Encoded Versions) When SRC Operand is a Memory Source
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[k+31:k] :=
Convert_QuadInteger_To_Single_Precision_Floating_Point(SRC[63:0])
ELSE
DEST[k+31:k] :=
Convert_QuadInteger_To_Single_Precision_Floating_Point(SRC[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[k+31:k] remains unchanged*
ELSE ; zeroing-masking
DEST[k+31:k] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTQQ2PS __m256 _mm512_cvtepi64_ps( __m512i a);
VCVTQQ2PS __m256 _mm512_mask_cvtepi64_ps( __m256 s, __mmask16 k, __m512i a);
VCVTQQ2PS __m256 _mm512_maskz_cvtepi64_ps( __mmask16 k, __m512i a);
VCVTQQ2PS __m256 _mm512_cvt_roundepi64_ps( __m512i a, int r);
VCVTQQ2PS __m256 _mm512_mask_cvt_roundepi_ps( __m256 s, __mmask8 k, __m512i a, int r);
VCVTQQ2PS __m256 _mm512_maskz_cvt_roundepi64_ps( __mmask8 k, __m512i a, int r);
VCVTQQ2PS __m128 _mm256_cvtepi64_ps( __m256i a);
VCVTQQ2PS __m128 _mm256_mask_cvtepi64_ps( __m128 s, __mmask8 k, __m256i a);
VCVTQQ2PS __m128 _mm256_maskz_cvtepi64_ps( __mmask8 k, __m256i a);
VCVTQQ2PS __m128 _mm_cvtepi64_ps( __m128i a);
VCVTQQ2PS __m128 _mm_mask_cvtepi64_ps( __m128 s, __mmask8 k, __m128i a);
VCVTQQ2PS __m128 _mm_maskz_cvtepi64_ps( __mmask8 k, __m128i a);

SIMD Floating-Point Exceptions


Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VCVTQQ2PS—Convert Packed Quadword Integers to Packed Single Precision Floating-Point Values Vol. 2C 5-87
VCVTSD2SH—Convert Low FP64 Value to an FP16 Value
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.F2.MAP5.W1 5A /r A V/V AVX512-FP16 Convert the low FP64 value in xmm3/m64 to an
VCVTSD2SH xmm1{k1}{z}, xmm2, OR AVX10.11 FP16 value and store the result in the low
xmm3/m64 {er} element of xmm1 subject to writemask k1. Bits
127:16 of xmm2 are copied to xmm1[127:16].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction converts the low FP64 value in the second source operand to an FP16 value, and stores the result in
the low element of the destination operand.
When the conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.

Operation
VCVTSD2SH dest, src1, src2
IF *SRC2 is a register* and (EVEX.b = 1):
SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)

IF k1[0] OR *no writemask*:


DEST.fp16[0] := Convert_fp64_to_fp16(SRC2.fp64[0])
ELSE IF *zeroing*:
DEST.fp16[0] := 0
// else dest.fp16[0] remains unchanged

DEST[127:16] := SRC1[127:16]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTSD2SH __m128h _mm_cvt_roundsd_sh (__m128h a, __m128d b, const int rounding);
VCVTSD2SH __m128h _mm_mask_cvt_roundsd_sh (__m128h src, __mmask8 k, __m128h a, __m128d b, const int rounding);
VCVTSD2SH __m128h _mm_maskz_cvt_roundsd_sh (__mmask8 k, __m128h a, __m128d b, const int rounding);
VCVTSD2SH __m128h _mm_cvtsd_sh (__m128h a, __m128d b);
VCVTSD2SH __m128h _mm_mask_cvtsd_sh (__m128h src, __mmask8 k, __m128h a, __m128d b);
VCVTSD2SH __m128h _mm_maskz_cvtsd_sh (__mmask8 k, __m128h a, __m128d b);

VCVTSD2SH—Convert Low FP64 Value to an FP16 Value Vol. 2C 5-88


SIMD Floating-Point Exceptions
Invalid, Underflow, Overflow, Precision, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VCVTSD2SH—Convert Low FP64 Value to an FP16 Value Vol. 2C 5-89


VCVTSD2USI—Convert Scalar Double Precision Floating-Point Value to Unsigned Doubleword
Integer
Opcode/ Op / 64/32 CPUID Description
Instruction En Bit Mode Feature
Support Flag
EVEX.LLIG.F2.0F.W0 79 /r A V/V AVX512F Convert one double precision floating-point value from
VCVTSD2USI r32, xmm1/m64{er} OR xmm1/m64 to one unsigned doubleword integer r32.
AVX10.11
EVEX.LLIG.F2.0F.W1 79 /r A V/N.E.2 AVX512F Convert one double precision floating-point value from
VCVTSD2USI r64, xmm1/m64{er} OR xmm1/m64 to one unsigned quadword integer zero-
AVX10.11 extended into r64.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
2. EVEX.W1 in non-64 bit is ignored; the instruction behaves as if the W0 version is used.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Fixed ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts a double precision floating-point value in the source operand (the second operand) to an unsigned
doubleword integer in the destination operand (the first operand). The source operand can be an XMM register or
a 64-bit memory location. The destination operand is a general-purpose register. When the source operand is an
XMM register, the double precision floating-point value is contained in the low quadword of the register.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is
returned, where w represents the number of bits in the destination format.

Operation
VCVTSD2USI (EVEX Encoded Version)
IF (SRC *is register*) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF 64-Bit Mode and OperandSize = 64
THEN DEST[63:0] := Convert_Double_Precision_Floating_Point_To_UInteger(SRC[63:0]);
ELSE DEST[31:0] := Convert_Double_Precision_Floating_Point_To_UInteger(SRC[63:0]);
FI

Intel C/C++ Compiler Intrinsic Equivalent


VCVTSD2USI unsigned int _mm_cvtsd_u32(__m128d);
VCVTSD2USI unsigned int _mm_cvt_roundsd_u32(__m128d, int r);
VCVTSD2USI unsigned __int64 _mm_cvtsd_u64(__m128d);
VCVTSD2USI unsigned __int64 _mm_cvt_roundsd_u64(__m128d, int r);

VCVTSD2USI—Convert Scalar Double Precision Floating-Point Value to Unsigned Doubleword Integer Vol. 2C 5-89
SIMD Floating-Point Exceptions
Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”

VCVTSD2USI—Convert Scalar Double Precision Floating-Point Value to Unsigned Doubleword Integer Vol. 2C 5-90
VCVTSH2SD—Convert Low FP16 Value to an FP64 Value
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.F3.MAP5.W0 5A /r A V/V AVX512-FP16 Convert the low FP16 value in xmm3/m16 to an
VCVTSH2SD xmm1{k1}{z}, xmm2, OR AVX10.11 FP64 value and store the result in the low
xmm3/m16 {sae} element of xmm1 subject to writemask k1. Bits
127:64 of xmm2 are copied to xmm1[127:64].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction converts the low FP16 element in the second source operand to a FP64 element in the low element
of the destination operand.
Bits 127:64 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP64 element of the destination is updated according
to the writemask.

Operation
VCVTSH2SD dest, src1, src2
IF k1[0] OR *no writemask*:
DEST.fp64[0] := Convert_fp16_to_fp64(SRC2.fp16[0])
ELSE IF *zeroing*:
DEST.fp64[0] := 0
// else dest.fp64[0] remains unchanged

DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTSH2SD __m128d _mm_cvt_roundsh_sd (__m128d a, __m128h b, const int sae);
VCVTSH2SD __m128d _mm_mask_cvt_roundsh_sd (__m128d src, __mmask8 k, __m128d a, __m128h b, const int sae);
VCVTSH2SD __m128d _mm_maskz_cvt_roundsh_sd (__mmask8 k, __m128d a, __m128h b, const int sae);
VCVTSH2SD __m128d _mm_cvtsh_sd (__m128d a, __m128h b);
VCVTSH2SD __m128d _mm_mask_cvtsh_sd (__m128d src, __mmask8 k, __m128d a, __m128h b);
VCVTSH2SD __m128d _mm_maskz_cvtsh_sd (__mmask8 k, __m128d a, __m128h b);

SIMD Floating-Point Exceptions


Invalid, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VCVTSH2SD—Convert Low FP16 Value to an FP64 Value Vol. 2C 5-91


VCVTSH2SI—Convert Low FP16 Value to Signed Integer
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.F3.MAP5.W0 2D /r A V/V1 AVX512-FP16 Convert the low FP16 element in xmm1/m16 to a
VCVTSH2SI r32, xmm1/m16 {er} OR AVX10.12 signed integer and store the result in r32.
EVEX.LLIG.F3.MAP5.W1 2D /r A V/N.E. AVX512-FP16 Convert the low FP16 element in xmm1/m16 to a
VCVTSH2SI r64, xmm1/m16 {er} OR AVX10.12 signed integer and store the result in r64.

NOTES:
1. Outside of 64b mode, the EVEX.W field is ignored. The instruction behaves as if W=0 was used.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts the low FP16 element in the source operand to a signed integer in the destination general
purpose register.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the integer indefinite value is
returned.

Operation
VCVTSH2SI dest, src
IF *SRC is a register* and (EVEX.b = 1):
SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)

IF 64-mode and OperandSize == 64:


DEST.qword := Convert_fp16_to_integer64(SRC.fp16[0])
ELSE:
DEST.dword := Convert_fp16_to_integer32(SRC.fp16[0])

Intel C/C++ Compiler Intrinsic Equivalent


VCVTSH2SI int _mm_cvt_roundsh_i32 (__m128h a, int rounding);
VCVTSH2SI __int64 _mm_cvt_roundsh_i64 (__m128h a, int rounding);
VCVTSH2SI int _mm_cvtsh_i32 (__m128h a);
VCVTSH2SI __int64 _mm_cvtsh_i64 (__m128h a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”

VCVTSH2SI—Convert Low FP16 Value to Signed Integer Vol. 2C 5-92


VCVTSH2SS—Convert Low FP16 Value to FP32 Value
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.NP.MAP6.W0 13 /r A V/V AVX512-FP16 Convert the low FP16 element in xmm3/m16 to
VCVTSH2SS xmm1{k1}{z}, xmm2, OR AVX10.11 an FP32 value and store in the low element of
xmm3/m16 {sae} xmm1 subject to writemask k1. Bits 127:32 of
xmm2 are copied to xmm1[127:32].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction converts the low FP16 element in the second source operand to the low FP32 element of the desti-
nation operand.
Bits 127:32 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.

Operation
VCVTSH2SS dest, src1, src2
IF k1[0] OR *no writemask*:
DEST.fp32[0] := Convert_fp16_to_fp32(SRC2.fp16[0])
ELSE IF *zeroing*:
DEST.fp32[0] := 0
// else dest.fp32[0] remains unchanged

DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTSH2SS __m128 _mm_cvt_roundsh_ss (__m128 a, __m128h b, const int sae);
VCVTSH2SS __m128 _mm_mask_cvt_roundsh_ss (__m128 src, __mmask8 k, __m128 a, __m128h b, const int sae);
VCVTSH2SS __m128 _mm_maskz_cvt_roundsh_ss (__mmask8 k, __m128 a, __m128h b, const int sae);
VCVTSH2SS __m128 _mm_cvtsh_ss (__m128 a, __m128h b);
VCVTSH2SS __m128 _mm_mask_cvtsh_ss (__m128 src, __mmask8 k, __m128 a, __m128h b);
VCVTSH2SS __m128 _mm_maskz_cvtsh_ss (__mmask8 k, __m128 a, __m128h b);

SIMD Floating-Point Exceptions


Invalid, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VCVTSH2SS—Convert Low FP16 Value to FP32 Value Vol. 2C 5-93


VCVTSH2USI—Convert Low FP16 Value to Unsigned Integer
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.F3.MAP5.W0 79 /r A V/V1 AVX512-FP16 Convert the low FP16 element in xmm1/m16 to
VCVTSH2USI r32, xmm1/m16 {er} OR AVX10.12 an unsigned integer and store the result in r32.
EVEX.LLIG.F3.MAP5.W1 79 /r A V/N.E. AVX512-FP16 Convert the low FP16 element in xmm1/m16 to
VCVTSH2USI r64, xmm1/m16 {er} OR AVX10.12 an unsigned integer and store the result in r64.

NOTES:
1. Outside of 64b mode, the EVEX.W field is ignored. The instruction behaves as if W=0 was used.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts the low FP16 element in the source operand to an unsigned integer in the destination
general purpose register.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the integer indefinite value is
returned.

Operation
VCVTSH2USI dest, src
// SET_RM() sets the rounding mode used for this instruction.
IF *SRC is a register* and (EVEX.b = 1):
SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)

IF 64-mode and OperandSize == 64:


DEST.qword := Convert_fp16_to_unsigned_integer64(SRC.fp16[0])
ELSE:
DEST.dword := Convert_fp16_to_unsigned_integer32(SRC.fp16[0])

Intel C/C++ Compiler Intrinsic Equivalent


VCVTSH2USI unsigned int _mm_cvt_roundsh_u32 (__m128h a, int sae);
VCVTSH2USI unsigned __int64 _mm_cvt_roundsh_u64 (__m128h a, int rounding);
VCVTSH2USI unsigned int _mm_cvtsh_u32 (__m128h a);
VCVTSH2USI unsigned __int64 _mm_cvtsh_u64 (__m128h a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”

VCVTSH2USI—Convert Low FP16 Value to Unsigned Integer Vol. 2C 5-94


VCVTSI2SH—Convert a Signed Doubleword/Quadword Integer to an FP16 Value
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.F3.MAP5.W0 2A /r A V/V1 AVX512-FP16 Convert the signed doubleword integer in
VCVTSI2SH xmm1, xmm2, r32/m32 OR AVX10.12 r32/m32 to an FP16 value and store the result in
{er} xmm1. Bits 127:16 of xmm2 are copied to
xmm1[127:16].
EVEX.LLIG.F3.MAP5.W1 2A /r A V/N.E. AVX512-FP16 Convert the signed quadword integer in r64/m64
VCVTSI2SH xmm1, xmm2, r64/m64 OR AVX10.12 to an FP16 value and store the result in xmm1.
{er} Bits 127:16 of xmm2 are copied to
xmm1[127:16].

NOTES:
1. Outside of 64b mode, the EVEX.W field is ignored. The instruction behaves as if W=0 was used.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction converts a signed doubleword integer (or signed quadword integer if operand size is 64 bits) in the
second source operand to an FP16 value in the destination operand. The result is stored in the low word of the desti-
nation operand. When conversion is inexact, the value returned is rounded according to the rounding control bits
in the MXCSR register or embedded rounding controls.
The second source operand can be a general-purpose register or a 32/64-bit memory location. The first source and
destination operands are XMM registers. Bits 127:16 of the XMM register destination are copied from corre-
sponding bits in the first source operand. Bits MAXVL-1:128 of the destination register are zeroed.
If the result of the convert operation is overflow and MXCSR.OM=0 then a SIMD exception will be raised with OE=1,
PE=1.

Operation
VCVTSI2SH dest, src1, src2
IF *SRC2 is a register* and (EVEX.b = 1):
SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)

IF 64-mode and OperandSize == 64:


DEST.fp16[0] := Convert_integer64_to_fp16(SRC2.qword)
ELSE:
DEST.fp16[0] := Convert_integer32_to_fp16(SRC2.dword)

DEST[127:16] := SRC1[127:16]
DEST[MAXVL-1:128] := 0

VCVTSI2SH—Convert a Signed Doubleword/Quadword Integer to an FP16 Value Vol. 2C 5-95


Intel C/C++ Compiler Intrinsic Equivalent
VCVTSI2SH __m128h _mm_cvt_roundi32_sh (__m128h a, int b, int rounding);
VCVTSI2SH __m128h _mm_cvt_roundi64_sh (__m128h a, __int64 b, int rounding);
VCVTSI2SH __m128h _mm_cvti32_sh (__m128h a, int b);
VCVTSI2SH __m128h _mm_cvti64_sh (__m128h a, __int64 b);

SIMD Floating-Point Exceptions


Overflow, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”

VCVTSI2SH—Convert a Signed Doubleword/Quadword Integer to an FP16 Value Vol. 2C 5-96


VCVTSS2SH—Convert Low FP32 Value to an FP16 Value
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.NP.MAP5.W0 1D /r A V/V AVX512-FP16 Convert low FP32 value in xmm3/m32 to an
VCVTSS2SH xmm1{k1}{z}, xmm2, OR AVX10.11 FP16 value and store in the low element of
xmm3/m32 {er} xmm1 subject to writemask k1. Bits 127:16 from
xmm2 are copied to xmm1[127:16].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction converts the low FP32 value in the second source operand to a FP16 value in the low element of the
destination operand.
When the conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.

Operation
VCVTSS2SH dest, src1, src2
IF *SRC2 is a register* and (EVEX.b = 1):
SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)

IF k1[0] OR *no writemask*:


DEST.fp16[0] := Convert_fp32_to_fp16(SRC2.fp32[0])
ELSE IF *zeroing*:
DEST.fp16[0] := 0
// else dest.fp16[0] remains unchanged

DEST[127:16] := SRC1[127:16]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTSS2SH __m128h _mm_cvt_roundss_sh (__m128h a, __m128 b, const int rounding);
VCVTSS2SH __m128h _mm_mask_cvt_roundss_sh (__m128h src, __mmask8 k, __m128h a, __m128 b, const int rounding);
VCVTSS2SH __m128h _mm_maskz_cvt_roundss_sh (__mmask8 k, __m128h a, __m128 b, const int rounding);
VCVTSS2SH __m128h _mm_cvtss_sh (__m128h a, __m128 b);
VCVTSS2SH __m128h _mm_mask_cvtss_sh (__m128h src, __mmask8 k, __m128h a, __m128 b);
VCVTSS2SH __m128h _mm_maskz_cvtss_sh (__mmask8 k, __m128h a, __m128 b);

VCVTSS2SH—Convert Low FP32 Value to an FP16 Value Vol. 2C 5-97


SIMD Floating-Point Exceptions
Invalid, Underflow, Overflow, Precision, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VCVTSS2SH—Convert Low FP32 Value to an FP16 Value Vol. 2C 5-98


VCVTSS2USI—Convert Scalar Single Precision Floating-Point Value to Unsigned Doubleword
Integer
Opcode/ Op / 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
EVEX.LLIG.F3.0F.W0 79 /r A V/V AVX512F Convert one single precision floating-point value
VCVTSS2USI r32, xmm1/m32{er} OR AVX10.11 from xmm1/m32 to one unsigned doubleword
integer in r32.
EVEX.LLIG.F3.0F.W1 79 /r A V/N.E.2 AVX512F Convert one single precision floating-point value
VCVTSS2USI r64, xmm1/m32{er} OR AVX10.11 from xmm1/m32 to one unsigned quadword
integer in r64.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
2. EVEX.W1 in non-64 bit is ignored; the instruction behaves as if the W0 version is used.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Fixed ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts a single precision floating-point value in the source operand (the second operand) to an unsigned double-
word integer (or unsigned quadword integer if operand size is 64 bits) in the destination operand (the first
operand). The source operand can be an XMM register or a memory location. The destination operand is a general-
purpose register. When the source operand is an XMM register, the single precision floating-point value is contained
in the low doubleword of the register.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is
returned, where w represents the number of bits in the destination format.
VEX.W1 and EVEX.W1 versions: promotes the instruction to produce 64-bit data in 64-bit mode.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Operation
VCVTSS2USI (EVEX Encoded Version)
IF (SRC *is register*) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF 64-bit Mode and OperandSize = 64
THEN
DEST[63:0] := Convert_Single_Precision_Floating_Point_To_UInteger(SRC[31:0]);
ELSE
DEST[31:0] := Convert_Single_Precision_Floating_Point_To_UInteger(SRC[31:0]);
FI;

VCVTSS2USI—Convert Scalar Single Precision Floating-Point Value to Unsigned Doubleword Integer Vol. 2C 5-98
Intel C/C++ Compiler Intrinsic Equivalent
VCVTSS2USI unsigned _mm_cvtss_u32( __m128 a);
VCVTSS2USI unsigned _mm_cvt_roundss_u32( __m128 a, int r);
VCVTSS2USI unsigned __int64 _mm_cvtss_u64( __m128 a);
VCVTSS2USI unsigned __int64 _mm_cvt_roundss_u64( __m128 a, int r);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”

VCVTSS2USI—Convert Scalar Single Precision Floating-Point Value to Unsigned Doubleword Integer Vol. 2C 5-99
VCVTTPD2QQ—Convert With Truncation Packed Double Precision Floating-Point Values to
Packed Quadword Integers
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F.W1 7A /r A V/V (AVX512VL AND Convert two packed double precision floating-point
VCVTTPD2QQ xmm1 {k1}{z}, AVX512DQ) OR values from zmm2/m128/m64bcst to two packed
xmm2/m128/m64bcst AVX10.11 quadword integers in zmm1 using truncation with
writemask k1.
EVEX.256.66.0F.W1 7A /r A V/V (AVX512VL AND Convert four packed double precision floating-point
VCVTTPD2QQ ymm1 {k1}{z}, AVX512DQ) OR values from ymm2/m256/m64bcst to four packed
ymm2/m256/m64bcst AVX10.11 quadword integers in ymm1 using truncation with
writemask k1.
EVEX.512.66.0F.W1 7A /r A V/V AVX512DQ Convert eight packed double precision floating-point
VCVTTPD2QQ zmm1 {k1}{z}, OR AVX10.11 values from zmm2/m512 to eight packed quadword
zmm2/m512/m64bcst {sae} integers in zmm1 using truncation with writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts with truncation packed double precision floating-point values in the source operand (second operand) to
packed quadword integers in the destination operand (first operand).
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location.
The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the indefinite integer value (2w-1, where w represents the number of bits in the destination format) is returned.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Operation
VCVTTPD2QQ (EVEX Encoded Version) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_Double_Precision_Floating_Point_To_QuadInteger_Truncate(SRC[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;

VCVTTPD2QQ—Convert With Truncation Packed Double Precision Floating-Point Values to Packed Quadword Integers Vol. 2C 5-100
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTTPD2QQ (EVEX Encoded Version) When SRC Operand is a Memory Source


(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[i+63:i] := Convert_Double_Precision_Floating_Point_To_QuadInteger_Truncate(SRC[63:0])
ELSE
DEST[i+63:i] := Convert_Double_Precision_Floating_Point_To_QuadInteger_Truncate(SRC[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTTPD2QQ __m512i _mm512_cvttpd_epi64( __m512d a);
VCVTTPD2QQ __m512i _mm512_mask_cvttpd_epi64( __m512i s, __mmask8 k, __m512d a);
VCVTTPD2QQ __m512i _mm512_maskz_cvttpd_epi64( __mmask8 k, __m512d a);
VCVTTPD2QQ __m512i _mm512_cvtt_roundpd_epi64( __m512d a, int sae);
VCVTTPD2QQ __m512i _mm512_mask_cvtt_roundpd_epi64( __m512i s, __mmask8 k, __m512d a, int sae);
VCVTTPD2QQ __m512i _mm512_maskz_cvtt_roundpd_epi64( __mmask8 k, __m512d a, int sae);
VCVTTPD2QQ __m256i _mm256_mask_cvttpd_epi64( __m256i s, __mmask8 k, __m256d a);
VCVTTPD2QQ __m256i _mm256_maskz_cvttpd_epi64( __mmask8 k, __m256d a);
VCVTTPD2QQ __m128i _mm_mask_cvttpd_epi64( __m128i s, __mmask8 k, __m128d a);
VCVTTPD2QQ __m128i _mm_maskz_cvttpd_epi64( __mmask8 k, __m128d a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VCVTTPD2QQ—Convert With Truncation Packed Double Precision Floating-Point Values to Packed Quadword Integers Vol. 2C 5-101
VCVTTPD2UDQ—Convert With Truncation Packed Double Precision Floating-Point Values to
Packed Unsigned Doubleword Integers
Opcode Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.0F.W1 78 /r A V/V (AVX512VL AND Convert two packed double precision floating-point
VCVTTPD2UDQ xmm1 {k1}{z}, AVX512F) OR values in xmm2/m128/m64bcst to two unsigned
xmm2/m128/m64bcst AVX10.11 doubleword integers in xmm1 using truncation
subject to writemask k1.
EVEX.256.0F.W1 78 02 /r A V/V (AVX512VL AND Convert four packed double precision floating-point
VCVTTPD2UDQ xmm1 {k1}{z}, AVX512F) OR values in ymm2/m256/m64bcst to four unsigned
ymm2/m256/m64bcst AVX10.11 doubleword integers in xmm1 using truncation
subject to writemask k1.
EVEX.512.0F.W1 78 /r A V/V AVX512F Convert eight packed double precision floating-point
VCVTTPD2UDQ ymm1 {k1}{z}, OR AVX10.11 values in zmm2/m512/m64bcst to eight unsigned
zmm2/m512/m64bcst {sae} doubleword integers in ymm1 using truncation
subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts with truncation packed double precision floating-point values in the source operand (the second operand)
to packed unsigned doubleword integers in the destination operand (the first operand).
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.
The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector
broadcasted from a 64-bit memory location. The destination operand is a YMM/XMM/XMM (low 64 bits) register
conditionally updated with writemask k1. The upper bits (MAXVL-1:256) of the corresponding destination are
zeroed.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Operation
VCVTTPD2UDQ (EVEX Encoded Versions) When SRC2 Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN
DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_UInteger_Truncate(SRC[k+63:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*

VCVTTPD2UDQ—Convert With Truncation Packed Double Precision Floating-Point Values to Packed Unsigned Doubleword Integers Vol. 2C 5-102
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0

VCVTTPD2UDQ (EVEX Encoded Versions) When SRC Operand is a Memory Source


(KL, VL) = (2, 128), (4, 256),(8, 512)

FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_UInteger_Truncate(SRC[63:0])
ELSE
DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_UInteger_Truncate(SRC[k+63:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTTPD2UDQ __m256i _mm512_cvttpd_epu32( __m512d a);
VCVTTPD2UDQ __m256i _mm512_mask_cvttpd_epu32( __m256i s, __mmask8 k, __m512d a);
VCVTTPD2UDQ __m256i _mm512_maskz_cvttpd_epu32( __mmask8 k, __m512d a);
VCVTTPD2UDQ __m256i _mm512_cvtt_roundpd_epu32( __m512d a, int sae);
VCVTTPD2UDQ __m256i _mm512_mask_cvtt_roundpd_epu32( __m256i s, __mmask8 k, __m512d a, int sae);
VCVTTPD2UDQ __m256i _mm512_maskz_cvtt_roundpd_epu32( __mmask8 k, __m512d a, int sae);
VCVTTPD2UDQ __m128i _mm256_mask_cvttpd_epu32( __m128i s, __mmask8 k, __m256d a);
VCVTTPD2UDQ __m128i _mm256_maskz_cvttpd_epu32( __mmask8 k, __m256d a);
VCVTTPD2UDQ __m128i _mm_mask_cvttpd_epu32( __m128i s, __mmask8 k, __m128d a);
VCVTTPD2UDQ __m128i _mm_maskz_cvttpd_epu32( __mmask8 k, __m128d a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VCVTTPD2UDQ—Convert With Truncation Packed Double Precision Floating-Point Values to Packed Unsigned Doubleword Integers Vol. 2C 5-103
VCVTTPD2UQQ—Convert With Truncation Packed Double Precision Floating-Point Values to
Packed Unsigned Quadword Integers
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F.W1 78 /r A V/V (AVX512VL AND Convert two packed double precision floating-point
VCVTTPD2UQQ xmm1 {k1}{z}, AVX512DQ) OR values from xmm2/m128/m64bcst to two packed
xmm2/m128/m64bcst AVX10.11 unsigned quadword integers in xmm1 using truncation
with writemask k1.
EVEX.256.66.0F.W1 78 /r A V/V (AVX512VL AND Convert four packed double precision floating-point
VCVTTPD2UQQ ymm1 {k1}{z}, AVX512DQ) OR values from ymm2/m256/m64bcst to four packed
ymm2/m256/m64bcst AVX10.11 unsigned quadword integers in ymm1 using truncation
with writemask k1.
EVEX.512.66.0F.W1 78 /r A V/V AVX512DQ Convert eight packed double precision floating-point
VCVTTPD2UQQ zmm1 {k1}{z}, OR AVX10.11 values from zmm2/mem to eight packed unsigned
zmm2/m512/m64bcst {sae} quadword integers in zmm1 using truncation with
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts with truncation packed double precision floating-point values in the source operand (second operand) to
packed unsigned quadword integers in the destination operand (first operand).
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location.
The destination operation is a ZMM/YMM/XMM register conditionally updated with writemask k1.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Operation
VCVTTPD2UQQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_Double_Precision_Floating_Point_To_UQuadInteger_Truncate(SRC[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;

VCVTTPD2UQQ—Convert With Truncation Packed Double Precision Floating-Point Values to Packed Unsigned Quadword Integers Vol. 2C 5-104
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTTPD2UQQ (EVEX Encoded Versions) When SRC Operand is a Memory Source


(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[i+63:i] :=
Convert_Double_Precision_Floating_Point_To_UQuadInteger_Truncate(SRC[63:0])
ELSE
DEST[i+63:i] :=
Convert_Double_Precision_Floating_Point_To_UQuadInteger_Truncate(SRC[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTTPD2UQQ _mm<size>[_mask[z]]_cvtt[_round]pd_epu64
VCVTTPD2UQQ __m512i _mm512_cvttpd_epu64( __m512d a);
VCVTTPD2UQQ __m512i _mm512_mask_cvttpd_epu64( __m512i s, __mmask8 k, __m512d a);
VCVTTPD2UQQ __m512i _mm512_maskz_cvttpd_epu64( __mmask8 k, __m512d a);
VCVTTPD2UQQ __m512i _mm512_cvtt_roundpd_epu64( __m512d a, int sae);
VCVTTPD2UQQ __m512i _mm512_mask_cvtt_roundpd_epu64( __m512i s, __mmask8 k, __m512d a, int sae);
VCVTTPD2UQQ __m512i _mm512_maskz_cvtt_roundpd_epu64( __mmask8 k, __m512d a, int sae);
VCVTTPD2UQQ __m256i _mm256_mask_cvttpd_epu64( __m256i s, __mmask8 k, __m256d a);
VCVTTPD2UQQ __m256i _mm256_maskz_cvttpd_epu64( __mmask8 k, __m256d a);
VCVTTPD2UQQ __m128i _mm_mask_cvttpd_epu64( __m128i s, __mmask8 k, __m128d a);
VCVTTPD2UQQ __m128i _mm_maskz_cvttpd_epu64( __mmask8 k, __m128d a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VCVTTPD2UQQ—Convert With Truncation Packed Double Precision Floating-Point Values to Packed Unsigned Quadword Integers Vol. 2C 5-105
VCVTTPH2DQ—Convert with Truncation Packed FP16 Values to Signed Doubleword Integers
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F3.MAP5.W0 5B /r A V/V (AVX512-FP16 Convert four packed FP16 values in
VCVTTPH2DQ xmm1{k1}{z}, AND AVX512VL) xmm2/m64/m16bcst to four signed doubleword
xmm2/m64/m16bcst OR AVX10.11 integers, and store the result in xmm1 using
truncation subject to writemask k1.
EVEX.256.F3.MAP5.W0 5B /r A V/V (AVX512-FP16 Convert eight packed FP16 values in
VCVTTPH2DQ ymm1{k1}{z}, AND AVX512VL) xmm2/m128/m16bcst to eight signed
xmm2/m128/m16bcst OR AVX10.11 doubleword integers, and store the result in
ymm1 using truncation subject to writemask k1.
EVEX.512.F3.MAP5.W0 5B /r A V/V AVX512-FP16 Convert sixteen packed FP16 values in
VCVTTPH2DQ zmm1{k1}{z}, OR AVX10.11 ymm2/m256/m16bcst to sixteen signed
ymm2/m256/m16bcst {sae} doubleword integers, and store the result in
zmm1 using truncation subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Half ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts packed FP16 values in the source operand to signed doubleword integers in destination
operand.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result is larger than
the maximum signed doubleword integer, the floating-point invalid exception is raised, and if this exception is
masked, the indefinite integer value is returned.
The destination elements are updated according to the writemask.

Operation
VCVTTPH2DQ dest, src
VL = 128, 256 or 512
KL := VL / 32

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.fp32[j] := Convert_fp16_to_integer32_truncate(tsrc)
ELSE IF *zeroing*:
DEST.fp32[j] := 0
// else dest.fp32[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VCVTTPH2DQ—Convert with Truncation Packed FP16 Values to Signed Doubleword Integers Vol. 2C 5-106
Intel C/C++ Compiler Intrinsic Equivalent
VCVTTPH2DQ __m512i _mm512_cvtt_roundph_epi32 (__m256h a, int sae);
VCVTTPH2DQ __m512i _mm512_mask_cvtt_roundph_epi32 (__m512i src, __mmask16 k, __m256h a, int sae);
VCVTTPH2DQ __m512i _mm512_maskz_cvtt_roundph_epi32 (__mmask16 k, __m256h a, int sae);
VCVTTPH2DQ __m128i _mm_cvttph_epi32 (__m128h a);
VCVTTPH2DQ __m128i _mm_mask_cvttph_epi32 (__m128i src, __mmask8 k, __m128h a);
VCVTTPH2DQ __m128i _mm_maskz_cvttph_epi32 (__mmask8 k, __m128h a);
VCVTTPH2DQ __m256i _mm256_cvttph_epi32 (__m128h a);
VCVTTPH2DQ __m256i _mm256_mask_cvttph_epi32 (__m256i src, __mmask8 k, __m128h a);
VCVTTPH2DQ __m256i _mm256_maskz_cvttph_epi32 (__mmask8 k, __m128h a);
VCVTTPH2DQ __m512i _mm512_cvttph_epi32 (__m256h a);
VCVTTPH2DQ __m512i _mm512_mask_cvttph_epi32 (__m512i src, __mmask16 k, __m256h a);
VCVTTPH2DQ __m512i _mm512_maskz_cvttph_epi32 (__mmask16 k, __m256h a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCVTTPH2DQ—Convert with Truncation Packed FP16 Values to Signed Doubleword Integers Vol. 2C 5-107
VCVTTPH2QQ—Convert with Truncation Packed FP16 Values to Signed Quadword Integers
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.MAP5.W0 7A /r A V/V (AVX512-FP16 Convert two packed FP16 values in
VCVTTPH2QQ xmm1{k1}{z}, AND AVX512VL) xmm2/m32/m16bcst to two signed quadword
xmm2/m32/m16bcst OR AVX10.11 integers, and store the result in xmm1 using
truncation subject to writemask k1.
EVEX.256.66.MAP5.W0 7A /r A V/V (AVX512-FP16 Convert four packed FP16 values in
VCVTTPH2QQ ymm1{k1}{z}, AND AVX512VL) xmm2/m64/m16bcst to four signed quadword
xmm2/m64/m16bcst OR AVX10.11 integers, and store the result in ymm1 using
truncation subject to writemask k1.
EVEX.512.66.MAP5.W0 7A /r A V/V AVX512-FP16 Convert eight packed FP16 values in
VCVTTPH2QQ zmm1{k1}{z}, OR AVX10.11 xmm2/m128/m16bcst to eight signed quadword
xmm2/m128/m16bcst {sae} integers, and store the result in zmm1 using
truncation subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Quarter ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts packed FP16 values in the source operand to signed quadword integers in the destination
operand.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the indefinite integer value is returned.
The destination elements are updated according to the writemask.

Operation
VCVTTPH2QQ dest, src
VL = 128, 256 or 512
KL := VL / 64

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.qword[j] := Convert_fp16_to_integer64_truncate(tsrc)
ELSE IF *zeroing*:
DEST.qword[j] := 0
// else dest.qword[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VCVTTPH2QQ—Convert with Truncation Packed FP16 Values to Signed Quadword Integers Vol. 2C 5-108
Intel C/C++ Compiler Intrinsic Equivalent
VCVTTPH2QQ __m512i _mm512_cvtt_roundph_epi64 (__m128h a, int sae);
VCVTTPH2QQ __m512i _mm512_mask_cvtt_roundph_epi64 (__m512i src, __mmask8 k, __m128h a, int sae);
VCVTTPH2QQ __m512i _mm512_maskz_cvtt_roundph_epi64 (__mmask8 k, __m128h a, int sae);
VCVTTPH2QQ __m128i _mm_cvttph_epi64 (__m128h a);
VCVTTPH2QQ __m128i _mm_mask_cvttph_epi64 (__m128i src, __mmask8 k, __m128h a);
VCVTTPH2QQ __m128i _mm_maskz_cvttph_epi64 (__mmask8 k, __m128h a);
VCVTTPH2QQ __m256i _mm256_cvttph_epi64 (__m128h a);
VCVTTPH2QQ __m256i _mm256_mask_cvttph_epi64 (__m256i src, __mmask8 k, __m128h a);
VCVTTPH2QQ __m256i _mm256_maskz_cvttph_epi64 (__mmask8 k, __m128h a);
VCVTTPH2QQ __m512i _mm512_cvttph_epi64 (__m128h a);
VCVTTPH2QQ __m512i _mm512_mask_cvttph_epi64 (__m512i src, __mmask8 k, __m128h a);
VCVTTPH2QQ __m512i _mm512_maskz_cvttph_epi64 (__mmask8 k, __m128h a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCVTTPH2QQ—Convert with Truncation Packed FP16 Values to Signed Quadword Integers Vol. 2C 5-109
VCVTTPH2UDQ—Convert with Truncation Packed FP16 Values to Unsigned Doubleword
Integers
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.NP.MAP5.W0 78 /r A V/V (AVX512-FP16 Convert four packed FP16 values in
VCVTTPH2UDQ xmm1{k1}{z}, AND AVX512VL) xmm2/m64/m16bcst to four unsigned
xmm2/m64/m16bcst OR AVX10.11 doubleword integers, and store the result in
xmm1 using truncation subject to writemask k1.
EVEX.256.NP.MAP5.W0 78 /r A V/V (AVX512-FP16 Convert eight packed FP16 values in
VCVTTPH2UDQ ymm1{k1}{z}, AND AVX512VL) xmm2/m128/m16bcst to eight unsigned
xmm2/m128/m16bcst OR AVX10.11 doubleword integers, and store the result in
ymm1 using truncation subject to writemask k1.
EVEX.512.NP.MAP5.W0 78 /r A V/V AVX512-FP16 Convert sixteen packed FP16 values in
VCVTTPH2UDQ zmm1{k1}{z}, OR AVX10.11 ymm2/m256/m16bcst to sixteen unsigned
ymm2/m256/m16bcst {sae} doubleword integers, and store the result in
zmm1 using truncation subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Half ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts packed FP16 values in the source operand to unsigned doubleword integers in the destina-
tion operand.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer indefinite value is returned.
The destination elements are updated according to the writemask.

Operation
VCVTTPH2UDQ dest, src
VL = 128, 256 or 512
KL := VL / 32

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.dword[j] := Convert_fp16_to_unsigned_integer32_truncate(tsrc)
ELSE IF *zeroing*:
DEST.dword[j] := 0
// else dest.dword[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VCVTTPH2UDQ—Convert with Truncation Packed FP16 Values to Unsigned Doubleword Integers Vol. 2C 5-110
Intel C/C++ Compiler Intrinsic Equivalent
VCVTTPH2UDQ __m512i _mm512_cvtt_roundph_epu32 (__m256h a, int sae);
VCVTTPH2UDQ __m512i _mm512_mask_cvtt_roundph_epu32 (__m512i src, __mmask16 k, __m256h a, int sae);
VCVTTPH2UDQ __m512i _mm512_maskz_cvtt_roundph_epu32 (__mmask16 k, __m256h a, int sae);
VCVTTPH2UDQ __m128i _mm_cvttph_epu32 (__m128h a);
VCVTTPH2UDQ __m128i _mm_mask_cvttph_epu32 (__m128i src, __mmask8 k, __m128h a);
VCVTTPH2UDQ __m128i _mm_maskz_cvttph_epu32 (__mmask8 k, __m128h a);
VCVTTPH2UDQ __m256i _mm256_cvttph_epu32 (__m128h a);
VCVTTPH2UDQ __m256i _mm256_mask_cvttph_epu32 (__m256i src, __mmask8 k, __m128h a);
VCVTTPH2UDQ __m256i _mm256_maskz_cvttph_epu32 (__mmask8 k, __m128h a);
VCVTTPH2UDQ __m512i _mm512_cvttph_epu32 (__m256h a);
VCVTTPH2UDQ __m512i _mm512_mask_cvttph_epu32 (__m512i src, __mmask16 k, __m256h a);
VCVTTPH2UDQ __m512i _mm512_maskz_cvttph_epu32 (__mmask16 k, __m256h a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCVTTPH2UDQ—Convert with Truncation Packed FP16 Values to Unsigned Doubleword Integers Vol. 2C 5-111
VCVTTPH2UQQ—Convert with Truncation Packed FP16 Values to Unsigned Quadword Integers
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.MAP5.W0 78 /r A V/V (AVX512-FP16 Convert two packed FP16 values in
VCVTTPH2UQQ xmm1{k1}{z}, AND AVX512VL) xmm2/m32/m16bcst to two unsigned quadword
xmm2/m32/m16bcst OR AVX10.11 integers, and store the result in xmm1 using
truncation subject to writemask k1.
EVEX.256.66.MAP5.W0 78 /r A V/V (AVX512-FP16 Convert four packed FP16 values in
VCVTTPH2UQQ ymm1{k1}{z}, AND AVX512VL) xmm2/m64/m16bcst to four unsigned quadword
xmm2/m64/m16bcst OR AVX10.11 integers, and store the result in ymm1 using
truncation subject to writemask k1.
EVEX.512.66.MAP5.W0 78 /r A V/V AVX512-FP16 Convert eight packed FP16 values in
VCVTTPH2UQQ zmm1{k1}{z}, OR AVX10.11 xmm2/m128/m16bcst to eight unsigned
xmm2/m128/m16bcst {sae} quadword integers, and store the result in zmm1
using truncation subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Quarter ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts packed FP16 values in the source operand to unsigned quadword integers in the destina-
tion operand.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer indefinite value is returned.
The destination elements are updated according to the writemask.

Operation
VCVTTPH2UQQ dest, src
VL = 128, 256 or 512
KL := VL / 64

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.qword[j] := Convert_fp16_to_unsigned_integer64_truncate(tsrc)
ELSE IF *zeroing*:
DEST.qword[j] := 0
// else dest.qword[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VCVTTPH2UQQ—Convert with Truncation Packed FP16 Values to Unsigned Quadword Integers Vol. 2C 5-112
Intel C/C++ Compiler Intrinsic Equivalent
VCVTTPH2UQQ __m512i _mm512_cvtt_roundph_epu64 (__m128h a, int sae);
VCVTTPH2UQQ __m512i _mm512_mask_cvtt_roundph_epu64 (__m512i src, __mmask8 k, __m128h a, int sae);
VCVTTPH2UQQ __m512i _mm512_maskz_cvtt_roundph_epu64 (__mmask8 k, __m128h a, int sae);
VCVTTPH2UQQ __m128i _mm_cvttph_epu64 (__m128h a);
VCVTTPH2UQQ __m128i _mm_mask_cvttph_epu64 (__m128i src, __mmask8 k, __m128h a);
VCVTTPH2UQQ __m128i _mm_maskz_cvttph_epu64 (__mmask8 k, __m128h a);
VCVTTPH2UQQ __m256i _mm256_cvttph_epu64 (__m128h a);
VCVTTPH2UQQ __m256i _mm256_mask_cvttph_epu64 (__m256i src, __mmask8 k, __m128h a);
VCVTTPH2UQQ __m256i _mm256_maskz_cvttph_epu64 (__mmask8 k, __m128h a);
VCVTTPH2UQQ __m512i _mm512_cvttph_epu64 (__m128h a);
VCVTTPH2UQQ __m512i _mm512_mask_cvttph_epu64 (__m512i src, __mmask8 k, __m128h a);
VCVTTPH2UQQ __m512i _mm512_maskz_cvttph_epu64 (__mmask8 k, __m128h a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCVTTPH2UQQ—Convert with Truncation Packed FP16 Values to Unsigned Quadword Integers Vol. 2C 5-113
VCVTTPH2UW—Convert Packed FP16 Values to Unsigned Word Integers
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.NP.MAP5.W0 7C /r A V/V (AVX512-FP16 Convert eight packed FP16 values in
VCVTTPH2UW xmm1{k1}{z}, AND AVX512VL) xmm2/m128/m16bcst to eight unsigned word
xmm2/m128/m16bcst OR AVX10.11 integers, and store the result in xmm1 using
truncation subject to writemask k1.
EVEX.256.NP.MAP5.W0 7C /r A V/V (AVX512-FP16 Convert sixteen packed FP16 values in
VCVTTPH2UW ymm1{k1}{z}, AND AVX512VL) ymm2/m256/m16bcst to sixteen unsigned word
ymm2/m256/m16bcst OR AVX10.11 integers, and store the result in ymm1 using
truncation subject to writemask k1.
EVEX.512.NP.MAP5.W0 7C /r A V/V AVX512-FP16 Convert thirty-two packed FP16 values in
VCVTTPH2UW zmm1{k1}{z}, OR AVX10.11 zmm2/m512/m16bcst to thirty-two unsigned
zmm2/m512/m16bcst {sae} word integers, and store the result in zmm1
using truncation subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts packed FP16 values in the source operand to unsigned word integers in the destination
operand.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer indefinite value is returned.
The destination elements are updated according to the writemask.

Operation
VCVTTPH2UW dest, src
VL = 128, 256 or 512
KL := VL / 16

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.word[j] := Convert_fp16_to_unsigned_integer16_truncate(tsrc)
ELSE IF *zeroing*:
DEST.word[j] := 0
// else dest.word[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VCVTTPH2UW—Convert Packed FP16 Values to Unsigned Word Integers Vol. 2C 5-114


Intel C/C++ Compiler Intrinsic Equivalent
VCVTTPH2UW __m512i _mm512_cvtt_roundph_epu16 (__m512h a, int sae);
VCVTTPH2UW __m512i _mm512_mask_cvtt_roundph_epu16 (__m512i src, __mmask32 k, __m512h a, int sae);
VCVTTPH2UW __m512i _mm512_maskz_cvtt_roundph_epu16 (__mmask32 k, __m512h a, int sae);
VCVTTPH2UW __m128i _mm_cvttph_epu16 (__m128h a);
VCVTTPH2UW __m128i _mm_mask_cvttph_epu16 (__m128i src, __mmask8 k, __m128h a);
VCVTTPH2UW __m128i _mm_maskz_cvttph_epu16 (__mmask8 k, __m128h a);
VCVTTPH2UW __m256i _mm256_cvttph_epu16 (__m256h a);
VCVTTPH2UW __m256i _mm256_mask_cvttph_epu16 (__m256i src, __mmask16 k, __m256h a);
VCVTTPH2UW __m256i _mm256_maskz_cvttph_epu16 (__mmask16 k, __m256h a);
VCVTTPH2UW __m512i _mm512_cvttph_epu16 (__m512h a);
VCVTTPH2UW __m512i _mm512_mask_cvttph_epu16 (__m512i src, __mmask32 k, __m512h a);
VCVTTPH2UW __m512i _mm512_maskz_cvttph_epu16 (__mmask32 k, __m512h a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCVTTPH2UW—Convert Packed FP16 Values to Unsigned Word Integers Vol. 2C 5-115


VCVTTPH2W—Convert Packed FP16 Values to Signed Word Integers
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.MAP5.W0 7C /r A V/V (AVX512-FP16 Convert eight packed FP16 values in
VCVTTPH2W xmm1{k1}{z}, AND AVX512VL) xmm2/m128/m16bcst to eight signed word
xmm2/m128/m16bcst OR AVX10.11 integers, and store the result in xmm1 using
truncation subject to writemask k1.
EVEX.256.66.MAP5.W0 7C /r A V/V (AVX512-FP16 Convert sixteen packed FP16 values in
VCVTTPH2W ymm1{k1}{z}, AND AVX512VL) ymm2/m256/m16bcst to sixteen signed word
ymm2/m256/m16bcst OR AVX10.11 integers, and store the result in ymm1 using
truncation subject to writemask k1.
EVEX.512.66.MAP5.W0 7C /r A V/V AVX512-FP16 Convert thirty-two packed FP16 values in
VCVTTPH2W zmm1{k1}{z}, OR AVX10.11 zmm2/m512/m16bcst to thirty-two signed word
zmm2/m512/m16bcst {sae} integers, and store the result in zmm1 using
truncation subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts packed FP16 values in the source operand to signed word integers in the destination
operand.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer indefinite value is returned.
The destination elements are updated according to the writemask.

Operation
VCVTTPH2W dest, src
VL = 128, 256 or 512
KL := VL / 16

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.word[j] := Convert_fp16_to_integer16_truncate(tsrc)
ELSE IF *zeroing*:
DEST.word[j] := 0
// else dest.word[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VCVTTPH2W—Convert Packed FP16 Values to Signed Word Integers Vol. 2C 5-116


Intel C/C++ Compiler Intrinsic Equivalent
VCVTTPH2W __m512i _mm512_cvtt_roundph_epi16 (__m512h a, int sae);
VCVTTPH2W __m512i _mm512_mask_cvtt_roundph_epi16 (__m512i src, __mmask32 k, __m512h a, int sae);
VCVTTPH2W __m512i _mm512_maskz_cvtt_roundph_epi16 (__mmask32 k, __m512h a, int sae);
VCVTTPH2W __m128i _mm_cvttph_epi16 (__m128h a);
VCVTTPH2W __m128i _mm_mask_cvttph_epi16 (__m128i src, __mmask8 k, __m128h a);
VCVTTPH2W __m128i _mm_maskz_cvttph_epi16 (__mmask8 k, __m128h a);
VCVTTPH2W __m256i _mm256_cvttph_epi16 (__m256h a);
VCVTTPH2W __m256i _mm256_mask_cvttph_epi16 (__m256i src, __mmask16 k, __m256h a);
VCVTTPH2W __m256i _mm256_maskz_cvttph_epi16 (__mmask16 k, __m256h a);
VCVTTPH2W __m512i _mm512_cvttph_epi16 (__m512h a);
VCVTTPH2W __m512i _mm512_mask_cvttph_epi16 (__m512i src, __mmask32 k, __m512h a);
VCVTTPH2W __m512i _mm512_maskz_cvttph_epi16 (__mmask32 k, __m512h a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCVTTPH2W—Convert Packed FP16 Values to Signed Word Integers Vol. 2C 5-117


VCVTTPS2QQ—Convert With Truncation Packed Single Precision Floating-Point Values to
Packed Signed Quadword Integer Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F.W0 7A /r A V/V (AVX512VL AND Convert two packed single precision floating-point values
VCVTTPS2QQ xmm1 {k1}{z}, AVX512DQ) OR from xmm2/m64/m32bcst to two packed signed
xmm2/m64/m32bcst AVX10.11 quadword values in xmm1 using truncation subject to
writemask k1.
EVEX.256.66.0F.W0 7A /r A V/V (AVX512VL AND Convert four packed single precision floating-point values
VCVTTPS2QQ ymm1 {k1}{z}, AVX512DQ) OR from xmm2/m128/m32bcst to four packed signed
xmm2/m128/m32bcst AVX10.11 quadword values in ymm1 using truncation subject to
writemask k1.
EVEX.512.66.0F.W0 7A /r A V/V AVX512DQ Convert eight packed single precision floating-point values
VCVTTPS2QQ zmm1 {k1}{z}, OR AVX10.11 from ymm2/m256/m32bcst to eight packed signed
ymm2/m256/m32bcst {sae} quadword values in zmm1 using truncation subject to
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Half ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts with truncation packed single precision floating-point values in the source operand to eight signed quad-
word integers in the destination operand.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the indefinite integer value (2w-1, where w represents the number of bits in the destination format) is returned.
EVEX encoded versions: The source operand is a YMM/XMM/XMM (low 64 bits) register or a 256/128/64-bit
memory location. The destination operation is a vector register conditionally updated with writemask k1.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
VCVTTPS2QQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_Single_Precision_To_QuadInteger_Truncate(SRC[k+31:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI

VCVTTPS2QQ—Convert With Truncation Packed Single Precision Floating-Point Values to Packed Signed Quadword Integer Values Vol. 2C 5-118
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTTPS2QQ (EVEX Encoded Versions) When SRC Operand is a Memory Source


(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[i+63:i] :=
Convert_Single_Precision_To_QuadInteger_Truncate(SRC[31:0])
ELSE
DEST[i+63:i] :=
Convert_Single_Precision_To_QuadInteger_Truncate(SRC[k+31:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTTPS2QQ __m512i _mm512_cvttps_epi64( __m256 a);
VCVTTPS2QQ __m512i _mm512_mask_cvttps_epi64( __m512i s, __mmask16 k, __m256 a);
VCVTTPS2QQ __m512i _mm512_maskz_cvttps_epi64( __mmask16 k, __m256 a);
VCVTTPS2QQ __m512i _mm512_cvtt_roundps_epi64( __m256 a, int sae);
VCVTTPS2QQ __m512i _mm512_mask_cvtt_roundps_epi64( __m512i s, __mmask16 k, __m256 a, int sae);
VCVTTPS2QQ __m512i _mm512_maskz_cvtt_roundps_epi64( __mmask16 k, __m256 a, int sae);
VCVTTPS2QQ __m256i _mm256_mask_cvttps_epi64( __m256i s, __mmask8 k, __m128 a);
VCVTTPS2QQ __m256i _mm256_maskz_cvttps_epi64( __mmask8 k, __m128 a);
VCVTTPS2QQ __m128i _mm_mask_cvttps_epi64( __m128i s, __mmask8 k, __m128 a);
VCVTTPS2QQ __m128i _mm_maskz_cvttps_epi64( __mmask8 k, __m128 a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VCVTTPS2QQ—Convert With Truncation Packed Single Precision Floating-Point Values to Packed Signed Quadword Integer Values Vol. 2C 5-119
VCVTTPS2UDQ—Convert With Truncation Packed Single Precision Floating-Point Values to
Packed Unsigned Doubleword Integer Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.0F.W0 78 /r A V/V (AVX512VL AND Convert four packed single precision floating-
VCVTTPS2UDQ xmm1 {k1}{z}, AVX512F) OR point values from xmm2/m128/m32bcst to four
xmm2/m128/m32bcst AVX10.11 packed unsigned doubleword values in xmm1
using truncation subject to writemask k1.
EVEX.256.0F.W0 78 /r A V/V (AVX512VL AND Convert eight packed single precision floating-
VCVTTPS2UDQ ymm1 {k1}{z}, AVX512F) OR point values from ymm2/m256/m32bcst to eight
ymm2/m256/m32bcst AVX10.11 packed unsigned doubleword values in ymm1
using truncation subject to writemask k1.
EVEX.512.0F.W0 78 /r A V/V AVX512F Convert sixteen packed single precision floating-
VCVTTPS2UDQ zmm1 {k1}{z}, OR AVX10.11 point values from zmm2/m512/m32bcst to
zmm2/m512/m32bcst {sae} sixteen packed unsigned doubleword values in
zmm1 using truncation subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts with truncation packed single precision floating-point values in the source operand to sixteen unsigned
doubleword integers in the destination operand.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or
a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a
ZMM/YMM/XMM register conditionally updated with writemask k1.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
VCVTTPS2UDQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_UInteger_Truncate(SRC[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI

VCVTTPS2UDQ—Convert With Truncation Packed Single Precision Floating-Point Values to Packed Unsigned Doubleword Integer Val- Vol. 2C 5-120
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTTPS2UDQ (EVEX Encoded Versions) When SRC Operand is a Memory Source


(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_UInteger_Truncate(SRC[31:0])
ELSE
DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_UInteger_Truncate(SRC[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTTPS2UDQ __m512i _mm512_cvttps_epu32( __m512 a);
VCVTTPS2UDQ __m512i _mm512_mask_cvttps_epu32( __m512i s, __mmask16 k, __m512 a);
VCVTTPS2UDQ __m512i _mm512_maskz_cvttps_epu32( __mmask16 k, __m512 a);
VCVTTPS2UDQ __m512i _mm512_cvtt_roundps_epu32( __m512 a, int sae);
VCVTTPS2UDQ __m512i _mm512_mask_cvtt_roundps_epu32( __m512i s, __mmask16 k, __m512 a, int sae);
VCVTTPS2UDQ __m512i _mm512_maskz_cvtt_roundps_epu32( __mmask16 k, __m512 a, int sae);
VCVTTPS2UDQ __m256i _mm256_mask_cvttps_epu32( __m256i s, __mmask8 k, __m256 a);
VCVTTPS2UDQ __m256i _mm256_maskz_cvttps_epu32( __mmask8 k, __m256 a);
VCVTTPS2UDQ __m128i _mm_mask_cvttps_epu32( __m128i s, __mmask8 k, __m128 a);
VCVTTPS2UDQ __m128i _mm_maskz_cvttps_epu32( __mmask8 k, __m128 a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VCVTTPS2UDQ—Convert With Truncation Packed Single Precision Floating-Point Values to Packed Unsigned Doubleword Integer Val- Vol. 2C 5-121
VCVTTPS2UQQ—Convert With Truncation Packed Single Precision Floating-Point Values to
Packed Unsigned Quadword Integer Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F.W0 78 /r A V/V (AVX512VL AND Convert two packed single precision floating-point
VCVTTPS2UQQ xmm1 {k1}{z}, AVX512DQ) OR values from xmm2/m64/m32bcst to two packed
xmm2/m64/m32bcst AVX10.11 unsigned quadword values in xmm1 using truncation
subject to writemask k1.
EVEX.256.66.0F.W0 78 /r A V/V (AVX512VL AND Convert four packed single precision floating-point
VCVTTPS2UQQ ymm1 {k1}{z}, AVX512DQ) OR values from xmm2/m128/m32bcst to four packed
xmm2/m128/m32bcst AVX10.11 unsigned quadword values in ymm1 using truncation
subject to writemask k1.
EVEX.512.66.0F.W0 78 /r A V/V AVX512DQ Convert eight packed single precision floating-point
VCVTTPS2UQQ zmm1 {k1}{z}, OR AVX10.11 values from ymm2/m256/m32bcst to eight packed
ymm2/m256/m32bcst {sae} unsigned quadword values in zmm1 using truncation
subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Half ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts with truncation up to eight packed single precision floating-point values in the source operand to
unsigned quadword integers in the destination operand.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.
EVEX encoded versions: The source operand is a YMM/XMM/XMM (low 64 bits) register or a 256/128/64-bit
memory location. The destination operation is a vector register conditionally updated with writemask k1.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
VCVTTPS2UQQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_Single_Precision_To_UQuadInteger_Truncate(SRC[k+31:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI

VCVTTPS2UQQ—Convert With Truncation Packed Single Precision Floating-Point Values to Packed Unsigned Quadword Integer Values Vol. 2C 5-122
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTTPS2UQQ (EVEX Encoded Versions) When SRC Operand is a Memory Source


(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[i+63:i] :=
Convert_Single_Precision_To_UQuadInteger_Truncate(SRC[31:0])
ELSE
DEST[i+63:i] :=
Convert_Single_Precision_To_UQuadInteger_Truncate(SRC[k+31:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTTPS2UQQ _mm<size>[_mask[z]]_cvtt[_round]ps_epu64
VCVTTPS2UQQ __m512i _mm512_cvttps_epu64( __m256 a);
VCVTTPS2UQQ __m512i _mm512_mask_cvttps_epu64( __m512i s, __mmask16 k, __m256 a);
VCVTTPS2UQQ __m512i _mm512_maskz_cvttps_epu64( __mmask16 k, __m256 a);
VCVTTPS2UQQ __m512i _mm512_cvtt_roundps_epu64( __m256 a, int sae);
VCVTTPS2UQQ __m512i _mm512_mask_cvtt_roundps_epu64( __m512i s, __mmask16 k, __m256 a, int sae);
VCVTTPS2UQQ __m512i _mm512_maskz_cvtt_roundps_epu64( __mmask16 k, __m256 a, int sae);
VCVTTPS2UQQ __m256i _mm256_mask_cvttps_epu64( __m256i s, __mmask8 k, __m128 a);
VCVTTPS2UQQ __m256i _mm256_maskz_cvttps_epu64( __mmask8 k, __m128 a);
VCVTTPS2UQQ __m128i _mm_mask_cvttps_epu64( __m128i s, __mmask8 k, __m128 a);
VCVTTPS2UQQ __m128i _mm_maskz_cvttps_epu64( __mmask8 k, __m128 a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VCVTTPS2UQQ—Convert With Truncation Packed Single Precision Floating-Point Values to Packed Unsigned Quadword Integer Values Vol. 2C 5-123
VCVTTSD2USI—Convert With Truncation Scalar Double Precision Floating-Point Value to
Unsigned Integer
Opcode/ Op / 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
EVEX.LLIG.F2.0F.W0 78 /r A V/V AVX512F Convert one double precision floating-point value
VCVTTSD2USI r32, xmm1/m64{sae} OR AVX10.11 from xmm1/m64 to one unsigned doubleword
integer r32 using truncation.
EVEX.LLIG.F2.0F.W1 78 /r A V/N.E.2 AVX512F Convert one double precision floating-point value
VCVTTSD2USI r64, xmm1/m64{sae} OR AVX10.11 from xmm1/m64 to one unsigned quadword
integer zero-extended into r64 using truncation.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
2. For this specific instruction, EVEX.W in non-64 bit is ignored; the instruction behaves as if the W0 version is used.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Fixed ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts with truncation a double precision floating-point value in the source operand (the second operand) to an
unsigned doubleword integer (or unsigned quadword integer if operand size is 64 bits) in the destination operand
(the first operand). The source operand can be an XMM register or a 64-bit memory location. The destination
operand is a general-purpose register. When the source operand is an XMM register, the double precision floating-
point value is contained in the low quadword of the register.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.
EVEX.W1 version: promotes the instruction to produce 64-bit data in 64-bit mode.

Operation
VCVTTSD2USI (EVEX Encoded Version)
IF 64-Bit Mode and OperandSize = 64
THEN DEST[63:0] := Convert_Double_Precision_Floating_Point_To_UInteger_Truncate(SRC[63:0]);
ELSE DEST[31:0] := Convert_Double_Precision_Floating_Point_To_UInteger_Truncate(SRC[63:0]);
FI

Intel C/C++ Compiler Intrinsic Equivalent


VCVTTSD2USI unsigned int _mm_cvttsd_u32(__m128d);
VCVTTSD2USI unsigned int _mm_cvtt_roundsd_u32(__m128d, int sae);
VCVTTSD2USI unsigned __int64 _mm_cvttsd_u64(__m128d);
VCVTTSD2USI unsigned __int64 _mm_cvtt_roundsd_u64(__m128d, int sae);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”

VCVTTSD2USI—Convert With Truncation Scalar Double Precision Floating-Point Value to Unsigned Integer Vol. 2C 5-124
VCVTTSH2SI—Convert with Truncation Low FP16 Value to a Signed Integer
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.F3.MAP5.W0 2C /r A V/V1 AVX512-FP16 Convert FP16 value in the low element of
VCVTTSH2SI r32, xmm1/m16 {sae} OR AVX10.12 xmm1/m16 to a signed integer and store the
result in r32 using truncation.
EVEX.LLIG.F3.MAP5.W1 2C /r A V/N.E. AVX512-FP16 Convert FP16 value in the low element of
VCVTTSH2SI r64, xmm1/m16 {sae} OR AVX10.12 xmm1/m16 to a signed integer and store the
result in r64 using truncation.

NOTES:
1. Outside of 64b mode, the EVEX.W field is ignored. The instruction behaves as if W=0 was used.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts the low FP16 element in the source operand to a signed integer in the destination general
purpose register.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer indefinite value is returned.

Operation
VCVTTSH2SI dest, src
IF 64-mode and OperandSize == 64:
DEST.qword := Convert_fp16_to_integer64_truncate(SRC.fp16[0])
ELSE:
DEST.dword := Convert_fp16_to_integer32_truncate(SRC.fp16[0])

Intel C/C++ Compiler Intrinsic Equivalent


VCVTTSH2SI int _mm_cvtt_roundsh_i32 (__m128h a, int sae);
VCVTTSH2SI __int64 _mm_cvtt_roundsh_i64 (__m128h a, int sae);
VCVTTSH2SI int _mm_cvttsh_i32 (__m128h a);
VCVTTSH2SI __int64 _mm_cvttsh_i64 (__m128h a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”

VCVTTSH2SI—Convert with Truncation Low FP16 Value to a Signed Integer Vol. 2C 5-125
VCVTTSH2USI—Convert with Truncation Low FP16 Value to an Unsigned Integer
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.F3.MAP5.W0 78 /r A V/V1 AVX512-FP16 Convert FP16 value in the low element of
VCVTTSH2USI r32, xmm1/m16 {sae} OR AVX10.12 xmm1/m16 to an unsigned integer and store the
result in r32 using truncation.
EVEX.LLIG.F3.MAP5.W1 78 /r A V/N.E. AVX512-FP16 Convert FP16 value in the low element of
VCVTTSH2USI r64, xmm1/m16 {sae} OR AVX10.12 xmm1/m16 to an unsigned integer and store the
result in r64 using truncation.

NOTES:
1. Outside of 64b mode, the EVEX.W field is ignored. The instruction behaves as if W=0 was used.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts the low FP16 element in the source operand to an unsigned integer in the destination
general purpose register.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer indefinite value is returned.

Operation
VCVTTSH2USI dest, src
IF 64-mode and OperandSize == 64:
DEST.qword := Convert_fp16_to_unsigned_integer64_truncate(SRC.fp16[0])
ELSE:
DEST.dword := Convert_fp16_to_unsigned_integer32_truncate(SRC.fp16[0])

Intel C/C++ Compiler Intrinsic Equivalent


VCVTTSH2USI unsigned int _mm_cvtt_roundsh_u32 (__m128h a, int sae);
VCVTTSH2USI unsigned __int64 _mm_cvtt_roundsh_u64 (__m128h a, int sae);
VCVTTSH2USI unsigned int _mm_cvttsh_u32 (__m128h a);
VCVTTSH2USI unsigned __int64 _mm_cvttsh_u64 (__m128h a);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”

VCVTTSH2USI—Convert with Truncation Low FP16 Value to an Unsigned Integer Vol. 2C 5-126
VCVTTSS2USI—Convert With Truncation Scalar Single Precision Floating-Point Value to
Unsigned Integer
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.F3.0F.W0 78 /r A V/V AVX512F Convert one single precision floating-point value
VCVTTSS2USI r32, xmm1/m32{sae} OR AVX10.11 from xmm1/m32 to one unsigned doubleword
integer in r32 using truncation.
EVEX.LLIG.F3.0F.W1 78 /r A V/N.E.2 AVX512F Convert one single precision floating-point value
VCVTTSS2USI r64, xmm1/m32{sae} OR AVX10.11 from xmm1/m32 to one unsigned quadword
integer in r64 using truncation.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
2. For this specific instruction, EVEX.W in non-64 bit is ignored; the instruction behaves as if the W0 version is used.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Fixed ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts with truncation a single precision floating-point value in the source operand (the second operand) to an
unsigned doubleword integer (or unsigned quadword integer if operand size is 64 bits) in the destination operand
(the first operand). The source operand can be an XMM register or a memory location. The destination operand is
a general-purpose register. When the source operand is an XMM register, the single precision floating-point value
is contained in the low doubleword of the register.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.
EVEX.W1 version: promotes the instruction to produce 64-bit data in 64-bit mode.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Operation
VCVTTSS2USI (EVEX Encoded Version)
IF 64-bit Mode and OperandSize = 64
THEN
DEST[63:0] := Convert_Single_Precision_Floating_Point_To_UInteger_Truncate(SRC[31:0]);
ELSE
DEST[31:0] := Convert_Single_Precision_Floating_Point_To_UInteger_Truncate(SRC[31:0]);
FI;

Intel C/C++ Compiler Intrinsic Equivalent


VCVTTSS2USI unsigned int _mm_cvttss_u32( __m128 a);
VCVTTSS2USI unsigned int _mm_cvtt_roundss_u32( __m128 a, int sae);
VCVTTSS2USI unsigned __int64 _mm_cvttss_u64( __m128 a);
VCVTTSS2USI unsigned __int64 _mm_cvtt_roundss_u64( __m128 a, int sae);

SIMD Floating-Point Exceptions


Invalid, Precision.

VCVTTSS2USI—Convert With Truncation Scalar Single Precision Floating-Point Value to Unsigned Integer Vol. 2C 5-127
Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”

VCVTTSS2USI—Convert With Truncation Scalar Single Precision Floating-Point Value to Unsigned Integer Vol. 2C 5-128
VCVTUDQ2PD—Convert Packed Unsigned Doubleword Integers to Packed Double Precision
Floating-Point Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F3.0F.W0 7A /r A V/V (AVX512VL AND Convert two packed unsigned doubleword integers
VCVTUDQ2PD xmm1 {k1}{z}, AVX512F) OR from ymm2/m64/m32bcst to packed double
xmm2/m64/m32bcst AVX10.11 precision floating-point values in zmm1 with
writemask k1.
EVEX.256.F3.0F.W0 7A /r A V/V (AVX512VL AND Convert four packed unsigned doubleword integers
VCVTUDQ2PD ymm1 {k1}{z}, AVX512F) OR from xmm2/m128/m32bcst to packed double
xmm2/m128/m32bcst AVX10.11 precision floating-point values in zmm1 with
writemask k1.
EVEX.512.F3.0F.W0 7A /r A V/V AVX512F Convert eight packed unsigned doubleword integers
VCVTUDQ2PD zmm1 {k1}{z}, OR AVX10.11 from ymm2/m256/m32bcst to eight packed double
ymm2/m256/m32bcst precision floating-point values in zmm1 with
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Half ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts packed unsigned doubleword integers in the source operand (second operand) to packed double precision
floating-point values in the destination operand (first operand).
The source operand is a YMM/XMM/XMM (low 64 bits) register, a 256/128/64-bit memory location or a
256/128/64-bit vector broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM
register conditionally updated with writemask k1.
Attempt to encode this instruction with EVEX embedded rounding is ignored.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Operation
VCVTUDQ2PD (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_UInteger_To_Double_Precision_Floating_Point(SRC[k+31:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI

VCVTUDQ2PD—Convert Packed Unsigned Doubleword Integers to Packed Double Precision Floating-Point Values Vol. 2C 5-128
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTUDQ2PD (EVEX Encoded Versions) When SRC Operand is a Memory Source


(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
Convert_UInteger_To_Double_Precision_Floating_Point(SRC[31:0])
ELSE
DEST[i+63:i] :=
Convert_UInteger_To_Double_Precision_Floating_Point(SRC[k+31:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTUDQ2PD __m512d _mm512_cvtepu32_pd( __m256i a);
VCVTUDQ2PD __m512d _mm512_mask_cvtepu32_pd( __m512d s, __mmask8 k, __m256i a);
VCVTUDQ2PD __m512d _mm512_maskz_cvtepu32_pd( __mmask8 k, __m256i a);
VCVTUDQ2PD __m256d _mm256_cvtepu32_pd( __m128i a);
VCVTUDQ2PD __m256d _mm256_mask_cvtepu32_pd( __m256d s, __mmask8 k, __m128i a);
VCVTUDQ2PD __m256d _mm256_maskz_cvtepu32_pd( __mmask8 k, __m128i a);
VCVTUDQ2PD __m128d _mm_cvtepu32_pd( __m128i a);
VCVTUDQ2PD __m128d _mm_mask_cvtepu32_pd( __m128d s, __mmask8 k, __m128i a);
VCVTUDQ2PD __m128d _mm_maskz_cvtepu32_pd( __mmask8 k, __m128i a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instructions, see Table 2-53, “Type E5 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VCVTUDQ2PD—Convert Packed Unsigned Doubleword Integers to Packed Double Precision Floating-Point Values Vol. 2C 5-129
VCVTUDQ2PH—Convert Packed Unsigned Doubleword Integers to Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F2.MAP5.W0 7A /r A V/V (AVX512-FP16 Convert four packed unsigned doubleword
VCVTUDQ2PH xmm1{k1}{z}, AND AVX512VL) integers from xmm2/m128/m32bcst to packed
xmm2/m128/m32bcst OR AVX10.11 FP16 values, and store the result in xmm1
subject to writemask k1.
EVEX.256.F2.MAP5.W0 7A /r A V/V (AVX512-FP16 Convert eight packed unsigned doubleword
VCVTUDQ2PH xmm1{k1}{z}, AND AVX512VL) integers from ymm2/m256/m32bcst to packed
ymm2/m256/m32bcst OR AVX10.11 FP16 values, and store the result in xmm1
subject to writemask k1.
EVEX.512.F2.MAP5.W0 7A /r A V/V AVX512-FP16 Convert sixteen packed unsigned doubleword
VCVTUDQ2PH ymm1{k1}{z}, OR AVX10.11 integers from zmm2/m512/m32bcst to packed
zmm2/m512/m32bcst {er} FP16 values, and store the result in ymm1
subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts packed unsigned doubleword integers in the source operand to packed FP16 values in the
destination operand. The destination elements are updated according to the writemask.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
If the result of the convert operation is overflow and MXCSR.OM=0 then a SIMD exception will be raised with OE=1,
PE=1.

Operation
VCVTUDQ2PH dest, src
VL = 128, 256 or 512
KL := VL / 32

IF *SRC is a register* and (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.dword[0]
ELSE
tsrc := SRC.dword[j]
DEST.fp16[j] := Convert_unsigned_integer32_to_fp16(tsrc)
ELSE IF *zeroing*:
DEST.fp16[j] := 0

VCVTUDQ2PH—Convert Packed Unsigned Doubleword Integers to Packed FP16 Values Vol. 2C 5-130
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL/2] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTUDQ2PH __m256h _mm512_cvt_roundepu32_ph (__m512i a, int rounding);
VCVTUDQ2PH __m256h _mm512_mask_cvt_roundepu32_ph (__m256h src, __mmask16 k, __m512i a, int rounding);
VCVTUDQ2PH __m256h _mm512_maskz_cvt_roundepu32_ph (__mmask16 k, __m512i a, int rounding);
VCVTUDQ2PH __m128h _mm_cvtepu32_ph (__m128i a);
VCVTUDQ2PH __m128h _mm_mask_cvtepu32_ph (__m128h src, __mmask8 k, __m128i a);
VCVTUDQ2PH __m128h _mm_maskz_cvtepu32_ph (__mmask8 k, __m128i a);
VCVTUDQ2PH __m128h _mm256_cvtepu32_ph (__m256i a);
VCVTUDQ2PH __m128h _mm256_mask_cvtepu32_ph (__m128h src, __mmask8 k, __m256i a);
VCVTUDQ2PH __m128h _mm256_maskz_cvtepu32_ph (__mmask8 k, __m256i a);
VCVTUDQ2PH __m256h _mm512_cvtepu32_ph (__m512i a);
VCVTUDQ2PH __m256h _mm512_mask_cvtepu32_ph (__m256h src, __mmask16 k, __m512i a);
VCVTUDQ2PH __m256h _mm512_maskz_cvtepu32_ph (__mmask16 k, __m512i a);

SIMD Floating-Point Exceptions


Overflow, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCVTUDQ2PH—Convert Packed Unsigned Doubleword Integers to Packed FP16 Values Vol. 2C 5-131
VCVTUDQ2PS—Convert Packed Unsigned Doubleword Integers to Packed Single Precision
Floating-Point Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F2.0F.W0 7A /r A V/V (AVX512VL AND Convert four packed unsigned doubleword integers
VCVTUDQ2PS xmm1 {k1}{z}, AVX512F) OR from xmm2/m128/m32bcst to packed single
xmm2/m128/m32bcst AVX10.11 precision floating-point values in xmm1 with
writemask k1.
EVEX.256.F2.0F.W0 7A /r A V/V (AVX512VL AND Convert eight packed unsigned doubleword integers
VCVTUDQ2PS ymm1 {k1}{z}, AVX512F) OR from ymm2/m256/m32bcst to packed single
ymm2/m256/m32bcst AVX10.11 precision floating-point values in zmm1 with
writemask k1.
EVEX.512.F2.0F.W0 7A /r A V/V AVX512F Convert sixteen packed unsigned doubleword
VCVTUDQ2PS zmm1 {k1}{z}, OR AVX10.11 integers from zmm2/m512/m32bcst to sixteen
zmm2/m512/m32bcst {er} packed single precision floating-point values in
zmm1 with writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts packed unsigned doubleword integers in the source operand (second operand) to single precision
floating-point values in the destination operand (first operand).
The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Operation
VCVTUDQ2PS (EVEX Encoded Version) When SRC Operand is a Register
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
Convert_UInteger_To_Single_Precision_Floating_Point(SRC[i+31:i])
ELSE
IF *merging-masking* ; merging-masking

VCVTUDQ2PS—Convert Packed Unsigned Doubleword Integers to Packed Single Precision Floating-Point Values Vol. 2C 5-132
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTUDQ2PS (EVEX Encoded Version) When SRC Operand is a Memory Source


(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_UInteger_To_Single_Precision_Floating_Point(SRC[31:0])
ELSE
DEST[i+31:i] :=
Convert_UInteger_To_Single_Precision_Floating_Point(SRC[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTUDQ2PS __m512 _mm512_cvtepu32_ps( __m512i a);
VCVTUDQ2PS __m512 _mm512_mask_cvtepu32_ps( __m512 s, __mmask16 k, __m512i a);
VCVTUDQ2PS __m512 _mm512_maskz_cvtepu32_ps( __mmask16 k, __m512i a);
VCVTUDQ2PS __m512 _mm512_cvt_roundepu32_ps( __m512i a, int r);
VCVTUDQ2PS __m512 _mm512_mask_cvt_roundepu32_ps( __m512 s, __mmask16 k, __m512i a, int r);
VCVTUDQ2PS __m512 _mm512_maskz_cvt_roundepu32_ps( __mmask16 k, __m512i a, int r);
VCVTUDQ2PS __m256 _mm256_cvtepu32_ps( __m256i a);
VCVTUDQ2PS __m256 _mm256_mask_cvtepu32_ps( __m256 s, __mmask8 k, __m256i a);
VCVTUDQ2PS __m256 _mm256_maskz_cvtepu32_ps( __mmask8 k, __m256i a);
VCVTUDQ2PS __m128 _mm_cvtepu32_ps( __m128i a);
VCVTUDQ2PS __m128 _mm_mask_cvtepu32_ps( __m128 s, __mmask8 k, __m128i a);
VCVTUDQ2PS __m128 _mm_maskz_cvtepu32_ps( __mmask8 k, __m128i a);

SIMD Floating-Point Exceptions


Precision.

VCVTUDQ2PS—Convert Packed Unsigned Doubleword Integers to Packed Single Precision Floating-Point Values Vol. 2C 5-133
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VCVTUDQ2PS—Convert Packed Unsigned Doubleword Integers to Packed Single Precision Floating-Point Values Vol. 2C 5-134
VCVTUQQ2PD—Convert Packed Unsigned Quadword Integers to Packed Double Precision
Floating-Point Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F3.0F.W1 7A /r A V/V (AVX512VL AND Convert two packed unsigned quadword integers from
VCVTUQQ2PD xmm1 {k1}{z}, AVX512DQ) OR xmm2/m128/m64bcst to two packed double precision
xmm2/m128/m64bcst AVX10.11 floating-point values in xmm1 with writemask k1.
EVEX.256.F3.0F.W1 7A /r A V/V (AVX512VL AND Convert four packed unsigned quadword integers from
VCVTUQQ2PD ymm1 {k1}{z}, AVX512DQ) OR ymm2/m256/m64bcst to packed double precision
ymm2/m256/m64bcst AVX10.11 floating-point values in ymm1 with writemask k1.
EVEX.512.F3.0F.W1 7A /r A V/V AVX512DQ Convert eight packed unsigned quadword integers
VCVTUQQ2PD zmm1 {k1}{z}, OR AVX10.11 from zmm2/m512/m64bcst to eight packed double
zmm2/m512/m64bcst {er} precision floating-point values in zmm1 with
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts packed unsigned quadword integers in the source operand (second operand) to packed double precision
floating-point values in the destination operand (first operand).
The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Operation
VCVTUQQ2PD (EVEX Encoded Version) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL == 512) AND (EVEX.b == 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_UQuadInteger_To_Double_Precision_Floating_Point(SRC[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking

VCVTUQQ2PD—Convert Packed Unsigned Quadword Integers to Packed Double Precision Floating-Point Values Vol. 2C 5-134
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VCVTUQQ2PD (EVEX Encoded Version) When SRC Operand is a Memory Source


(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[i+63:i] :=
Convert_UQuadInteger_To_Double_Precision_Floating_Point(SRC[63:0])
ELSE
DEST[i+63:i] :=
Convert_UQuadInteger_To_Double_Precision_Floating_Point(SRC[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTUQQ2PD __m512d _mm512_cvtepu64_ps( __m512i a);
VCVTUQQ2PD __m512d _mm512_mask_cvtepu64_ps( __m512d s, __mmask8 k, __m512i a);
VCVTUQQ2PD __m512d _mm512_maskz_cvtepu64_ps( __mmask8 k, __m512i a);
VCVTUQQ2PD __m512d _mm512_cvt_roundepu64_ps( __m512i a, int r);
VCVTUQQ2PD __m512d _mm512_mask_cvt_roundepu64_ps( __m512d s, __mmask8 k, __m512i a, int r);
VCVTUQQ2PD __m512d _mm512_maskz_cvt_roundepu64_ps( __mmask8 k, __m512i a, int r);
VCVTUQQ2PD __m256d _mm256_cvtepu64_ps( __m256i a);
VCVTUQQ2PD __m256d _mm256_mask_cvtepu64_ps( __m256d s, __mmask8 k, __m256i a);
VCVTUQQ2PD __m256d _mm256_maskz_cvtepu64_ps( __mmask8 k, __m256i a);
VCVTUQQ2PD __m128d _mm_cvtepu64_ps( __m128i a);
VCVTUQQ2PD __m128d _mm_mask_cvtepu64_ps( __m128d s, __mmask8 k, __m128i a);
VCVTUQQ2PD __m128d _mm_maskz_cvtepu64_ps( __mmask8 k, __m128i a);

SIMD Floating-Point Exceptions


Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VCVTUQQ2PD—Convert Packed Unsigned Quadword Integers to Packed Double Precision Floating-Point Values Vol. 2C 5-135
VCVTUQQ2PH—Convert Packed Unsigned Quadword Integers to Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F2.MAP5.W1 7A /r A V/V (AVX512-FP16 Convert two packed unsigned doubleword
VCVTUQQ2PH xmm1{k1}{z}, AND AVX512VL) integers from xmm2/m128/m64bcst to packed
xmm2/m128/m64bcst OR AVX10.11 FP16 values, and store the result in xmm1
subject to writemask k1.
EVEX.256.F2.MAP5.W1 7A /r A V/V (AVX512-FP16 Convert four packed unsigned doubleword
VCVTUQQ2PH xmm1{k1}{z}, AND AVX512VL) integers from ymm2/m256/m64bcst to packed
ymm2/m256/m64bcst OR AVX10.11 FP16 values, and store the result in xmm1
subject to writemask k1.
EVEX.512.F2.MAP5.W1 7A /r A V/V AVX512-FP16 Convert eight packed unsigned doubleword
VCVTUQQ2PH xmm1{k1}{z}, OR AVX10.11 integers from zmm2/m512/m64bcst to packed
zmm2/m512/m64bcst {er} FP16 values, and store the result in xmm1
subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts packed unsigned quadword integers in the source operand to packed FP16 values in the
destination operand. The destination elements are updated according to the writemask.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
If the result of the convert operation is overflow and MXCSR.OM=0 then a SIMD exception will be raised with OE=1,
PE=1.

Operation
VCVTUQQ2PH dest, src
VL = 128, 256 or 512
KL := VL / 64

IF *SRC is a register* and (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.qword[0]
ELSE
tsrc := SRC.qword[j]
DEST.fp16[j] := Convert_unsigned_integer64_to_fp16(tsrc)
ELSE IF *zeroing*:
DEST.fp16[j] := 0

VCVTUQQ2PH—Convert Packed Unsigned Quadword Integers to Packed FP16 Values Vol. 2C 5-136
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL/4] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTUQQ2PH __m128h _mm512_cvt_roundepu64_ph (__m512i a, int rounding);
VCVTUQQ2PH __m128h _mm512_mask_cvt_roundepu64_ph (__m128h src, __mmask8 k, __m512i a, int rounding);
VCVTUQQ2PH __m128h _mm512_maskz_cvt_roundepu64_ph (__mmask8 k, __m512i a, int rounding);
VCVTUQQ2PH __m128h _mm_cvtepu64_ph (__m128i a);
VCVTUQQ2PH __m128h _mm_mask_cvtepu64_ph (__m128h src, __mmask8 k, __m128i a);
VCVTUQQ2PH __m128h _mm_maskz_cvtepu64_ph (__mmask8 k, __m128i a);
VCVTUQQ2PH __m128h _mm256_cvtepu64_ph (__m256i a);
VCVTUQQ2PH __m128h _mm256_mask_cvtepu64_ph (__m128h src, __mmask8 k, __m256i a);
VCVTUQQ2PH __m128h _mm256_maskz_cvtepu64_ph (__mmask8 k, __m256i a);
VCVTUQQ2PH __m128h _mm512_cvtepu64_ph (__m512i a);
VCVTUQQ2PH __m128h _mm512_mask_cvtepu64_ph (__m128h src, __mmask8 k, __m512i a);
VCVTUQQ2PH __m128h _mm512_maskz_cvtepu64_ph (__mmask8 k, __m512i a);

SIMD Floating-Point Exceptions


Overflow, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCVTUQQ2PH—Convert Packed Unsigned Quadword Integers to Packed FP16 Values Vol. 2C 5-137
VCVTUQQ2PS—Convert Packed Unsigned Quadword Integers to Packed Single Precision
Floating-Point Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F2.0F.W1 7A /r A V/V (AVX512VL AND Convert two packed unsigned quadword integers from
VCVTUQQ2PS xmm1 {k1}{z}, AVX512DQ) OR xmm2/m128/m64bcst to packed single precision
xmm2/m128/m64bcst AVX10.11 floating-point values in zmm1 with writemask k1.
EVEX.256.F2.0F.W1 7A /r A V/V (AVX512VL AND Convert four packed unsigned quadword integers from
VCVTUQQ2PS xmm1 {k1}{z}, AVX512DQ) OR ymm2/m256/m64bcst to packed single precision
ymm2/m256/m64bcst AVX10.11 floating-point values in xmm1 with writemask k1.
EVEX.512.F2.0F.W1 7A /r A V/V AVX512DQ Convert eight packed unsigned quadword integers from
VCVTUQQ2PS ymm1 {k1}{z}, OR AVX10.11 zmm2/m512/m64bcst to eight packed single precision
zmm2/m512/m64bcst {er} floating-point values in zmm1 with writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts packed unsigned quadword integers in the source operand (second operand) to single precision floating-
point values in the destination operand (first operand).
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location.
The destination operand is a YMM/XMM/XMM (low 64 bits) register conditionally updated with writemask k1.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Operation
VCVTUQQ2PS (EVEX Encoded Version) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;

FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
Convert_UQuadInteger_To_Single_Precision_Floating_Point(SRC[k+63:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking

VCVTUQQ2PS—Convert Packed Unsigned Quadword Integers to Packed Single Precision Floating-Point Values Vol. 2C 5-138
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0

VCVTUQQ2PS (EVEX Encoded Version) When SRC Operand is a Memory Source


(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_UQuadInteger_To_Single_Precision_Floating_Point(SRC[63:0])
ELSE
DEST[i+31:i] :=
Convert_UQuadInteger_To_Single_Precision_Floating_Point(SRC[k+63:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTUQQ2PS __m256 _mm512_cvtepu64_ps( __m512i a);
VCVTUQQ2PS __m256 _mm512_mask_cvtepu64_ps( __m256 s, __mmask8 k, __m512i a);
VCVTUQQ2PS __m256 _mm512_maskz_cvtepu64_ps( __mmask8 k, __m512i a);
VCVTUQQ2PS __m256 _mm512_cvt_roundepu64_ps( __m512i a, int r);
VCVTUQQ2PS __m256 _mm512_mask_cvt_roundepu64_ps( __m256 s, __mmask8 k, __m512i a, int r);
VCVTUQQ2PS __m256 _mm512_maskz_cvt_roundepu64_ps( __mmask8 k, __m512i a, int r);
VCVTUQQ2PS __m128 _mm256_cvtepu64_ps( __m256i a);
VCVTUQQ2PS __m128 _mm256_mask_cvtepu64_ps( __m128 s, __mmask8 k, __m256i a);
VCVTUQQ2PS __m128 _mm256_maskz_cvtepu64_ps( __mmask8 k, __m256i a);
VCVTUQQ2PS __m128 _mm_cvtepu64_ps( __m128i a);
VCVTUQQ2PS __m128 _mm_mask_cvtepu64_ps( __m128 s, __mmask8 k, __m128i a);
VCVTUQQ2PS __m128 _mm_maskz_cvtepu64_ps( __mmask8 k, __m128i a);

SIMD Floating-Point Exceptions


Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VCVTUQQ2PS—Convert Packed Unsigned Quadword Integers to Packed Single Precision Floating-Point Values Vol. 2C 5-139
VCVTUSI2SD—Convert Unsigned Integer to Scalar Double Precision Floating-Point Value
Opcode/ Op / 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
EVEX.LLIG.F2.0F.W0 7B /r A V/V AVX512F Convert one unsigned doubleword integer from
VCVTUSI2SD xmm1, xmm2, r/m32 OR AVX10.11 r/m32 to one double precision floating-point
value in xmm1.
EVEX.LLIG.F2.0F.W1 7B /r A V/N.E.2 AVX512F Convert one unsigned quadword integer from
VCVTUSI2SD xmm1, xmm2, r/m64{er} OR AVX10.11 r/m64 to one double precision floating-point
value in xmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
2. For this specific instruction, EVEX.W in non-64 bit is ignored; the instruction behaves as if the W0 version is used.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Converts an unsigned doubleword integer (or unsigned quadword integer if operand size is 64 bits) in the second
source operand to a double precision floating-point value in the destination operand. The result is stored in the low
quadword of the destination operand. When conversion is inexact, the value returned is rounded according to the
rounding control bits in the MXCSR register.
The second source operand can be a general-purpose register or a 32/64-bit memory location. The first source and
destination operands are XMM registers. Bits (127:64) of the XMM register destination are copied from corre-
sponding bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.
EVEX.W1 version: promotes the instruction to use 64-bit input value in 64-bit mode.
EVEX.W0 version: attempt to encode this instruction with EVEX embedded rounding is ignored.

Operation
VCVTUSI2SD (EVEX Encoded Version)
IF (SRC2 *is register*) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF 64-Bit Mode And OperandSize = 64
THEN
DEST[63:0] := Convert_UInteger_To_Double_Precision_Floating_Point(SRC2[63:0]);
ELSE
DEST[63:0] := Convert_UInteger_To_Double_Precision_Floating_Point(SRC2[31:0]);
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

VCVTUSI2SD—Convert Unsigned Integer to Scalar Double Precision Floating-Point Value Vol. 2C 5-140
Intel C/C++ Compiler Intrinsic Equivalent
VCVTUSI2SD __m128d _mm_cvtu32_sd( __m128d s, unsigned a);
VCVTUSI2SD __m128d _mm_cvtu64_sd( __m128d s, unsigned __int64 a);
VCVTUSI2SD __m128d _mm_cvt_roundu64_sd( __m128d s, unsigned __int64 a, int r);

SIMD Floating-Point Exceptions


Precision.

Other Exceptions
See Table 2-50, “Type E3NF Class Exception Conditions” if W1; otherwise, see Table 2-61, “Type E10NF Class
Exception Conditions.”

VCVTUSI2SD—Convert Unsigned Integer to Scalar Double Precision Floating-Point Value Vol. 2C 5-141
VCVTUSI2SS—Convert Unsigned Integer to Scalar Single Precision Floating-Point Value
Opcode/ Op / 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
EVEX.LLIG.F3.0F.W0 7B /r A V/V AVX512F Convert one signed doubleword integer from r/m32
VCVTUSI2SS xmm1, xmm2, r/m32{er} OR AVX10.11 to one single precision floating-point value in
xmm1.
EVEX.LLIG.F3.0F.W1 7B /r A V/N.E.2 AVX512F Convert one signed quadword integer from r/m64
VCVTUSI2SS xmm1, xmm2, r/m64{er} OR AVX10.11 to one single precision floating-point value in
xmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
2. For this specific instruction, EVEX.W in non-64 bit is ignored; the instruction behaves as if the W0 version is used.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
Converts a unsigned doubleword integer (or unsigned quadword integer if operand size is 64 bits) in the source
operand (second operand) to a single precision floating-point value in the destination operand (first operand). The
source operand can be a general-purpose register or a memory location. The destination operand is an XMM
register. The result is stored in the low doubleword of the destination operand. When a conversion is inexact, the
value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding
control bits.
The second source operand can be a general-purpose register or a 32/64-bit memory location. The first source and
destination operands are XMM registers. Bits (127:32) of the XMM register destination are copied from corre-
sponding bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.
EVEX.W1 version: promotes the instruction to use 64-bit input value in 64-bit mode.

Operation
VCVTUSI2SS (EVEX Encoded Version)
IF (SRC2 *is register*) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF 64-Bit Mode And OperandSize = 64
THEN
DEST[31:0] := Convert_UInteger_To_Single_Precision_Floating_Point(SRC[63:0]);
ELSE
DEST[31:0] := Convert_UInteger_To_Single_Precision_Floating_Point(SRC[31:0]);
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

VCVTUSI2SS—Convert Unsigned Integer to Scalar Single Precision Floating-Point Value Vol. 2C 5-144
Intel C/C++ Compiler Intrinsic Equivalent
VCVTUSI2SS __m128 _mm_cvtu32_ss( __m128 s, unsigned a);
VCVTUSI2SS __m128 _mm_cvt_roundu32_ss( __m128 s, unsigned a, int r);
VCVTUSI2SS __m128 _mm_cvtu64_ss( __m128 s, unsigned __int64 a);
VCVTUSI2SS __m128 _mm_cvt_roundu64_ss( __m128 s, unsigned __int64 a, int r);

SIMD Floating-Point Exceptions


Precision.

Other Exceptions
See Table 2-50, “Type E3NF Class Exception Conditions.”

VCVTUSI2SS—Convert Unsigned Integer to Scalar Single Precision Floating-Point Value Vol. 2C 5-145
VCVTUW2PH—Convert Packed Unsigned Word Integers to FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F2.MAP5.W0 7D /r A V/V (AVX512-FP16 Convert eight packed unsigned word integers
VCVTUW2PH xmm1{k1}{z}, AND AVX512VL) from xmm2/m128/m16bcst to FP16 values, and
xmm2/m128/m16bcst OR AVX10.11 store the result in xmm1 subject to writemask k1.
EVEX.256.F2.MAP5.W0 7D /r A V/V (AVX512-FP16 Convert sixteen packed unsigned word integers
VCVTUW2PH ymm1{k1}{z}, AND AVX512VL) from ymm2/m256/m16bcst to FP16 values, and
ymm2/m256/m16bcst OR AVX10.11 store the result in ymm1 subject to writemask k1.
EVEX.512.F2.MAP5.W0 7D /r A V/V AVX512-FP16 Convert thirty-two packed unsigned word
VCVTUW2PH zmm1{k1}{z}, OR AVX10.11 integers from zmm2/m512/m16bcst to FP16
zmm2/m512/m16bcst {er} values, and store the result in zmm1 subject to
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts packed unsigned word integers in the source operand to FP16 values in the destination
operand. When conversion is inexact, the value returned is rounded according to the rounding control bits in the
MXCSR register or embedded rounding controls.
The destination elements are updated according to the writemask.
If the result of the convert operation is overflow and MXCSR.OM=0 then a SIMD exception will be raised with OE=1,
PE=1.

Operation
VCVTUW2PH dest, src
VL = 128, 256 or 512
KL := VL / 16

IF *SRC is a register* and (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.word[0]
ELSE
tsrc := SRC.word[j]
DEST.fp16[j] := Convert_unsignd_integer16_to_fp16(tsrc)
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

VCVTUW2PH—Convert Packed Unsigned Word Integers to FP16 Values Vol. 2C 5-146


DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VCVTUW2PH __m512h _mm512_cvt_roundepu16_ph (__m512i a, int rounding);
VCVTUW2PH __m512h _mm512_mask_cvt_roundepu16_ph (__m512h src, __mmask32 k, __m512i a, int rounding);
VCVTUW2PH __m512h _mm512_maskz_cvt_roundepu16_ph (__mmask32 k, __m512i a, int rounding);
VCVTUW2PH __m128h _mm_cvtepu16_ph (__m128i a);
VCVTUW2PH __m128h _mm_mask_cvtepu16_ph (__m128h src, __mmask8 k, __m128i a);
VCVTUW2PH __m128h _mm_maskz_cvtepu16_ph (__mmask8 k, __m128i a);
VCVTUW2PH __m256h _mm256_cvtepu16_ph (__m256i a);
VCVTUW2PH __m256h _mm256_mask_cvtepu16_ph (__m256h src, __mmask16 k, __m256i a);
VCVTUW2PH __m256h _mm256_maskz_cvtepu16_ph (__mmask16 k, __m256i a);
VCVTUW2PH __m512h _mm512_cvtepu16_ph (__m512i a);
VCVTUW2PH __m512h _mm512_mask_cvtepu16_ph (__m512h src, __mmask32 k, __m512i a);
VCVTUW2PH __m512h _mm512_maskz_cvtepu16_ph (__mmask32 k, __m512i a);

SIMD Floating-Point Exceptions


Overflow, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCVTUW2PH—Convert Packed Unsigned Word Integers to FP16 Values Vol. 2C 5-147


VCVTW2PH—Convert Packed Signed Word Integers to FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F3.MAP5.W0 7D /r A V/V (AVX512-FP16 Convert eight packed signed word integers from
VCVTW2PH xmm1{k1}{z}, AND AVX512VL) xmm2/m128/m16bcst to FP16 values, and store
xmm2/m128/m16bcst OR AVX10.11 the result in xmm1 subject to writemask k1.
EVEX.256.F3.MAP5.W0 7D /r A V/V (AVX512-FP16 Convert sixteen packed signed word integers
VCVTW2PH ymm1{k1}{z}, AND AVX512VL) from ymm2/m256/m16bcst to FP16 values, and
ymm2/m256/m16bcst OR AVX10.11 store the result in ymm1 subject to writemask k1.
EVEX.512.F3.MAP5.W0 7D /r A V/V AVX512-FP16 Convert thirty-two packed signed word integers
VCVTW2PH zmm1{k1}{z}, OR AVX10.11 from zmm2/m512/m16bcst to FP16 values, and
zmm2/m512/m16bcst {er} store the result in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction converts packed signed word integers in the source operand to FP16 values in the destination
operand. When conversion is inexact, the value returned is rounded according to the rounding control bits in the
MXCSR register or embedded rounding controls.
The destination elements are updated according to the writemask.

Operation
VCVTW2PH dest, src
VL = 128, 256 or 512
KL := VL / 16

IF *SRC is a register* and (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.word[0]
ELSE
tsrc := SRC.word[j]
DEST.fp16[j] := Convert_integer16_to_fp16(tsrc)
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VCVTW2PH—Convert Packed Signed Word Integers to FP16 Values Vol. 2C 5-148


Intel C/C++ Compiler Intrinsic Equivalent
VCVTW2PH __m512h _mm512_cvt_roundepi16_ph (__m512i a, int rounding);
VCVTW2PH __m512h _mm512_mask_cvt_roundepi16_ph (__m512h src, __mmask32 k, __m512i a, int rounding);
VCVTW2PH __m512h _mm512_maskz_cvt_roundepi16_ph (__mmask32 k, __m512i a, int rounding);
VCVTW2PH __m128h _mm_cvtepi16_ph (__m128i a);
VCVTW2PH __m128h _mm_mask_cvtepi16_ph (__m128h src, __mmask8 k, __m128i a);
VCVTW2PH __m128h _mm_maskz_cvtepi16_ph (__mmask8 k, __m128i a);
VCVTW2PH __m256h _mm256_cvtepi16_ph (__m256i a);
VCVTW2PH __m256h _mm256_mask_cvtepi16_ph (__m256h src, __mmask16 k, __m256i a);
VCVTW2PH __m256h _mm256_maskz_cvtepi16_ph (__mmask16 k, __m256i a);
VCVTW2PH __m512h _mm512_cvtepi16_ph (__m512i a);
VCVTW2PH __m512h _mm512_mask_cvtepi16_ph (__m512h src, __mmask32 k, __m512i a);
VCVTW2PH __m512h _mm512_maskz_cvtepi16_ph (__mmask32 k, __m512i a);

SIMD Floating-Point Exceptions


Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VCVTW2PH—Convert Packed Signed Word Integers to FP16 Values Vol. 2C 5-149


VDBPSADBW—Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F3A.W0 42 /r ib A V/V (AVX512VL AND Compute packed SAD word results of unsigned bytes in
VDBPSADBW xmm1 {k1}{z}, xmm2, AVX512BW) OR dword block from xmm2 with unsigned bytes of dword
xmm3/m128, imm8 AVX10.11 blocks transformed from xmm3/m128 using the shuffle
controls in imm8. Results are written to xmm1 under the
writemask k1.
EVEX.256.66.0F3A.W0 42 /r ib A V/V (AVX512VL AND Compute packed SAD word results of unsigned bytes in
VDBPSADBW ymm1 {k1}{z}, ymm2, AVX512BW) OR dword block from ymm2 with unsigned bytes of dword
ymm3/m256, imm8 AVX10.11 blocks transformed from ymm3/m256 using the shuffle
controls in imm8. Results are written to ymm1 under the
writemask k1.
EVEX.512.66.0F3A.W0 42 /r ib A V/V AVX512BW Compute packed SAD word results of unsigned bytes in
VDBPSADBW zmm1 {k1}{z}, zmm2, OR AVX10.11 dword block from zmm2 with unsigned bytes of dword
zmm3/m512, imm8 blocks transformed from zmm3/m512 using the shuffle
controls in imm8. Results are written to zmm1 under the
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the processor at
run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector width and
as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
Compute packed SAD (sum of absolute differences) word results of unsigned bytes from two 32-bit dword
elements. Packed SAD word results are calculated in multiples of qword superblocks, producing 4 SAD word results
in each 64-bit superblock of the destination register.
Within each super block of packed word results, the SAD results from two 32-bit dword elements are calculated as
follows:
• The lower two word results are calculated each from the SAD operation between a sliding dword element within
a qword superblock from an intermediate vector with a stationary dword element in the corresponding qword
superblock of the first source operand. The intermediate vector, see “Tmp1” in Figure 5-8, is constructed from
the second source operand the imm8 byte as shuffle control to select dword elements within a 128-bit lane of
the second source operand. The two sliding dword elements in a qword superblock of Tmp1 are located at byte
offset 0 and 1 within the superblock, respectively. The stationary dword element in the qword superblock from
the first source operand is located at byte offset 0.
• The next two word results are calculated each from the SAD operation between a sliding dword element within
a qword superblock from the intermediate vector Tmp1 with a second stationary dword element in the corre-
sponding qword superblock of the first source operand. The two sliding dword elements in a qword superblock
of Tmp1 are located at byte offset 2and 3 within the superblock, respectively. The stationary dword element in
the qword superblock from the first source operand is located at byte offset 4.
• The intermediate vector is constructed in 128-bits lanes. Within each 128-bit lane, each dword element of the
intermediate vector is selected by a two-bit field within the imm8 byte on the corresponding 128-bits of the
second source operand. The imm8 byte serves as dword shuffle control within each 128-bit lanes of the inter-
mediate vector and the second source operand, similarly to PSHUFD.

VDBPSADBW—Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes Vol. 2C 5-150


The first source operand is a ZMM/YMM/XMM register. The second source operand is a ZMM/YMM/XMM register, or
a 512/256/128-bit memory location. The destination operand is conditionally updated based on writemask k1 at
16-bit word granularity.

127+128*n 95+128*n 63+128*n 31+128*n 128*n


128-bit Lane of Src2 DW3 DW2 DW1 DW0

00B: DW0
01B: DW1
imm8 shuffle control 10B: DW2
11B: DW3
7 5 3 1 0

127+128*n 95+128*n 63+128*n 31+128*n 128*n


128-bit Lane of Tmp1

Tmp1 qword superblock

55 47 39 31 24 39 31 23 15 8
Tmp1 sliding dword Tmp1 sliding dword

63 55 47 39 32 31 23 15 7 0
Src1 stationary dword 0
Src1 stationary dword 1
_ _ _ _ _ _ _ _
abs abs abs abs abs abs abs abs

+ 47 39 31 23 16 +
31 23 15 7 0
Tmp1 sliding dword
Tmp1 sliding dword
63 55 47 39 32
31 23 15 7 0
Src1 stationary dword 1
Src1 stationary dword 0
_ _ _ _
_ _ _ _
abs abs abs abs
abs abs abs abs

+
+
63 47 31 15 0
Destination qword superblock

Figure 5-8. 64-bit Super Block of SAD Operation in VDBPSADBW

Operation
VDBPSADBW (EVEX Encoded Versions)
(KL, VL) = (8, 128), (16, 256), (32, 512)
Selection of quadruplets:
FOR I = 0 to VL step 128
TMP1[I+31:I] := select (SRC2[I+127: I], imm8[1:0])
TMP1[I+63: I+32] := select (SRC2[I+127: I], imm8[3:2])
TMP1[I+95: I+64] := select (SRC2[I+127: I], imm8[5:4])
TMP1[I+127: I+96] := select (SRC2[I+127: I], imm8[7:6])
END FOR

SAD of quadruplets:

VDBPSADBW—Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes Vol. 2C 5-151


FOR I =0 to VL step 64
TMP_DEST[I+15:I] := ABS(SRC1[I+7: I] - TMP1[I+7: I]) +
ABS(SRC1[I+15: I+8]- TMP1[I+15: I+8]) +
ABS(SRC1[I+23: I+16]- TMP1[I+23: I+16]) +
ABS(SRC1[I+31: I+24]- TMP1[I+31: I+24])

TMP_DEST[I+31: I+16] := ABS(SRC1[I+7: I] - TMP1[I+15: I+8]) +


ABS(SRC1[I+15: I+8]- TMP1[I+23: I+16]) +
ABS(SRC1[I+23: I+16]- TMP1[I+31: I+24]) +
ABS(SRC1[I+31: I+24]- TMP1[I+39: I+32])
TMP_DEST[I+47: I+32] := ABS(SRC1[I+39: I+32] - TMP1[I+23: I+16]) +
ABS(SRC1[I+47: I+40]- TMP1[I+31: I+24]) +
ABS(SRC1[I+55: I+48]- TMP1[I+39: I+32]) +
ABS(SRC1[I+63: I+56]- TMP1[I+47: I+40])

TMP_DEST[I+63: I+48] := ABS(SRC1[I+39: I+32] - TMP1[I+31: I+24]) +


ABS(SRC1[I+47: I+40] - TMP1[I+39: I+32]) +
ABS(SRC1[I+55: I+48] - TMP1[I+47: I+40]) +
ABS(SRC1[I+63: I+56] - TMP1[I+55: I+48])
ENDFOR

FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TMP_DEST[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VDBPSADBW __m512i _mm512_dbsad_epu8(__m512i a, __m512i b int imm8);
VDBPSADBW __m512i _mm512_mask_dbsad_epu8(__m512i s, __mmask32 m, __m512i a, __m512i b int imm8);
VDBPSADBW __m512i _mm512_maskz_dbsad_epu8(__mmask32 m, __m512i a, __m512i b int imm8);
VDBPSADBW __m256i _mm256_dbsad_epu8(__m256i a, __m256i b int imm8);
VDBPSADBW __m256i _mm256_mask_dbsad_epu8(__m256i s, __mmask16 m, __m256i a, __m256i b int imm8);
VDBPSADBW __m256i _mm256_maskz_dbsad_epu8(__mmask16 m, __m256i a, __m256i b int imm8);
VDBPSADBW __m128i _mm_dbsad_epu8(__m128i a, __m128i b int imm8);
VDBPSADBW __m128i _mm_mask_dbsad_epu8(__m128i s, __mmask8 m, __m128i a, __m128i b int imm8);
VDBPSADBW __m128i _mm_maskz_dbsad_epu8(__mmask8 m, __m128i a, __m128i b int imm8);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”

VDBPSADBW—Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes Vol. 2C 5-152


VDIVPH—Divide Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.NP.MAP5.W0 5E /r A V/V (AVX512-FP16 Divide packed FP16 values in xmm2 by packed
VDIVPH xmm1{k1}{z}, xmm2, AND AVX512VL) FP16 values in xmm3/m128/m16bcst, and store
xmm3/m128/m16bcst OR AVX10.11 the result in xmm1 subject to writemask k1.
EVEX.256.NP.MAP5.W0 5E /r A V/V (AVX512-FP16 Divide packed FP16 values in ymm2 by packed
VDIVPH ymm1{k1}{z}, ymm2, AND AVX512VL) FP16 values in ymm3/m256/m16bcst, and store
ymm3/m256/m16bcst OR AVX10.11 the result in ymm1 subject to writemask k1.
EVEX.512.NP.MAP5.W0 5E /r A V/V AVX512-FP16 Divide packed FP16 values in zmm2 by packed
VDIVPH zmm1{k1}{z}, zmm2, OR AVX10.11 FP16 values in zmm3/m512/m16bcst, and store
zmm3/m512/m16bcst {er} the result in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction divides packed FP16 values from the first source operand by the corresponding elements in the
second source operand, storing the packed FP16 result in the destination operand. The destination elements are
updated according to the writemask.

Operation
VDIVPH (EVEX Encoded Versions) When SRC2 Operand is a Register
VL = 128, 256 or 512
KL := VL/16

IF (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.fp16[j] := SRC1.fp16[j] / SRC2.fp16[j]
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VDIVPH—Divide Packed FP16 Values Vol. 2C 5-153


VDIVPH (EVEX Encoded Versions) When SRC2 Operand is a Memory Source
VL = 128, 256 or 512
KL := VL/16

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
DEST.fp16[j] := SRC1.fp16[j] / SRC2.fp16[0]
ELSE:
DEST.fp16[j] := SRC1.fp16[j] / SRC2.fp16[j]
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VDIVPH __m128h _mm_div_ph (__m128h a, __m128h b);
VDIVPH __m128h _mm_mask_div_ph (__m128h src, __mmask8 k, __m128h a, __m128h b);
VDIVPH __m128h _mm_maskz_div_ph (__mmask8 k, __m128h a, __m128h b);
VDIVPH __m256h _mm256_div_ph (__m256h a, __m256h b);
VDIVPH __m256h _mm256_mask_div_ph (__m256h src, __mmask16 k, __m256h a, __m256h b);
VDIVPH __m256h _mm256_maskz_div_ph (__mmask16 k, __m256h a, __m256h b);
VDIVPH __m512h _mm512_div_ph (__m512h a, __m512h b);
VDIVPH __m512h _mm512_mask_div_ph (__m512h src, __mmask32 k, __m512h a, __m512h b);
VDIVPH __m512h _mm512_maskz_div_ph (__mmask32 k, __m512h a, __m512h b);
VDIVPH __m512h _mm512_div_round_ph (__m512h a, __m512h b, int rounding);
VDIVPH __m512h _mm512_mask_div_round_ph (__m512h src, __mmask32 k, __m512h a, __m512h b, int rounding);
VDIVPH __m512h _mm512_maskz_div_round_ph (__mmask32 k, __m512h a, __m512h b, int rounding);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal, Zero.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VDIVPH—Divide Packed FP16 Values Vol. 2C 5-154


VDIVSH—Divide Scalar FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.F3.MAP5.W0 5E /r A V/V AVX512-FP16 Divide low FP16 value in xmm2 by low FP16
VDIVSH xmm1{k1}{z}, xmm2, OR AVX10.11 value in xmm3/m16, and store the result in xmm1
xmm3/m16 {er} subject to writemask k1. Bits 127:16 of xmm2
are copied to xmm1[127:16].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction divides the low FP16 value from the first source operand by the corresponding value in the second
source operand, storing the FP16 result in the destination operand. Bits 127:16 of the destination operand are
copied from the corresponding bits of the first source operand. Bits MAXVL-1:128 of the destination operand are
zeroed. The low FP16 element of the destination is updated according to the writemask.

Operation
VDIVSH (EVEX Encoded Versions)
IF EVEX.b = 1 and SRC2 is a register:
SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

IF k1[0] OR *no writemask*:


DEST.fp16[0] := SRC1.fp16[0] / SRC2.fp16[0]
ELSE IF *zeroing*:
DEST.fp16[0] := 0
// else dest.fp16[0] remains unchanged

DEST[127:16] := SRC1[127:16]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VDIVSH __m128h _mm_div_round_sh (__m128h a, __m128h b, int rounding);
VDIVSH __m128h _mm_mask_div_round_sh (__m128h src, __mmask8 k, __m128h a, __m128h b, int rounding);
VDIVSH __m128h _mm_maskz_div_round_sh (__mmask8 k, __m128h a, __m128h b, int rounding);
VDIVSH __m128h _mm_div_sh (__m128h a, __m128h b);
VDIVSH __m128h _mm_mask_div_sh (__m128h src, __mmask8 k, __m128h a, __m128h b);
VDIVSH __m128h _mm_maskz_div_sh (__mmask8 k, __m128h a, __m128h b);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal, Zero.

VDIVSH—Divide Scalar FP16 Values Vol. 2C 5-155


Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VDIVSH—Divide Scalar FP16 Values Vol. 2C 5-156


VDPBF16PS—Dot Product of BF16 Pairs Accumulated Into Packed Single Precision
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F3.0F38.W0 52 /r A V/V (AVX512_BF16 Multiply BF16 pairs from xmm2 and
VDPBF16PS xmm1{k1}{z}, xmm2, AND AVX512VL) xmm3/m128, and accumulate the resulting
xmm3/m128/m32bcst OR AVX10.11 packed single precision results in xmm1 with
writemask k1.
EVEX.256.F3.0F38.W0 52 /r A V/V (AVX512_BF16 Multiply BF16 pairs from ymm2 and
VDPBF16PS ymm1{k1}{z}, ymm2, AND AVX512VL) ymm3/m256, and accumulate the resulting
ymm3/m256/m32bcst OR AVX10.11 packed single precision results in ymm1 with
writemask k1.
EVEX.512.F3.0F38.W0 52 /r A V/V (AVX512_BF16 Multiply BF16 pairs from zmm2 and
VDPBF16PS zmm1{k1}{z}, zmm2, AND AVX512F) zmm3/m512, and accumulate the resulting
zmm3/m512/m32bcst OR AVX10.11 packed single precision results in zmm1 with
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs a SIMD dot-product of two BF16 pairs and accumulates into a packed single precision
register.
“Round to nearest even” rounding mode is used when doing each accumulation of the FMA. Output denormals are
always flushed to zero and input denormals are always treated as zero. MXCSR is not consulted nor updated.

NaN propagation priorities are described in Table 5-4.

Table 5-4. NaN Propagation Priorities


NaN Priority Description Comments
1 src1 low is NaN
Lower part has priority over upper part, i.e., it overrides the upper part.
2 src2 low is NaN
3 src1 high is NaN
Upper part may be overridden if lower has NaN.
4 src2 high is NaN
5 srcdest is NaN Dest is propagated if no NaN is encountered by src2.

Operation
Define make_fp32(x):
// The x parameter is bfloat16. Pack it in to upper 16b of a dword. The bit pattern is a legal fp32 value. Return that bit pattern.
dword := 0
dword[31:16] := x
RETURN dword

VDPBF16PS—Dot Product of BF16 Pairs Accumulated Into Packed Single Precision Vol. 2C 5-156
VDPBF16PS srcdest, src1, src2
VL = (128, 256, 512)
KL = VL/32

origdest := srcdest
FOR i := 0 to KL-1:
IF k1[ i ] or *no writemask*:
IF src2 is memory and evex.b == 1:
t := src2.dword[0]
ELSE:
t := src2.dword[ i ]

// FP32 FMA with daz in, ftz out and RNE rounding. MXCSR neither consulted nor updated.

srcdest.fp32[ i ] += make_fp32(src1.bfloat16[2*i+1]) * make_fp32(t.bfloat[1])


srcdest.fp32[ i ] += make_fp32(src1.bfloat16[2*i+0]) * make_fp32(t.bfloat[0])

ELSE IF *zeroing*:
srcdest.dword[ i ] := 0
ELSE: // merge masking, dest element unchanged
srcdest.dword[ i ] := origdest.dword[ i ]

srcdest[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VDPBF16PS __m128 _mm_dpbf16_ps(__m128, __m128bh, __m128bh);
VDPBF16PS __m128 _mm_mask_dpbf16_ps( __m128, __mmask8, __m128bh, __m128bh);
VDPBF16PS __m128 _mm_maskz_dpbf16_ps(__mmask8, __m128, __m128bh, __m128bh);
VDPBF16PS __m256 _mm256_dpbf16_ps(__m256, __m256bh, __m256bh);
VDPBF16PS __m256 _mm256_mask_dpbf16_ps(__m256, __mmask8, __m256bh, __m256bh);
VDPBF16PS __m256 _mm256_maskz_dpbf16_ps(__mmask8, __m256, __m256bh, __m256bh);
VDPBF16PS __m512 _mm512_dpbf16_ps(__m512, __m512bh, __m512bh);
VDPBF16PS __m512 _mm512_mask_dpbf16_ps(__m512, __mmask16, __m512bh, __m512bh);
VDPBF16PS __m512 _mm512_maskz_dpbf16_ps(__mmask16, __m512, __m512bh, __m512bh);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”

VDPBF16PS—Dot Product of BF16 Pairs Accumulated Into Packed Single Precision Vol. 2C 5-157
VEXPANDPD—Load Sparse Packed Double Precision Floating-Point Values From Dense Memory
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F38.W1 88 /r A V/V (AVX512VL AND Expand packed double precision floating-point
VEXPANDPD xmm1 {k1}{z}, AVX512F) OR values from xmm2/m128 to xmm1 using
xmm2/m128 AVX10.11 writemask k1.
EVEX.256.66.0F38.W1 88 /r A V/V (AVX512VL AND Expand packed double precision floating-point
VEXPANDPD ymm1 {k1}{z}, ymm2/m256 AVX512F) OR values from ymm2/m256 to ymm1 using
AVX10.11 writemask k1.
EVEX.512.66.0F38.W1 88 /r A V/V AVX512F Expand packed double precision floating-point
VEXPANDPD zmm1 {k1}{z}, zmm2/m512 OR AVX10.11 values from zmm2/m512 to zmm1 using
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Expand (load) up to 8/4/2, contiguous, double precision floating-point values of the input vector in the source
operand (the second operand) to sparse elements in the destination operand (the first operand) selected by the
writemask k1.
The destination operand is a ZMM/YMM/XMM register, the source operand can be a ZMM/YMM/XMM register or a
512/256/128-bit memory location.
The input vector starts from the lowest element in the source operand. The writemask register k1 selects the desti-
nation elements (a partial vector or sparse elements if less than 8 elements) to be replaced by the ascending
elements in the input vector. Destination elements not selected by the writemask k1 are either unmodified or
zeroed, depending on EVEX.z.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.

VEXPANDPD—Load Sparse Packed Double Precision Floating-Point Values From Dense Memory Vol. 2C 5-160
Operation
VEXPANDPD (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
k := 0
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
DEST[i+63:i] := SRC[k+63:k];
k := k + 64
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VEXPANDPD __m512d _mm512_mask_expand_pd( __m512d s, __mmask8 k, __m512d a);
VEXPANDPD __m512d _mm512_maskz_expand_pd( __mmask8 k, __m512d a);
VEXPANDPD __m512d _mm512_mask_expandloadu_pd( __m512d s, __mmask8 k, void * a);
VEXPANDPD __m512d _mm512_maskz_expandloadu_pd( __mmask8 k, void * a);
VEXPANDPD __m256d _mm256_mask_expand_pd( __m256d s, __mmask8 k, __m256d a);
VEXPANDPD __m256d _mm256_maskz_expand_pd( __mmask8 k, __m256d a);
VEXPANDPD __m256d _mm256_mask_expandloadu_pd( __m256d s, __mmask8 k, void * a);
VEXPANDPD __m256d _mm256_maskz_expandloadu_pd( __mmask8 k, void * a);
VEXPANDPD __m128d _mm_mask_expand_pd( __m128d s, __mmask8 k, __m128d a);
VEXPANDPD __m128d _mm_maskz_expand_pd( __mmask8 k, __m128d a);
VEXPANDPD __m128d _mm_mask_expandloadu_pd( __m128d s, __mmask8 k, void * a);
VEXPANDPD __m128d _mm_maskz_expandloadu_pd( __mmask8 k, void * a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VEXPANDPD—Load Sparse Packed Double Precision Floating-Point Values From Dense Memory Vol. 2C 5-161
VEXPANDPS—Load Sparse Packed Single Precision Floating-Point Values From Dense Memory
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F38.W0 88 /r A V/V (AVX512VL AND Expand packed single precision floating-point
VEXPANDPS xmm1 {k1}{z}, xmm2/m128 AVX512F) OR values from xmm2/m128 to xmm1 using
AVX10.11 writemask k1.
EVEX.256.66.0F38.W0 88 /r A V/V (AVX512VL AND Expand packed single precision floating-point
VEXPANDPS ymm1 {k1}{z}, ymm2/m256 AVX512F) OR values from ymm2/m256 to ymm1 using
AVX10.11 writemask k1.
EVEX.512.66.0F38.W0 88 /r A V/V AVX512F Expand packed single precision floating-point
VEXPANDPS zmm1 {k1}{z}, zmm2/m512 OR AVX10.11 values from zmm2/m512 to zmm1 using
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Expand (load) up to 16/8/4, contiguous, single precision floating-point values of the input vector in the source
operand (the second operand) to sparse elements of the destination operand (the first operand) selected by the
writemask k1.
The destination operand is a ZMM/YMM/XMM register, the source operand can be a ZMM/YMM/XMM register or a
512/256/128-bit memory location.
The input vector starts from the lowest element in the source operand. The writemask k1 selects the destination
elements (a partial vector or sparse elements if less than 16 elements) to be replaced by the ascending elements
in the input vector. Destination elements not selected by the writemask k1 are either unmodified or zeroed,
depending on EVEX.z.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.

VEXPANDPS—Load Sparse Packed Single Precision Floating-Point Values From Dense Memory Vol. 2C 5-162
Operation
VEXPANDPS (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
k := 0
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
DEST[i+31:i] := SRC[k+31:k];
k := k + 32
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VEXPANDPS __m512 _mm512_mask_expand_ps( __m512 s, __mmask16 k, __m512 a);
VEXPANDPS __m512 _mm512_maskz_expand_ps( __mmask16 k, __m512 a);
VEXPANDPS __m512 _mm512_mask_expandloadu_ps( __m512 s, __mmask16 k, void * a);
VEXPANDPS __m512 _mm512_maskz_expandloadu_ps( __mmask16 k, void * a);
VEXPANDPD __m256 _mm256_mask_expand_ps( __m256 s, __mmask8 k, __m256 a);
VEXPANDPD __m256 _mm256_maskz_expand_ps( __mmask8 k, __m256 a);
VEXPANDPD __m256 _mm256_mask_expandloadu_ps( __m256 s, __mmask8 k, void * a);
VEXPANDPD __m256 _mm256_maskz_expandloadu_ps( __mmask8 k, void * a);
VEXPANDPD __m128 _mm_mask_expand_ps( __m128 s, __mmask8 k, __m128 a);
VEXPANDPD __m128 _mm_maskz_expand_ps( __mmask8 k, __m128 a);
VEXPANDPD __m128 _mm_mask_expandloadu_ps( __m128 s, __mmask8 k, void * a);
VEXPANDPD __m128 _mm_maskz_expandloadu_ps( __mmask8 k, void * a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VEXPANDPS—Load Sparse Packed Single Precision Floating-Point Values From Dense Memory Vol. 2C 5-163
VEXTRACTF128/VEXTRACTF32x4/VEXTRACTF64x2/VEXTRACTF32x8/VEXTRACTF64x4—
Extract Packed Floating-Point Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.256.66.0F3A.W0 19 /r ib A V/V AVX Extract 128 bits of packed floating-point values
VEXTRACTF128 xmm1/m128, ymm2, from ymm2 and store results in xmm1/m128.
imm8
EVEX.256.66.0F3A.W0 19 /r ib C V/V (AVX512VL AND Extract 128 bits of packed single precision
VEXTRACTF32X4 xmm1/m128 {k1}{z}, AVX512F) OR floating-point values from ymm2 and store
ymm2, imm8 AVX10.11 results in xmm1/m128 subject to writemask k1.
EVEX.512.66.0F3A.W0 19 /r ib C V/V AVX512F Extract 128 bits of packed single precision
VEXTRACTF32x4 xmm1/m128 {k1}{z}, OR AVX10.11 floating-point values from zmm2 and store
zmm2, imm8 results in xmm1/m128 subject to writemask k1.
EVEX.256.66.0F3A.W1 19 /r ib B V/V (AVX512VL AND Extract 128 bits of packed double precision
VEXTRACTF64X2 xmm1/m128 {k1}{z}, AVX512DQ) OR floating-point values from ymm2 and store
ymm2, imm8 AVX10.11 results in xmm1/m128 subject to writemask k1.
EVEX.512.66.0F3A.W1 19 /r ib B V/V AVX512DQ Extract 128 bits of packed double precision
VEXTRACTF64X2 xmm1/m128 {k1}{z}, OR AVX10.11 floating-point values from zmm2 and store
zmm2, imm8 results in xmm1/m128 subject to writemask k1.
EVEX.512.66.0F3A.W0 1B /r ib D V/V AVX512DQ Extract 256 bits of packed single precision
VEXTRACTF32X8 ymm1/m256 {k1}{z}, OR AVX10.11 floating-point values from zmm2 and store
zmm2, imm8 results in ymm1/m256 subject to writemask k1.
EVEX.512.66.0F3A.W1 1B /r ib C V/V AVX512F Extract 256 bits of packed double precision
VEXTRACTF64x4 ymm1/m256 {k1}{z}, OR AVX10.11 floating-point values from zmm2 and store
zmm2, imm8 results in ymm1/m256 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:r/m (w) ModRM:reg (r) imm8 N/A
B Tuple2 ModRM:r/m (w) ModRM:reg (r) imm8 N/A
C Tuple4 ModRM:r/m (w) ModRM:reg (r) imm8 N/A
D Tuple8 ModRM:r/m (w) ModRM:reg (r) imm8 N/A

Description
VEXTRACTF128/VEXTRACTF32x4 and VEXTRACTF64x2 extract 128-bits of single precision floating-point values
from the source operand (the second operand) and store to the low 128-bit of the destination operand (the first
operand). The 128-bit data extraction occurs at an 128-bit granular offset specified by imm8[0] (256-bit) or
imm8[1:0] as the multiply factor. The destination may be either a vector register or an 128-bit memory location.
VEXTRACTF32x4: The low 128-bit of the destination operand is updated at 32-bit granularity according to the
writemask.
VEXTRACTF32x8 and VEXTRACTF64x4 extract 256-bits of double precision floating-point values from the source
operand (second operand) and store to the low 256-bit of the destination operand (the first operand). The 256-bit
data extraction occurs at an 256-bit granular offset specified by imm8[0] (256-bit) or imm8[0] as the multiply
factor The destination may be either a vector register or a 256-bit memory location.

VEXTRACTF128/VEXTRACTF32x4/VEXTRACTF64x2/VEXTRACTF32x8/VEXTRACTF64x4— Extract Packed Floating-Point Values Vol. 2C 5-164


VEXTRACTF64x4: The low 256-bit of the destination operand is updated at 64-bit granularity according to the
writemask.
VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
The high 6 bits of the immediate are ignored.
If VEXTRACTF128 is encoded with VEX.L= 0, an attempt to execute the instruction encoded with VEX.L= 0 will
cause an #UD exception.

Operation
VEXTRACTF32x4 (EVEX Encoded Versions) When Destination is a Register
VL = 256, 512
IF VL = 256
CASE (imm8[0]) OF
0: TMP_DEST[127:0] := SRC1[127:0]
1: TMP_DEST[127:0] := SRC1[255:128]
ESAC.
FI;
IF VL = 512
CASE (imm8[1:0]) OF
00: TMP_DEST[127:0] := SRC1[127:0]
01: TMP_DEST[127:0] := SRC1[255:128]
10: TMP_DEST[127:0] := SRC1[383:256]
11: TMP_DEST[127:0] := SRC1[511:384]
ESAC.
FI;
FOR j := 0 TO 3
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:128] := 0

VEXTRACTF32x4 (EVEX Encoded Versions) When Destination is Memory


VL = 256, 512
IF VL = 256
CASE (imm8[0]) OF
0: TMP_DEST[127:0] := SRC1[127:0]
1: TMP_DEST[127:0] := SRC1[255:128]
ESAC.
FI;
IF VL = 512
CASE (imm8[1:0]) OF
00: TMP_DEST[127:0] := SRC1[127:0]
01: TMP_DEST[127:0] := SRC1[255:128]
10: TMP_DEST[127:0] := SRC1[383:256]
11: TMP_DEST[127:0] := SRC1[511:384]
ESAC.

VEXTRACTF128/VEXTRACTF32x4/VEXTRACTF64x2/VEXTRACTF32x8/VEXTRACTF64x4— Extract Packed Floating-Point Values Vol. 2C 5-165


FI;

FOR j := 0 TO 3
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE *DEST[i+31:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VEXTRACTF64x2 (EVEX Encoded Versions) When Destination is a Register


VL = 256, 512
IF VL = 256
CASE (imm8[0]) OF
0: TMP_DEST[127:0] := SRC1[127:0]
1: TMP_DEST[127:0] := SRC1[255:128]
ESAC.
FI;
IF VL = 512
CASE (imm8[1:0]) OF
00: TMP_DEST[127:0] := SRC1[127:0]
01: TMP_DEST[127:0] := SRC1[255:128]
10: TMP_DEST[127:0] := SRC1[383:256]
11: TMP_DEST[127:0] := SRC1[511:384]
ESAC.
FI;

FOR j := 0 TO 1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:128] := 0

VEXTRACTF64x2 (EVEX Encoded Versions) When Destination is Memory


VL = 256, 512
IF VL = 256
CASE (imm8[0]) OF
0: TMP_DEST[127:0] := SRC1[127:0]
1: TMP_DEST[127:0] := SRC1[255:128]
ESAC.
FI;
IF VL = 512
CASE (imm8[1:0]) OF
00: TMP_DEST[127:0] := SRC1[127:0]
01: TMP_DEST[127:0] := SRC1[255:128]
10: TMP_DEST[127:0] := SRC1[383:256]

VEXTRACTF128/VEXTRACTF32x4/VEXTRACTF64x2/VEXTRACTF32x8/VEXTRACTF64x4— Extract Packed Floating-Point Values Vol. 2C 5-166


11: TMP_DEST[127:0] := SRC1[511:384]
ESAC.
FI;

FOR j := 0 TO 1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE *DEST[i+63:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VEXTRACTF32x8 (EVEX.U1.512 Encoded Version) When Destination is a Register


VL = 512
CASE (imm8[0]) OF
0: TMP_DEST[255:0] := SRC1[255:0]
1: TMP_DEST[255:0] := SRC1[511:256]
ESAC.

FOR j := 0 TO 7
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:256] := 0

VEXTRACTF32x8 (EVEX.U1.512 Encoded Version) When Destination is Memory


CASE (imm8[0]) OF
0: TMP_DEST[255:0] := SRC1[255:0]
1: TMP_DEST[255:0] := SRC1[511:256]
ESAC.

FOR j := 0 TO 7
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE *DEST[i+31:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VEXTRACTF64x4 (EVEX.512 Encoded Version) When Destination is a Register


VL = 512
CASE (imm8[0]) OF
0: TMP_DEST[255:0] := SRC1[255:0]
1: TMP_DEST[255:0] := SRC1[511:256]
ESAC.

VEXTRACTF128/VEXTRACTF32x4/VEXTRACTF64x2/VEXTRACTF32x8/VEXTRACTF64x4— Extract Packed Floating-Point Values Vol. 2C 5-167


FOR j := 0 TO 3
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:256] := 0

VEXTRACTF64x4 (EVEX.512 Encoded Version) When Destination is Memory


CASE (imm8[0]) OF
0: TMP_DEST[255:0] := SRC1[255:0]
1: TMP_DEST[255:0] := SRC1[511:256]
ESAC.

FOR j := 0 TO 3
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE ; merging-masking
*DEST[i+63:i] remains unchanged*
FI;
ENDFOR

VEXTRACTF128 (Memory Destination Form)


CASE (imm8[0]) OF
0: DEST[127:0] := SRC1[127:0]
1: DEST[127:0] := SRC1[255:128]
ESAC.

VEXTRACTF128 (Register Destination Form)


CASE (imm8[0]) OF
0: DEST[127:0] := SRC1[127:0]
1: DEST[127:0] := SRC1[255:128]
ESAC.
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VEXTRACTF32x4 __m128 _mm512_extractf32x4_ps(__m512 a, const int nidx);
VEXTRACTF32x4 __m128 _mm512_mask_extractf32x4_ps(__m128 s, __mmask8 k, __m512 a, const int nidx);
VEXTRACTF32x4 __m128 _mm512_maskz_extractf32x4_ps( __mmask8 k, __m512 a, const int nidx);
VEXTRACTF32x4 __m128 _mm256_extractf32x4_ps(__m256 a, const int nidx);
VEXTRACTF32x4 __m128 _mm256_mask_extractf32x4_ps(__m128 s, __mmask8 k, __m256 a, const int nidx);
VEXTRACTF32x4 __m128 _mm256_maskz_extractf32x4_ps( __mmask8 k, __m256 a, const int nidx);
VEXTRACTF32x8 __m256 _mm512_extractf32x8_ps(__m512 a, const int nidx);
VEXTRACTF32x8 __m256 _mm512_mask_extractf32x8_ps(__m256 s, __mmask8 k, __m512 a, const int nidx);
VEXTRACTF32x8 __m256 _mm512_maskz_extractf32x8_ps( __mmask8 k, __m512 a, const int nidx);
VEXTRACTF64x2 __m128d _mm512_extractf64x2_pd(__m512d a, const int nidx);
VEXTRACTF64x2 __m128d _mm512_mask_extractf64x2_pd(__m128d s, __mmask8 k, __m512d a, const int nidx);

VEXTRACTF128/VEXTRACTF32x4/VEXTRACTF64x2/VEXTRACTF32x8/VEXTRACTF64x4— Extract Packed Floating-Point Values Vol. 2C 5-168


VEXTRACTF64x2 __m128d _mm512_maskz_extractf64x2_pd( __mmask8 k, __m512d a, const int nidx);
VEXTRACTF64x2 __m128d _mm256_extractf64x2_pd(__m256d a, const int nidx);
VEXTRACTF64x2 __m128d _mm256_mask_extractf64x2_pd(__m128d s, __mmask8 k, __m256d a, const int nidx);
VEXTRACTF64x2 __m128d _mm256_maskz_extractf64x2_pd( __mmask8 k, __m256d a, const int nidx);
VEXTRACTF64x4 __m256d _mm512_extractf64x4_pd( __m512d a, const int nidx);
VEXTRACTF64x4 __m256d _mm512_mask_extractf64x4_pd(__m256d s, __mmask8 k, __m512d a, const int nidx);
VEXTRACTF64x4 __m256d _mm512_maskz_extractf64x4_pd( __mmask8 k, __m512d a, const int nidx);
VEXTRACTF128 __m128 _mm256_extractf128_ps (__m256 a, int offset);
VEXTRACTF128 __m128d _mm256_extractf128_pd (__m256d a, int offset);
VEXTRACTF128 __m128i_mm256_extractf128_si256(__m256i a, int offset);

SIMD Floating-Point Exceptions


None.

Other Exceptions
VEX-encoded instructions, see Table 2-23, “Type 6 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-56, “Type E6NF Class Exception Conditions.”
Additionally:
#UD IF VEX.L = 0.
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

VEXTRACTF128/VEXTRACTF32x4/VEXTRACTF64x2/VEXTRACTF32x8/VEXTRACTF64x4— Extract Packed Floating-Point Values Vol. 2C 5-169


VEXTRACTI128/VEXTRACTI32x4/VEXTRACTI64x2/VEXTRACTI32x8/VEXTRACTI64x4—Extract
Packed Integer Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.256.66.0F3A.W0 39 /r ib A V/V AVX2 Extract 128 bits of integer data from ymm2
VEXTRACTI128 xmm1/m128, ymm2, and store results in xmm1/m128.
imm8
EVEX.256.66.0F3A.W0 39 /r ib C V/V (AVX512VL AND Extract 128 bits of double-word integer values
VEXTRACTI32X4 xmm1/m128 {k1}{z}, AVX512F) OR from ymm2 and store results in xmm1/m128
ymm2, imm8 AVX10.11 subject to writemask k1.
EVEX.512.66.0F3A.W0 39 /r ib C V/V AVX512F Extract 128 bits of double-word integer values
VEXTRACTI32x4 xmm1/m128 {k1}{z}, OR AVX10.11 from zmm2 and store results in xmm1/m128
zmm2, imm8 subject to writemask k1.
EVEX.256.66.0F3A.W1 39 /r ib B V/V (AVX512VL AND Extract 128 bits of quad-word integer values
VEXTRACTI64X2 xmm1/m128 {k1}{z}, AVX512DQ) OR from ymm2 and store results in xmm1/m128
ymm2, imm8 AVX10.11 subject to writemask k1.
EVEX.512.66.0F3A.W1 39 /r ib B V/V AVX512DQ Extract 128 bits of quad-word integer values
VEXTRACTI64X2 xmm1/m128 {k1}{z}, OR AVX10.11 from zmm2 and store results in xmm1/m128
zmm2, imm8 subject to writemask k1.
EVEX.512.66.0F3A.W0 3B /r ib D V/V AVX512DQ Extract 256 bits of double-word integer values
VEXTRACTI32X8 ymm1/m256 {k1}{z}, OR AVX10.11 from zmm2 and store results in ymm1/m256
zmm2, imm8 subject to writemask k1.
EVEX.512.66.0F3A.W1 3B /r ib C V/V AVX512F Extract 256 bits of quad-word integer values
VEXTRACTI64x4 ymm1/m256 {k1}{z}, OR AVX10.11 from zmm2 and store results in ymm1/m256
zmm2, imm8 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:r/m (w) ModRM:reg (r) imm8 N/A
B Tuple2 ModRM:r/m (w) ModRM:reg (r) imm8 N/A
C Tuple4 ModRM:r/m (w) ModRM:reg (r) imm8 N/A
D Tuple8 ModRM:r/m (w) ModRM:reg (r) imm8 N/A

Description
VEXTRACTI128/VEXTRACTI32x4 and VEXTRACTI64x2 extract 128-bits of doubleword integer values from the
source operand (the second operand) and store to the low 128-bit of the destination operand (the first operand).
The 128-bit data extraction occurs at an 128-bit granular offset specified by imm8[0] (256-bit) or imm8[1:0] as
the multiply factor. The destination may be either a vector register or an 128-bit memory location.
VEXTRACTI32x4: The low 128-bit of the destination operand is updated at 32-bit granularity according to the
writemask.
VEXTRACTI64x2: The low 128-bit of the destination operand is updated at 64-bit granularity according to the
writemask.
VEXTRACTI32x8 and VEXTRACTI64x4 extract 256-bits of quadword integer values from the source operand (the
second operand) and store to the low 256-bit of the destination operand (the first operand). The 256-bit data

VEXTRACTI128/VEXTRACTI32x4/VEXTRACTI64x2/VEXTRACTI32x8/VEXTRACTI64x4—Extract Packed Integer Values Vol. 2C 5-170


extraction occurs at an 256-bit granular offset specified by imm8[0] (256-bit) or imm8[0] as the multiply factor
The destination may be either a vector register or a 256-bit memory location.
VEXTRACTI32x8: The low 256-bit of the destination operand is updated at 32-bit granularity according to the
writemask.
VEXTRACTI64x4: The low 256-bit of the destination operand is updated at 64-bit granularity according to the
writemask.
VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
The high 7 bits (6 bits in EVEX.512) of the immediate are ignored.
If VEXTRACTI128 is encoded with VEX.L= 0, an attempt to execute the instruction encoded with VEX.L= 0 will
cause an #UD exception.

Operation
VEXTRACTI32x4 (EVEX encoded versions) when destination is a register
VL = 256, 512
IF VL = 256
CASE (imm8[0]) OF
0: TMP_DEST[127:0] := SRC1[127:0]
1: TMP_DEST[127:0] := SRC1[255:128]
ESAC.
FI;
IF VL = 512
CASE (imm8[1:0]) OF
00: TMP_DEST[127:0] := SRC1[127:0]
01: TMP_DEST[127:0] := SRC1[255:128]
10: TMP_DEST[127:0] := SRC1[383:256]
11: TMP_DEST[127:0] := SRC1[511:384]
ESAC.
FI;
FOR j := 0 TO 3
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:128] := 0

VEXTRACTI32x4 (EVEX encoded versions) when destination is memory


VL = 256, 512
IF VL = 256
CASE (imm8[0]) OF
0: TMP_DEST[127:0] := SRC1[127:0]
1: TMP_DEST[127:0] := SRC1[255:128]
ESAC.
FI;
IF VL = 512
CASE (imm8[1:0]) OF
00: TMP_DEST[127:0] := SRC1[127:0]

VEXTRACTI128/VEXTRACTI32x4/VEXTRACTI64x2/VEXTRACTI32x8/VEXTRACTI64x4—Extract Packed Integer Values Vol. 2C 5-171


01: TMP_DEST[127:0] := SRC1[255:128]
10: TMP_DEST[127:0] := SRC1[383:256]
11: TMP_DEST[127:0] := SRC1[511:384]
ESAC.
FI;

FOR j := 0 TO 3
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE *DEST[i+31:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VEXTRACTI64x2 (EVEX encoded versions) when destination is a register


VL = 256, 512
IF VL = 256
CASE (imm8[0]) OF
0: TMP_DEST[127:0] := SRC1[127:0]
1: TMP_DEST[127:0] := SRC1[255:128]
ESAC.
FI;
IF VL = 512
CASE (imm8[1:0]) OF
00: TMP_DEST[127:0] := SRC1[127:0]
01: TMP_DEST[127:0] := SRC1[255:128]
10: TMP_DEST[127:0] := SRC1[383:256]
11: TMP_DEST[127:0] := SRC1[511:384]
ESAC.
FI;

FOR j := 0 TO 1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:128] := 0

VEXTRACTI64x2 (EVEX encoded versions) when destination is memory


VL = 256, 512
IF VL = 256
CASE (imm8[0]) OF
0: TMP_DEST[127:0] := SRC1[127:0]
1: TMP_DEST[127:0] := SRC1[255:128]
ESAC.
FI;
IF VL = 512

VEXTRACTI128/VEXTRACTI32x4/VEXTRACTI64x2/VEXTRACTI32x8/VEXTRACTI64x4—Extract Packed Integer Values Vol. 2C 5-172


CASE (imm8[1:0]) OF
00: TMP_DEST[127:0] := SRC1[127:0]
01: TMP_DEST[127:0] := SRC1[255:128]
10: TMP_DEST[127:0] := SRC1[383:256]
11: TMP_DEST[127:0] := SRC1[511:384]
ESAC.
FI;

FOR j := 0 TO 1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE *DEST[i+63:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VEXTRACTI32x8 (EVEX.U1.512 encoded version) when destination is a register


VL = 512
CASE (imm8[0]) OF
0: TMP_DEST[255:0] := SRC1[255:0]
1: TMP_DEST[255:0] := SRC1[511:256]
ESAC.

FOR j := 0 TO 7
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:256] := 0

VEXTRACTI32x8 (EVEX.U1.512 encoded version) when destination is memory


CASE (imm8[0]) OF
0: TMP_DEST[255:0] := SRC1[255:0]
1: TMP_DEST[255:0] := SRC1[511:256]
ESAC.

FOR j := 0 TO 7
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE *DEST[i+31:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VEXTRACTI128/VEXTRACTI32x4/VEXTRACTI64x2/VEXTRACTI32x8/VEXTRACTI64x4—Extract Packed Integer Values Vol. 2C 5-173


VEXTRACTI64x4 (EVEX.512 encoded version) when destination is a register
VL = 512
CASE (imm8[0]) OF
0: TMP_DEST[255:0] := SRC1[255:0]
1: TMP_DEST[255:0] := SRC1[511:256]
ESAC.

FOR j := 0 TO 3
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:256] := 0

VEXTRACTI64x4 (EVEX.512 encoded version) when destination is memory


CASE (imm8[0]) OF
0: TMP_DEST[255:0] := SRC1[255:0]
1: TMP_DEST[255:0] := SRC1[511:256]
ESAC.
FOR j := 0 TO 3
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE *DEST[i+63:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VEXTRACTI128 (memory destination form)


CASE (imm8[0]) OF
0: DEST[127:0] := SRC1[127:0]
1: DEST[127:0] := SRC1[255:128]
ESAC.

VEXTRACTI128 (register destination form)


CASE (imm8[0]) OF
0: DEST[127:0] := SRC1[127:0]
1: DEST[127:0] := SRC1[255:128]
ESAC.
DEST[MAXVL-1:128] := 0

VEXTRACTI128/VEXTRACTI32x4/VEXTRACTI64x2/VEXTRACTI32x8/VEXTRACTI64x4—Extract Packed Integer Values Vol. 2C 5-174


Intel C/C++ Compiler Intrinsic Equivalent
VEXTRACTI32x4 __m128i _mm512_extracti32x4_epi32(__m512i a, const int nidx);
VEXTRACTI32x4 __m128i _mm512_mask_extracti32x4_epi32(__m128i s, __mmask8 k, __m512i a, const int nidx);
VEXTRACTI32x4 __m128i _mm512_maskz_extracti32x4_epi32( __mmask8 k, __m512i a, const int nidx);
VEXTRACTI32x4 __m128i _mm256_extracti32x4_epi32(__m256i a, const int nidx);
VEXTRACTI32x4 __m128i _mm256_mask_extracti32x4_epi32(__m128i s, __mmask8 k, __m256i a, const int nidx);
VEXTRACTI32x4 __m128i _mm256_maskz_extracti32x4_epi32( __mmask8 k, __m256i a, const int nidx);
VEXTRACTI32x8 __m256i _mm512_extracti32x8_epi32(__m512i a, const int nidx);
VEXTRACTI32x8 __m256i _mm512_mask_extracti32x8_epi32(__m256i s, __mmask8 k, __m512i a, const int nidx);
VEXTRACTI32x8 __m256i _mm512_maskz_extracti32x8_epi32( __mmask8 k, __m512i a, const int nidx);
VEXTRACTI64x2 __m128i _mm512_extracti64x2_epi64(__m512i a, const int nidx);
VEXTRACTI64x2 __m128i _mm512_mask_extracti64x2_epi64(__m128i s, __mmask8 k, __m512i a, const int nidx);
VEXTRACTI64x2 __m128i _mm512_maskz_extracti64x2_epi64( __mmask8 k, __m512i a, const int nidx);
VEXTRACTI64x2 __m128i _mm256_extracti64x2_epi64(__m256i a, const int nidx);
VEXTRACTI64x2 __m128i _mm256_mask_extracti64x2_epi64(__m128i s, __mmask8 k, __m256i a, const int nidx);
VEXTRACTI64x2 __m128i _mm256_maskz_extracti64x2_epi64( __mmask8 k, __m256i a, const int nidx);
VEXTRACTI64x4 __m256i _mm512_extracti64x4_epi64(__m512i a, const int nidx);
VEXTRACTI64x4 __m256i _mm512_mask_extracti64x4_epi64(__m256i s, __mmask8 k, __m512i a, const int nidx);
VEXTRACTI64x4 __m256i _mm512_maskz_extracti64x4_epi64( __mmask8 k, __m512i a, const int nidx);
VEXTRACTI128 __m128i _mm256_extracti128_si256(__m256i a, int offset);

SIMD Floating-Point Exceptions


None

Other Exceptions
VEX-encoded instructions, see Table 2-23, “Type 6 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-56, “Type E6NF Class Exception Conditions.”
Additionally:
#UD IF VEX.L = 0.
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

VEXTRACTI128/VEXTRACTI32x4/VEXTRACTI64x2/VEXTRACTI32x8/VEXTRACTI64x4—Extract Packed Integer Values Vol. 2C 5-175


VFCMADDCPH/VFMADDCPH—Complex Multiply and Accumulate FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F2.MAP6.W0 56 /r A V/V (AVX512-FP16 Complex multiply a pair of FP16 values from
VFCMADDCPH xmm1{k1}{z}, xmm2, AND AVX512VL) xmm2 and xmm3/m128/m32bcst, add to xmm1
xmm3/m128/m32bcst OR AVX10.11 and store the result in xmm1 subject to
writemask k1.
EVEX.256.F2.MAP6.W0 56 /r A V/V (AVX512-FP16 Complex multiply a pair of FP16 values from
VFCMADDCPH ymm1{k1}{z}, ymm2, AND AVX512VL) ymm2 and ymm3/m256/m32bcst, add to ymm1
ymm3/m256/m32bcst OR AVX10.11 and store the result in ymm1 subject to
writemask k1.
EVEX.512.F2.MAP6.W0 56 /r A V/V AVX512-FP16 Complex multiply a pair of FP16 values from
VFCMADDCPH zmm1{k1}{z}, zmm2, OR AVX10.11 zmm2 and zmm3/m512/m32bcst, add to zmm1
zmm3/m512/m32bcst {er} and store the result in zmm1 subject to
writemask k1.
EVEX.128.F3.MAP6.W0 56 /r A V/V (AVX512-FP16 Complex multiply a pair of FP16 values from
VFMADDCPH xmm1{k1}{z}, xmm2, AND AVX512VL) xmm2 and the complex conjugate of
xmm3/m128/m32bcst OR AVX10.11 xmm3/m128/m32bcst, add to xmm1 and store
the result in xmm1 subject to writemask k1.
EVEX.256.F3.MAP6.W0 56 /r A V/V (AVX512-FP16 Complex multiply a pair of FP16 values from
VFMADDCPH ymm1{k1}{z}, ymm2, AND AVX512VL) ymm2 and the complex conjugate of
ymm3/m256/m32bcst OR AVX10.11 ymm3/m256/m32bcst, add to ymm1 and store
the result in ymm1 subject to writemask k1.
EVEX.512.F3.MAP6.W0 56 /r A V/V AVX512-FP16 Complex multiply a pair of FP16 values from
VFMADDCPH zmm1{k1}{z}, zmm2, OR AVX10.11 zmm2 and the complex conjugate of
zmm3/m512/m32bcst {er} zmm3/m512/m32bcst, add to zmm1 and store
the result in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs a complex multiply and accumulate operation. There are normal and complex conjugate
forms of the operation.
The broadcasting and masking for this operation is done on 32-bit quantities representing a pair of FP16 values.
Rounding is performed at every FMA (fused multiply and add) boundary. Execution occurs as if all MXCSR excep-
tions are masked. MXCSR status bits are updated to reflect exceptional conditions.

VFCMADDCPH/VFMADDCPH—Complex Multiply and Accumulate FP16 Values Vol. 2C 5-194


Operation
VFCMADDCPH dest{k1}, src1, src2 (AVX512)
VL = 128, 256, 512
KL := VL / 32

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF broadcasting and src2 is memory:
tsrc2.fp16[2*i+0] := src2.fp16[0]
tsrc2.fp16[2*i+1] := src2.fp16[1]
ELSE:
tsrc2.fp16[2*i+0] := src2.fp16[2*i+0]
tsrc2.fp16[2*i+1] := src2.fp16[2*i+1]

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
tmp[2*i+0] := dest.fp16[2*i+0] + src1.fp16[2*i+0] * tsrc2.fp16[2*i+0]
tmp[2*i+1] := dest.fp16[2*i+1] + src1.fp16[2*i+1] * tsrc2.fp16[2*i+0]

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
// conjugate version subtracts odd final term
dest.fp16[2*i+0] := tmp[2*i+0] + src1.fp16[2*i+1] * tsrc2.fp16[2*i+1]
dest.fp16[2*i+1] := tmp[2*i+1] - src1.fp16[2*i+0] * tsrc2.fp16[2*i+1]
ELSE IF *zeroing*:
dest.fp16[2*i+0] := 0
dest.fp16[2*i+1] := 0

DEST[MAXVL-1:VL] := 0

VFMADDCPH dest{k1}, src1, src2 (AVX512)


VL = 128, 256, 512
KL := VL / 32

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF broadcasting and src2 is memory:
tsrc2.fp16[2*i+0] := src2.fp16[0]
tsrc2.fp16[2*i+1] := src2.fp16[1]
ELSE:
tsrc2.fp16[2*i+0] := src2.fp16[2*i+0]
tsrc2.fp16[2*i+1] := src2.fp16[2*i+1]

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
tmp[2*i+0] := dest.fp16[2*i+0] + src1.fp16[2*i+0] * tsrc2.fp16[2*i+0]
tmp[2*i+1] := dest.fp16[2*i+1] + src1.fp16[2*i+1] * tsrc2.fp16[2*i+0]

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
// non-conjugate version subtracts even term
dest.fp16[2*i+0] := tmp[2*i+0] - src1.fp16[2*i+1] * tsrc2.fp16[2*i+1]
dest.fp16[2*i+1] := tmp[2*i+1] + src1.fp16[2*i+0] * tsrc2.fp16[2*i+1]
ELSE IF *zeroing*:

VFCMADDCPH/VFMADDCPH—Complex Multiply and Accumulate FP16 Values Vol. 2C 5-195


dest.fp16[2*i+0] := 0
dest.fp16[2*i+1] := 0

DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFCMADDCPH __m128h _mm_fcmadd_pch (__m128h a, __m128h b, __m128h c);
VFCMADDCPH __m128h _mm_mask_fcmadd_pch (__m128h a, __mmask8 k, __m128h b, __m128h c);
VFCMADDCPH __m128h _mm_mask3_fcmadd_pch (__m128h a, __m128h b, __m128h c, __mmask8 k);
VFCMADDCPH __m128h _mm_maskz_fcmadd_pch (__mmask8 k, __m128h a, __m128h b, __m128h c);
VFCMADDCPH __m256h _mm256_fcmadd_pch (__m256h a, __m256h b, __m256h c);
VFCMADDCPH __m256h _mm256_mask_fcmadd_pch (__m256h a, __mmask8 k, __m256h b, __m256h c);
VFCMADDCPH __m256h _mm256_mask3_fcmadd_pch (__m256h a, __m256h b, __m256h c, __mmask8 k);
VFCMADDCPH __m256h _mm256_maskz_fcmadd_pch (__mmask8 k, __m256h a, __m256h b, __m256h c);
VFCMADDCPH __m512h _mm512_fcmadd_pch (__m512h a, __m512h b, __m512h c);
VFCMADDCPH __m512h _mm512_mask_fcmadd_pch (__m512h a, __mmask16 k, __m512h b, __m512h c);
VFCMADDCPH __m512h _mm512_mask3_fcmadd_pch (__m512h a, __m512h b, __m512h c, __mmask16 k);
VFCMADDCPH __m512h _mm512_maskz_fcmadd_pch (__mmask16 k, __m512h a, __m512h b, __m512h c);
VFCMADDCPH __m512h _mm512_fcmadd_round_pch (__m512h a, __m512h b, __m512h c, const int rounding);
VFCMADDCPH __m512h _mm512_mask_fcmadd_round_pch (__m512h a, __mmask16 k, __m512h b, __m512h c, const int rounding);
VFCMADDCPH __m512h _mm512_mask3_fcmadd_round_pch (__m512h a, __m512h b, __m512h c, __mmask16 k, const int rounding);
VFCMADDCPH __m512h _mm512_maskz_fcmadd_round_pch (__mmask16 k, __m512h a, __m512h b, __m512h c, const int rounding);

VFMADDCPH __m128h _mm_fmadd_pch (__m128h a, __m128h b, __m128h c);


VFMADDCPH __m128h _mm_mask_fmadd_pch (__m128h a, __mmask8 k, __m128h b, __m128h c);
VFMADDCPH __m128h _mm_mask3_fmadd_pch (__m128h a, __m128h b, __m128h c, __mmask8 k);
VFMADDCPH __m128h _mm_maskz_fmadd_pch (__mmask8 k, __m128h a, __m128h b, __m128h c);
VFMADDCPH __m256h _mm256_fmadd_pch (__m256h a, __m256h b, __m256h c);
VFMADDCPH __m256h _mm256_mask_fmadd_pch (__m256h a, __mmask8 k, __m256h b, __m256h c);
VFMADDCPH __m256h _mm256_mask3_fmadd_pch (__m256h a, __m256h b, __m256h c, __mmask8 k);
VFMADDCPH __m256h _mm256_maskz_fmadd_pch (__mmask8 k, __m256h a, __m256h b, __m256h c);
VFMADDCPH __m512h _mm512_fmadd_pch (__m512h a, __m512h b, __m512h c);
VFMADDCPH __m512h _mm512_mask_fmadd_pch (__m512h a, __mmask16 k, __m512h b, __m512h c);
VFMADDCPH __m512h _mm512_mask3_fmadd_pch (__m512h a, __m512h b, __m512h c, __mmask16 k);
VFMADDCPH __m512h _mm512_maskz_fmadd_pch (__mmask16 k, __m512h a, __m512h b, __m512h c);
VFMADDCPH __m512h _mm512_fmadd_round_pch (__m512h a, __m512h b, __m512h c, const int rounding);
VFMADDCPH __m512h _mm512_mask_fmadd_round_pch (__m512h a, __mmask16 k, __m512h b, __m512h c, const int rounding);
VFMADDCPH __m512h _mm512_mask3_fmadd_round_pch (__m512h a, __m512h b, __m512h c, __mmask16 k, const int rounding);
VFMADDCPH __m512h _mm512_maskz_fmadd_round_pch (__mmask16 k, __m512h a, __m512h b, __m512h c, const int rounding);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If (dest_reg == src1_reg) or (dest_reg == src2_reg).

VFCMADDCPH/VFMADDCPH—Complex Multiply and Accumulate FP16 Values Vol. 2C 5-196


VFCMADDCSH/VFMADDCSH—Complex Multiply and Accumulate Scalar FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.F2.MAP6.W0 57 /r A V/V AVX512-FP16 Complex multiply a pair of FP16 values from
VFCMADDCSH xmm1{k1}{z}, xmm2, OR AVX10.11 xmm2 and xmm3/m32, add to xmm1 and store
xmm3/m32 {er} the result in xmm1 subject to writemask k1. Bits
127:32 of xmm2 are copied to xmm1[127:32].
EVEX.LLIG.F3.MAP6.W0 57 /r A V/V AVX512-FP16 Complex multiply a pair of FP16 values from
VFMADDCSH xmm1{k1}{z}, xmm2, OR AVX10.11 xmm2 and the complex conjugate of xmm3/m32,
xmm3/m32 {er} add to xmm1 and store the result in xmm1
subject to writemask k1. Bits 127:32 of xmm2
are copied to xmm1[127:32].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs a complex multiply and accumulate operation. There are normal and complex conjugate
forms of the operation.
The masking for this operation is done on 32-bit quantities representing a pair of FP16 values.
Bits 127:32 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
Rounding is performed at every FMA (fused multiply and add) boundary. Execution occurs as if all MXCSR excep-
tions are masked. MXCSR status bits are updated to reflect exceptional conditions.

Operation
VFCMADDCSH dest{k1}, src1, src2 (AVX512)
IF k1[0] or *no writemask*:
tmp[0] := dest.fp16[0] + src1.fp16[0] * src2.fp16[0]
tmp[1] := dest.fp16[1] + src1.fp16[1] * src2.fp16[0]

// conjugate version subtracts odd final term


dest.fp16[0] := tmp[0] + src1.fp16[1] * src2.fp16[1]
dest.fp16[1] := tmp[1] - src1.fp16[0] * src2.fp16[1]
ELSE IF *zeroing*:
dest.fp16[0] := 0
dest.fp16[1] := 0

DEST[127:32] := src1[127:32] // copy upper part of src1


DEST[MAXVL-1:128] := 0

VFCMADDCSH/VFMADDCSH—Complex Multiply and Accumulate Scalar FP16 Values Vol. 2C 5-197


VFMADDCSH dest{k1}, src1, src2 (AVX512)
IF k1[0] or *no writemask*:
tmp[0] := dest.fp16[0] + src1.fp16[0] * src2.fp16[0]
tmp[1] := dest.fp16[1] + src1.fp16[1] * src2.fp16[0]

// non-conjugate version subtracts last even term


dest.fp16[0] := tmp[0] - src1.fp16[1] * src2.fp16[1]
dest.fp16[1] := tmp[1] + src1.fp16[0] * src2.fp16[1]
ELSE IF *zeroing*:
dest.fp16[0] := 0
dest.fp16[1] := 0

DEST[127:32] := src1[127:32] // copy upper part of src1


DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFCMADDCSH __m128h _mm_fcmadd_round_sch (__m128h a, __m128h b, __m128h c, const int rounding);
VFCMADDCSH __m128h _mm_mask_fcmadd_round_sch (__m128h a, __mmask8 k, __m128h b, __m128h c, const int rounding);
VFCMADDCSH __m128h _mm_mask3_fcmadd_round_sch (__m128h a, __m128h b, __m128h c, __mmask8 k, const int rounding);
VFCMADDCSH __m128h _mm_maskz_fcmadd_round_sch (__mmask8 k, __m128h a, __m128h b, __m128h c, const int rounding);
VFCMADDCSH __m128h _mm_fcmadd_sch (__m128h a, __m128h b, __m128h c);
VFCMADDCSH __m128h _mm_mask_fcmadd_sch (__m128h a, __mmask8 k, __m128h b, __m128h c);
VFCMADDCSH __m128h _mm_mask3_fcmadd_sch (__m128h a, __m128h b, __m128h c, __mmask8 k);
VFCMADDCSH __m128h _mm_maskz_fcmadd_sch (__mmask8 k, __m128h a, __m128h b, __m128h c);

VFMADDCSH __m128h _mm_fmadd_round_sch (__m128h a, __m128h b, __m128h c, const int rounding);


VFMADDCSH __m128h _mm_mask_fmadd_round_sch (__m128h a, __mmask8 k, __m128h b, __m128h c, const int rounding);
VFMADDCSH __m128h _mm_mask3_fmadd_round_sch (__m128h a, __m128h b, __m128h c, __mmask8 k, const int rounding);
VFMADDCSH __m128h _mm_maskz_fmadd_round_sch (__mmask8 k, __m128h a, __m128h b, __m128h c, const int rounding);
VFMADDCSH __m128h _mm_fmadd_sch (__m128h a, __m128h b, __m128h c);
VFMADDCSH __m128h _mm_mask_fmadd_sch (__m128h a, __mmask8 k, __m128h b, __m128h c);
VFMADDCSH __m128h _mm_mask3_fmadd_sch (__m128h a, __m128h b, __m128h c, __mmask8 k);
VFMADDCSH __m128h _mm_maskz_fmadd_sch (__mmask8 k, __m128h a, __m128h b, __m128h c);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-60, “Type E10 Class Exception Conditions.”
Additionally:
#UD If (dest_reg == src1_reg) or (dest_reg == src2_reg).

VFCMADDCSH/VFMADDCSH—Complex Multiply and Accumulate Scalar FP16 Values Vol. 2C 5-198


VFCMULCPH/VFMULCPH—Complex Multiply FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F2.MAP6.W0 D6 /r A V/V (AVX512-FP16 Complex multiply a pair of FP16 values from
VFCMULCPH xmm1{k1}{z}, xmm2, AND AVX512VL) xmm2 and xmm3/m128/m32bcst, and store the
xmm3/m128/m32bcst OR AVX10.11 result in xmm1 subject to writemask k1.
EVEX.256.F2.MAP6.W0 D6 /r A V/V (AVX512-FP16 Complex multiply a pair of FP16 values from
VFCMULCPH ymm1{k1}{z}, ymm2, AND AVX512VL) ymm2 and ymm3/m256/m32bcst, and store the
ymm3/m256/m32bcst OR AVX10.11 result in ymm1 subject to writemask k1.
EVEX.512.F2.MAP6.W0 D6 /r A V/V AVX512-FP16 Complex multiply a pair of FP16 values from
VFCMULCPH zmm1{k1}{z}, zmm2, OR AVX10.11 zmm2 and zmm3/m512/m32bcst, and store the
zmm3/m512/m32bcst {er} result in zmm1 subject to writemask k1.
EVEX.128.F3.MAP6.W0 D6 /r A V/V (AVX512-FP16 Complex multiply a pair of FP16 values from
VFMULCPH xmm1{k1}{z}, xmm2, AND AVX512VL) xmm2 and the complex conjugate of
xmm3/m128/m32bcst OR AVX10.11 xmm3/m128/m32bcst, and store the result in
xmm1 subject to writemask k1.
EVEX.256.F3.MAP6.W0 D6 /r A V/V (AVX512-FP16 Complex multiply a pair of FP16 values from
VFMULCPH ymm1{k1}{z}, ymm2, AND AVX512VL) ymm2 and the complex conjugate of
ymm3/m256/m32bcst OR AVX10.11 ymm3/m256/m32bcst, and store the result in
ymm1 subject to writemask k1.
EVEX.512.F3.MAP6.W0 D6 /r A V/V AVX512-FP16 Complex multiply a pair of FP16 values from
VFMULCPH zmm1{k1}{z}, zmm2, OR AVX10.11 zmm2 and the complex conjugate of
zmm3/m512/m32bcst {er} zmm3/m512/m32bcst, and store the result in
zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs a complex multiply operation. There are normal and complex conjugate forms of the oper-
ation. The broadcasting and masking for this operation is done on 32-bit quantities representing a pair of FP16
values.
Rounding is performed at every FMA (fused multiply and add) boundary. Execution occurs as if all MXCSR excep-
tions are masked. MXCSR status bits are updated to reflect exceptional conditions.

Operation
VFCMULCPH dest{k1}, src1, src2 (AVX512)
VL = 128, 256 or 512
KL := VL/32

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF broadcasting and src2 is memory:
tsrc2.fp16[2*i+0] := src2.fp16[0]
tsrc2.fp16[2*i+1] := src2.fp16[1]

VFCMULCPH/VFMULCPH—Complex Multiply FP16 Values Vol. 2C 5-199


ELSE:
tsrc2.fp16[2*i+0] := src2.fp16[2*i+0]
tsrc2.fp16[2*i+1] := src2.fp16[2*i+1]

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
tmp.fp16[2*i+0] := src1.fp16[2*i+0] * tsrc2.fp16[2*i+0]
tmp.fp16[2*i+1] := src1.fp16[2*i+1] * tsrc2.fp16[2*i+0]

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
// conjugate version subtracts odd final term
dest.fp16[2*i] := tmp.fp16[2*i+0] +src1.fp16[2*i+1] * tsrc2.fp16[2*i+1]
dest.fp16[2*i+1] := tmp.fp16[2*i+1] - src1.fp16[2*i+0] * tsrc2.fp16[2*i+1]
ELSE IF *zeroing*:
dest.fp16[2*i+0] := 0
dest.fp16[2*i+1] := 0

DEST[MAXVL-1:VL] := 0

VFMULCPH dest{k1}, src1, src2 (AVX512)


VL = 128, 256 or 512
KL := VL/32

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF broadcasting and src2 is memory:
tsrc2.fp16[2*i+0] := src2.fp16[0]
tsrc2.fp16[2*i+1] := src2.fp16[1]
ELSE:
tsrc2.fp16[2*i+0] := src2.fp16[2*i+0]
tsrc2.fp16[2*i+1] := src2.fp16[2*i+1]

FOR i := 0 to kl-1:
IF k1[i] or *no writemask*:
tmp.fp16[2*i+0] := src1.fp16[2*i+0] * tsrc2.fp16[2*i+0]
tmp.fp16[2*i+1] := src1.fp16[2*i+1] * tsrc2.fp16[2*i+0]

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
// non-conjugate version subtracts last even term
dest.fp16[2*i+0] := tmp.fp16[2*i+0] - src1.fp16[2*i+1] * tsrc2.fp16[2*i+1]
dest.fp16[2*i+1] := tmp.fp16[2*i+1] + src1.fp16[2*i+0] * tsrc2.fp16[2*i+1]
ELSE IF *zeroing*:
dest.fp16[2*i+0] := 0
dest.fp16[2*i+1] := 0

DEST[MAXVL-1:VL] := 0

VFCMULCPH/VFMULCPH—Complex Multiply FP16 Values Vol. 2C 5-200


Intel C/C++ Compiler Intrinsic Equivalent
VFCMULCPH __m128h _mm_cmul_pch (__m128h a, __m128h b);
VFCMULCPH __m128h _mm_mask_cmul_pch (__m128h src, __mmask8 k, __m128h a, __m128h b);
VFCMULCPH __m128h _mm_maskz_cmul_pch (__mmask8 k, __m128h a, __m128h b);
VFCMULCPH __m256h _mm256_cmul_pch (__m256h a, __m256h b);
VFCMULCPH __m256h _mm256_mask_cmul_pch (__m256h src, __mmask8 k, __m256h a, __m256h b);
VFCMULCPH __m256h _mm256_maskz_cmul_pch (__mmask8 k, __m256h a, __m256h b);
VFCMULCPH __m512h _mm512_cmul_pch (__m512h a, __m512h b);
VFCMULCPH __m512h _mm512_mask_cmul_pch (__m512h src, __mmask16 k, __m512h a, __m512h b);
VFCMULCPH __m512h _mm512_maskz_cmul_pch (__mmask16 k, __m512h a, __m512h b);
VFCMULCPH __m512h _mm512_cmul_round_pch (__m512h a, __m512h b, const int rounding);
VFCMULCPH __m512h _mm512_mask_cmul_round_pch (__m512h src, __mmask16 k, __m512h a, __m512h b, const int rounding);
VFCMULCPH __m512h _mm512_maskz_cmul_round_pch (__mmask16 k, __m512h a, __m512h b, const int rounding);
VFCMULCPH __m128h _mm_fcmul_pch (__m128h a, __m128h b);
VFCMULCPH __m128h _mm_mask_fcmul_pch (__m128h src, __mmask8 k, __m128h a, __m128h b);
VFCMULCPH __m128h _mm_maskz_fcmul_pch (__mmask8 k, __m128h a, __m128h b);
VFCMULCPH __m256h _mm256_fcmul_pch (__m256h a, __m256h b);
VFCMULCPH __m256h _mm256_mask_fcmul_pch (__m256h src, __mmask8 k, __m256h a, __m256h b);
VFCMULCPH __m256h _mm256_maskz_fcmul_pch (__mmask8 k, __m256h a, __m256h b);
VFCMULCPH __m512h _mm512_fcmul_pch (__m512h a, __m512h b);
VFCMULCPH __m512h _mm512_mask_fcmul_pch (__m512h src, __mmask16 k, __m512h a, __m512h b);
VFCMULCPH __m512h _mm512_maskz_fcmul_pch (__mmask16 k, __m512h a, __m512h b);
VFCMULCPH __m512h _mm512_fcmul_round_pch (__m512h a, __m512h b, const int rounding);
VFCMULCPH __m512h _mm512_mask_fcmul_round_pch (__m512h src, __mmask16 k, __m512h a, __m512h b, const int rounding);
VFCMULCPH __m512h _mm512_maskz_fcmul_round_pch (__mmask16 k, __m512h a, __m512h b, const int rounding);

VFMULCPH __m128h _mm_fmul_pch (__m128h a, __m128h b);


VFMULCPH __m128h _mm_mask_fmul_pch (__m128h src, __mmask8 k, __m128h a, __m128h b);
VFMULCPH __m128h _mm_maskz_fmul_pch (__mmask8 k, __m128h a, __m128h b);
VFMULCPH __m256h _mm256_fmul_pch (__m256h a, __m256h b);
VFMULCPH __m256h _mm256_mask_fmul_pch (__m256h src, __mmask8 k, __m256h a, __m256h b);
VFMULCPH __m256h _mm256_maskz_fmul_pch (__mmask8 k, __m256h a, __m256h b);
VFMULCPH __m512h _mm512_fmul_pch (__m512h a, __m512h b);
VFMULCPH __m512h _mm512_mask_fmul_pch (__m512h src, __mmask16 k, __m512h a, __m512h b);
VFMULCPH __m512h _mm512_maskz_fmul_pch (__mmask16 k, __m512h a, __m512h b);
VFMULCPH __m512h _mm512_fmul_round_pch (__m512h a, __m512h b, const int rounding);
VFMULCPH __m512h _mm512_mask_fmul_round_pch (__m512h src, __mmask16 k, __m512h a, __m512h b, const int rounding);
VFMULCPH __m512h _mm512_maskz_fmul_round_pch (__mmask16 k, __m512h a, __m512h b, const int rounding);
VFMULCPH __m128h _mm_mask_mul_pch (__m128h src, __mmask8 k, __m128h a, __m128h b);
VFMULCPH __m128h _mm_maskz_mul_pch (__mmask8 k, __m128h a, __m128h b);
VFMULCPH __m128h _mm_mul_pch (__m128h a, __m128h b);
VFMULCPH __m256h _mm256_mask_mul_pch (__m256h src, __mmask8 k, __m256h a, __m256h b);
VFMULCPH __m256h _mm256_maskz_mul_pch (__mmask8 k, __m256h a, __m256h b);
VFMULCPH __m256h _mm256_mul_pch (__m256h a, __m256h b);
VFMULCPH __m512h _mm512_mask_mul_pch (__m512h src, __mmask16 k, __m512h a, __m512h b);
VFMULCPH __m512h _mm512_maskz_mul_pch (__mmask16 k, __m512h a, __m512h b);
VFMULCPH __m512h _mm512_mul_pch (__m512h a, __m512h b);
VFMULCPH __m512h _mm512_mask_mul_round_pch (__m512h src, __mmask16 k, __m512h a, __m512h b, const int rounding);
VFMULCPH __m512h _mm512_maskz_mul_round_pch (__mmask16 k, __m512h a, __m512h b, const int rounding);
VFMULCPH __m512h _mm512_mul_round_pch (__m512h a, __m512h b, const int rounding);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal.

VFCMULCPH/VFMULCPH—Complex Multiply FP16 Values Vol. 2C 5-201


Other Exceptions
EVEX-encoded instructions, see Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If (dest_reg == src1_reg) or (dest_reg == src2_reg).

VFCMULCPH/VFMULCPH—Complex Multiply FP16 Values Vol. 2C 5-202


VFCMULCSH/VFMULCSH—Complex Multiply Scalar FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.F2.MAP6.W0 D7 /r A V/V AVX512-FP16 Complex multiply a pair of FP16 values from
VFCMULCSH xmm1{k1}{z}, xmm2, OR AVX10.11 xmm2 and xmm3/m32, and store the result in
xmm3/m32 {er} xmm1 subject to writemask k1. Bits 127:32 of
xmm2 are copied to xmm1[127:32].
EVEX.LLIG.F3.MAP6.W0 D7 /r A V/V AVX512-FP16 Complex multiply a pair of FP16 values from
VFMULCSH xmm1{k1}{z}, xmm2, OR AVX10.11 xmm2 and the complex conjugate of xmm3/m32,
xmm3/m32 {er} and store the result in xmm1 subject to
writemask k1. Bits 127:32 of xmm2 are copied to
xmm1[127:32].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs a complex multiply operation. There are normal and complex conjugate forms of the oper-
ation. The masking for this operation is done on 32-bit quantities representing a pair of FP16 values.
Bits 127:32 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
Rounding is performed at every FMA (fused multiply and add) boundary. Execution occurs as if all MXCSR excep-
tions are masked. MXCSR status bits are updated to reflect exceptional conditions.

Operation
VFCMULCSH dest{k1}, src1, src2 (AVX512)
KL := VL / 32

IF k1[0] or *no writemask*:


tmp.fp16[0] := src1.fp16[0] * src2.fp16[0]
tmp.fp16[1] := src1.fp16[1] * src2.fp16[0]

// conjugate version subtracts odd final term


dest.fp16[0] := tmp.fp16[0] + src1.fp16[1] * src2.fp16[1]
dest.fp16[1] := tmp.fp16[1] - src1.fp16[0] * src2.fp16[1]
ELSE IF *zeroing*:
dest.fp16[0] := 0
dest.fp16[1] := 0

DEST[127:32] := src1[127:32] // copy upper part of src1


DEST[MAXVL-1:128] := 0

VFCMULCSH/VFMULCSH—Complex Multiply Scalar FP16 Values Vol. 2C 5-202


VFMULCSH dest{k1}, src1, src2 (AVX512)
KL := VL / 32

IF k1[0] or *no writemask*:


// non-conjugate version subtracts last even term
tmp.fp16[0] := src1.fp16[0] * src2.fp16[0]
tmp.fp16[1] := src1.fp16[1] * src2.fp16[0]
dest.fp16[0] := tmp.fp16[0] - src1.fp16[1] * src2.fp16[1]
dest.fp16[1] := tmp.fp16[1] + src1.fp16[0] * src2.fp16[1]
ELSE IF *zeroing*:
dest.fp16[0] := 0
dest.fp16[1] := 0

DEST[127:32] := src1[127:32] // copy upper part of src1


DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFCMULCSH __m128h _mm_cmul_round_sch (__m128h a, __m128h b, const int rounding);
VFCMULCSH __m128h _mm_mask_cmul_round_sch (__m128h src, __mmask8 k, __m128h a, __m128h b, const int rounding);
VFCMULCSH __m128h _mm_maskz_cmul_round_sch (__mmask8 k, __m128h a, __m128h b, const int rounding);
VFCMULCSH __m128h _mm_cmul_sch (__m128h a, __m128h b);
VFCMULCSH __m128h _mm_mask_cmul_sch (__m128h src, __mmask8 k, __m128h a, __m128h b);
VFCMULCSH __m128h _mm_maskz_cmul_sch (__mmask8 k, __m128h a, __m128h b);
VFCMULCSH __m128h _mm_fcmul_round_sch (__m128h a, __m128h b, const int rounding);
VFCMULCSH __m128h _mm_mask_fcmul_round_sch (__m128h src, __mmask8 k, __m128h a, __m128h b, const int rounding);
VFCMULCSH __m128h _mm_maskz_fcmul_round_sch (__mmask8 k, __m128h a, __m128h b, const int rounding);
VFCMULCSH __m128h _mm_fcmul_sch (__m128h a, __m128h b);
VFCMULCSH __m128h _mm_mask_fcmul_sch (__m128h src, __mmask8 k, __m128h a, __m128h b);
VFCMULCSH __m128h _mm_maskz_fcmul_sch (__mmask8 k, __m128h a, __m128h b);

VFMULCSH __m128h _mm_fmul_round_sch (__m128h a, __m128h b, const int rounding);


VFMULCSH __m128h _mm_mask_fmul_round_sch (__m128h src, __mmask8 k, __m128h a, __m128h b, const int rounding);
VFMULCSH __m128h _mm_maskz_fmul_round_sch (__mmask8 k, __m128h a, __m128h b, const int rounding);
VFMULCSH __m128h _mm_fmul_sch (__m128h a, __m128h b);
VFMULCSH __m128h _mm_mask_fmul_sch (__m128h src, __mmask8 k, __m128h a, __m128h b);
VFMULCSH __m128h _mm_maskz_fmul_sch (__mmask8 k, __m128h a, __m128h b);
VFMULCSH __m128h _mm_mask_mul_round_sch (__m128h src, __mmask8 k, __m128h a, __m128h b, const int rounding);
VFMULCSH __m128h _mm_maskz_mul_round_sch (__mmask8 k, __m128h a, __m128h b, const int rounding);
VFMULCSH __m128h _mm_mul_round_sch (__m128h a, __m128h b, const int rounding);
VFMULCSH __m128h _mm_mask_mul_sch (__m128h src, __mmask8 k, __m128h a, __m128h b);
VFMULCSH __m128h _mm_maskz_mul_sch (__mmask8 k, __m128h a, __m128h b);
VFMULCSH __m128h _mm_mul_sch (__m128h a, __m128h b);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-60, “Type E10 Class Exception Conditions.”
Additionally:
#UD If (dest_reg == src1_reg) or (dest_reg == src2_reg).

VFCMULCSH/VFMULCSH—Complex Multiply Scalar FP16 Values Vol. 2C 5-203


VFIXUPIMMPD—Fix Up Special Packed Float64 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F3A.W1 54 /r ib A V/V (AVX512VL AND Fix up special numbers in float64 vector xmm1,
VFIXUPIMMPD xmm1 {k1}{z}, xmm2, AVX512F) OR float64 vector xmm2 and int64 vector
xmm3/m128/m64bcst, imm8 AVX10.11 xmm3/m128/m64bcst and store the result in
xmm1, under writemask.
EVEX.256.66.0F3A.W1 54 /r ib A V/V (AVX512VL AND Fix up special numbers in float64 vector ymm1,
VFIXUPIMMPD ymm1 {k1}{z}, ymm2, AVX512F) OR float64 vector ymm2 and int64 vector
ymm3/m256/m64bcst, imm8 AVX10.11 ymm3/m256/m64bcst and store the result in
ymm1, under writemask.
EVEX.512.66.0F3A.W1 54 /r ib A V/V AVX512F Fix up elements of float64 vector in zmm2 using
VFIXUPIMMPD zmm1 {k1}{z}, zmm2, OR AVX10.11 int64 vector table in zmm3/m512/m64bcst,
zmm3/m512/m64bcst{sae}, imm8 combine with preserved elements from zmm1,
and store the result in zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
Perform fix-up of quad-word elements encoded in double precision floating-point format in the first source operand
(the second operand) using a 32-bit, two-level look-up table specified in the corresponding quadword element of
the second source operand (the third operand) with exception reporting specifier imm8. The elements that are
fixed-up are selected by mask bits of 1 specified in the opmask k1. Mask bits of 0 in the opmask k1 or table
response action of 0000b preserves the corresponding element of the first operand. The fixed-up elements from
the first source operand and the preserved element in the first operand are combined as the final results in the
destination operand (the first operand).
The destination and the first source operands are ZMM/YMM/XMM registers. The second source operand can be a
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-
bit memory location.
The two-level look-up table perform a fix-up of each double precision floating-point input data in the first source
operand by decoding the input data encoding into 8 token types. A response table is defined for each token type
that converts the input encoding in the first source operand with one of 16 response actions.
This instruction is specifically intended for use in fixing up the results of arithmetic calculations involving one source
so that they match the spec, although it is generally useful for fixing up the results of multiple-instruction
sequences to reflect special-number inputs. For example, consider rcp(0). Input 0 to rcp, and you should get INF
according to the DX10 spec. However, evaluating rcp via Newton-Raphson, where x=approx(1/0), yields an incor-
rect result. To deal with this, VFIXUPIMMPD can be used after the N-R reciprocal sequence to set the result to the
correct value (i.e., INF when the input is 0).
If MXCSR.DAZ is not set, denormal input elements in the first source operand are considered as normal inputs and
do not trigger any fixup nor fault reporting.
Imm8 is used to set the required flags reporting. It supports #ZE and #IE fault reporting (see details below).
MXCSR mask bits are ignored and are treated as if all mask bits are set to masked response). If any of the imm8
bits is set and the condition met for fault reporting, MXCSR.IE or MXCSR.ZE might be updated.

VFIXUPIMMPD—Fix Up Special Packed Float64 Values Vol. 2C 5-204


This instruction is writemasked, so only those elements with the corresponding bit set in vector mask register k1
are computed and stored into zmm1. Elements in the destination with the corresponding bit clear in k1 retain their
previous values or are set to 0.

Operation
enum TOKEN_TYPE
{
QNAN_TOKEN := 0,
SNAN_TOKEN := 1,
ZERO_VALUE_TOKEN := 2,
POS_ONE_VALUE_TOKEN := 3,
NEG_INF_TOKEN := 4,
POS_INF_TOKEN := 5,
NEG_VALUE_TOKEN := 6,
POS_VALUE_TOKEN := 7
}

FIXUPIMM_DP (dest[63:0], src1[63:0],tbl3[63:0], imm8 [7:0]){


tsrc[63:0] := ((src1[62:52] = 0) AND (MXCSR.DAZ =1)) ? 0.0 : src1[63:0]
CASE(tsrc[63:0] of TOKEN_TYPE) {
QNAN_TOKEN: j := 0;
SNAN_TOKEN: j := 1;
ZERO_VALUE_TOKEN: j := 2;
POS_ONE_VALUE_TOKEN: j := 3;
NEG_INF_TOKEN: j := 4;
POS_INF_TOKEN: j := 5;
NEG_VALUE_TOKEN: j := 6;
POS_VALUE_TOKEN: j := 7;
} ; end source special CASE(tsrc…)

; The required response from src3 table is extracted


token_response[3:0] = tbl3[3+4*j:4*j];

CASE(token_response[3:0]) {
0000: dest[63:0] := dest[63:0]; ; preserve content of DEST
0001: dest[63:0] := tsrc[63:0]; ; pass through src1 normal input value, denormal as zero
0010: dest[63:0] := QNaN(tsrc[63:0]);
0011: dest[63:0] := QNAN_Indefinite;
0100: dest[63:0] := -INF;
0101: dest[63:0] := +INF;
0110: dest[63:0] := tsrc.sign? –INF : +INF;
0111: dest[63:0] := -0;
1000: dest[63:0] := +0;
1001: dest[63:0] := -1;
1010: dest[63:0] := +1;
1011: dest[63:0] := ½;
1100: dest[63:0] := 90.0;
1101: dest[63:0] := PI/2;
1110: dest[63:0] := MAX_FLOAT;
1111: dest[63:0] := -MAX_FLOAT;
} ; end of token_response CASE

; The required fault reporting from imm8 is extracted


; TOKENs are mutually exclusive and TOKENs priority defines the order.

VFIXUPIMMPD—Fix Up Special Packed Float64 Values Vol. 2C 5-205


; Multiple faults related to a single token can occur simultaneously.
IF (tsrc[63:0] of TOKEN_TYPE: ZERO_VALUE_TOKEN) AND imm8[0] then set #ZE;
IF (tsrc[63:0] of TOKEN_TYPE: ZERO_VALUE_TOKEN) AND imm8[1] then set #IE;
IF (tsrc[63:0] of TOKEN_TYPE: ONE_VALUE_TOKEN) AND imm8[2] then set #ZE;
IF (tsrc[63:0] of TOKEN_TYPE: ONE_VALUE_TOKEN) AND imm8[3] then set #IE;
IF (tsrc[63:0] of TOKEN_TYPE: SNAN_TOKEN) AND imm8[4] then set #IE;
IF (tsrc[63:0] of TOKEN_TYPE: NEG_INF_TOKEN) AND imm8[5] then set #IE;
IF (tsrc[63:0] of TOKEN_TYPE: NEG_VALUE_TOKEN) AND imm8[6] then set #IE;
IF (tsrc[63:0] of TOKEN_TYPE: POS_INF_TOKEN) AND imm8[7] then set #IE;
; end fault reporting
return dest[63:0];
} ; end of FIXUPIMM_DP()

VFIXUPIMMPD
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+63:i] := FIXUPIMM_DP(DEST[i+63:i], SRC1[i+63:i], SRC2[63:0], imm8 [7:0])
ELSE
DEST[i+63:i] := FIXUPIMM_DP(DEST[i+63:i], SRC1[i+63:i], SRC2[i+63:i], imm8 [7:0])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE DEST[i+63:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Immediate Control Description:

7 6 5 4 3 2 1 0

+ INF  #IE
- VE  #IE
- INF  #IE
SNaN  #IE
ONE  #IE
ONE  #ZE
ZERO  #IE
ZERO  #ZE

Figure 5-9. VFIXUPIMMPD Immediate Control Description

VFIXUPIMMPD—Fix Up Special Packed Float64 Values Vol. 2C 5-206


Intel C/C++ Compiler Intrinsic Equivalent
VFIXUPIMMPD __m512d _mm512_fixupimm_pd( __m512d a, __m512d b, __m512i c, int imm8);
VFIXUPIMMPD __m512d _mm512_mask_fixupimm_pd(__m512d a, __mmask8 k, __m512d b, __m512i c, int imm8);
VFIXUPIMMPD __m512d _mm512_maskz_fixupimm_pd( __mmask8 k, __m512d a, __m512d b, __m512i c, int imm8);
VFIXUPIMMPD __m512d _mm512_fixupimm_round_pd( __m512d a, __m512d b, __m512i c, int imm8, int sae);
VFIXUPIMMPD __m512d _mm512_mask_fixupimm_round_pd(__m512d a, __mmask8 k, __m512d b, __m512i c, int imm8, int sae);
VFIXUPIMMPD __m512d _mm512_maskz_fixupimm_round_pd( __mmask8 k, __m512d a, __m512d b, __m512i c, int imm8, int sae);
VFIXUPIMMPD __m256d _mm256_fixupimm_pd( __m256d a, m256d b, __m256i c, int imm8);
VFIXUPIMMPD __m256d _mm256_mask_fixupimm_pd(__m256d a, __mmask8 k, __m256d b, __m256i c, int imm8);
VFIXUPIMMPD __m256d _mm256_maskz_fixupimm_pd( __mmask8 k, __m256d a, __m256d b, __m256i c, int imm8);
VFIXUPIMMPD __m128d _mm_fixupimm_pd( __m128d a, __m128d b, __m128i c, int imm8);
VFIXUPIMMPD __m128d _mm_mask_fixupimm_pd(__m128d a, __mmask8 k, __m128d b, __m128i c, int imm8);
VFIXUPIMMPD __m128d _mm_maskz_fixupimm_pd( __mmask8 k, __m128d a, __m128d b, __m128i c, int imm8);

SIMD Floating-Point Exceptions

Zero, Invalid.

Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”

VFIXUPIMMPD—Fix Up Special Packed Float64 Values Vol. 2C 5-207


VFIXUPIMMPS—Fix Up Special Packed Float32 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F3A.W0 54 /r A V/V (AVX512VL AND Fix up special numbers in float32 vector xmm1,
VFIXUPIMMPS xmm1 {k1}{z}, xmm2, AVX512F) OR float32 vector xmm2 and int32 vector
xmm3/m128/m32bcst, imm8 AVX10.11 xmm3/m128/m32bcst and store the result in
xmm1, under writemask.
EVEX.256.66.0F3A.W0 54 /r A V/V (AVX512VL AND Fix up special numbers in float32 vector ymm1,
VFIXUPIMMPS ymm1 {k1}{z}, ymm2, AVX512F) OR float32 vector ymm2 and int32 vector
ymm3/m256/m32bcst, imm8 AVX10.11 ymm3/m256/m32bcst and store the result in
ymm1, under writemask.
EVEX.512.66.0F3A.W0 54 /r ib A V/V AVX512F Fix up elements of float32 vector in zmm2 using
VFIXUPIMMPS zmm1 {k1}{z}, zmm2, OR AVX10.11 int32 vector table in zmm3/m512/m32bcst,
zmm3/m512/m32bcst{sae}, imm8 combine with preserved elements from zmm1,
and store the result in zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
Perform fix-up of doubleword elements encoded in single precision floating-point format in the first source operand
(the second operand) using a 32-bit, two-level look-up table specified in the corresponding doubleword element of
the second source operand (the third operand) with exception reporting specifier imm8. The elements that are
fixed-up are selected by mask bits of 1 specified in the opmask k1. Mask bits of 0 in the opmask k1 or table
response action of 0000b preserves the corresponding element of the first operand. The fixed-up elements from
the first source operand and the preserved element in the first operand are combined as the final results in the
destination operand (the first operand).
The destination and the first source operands are ZMM/YMM/XMM registers. The second source operand can be a
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-
bit memory location.
The two-level look-up table perform a fix-up of each single precision floating-point input data in the first source
operand by decoding the input data encoding into 8 token types. A response table is defined for each token type
that converts the input encoding in the first source operand with one of 16 response actions.
This instruction is specifically intended for use in fixing up the results of arithmetic calculations involving one source
so that they match the spec, although it is generally useful for fixing up the results of multiple-instruction
sequences to reflect special-number inputs. For example, consider rcp(0). Input 0 to rcp, and you should get INF
according to the DX10 spec. However, evaluating rcp via Newton-Raphson, where x=approx(1/0), yields an incor-
rect result. To deal with this, VFIXUPIMMPS can be used after the N-R reciprocal sequence to set the result to the
correct value (i.e., INF when the input is 0).
If MXCSR.DAZ is not set, denormal input elements in the first source operand are considered as normal inputs and
do not trigger any fixup nor fault reporting.
Imm8 is used to set the required flags reporting. It supports #ZE and #IE fault reporting (see details below).
MXCSR.DAZ is used and refer to zmm2 only (i.e., zmm1 is not considered as zero in case MXCSR.DAZ is set).
MXCSR mask bits are ignored and are treated as if all mask bits are set to masked response). If any of the imm8
bits is set and the condition met for fault reporting, MXCSR.IE or MXCSR.ZE might be updated.

VFIXUPIMMPS—Fix Up Special Packed Float32 Values Vol. 2C 5-208


Operation
enum TOKEN_TYPE
{
QNAN_TOKEN := 0,
SNAN_TOKEN := 1,
ZERO_VALUE_TOKEN := 2,
POS_ONE_VALUE_TOKEN := 3,
NEG_INF_TOKEN := 4,
POS_INF_TOKEN := 5,
NEG_VALUE_TOKEN := 6,
POS_VALUE_TOKEN := 7
}

FIXUPIMM_SP ( dest[31:0], src1[31:0],tbl3[31:0], imm8 [7:0]){


tsrc[31:0] := ((src1[30:23] = 0) AND (MXCSR.DAZ =1)) ? 0.0 : src1[31:0]
CASE(tsrc[31:0] of TOKEN_TYPE) {
QNAN_TOKEN: j := 0;
SNAN_TOKEN: j := 1;
ZERO_VALUE_TOKEN: j := 2;
POS_ONE_VALUE_TOKEN: j := 3;
NEG_INF_TOKEN: j := 4;
POS_INF_TOKEN: j := 5;
NEG_VALUE_TOKEN: j := 6;
POS_VALUE_TOKEN: j := 7;
} ; end source special CASE(tsrc…)

; The required response from src3 table is extracted


token_response[3:0] = tbl3[3+4*j:4*j];

CASE(token_response[3:0]) {
0000: dest[31:0] := dest[31:0]; ; preserve content of DEST
0001: dest[31:0] := tsrc[31:0]; ; pass through src1 normal input value, denormal as zero
0010: dest[31:0] := QNaN(tsrc[31:0]);
0011: dest[31:0] := QNAN_Indefinite;
0100: dest[31:0] := -INF;
0101: dest[31:0] := +INF;
0110: dest[31:0] := tsrc.sign? –INF : +INF;
0111: dest[31:0] := -0;
1000: dest[31:0] := +0;
1001: dest[31:0] := -1;
1010: dest[31:0] := +1;
1011: dest[31:0] := ½;
1100: dest[31:0] := 90.0;
1101: dest[31:0] := PI/2;
1110: dest[31:0] := MAX_FLOAT;
1111: dest[31:0] := -MAX_FLOAT;
} ; end of token_response CASE

; The required fault reporting from imm8 is extracted


; TOKENs are mutually exclusive and TOKENs priority defines the order.
; Multiple faults related to a single token can occur simultaneously.
IF (tsrc[31:0] of TOKEN_TYPE: ZERO_VALUE_TOKEN) AND imm8[0] then set #ZE;
IF (tsrc[31:0] of TOKEN_TYPE: ZERO_VALUE_TOKEN) AND imm8[1] then set #IE;
IF (tsrc[31:0] of TOKEN_TYPE: ONE_VALUE_TOKEN) AND imm8[2] then set #ZE;

VFIXUPIMMPS—Fix Up Special Packed Float32 Values Vol. 2C 5-209


IF (tsrc[31:0] of TOKEN_TYPE: ONE_VALUE_TOKEN) AND imm8[3] then set #IE;
IF (tsrc[31:0] of TOKEN_TYPE: SNAN_TOKEN) AND imm8[4] then set #IE;
IF (tsrc[31:0] of TOKEN_TYPE: NEG_INF_TOKEN) AND imm8[5] then set #IE;
IF (tsrc[31:0] of TOKEN_TYPE: NEG_VALUE_TOKEN) AND imm8[6] then set #IE;
IF (tsrc[31:0] of TOKEN_TYPE: POS_INF_TOKEN) AND imm8[7] then set #IE;
; end fault reporting
return dest[31:0];
} ; end of FIXUPIMM_SP()

VFIXUPIMMPS (EVEX)
(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+31:i] := FIXUPIMM_SP(DEST[i+31:i], SRC1[i+31:i], SRC2[31:0], imm8 [7:0])
ELSE
DEST[i+31:i] := FIXUPIMM_SP(DEST[i+31:i], SRC1[i+31:i], SRC2[i+31:i], imm8 [7:0])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE DEST[i+31:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Immediate Control Description:

7 6 5 4 3 2 1 0

+ INF  #IE
- VE  #IE
- INF  #IE
SNaN  #IE
ONE  #IE
ONE  #ZE
ZERO  #IE
ZERO  #ZE

Figure 5-10. VFIXUPIMMPS Immediate Control Description

VFIXUPIMMPS—Fix Up Special Packed Float32 Values Vol. 2C 5-210


Intel C/C++ Compiler Intrinsic Equivalent
VFIXUPIMMPS __m512 _mm512_fixupimm_ps( __m512 a, __m512 b, __m512i c, int imm8);
VFIXUPIMMPS __m512 _mm512_mask_fixupimm_ps(__m512 a, __mmask16 k, __m512 b, __m512i c, int imm8);
VFIXUPIMMPS __m512 _mm512_maskz_fixupimm_ps( __mmask16 k, __m512 a, __m512 b, __m512i c, int imm8);
VFIXUPIMMPS __m512 _mm512_fixupimm_round_ps( __m512 a, __m512 b, __m512i c, int imm8, int sae);
VFIXUPIMMPS __m512 _mm512_mask_fixupimm_round_ps(__m512 a, __mmask16 k, __m512 b, __m512i c, int imm8, int sae);
VFIXUPIMMPS __m512 _mm512_maskz_fixupimm_round_ps( __mmask16 k, __m512 a, __m512 b, __m512i c, int imm8, int sae);
VFIXUPIMMPS __m256 _mm256_fixupimm_ps( __m256 a, __m256 b, __m256i c, int imm8);
VFIXUPIMMPS __m256 _mm256_mask_fixupimm_ps(__m256 a, __mmask8 k, __m256 b, __m256i c, int imm8);
VFIXUPIMMPS __m256 _mm256_maskz_fixupimm_ps( __mmask8 k, __m256 a, __m256 b, __m256i c, int imm8);
VFIXUPIMMPS __m128 _mm_fixupimm_ps( __m128 a, __m128 b, __m128i c, int imm8);
VFIXUPIMMPS __m128 _mm_mask_fixupimm_ps(__m128 a, __mmask8 k, __m128 b, __m128i c, int imm8);
VFIXUPIMMPS __m128 _mm_maskz_fixupimm_ps( __mmask8 k, __m128 a, __m128 b, __m128i c, int imm8);

SIMD Floating-Point Exceptions

Zero, Invalid.

Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”

VFIXUPIMMPS—Fix Up Special Packed Float32 Values Vol. 2C 5-211


VFIXUPIMMSD—Fix Up Special Scalar Float64 Value
Opcode/ Op / 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
EVEX.LLIG.66.0F3A.W1 55 /r ib A V/V AVX512F Fix up a float64 number in the low quadword element
VFIXUPIMMSD xmm1 {k1}{z}, OR AVX10.11 of xmm2 using scalar int32 table in xmm3/m64 and
xmm2, xmm3/m64{sae}, imm8 store the result in xmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
Perform a fix-up of the low quadword element encoded in double precision floating-point format in the first source
operand (the second operand) using a 32-bit, two-level look-up table specified in the low quadword element of the
second source operand (the third operand) with exception reporting specifier imm8. The element that is fixed-up
is selected by mask bit of 1 specified in the opmask k1. Mask bit of 0 in the opmask k1 or table response action of
0000b preserves the corresponding element of the first operand. The fixed-up element from the first source
operand or the preserved element in the first operand becomes the low quadword element of the destination
operand (the first operand). Bits 127:64 of the destination operand is copied from the corresponding bits of the
first source operand. The destination and first source operands are XMM registers. The second source operand can
be a XMM register or a 64- bit memory location.
The two-level look-up table perform a fix-up of each double precision floating-point input data in the first source
operand by decoding the input data encoding into 8 token types. A response table is defined for each token type
that converts the input encoding in the first source operand with one of 16 response actions.
This instruction is specifically intended for use in fixing up the results of arithmetic calculations involving one source
so that they match the spec, although it is generally useful for fixing up the results of multiple-instruction
sequences to reflect special-number inputs. For example, consider rcp(0). Input 0 to rcp, and you should get INF
according to the DX10 spec. However, evaluating rcp via Newton-Raphson, where x=approx(1/0), yields an incor-
rect result. To deal with this, VFIXUPIMMPD can be used after the N-R reciprocal sequence to set the result to the
correct value (i.e., INF when the input is 0).
If MXCSR.DAZ is not set, denormal input elements in the first source operand are considered as normal inputs and
do not trigger any fixup nor fault reporting.
Imm8 is used to set the required flags reporting. It supports #ZE and #IE fault reporting (see details below).
MXCSR.DAZ is used and refer to zmm2 only (i.e., zmm1 is not considered as zero in case MXCSR.DAZ is set).
MXCSR mask bits are ignored and are treated as if all mask bits are set to masked response). If any of the imm8
bits is set and the condition met for fault reporting, MXCSR.IE or MXCSR.ZE might be updated.

VFIXUPIMMSD—Fix Up Special Scalar Float64 Value Vol. 2C 5-212


Operation
enum TOKEN_TYPE
{
QNAN_TOKEN := 0,
SNAN_TOKEN := 1,
ZERO_VALUE_TOKEN := 2,
POS_ONE_VALUE_TOKEN := 3,
NEG_INF_TOKEN := 4,
POS_INF_TOKEN := 5,
NEG_VALUE_TOKEN := 6,
POS_VALUE_TOKEN := 7
}

FIXUPIMM_DP (dest[63:0], src1[63:0],tbl3[63:0], imm8 [7:0]){


tsrc[63:0] := ((src1[62:52] = 0) AND (MXCSR.DAZ =1)) ? 0.0 : src1[63:0]
CASE(tsrc[63:0] of TOKEN_TYPE) {
QNAN_TOKEN: j := 0;
SNAN_TOKEN: j := 1;
ZERO_VALUE_TOKEN: j := 2;
POS_ONE_VALUE_TOKEN: j := 3;
NEG_INF_TOKEN: j := 4;
POS_INF_TOKEN: j := 5;
NEG_VALUE_TOKEN: j := 6;
POS_VALUE_TOKEN: j := 7;
} ; end source special CASE(tsrc…)

; The required response from src3 table is extracted


token_response[3:0] = tbl3[3+4*j:4*j];

CASE(token_response[3:0]) {
0000: dest[63:0] := dest[63:0] ; preserve content of DEST
0001: dest[63:0] := tsrc[63:0]; ; pass through src1 normal input value, denormal as zero
0010: dest[63:0] := QNaN(tsrc[63:0]);
0011: dest[63:0] := QNAN_Indefinite;
0100:dest[63:0] := -INF;
0101: dest[63:0] := +INF;
0110: dest[63:0] := tsrc.sign? –INF : +INF;
0111: dest[63:0] := -0;
1000: dest[63:0] := +0;
1001: dest[63:0] := -1;
1010: dest[63:0] := +1;
1011: dest[63:0] := ½;
1100: dest[63:0] := 90.0;
1101: dest[63:0] := PI/2;
1110: dest[63:0] := MAX_FLOAT;
1111: dest[63:0] := -MAX_FLOAT;
} ; end of token_response CASE

; The required fault reporting from imm8 is extracted


; TOKENs are mutually exclusive and TOKENs priority defines the order.
; Multiple faults related to a single token can occur simultaneously.
IF (tsrc[63:0] of TOKEN_TYPE: ZERO_VALUE_TOKEN) AND imm8[0] then set #ZE;
IF (tsrc[63:0] of TOKEN_TYPE: ZERO_VALUE_TOKEN) AND imm8[1] then set #IE;
IF (tsrc[63:0] of TOKEN_TYPE: ONE_VALUE_TOKEN) AND imm8[2] then set #ZE;

VFIXUPIMMSD—Fix Up Special Scalar Float64 Value Vol. 2C 5-213


IF (tsrc[63:0] of TOKEN_TYPE: ONE_VALUE_TOKEN) AND imm8[3] then set #IE;
IF (tsrc[63:0] of TOKEN_TYPE: SNAN_TOKEN) AND imm8[4] then set #IE;
IF (tsrc[63:0] of TOKEN_TYPE: NEG_INF_TOKEN) AND imm8[5] then set #IE;
IF (tsrc[63:0] of TOKEN_TYPE: NEG_VALUE_TOKEN) AND imm8[6] then set #IE;
IF (tsrc[63:0] of TOKEN_TYPE: POS_INF_TOKEN) AND imm8[7] then set #IE;
; end fault reporting
return dest[63:0];
} ; end of FIXUPIMM_DP()

VFIXUPIMMSD (EVEX encoded version)


IF k1[0] OR *no writemask*
THEN DEST[63:0] := FIXUPIMM_DP(DEST[63:0], SRC1[63:0], SRC2[63:0], imm8 [7:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE DEST[63:0] := 0 ; zeroing-masking
FI
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

Immediate Control Description:

7 6 5 4 3 2 1 0

+ INF  #IE
- VE  #IE
- INF  #IE
SNaN  #IE
ONE  #IE
ONE  #ZE
ZERO  #IE
ZERO  #ZE

Figure 5-11. VFIXUPIMMSD Immediate Control Description

Intel C/C++ Compiler Intrinsic Equivalent


VFIXUPIMMSD __m128d _mm_fixupimm_sd( __m128d a, __m128d b, __m128i c, int imm8);
VFIXUPIMMSD __m128d _mm_mask_fixupimm_sd(__m128d a, __mmask8 k, __m128d b, __m128i c, int imm8);
VFIXUPIMMSD __m128d _mm_maskz_fixupimm_sd( __mmask8 k, __m128d a, __m128d b, __m128i c, int imm8);
VFIXUPIMMSD __m128d _mm_fixupimm_round_sd( __m128d a, __m128d b, __m128i c, int imm8, int sae);
VFIXUPIMMSD __m128d _mm_mask_fixupimm_round_sd(__m128d a, __mmask8 k, __m128d b, __m128i c, int imm8, int sae);
VFIXUPIMMSD __m128d _mm_maskz_fixupimm_round_sd( __mmask8 k, __m128d a, __m128d b, __m128i c, int imm8, int sae);

SIMD Floating-Point Exceptions

Zero, Invalid

VFIXUPIMMSD—Fix Up Special Scalar Float64 Value Vol. 2C 5-214


Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”

VFIXUPIMMSD—Fix Up Special Scalar Float64 Value Vol. 2C 5-215


VFIXUPIMMSS—Fix Up Special Scalar Float32 Value
Opcode/ Op / 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
EVEX.LLIG.66.0F3A.W0 55 /r ib A V/V AVX512F Fix up a float32 number in the low doubleword
VFIXUPIMMSS xmm1 {k1}{z}, xmm2, OR AVX10.11 element in xmm2 using scalar int32 table in
xmm3/m32{sae}, imm8 xmm3/m32 and store the result in xmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
Perform a fix-up of the low doubleword element encoded in single precision floating-point format in the first source
operand (the second operand) using a 32-bit, two-level look-up table specified in the low doubleword element of
the second source operand (the third operand) with exception reporting specifier imm8. The element that is fixed-
up is selected by mask bit of 1 specified in the opmask k1. Mask bit of 0 in the opmask k1 or table response action
of 0000b preserves the corresponding element of the first operand. The fixed-up element from the first source
operand or the preserved element in the first operand becomes the low doubleword element of the destination
operand (the first operand) Bits 127:32 of the destination operand is copied from the corresponding bits of the first
source operand. The destination and first source operands are XMM registers. The second source operand can be a
XMM register or a 32-bit memory location.
The two-level look-up table perform a fix-up of each single precision floating-point input data in the first source
operand by decoding the input data encoding into 8 token types. A response table is defined for each token type
that converts the input encoding in the first source operand with one of 16 response actions.
This instruction is specifically intended for use in fixing up the results of arithmetic calculations involving one source
so that they match the spec, although it is generally useful for fixing up the results of multiple-instruction
sequences to reflect special-number inputs. For example, consider rcp(0). Input 0 to rcp, and you should get INF
according to the DX10 spec. However, evaluating rcp via Newton-Raphson, where x=approx(1/0), yields an incor-
rect result. To deal with this, VFIXUPIMMPD can be used after the N-R reciprocal sequence to set the result to the
correct value (i.e., INF when the input is 0).
If MXCSR.DAZ is not set, denormal input elements in the first source operand are considered as normal inputs and
do not trigger any fixup nor fault reporting.
Imm8 is used to set the required flags reporting. It supports #ZE and #IE fault reporting (see details below).
MXCSR.DAZ is used and refer to zmm2 only (i.e., zmm1 is not considered as zero in case MXCSR.DAZ is set).
MXCSR mask bits are ignored and are treated as if all mask bits are set to masked response). If any of the imm8
bits is set and the condition met for fault reporting, MXCSR.IE or MXCSR.ZE might be updated.

VFIXUPIMMSS—Fix Up Special Scalar Float32 Value Vol. 2C 5-215


Operation
enum TOKEN_TYPE
{
QNAN_TOKEN := 0,
SNAN_TOKEN := 1,
ZERO_VALUE_TOKEN := 2,
POS_ONE_VALUE_TOKEN := 3,
NEG_INF_TOKEN := 4,
POS_INF_TOKEN := 5,
NEG_VALUE_TOKEN := 6,
POS_VALUE_TOKEN := 7
}

FIXUPIMM_SP (dest[31:0], src1[31:0],tbl3[31:0], imm8 [7:0]){


tsrc[31:0] := ((src1[30:23] = 0) AND (MXCSR.DAZ =1)) ? 0.0 : src1[31:0]
CASE(tsrc[63:0] of TOKEN_TYPE) {
QNAN_TOKEN: j := 0;
SNAN_TOKEN: j := 1;
ZERO_VALUE_TOKEN: j := 2;
POS_ONE_VALUE_TOKEN: j := 3;
NEG_INF_TOKEN: j := 4;
POS_INF_TOKEN: j := 5;
NEG_VALUE_TOKEN: j := 6;
POS_VALUE_TOKEN: j := 7;
} ; end source special CASE(tsrc…)

; The required response from src3 table is extracted


token_response[3:0] = tbl3[3+4*j:4*j];

CASE(token_response[3:0]) {
0000: dest[31:0] := dest[31:0]; ; preserve content of DEST
0001: dest[31:0] := tsrc[31:0]; ; pass through src1 normal input value, denormal as zero
0010: dest[31:0] := QNaN(tsrc[31:0]);
0011: dest[31:0] := QNAN_Indefinite;
0100: dest[31:0] := -INF;
0101: dest[31:0] := +INF;
0110: dest[31:0] := tsrc.sign? –INF : +INF;
0111: dest[31:0] := -0;
1000: dest[31:0] := +0;
1001: dest[31:0] := -1;
1010: dest[31:0] := +1;
1011: dest[31:0] := ½;
1100: dest[31:0] := 90.0;
1101: dest[31:0] := PI/2;
1110: dest[31:0] := MAX_FLOAT;
1111: dest[31:0] := -MAX_FLOAT;
} ; end of token_response CASE

; The required fault reporting from imm8 is extracted


; TOKENs are mutually exclusive and TOKENs priority defines the order.
; Multiple faults related to a single token can occur simultaneously.
IF (tsrc[31:0] of TOKEN_TYPE: ZERO_VALUE_TOKEN) AND imm8[0] then set #ZE;
IF (tsrc[31:0] of TOKEN_TYPE: ZERO_VALUE_TOKEN) AND imm8[1] then set #IE;
IF (tsrc[31:0] of TOKEN_TYPE: ONE_VALUE_TOKEN) AND imm8[2] then set #ZE;

VFIXUPIMMSS—Fix Up Special Scalar Float32 Value Vol. 2C 5-216


IF (tsrc[31:0] of TOKEN_TYPE: ONE_VALUE_TOKEN) AND imm8[3] then set #IE;
IF (tsrc[31:0] of TOKEN_TYPE: SNAN_TOKEN) AND imm8[4] then set #IE;
IF (tsrc[31:0] of TOKEN_TYPE: NEG_INF_TOKEN) AND imm8[5] then set #IE;
IF (tsrc[31:0] of TOKEN_TYPE: NEG_VALUE_TOKEN) AND imm8[6] then set #IE;
IF (tsrc[31:0] of TOKEN_TYPE: POS_INF_TOKEN) AND imm8[7] then set #IE;
; end fault reporting
return dest[31:0];
} ; end of FIXUPIMM_SP()

VFIXUPIMMSS (EVEX encoded version)


IF k1[0] OR *no writemask*
THEN DEST[31:0] := FIXUPIMM_SP(DEST[31:0], SRC1[31:0], SRC2[31:0], imm8 [7:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE DEST[31:0] := 0 ; zeroing-masking
FI
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

Immediate Control Description:

7 6 5 4 3 2 1 0

+ INF  #IE
- VE  #IE
- INF  #IE
SNaN  #IE
ONE  #IE
ONE  #ZE
ZERO  #IE
ZERO  #ZE

Figure 5-12. VFIXUPIMMSS Immediate Control Description

Intel C/C++ Compiler Intrinsic Equivalent


VFIXUPIMMSS __m128 _mm_fixupimm_ss( __m128 a, __m128 b, __m128i c, int imm8);
VFIXUPIMMSS __m128 _mm_mask_fixupimm_ss(__m128 a, __mmask8 k, __m128 b, __m128i c, int imm8);
VFIXUPIMMSS __m128 _mm_maskz_fixupimm_ss( __mmask8 k, __m128 a, __m128 b, __m128i c, int imm8);
VFIXUPIMMSS __m128 _mm_fixupimm_round_ss( __m128 a, __m128 b, __m128i c, int imm8, int sae);
VFIXUPIMMSS __m128 _mm_mask_fixupimm_round_ss(__m128 a, __mmask8 k, __m128 b, __m128i c, int imm8, int sae);
VFIXUPIMMSS __m128 _mm_maskz_fixupimm_round_ss( __mmask8 k, __m128 a, __m128 b, __m128i c, int imm8, int sae);

VFIXUPIMMSS—Fix Up Special Scalar Float32 Value Vol. 2C 5-217


SIMD Floating-Point Exceptions

Zero, Invalid

Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”

VFIXUPIMMSS—Fix Up Special Scalar Float32 Value Vol. 2C 5-218


VFMADD132PD/VFMADD213PD/VFMADD231PD—Fused Multiply-Add of Packed Double
Precision Floating-Point Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.66.0F38.W1 98 /r A V/V FMA Multiply packed double precision floating-point
VFMADD132PD xmm1, xmm2, values from xmm1 and xmm3/mem, add to
xmm3/m128 xmm2 and put result in xmm1.
VEX.128.66.0F38.W1 A8 /r A V/V FMA Multiply packed double precision floating-point
VFMADD213PD xmm1, xmm2, values from xmm1 and xmm2, add to
xmm3/m128 xmm3/mem and put result in xmm1.
VEX.128.66.0F38.W1 B8 /r A V/V FMA Multiply packed double precision floating-point
VFMADD231PD xmm1, xmm2, values from xmm2 and xmm3/mem, add to
xmm3/m128 xmm1 and put result in xmm1.
VEX.256.66.0F38.W1 98 /r A V/V FMA Multiply packed double precision floating-point
VFMADD132PD ymm1, ymm2, values from ymm1 and ymm3/mem, add to
ymm3/m256 ymm2 and put result in ymm1.
VEX.256.66.0F38.W1 A8 /r A V/V FMA Multiply packed double precision floating-point
VFMADD213PD ymm1, ymm2, values from ymm1 and ymm2, add to
ymm3/m256 ymm3/mem and put result in ymm1.
VEX.256.66.0F38.W1 B8 /r A V/V FMA Multiply packed double precision floating-point
VFMADD231PD ymm1, ymm2, values from ymm2 and ymm3/mem, add to
ymm3/m256 ymm1 and put result in ymm1.
EVEX.128.66.0F38.W1 98 /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMADD132PD xmm1 {k1}{z}, xmm2, AVX512F) OR values from xmm1 and xmm3/m128/m64bcst,
xmm3/m128/m64bcst AVX10.11 add to xmm2 and put result in xmm1.
EVEX.128.66.0F38.W1 A8 /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMADD213PD xmm1 {k1}{z}, xmm2, AVX512F) OR values from xmm1 and xmm2, add to
xmm3/m128/m64bcst AVX10.11 xmm3/m128/m64bcst and put result in xmm1.
EVEX.128.66.0F38.W1 B8 /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMADD231PD xmm1 {k1}{z}, xmm2, AVX512F) OR values from xmm2 and xmm3/m128/m64bcst,
xmm3/m128/m64bcst AVX10.11 add to xmm1 and put result in xmm1.
EVEX.256.66.0F38.W1 98 /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMADD132PD ymm1 {k1}{z}, ymm2, AVX512F) OR values from ymm1 and ymm3/m256/m64bcst,
ymm3/m256/m64bcst AVX10.11 add to ymm2 and put result in ymm1.
EVEX.256.66.0F38.W1 A8 /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMADD213PD ymm1 {k1}{z}, ymm2, AVX512F) OR values from ymm1 and ymm2, add to
ymm3/m256/m64bcst AVX10.11 ymm3/m256/m64bcst and put result in ymm1.
EVEX.256.66.0F38.W1 B8 /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMADD231PD ymm1 {k1}{z}, ymm2, AVX512F) OR values from ymm2 and ymm3/m256/m64bcst,
ymm3/m256/m64bcst AVX10.11 add to ymm1 and put result in ymm1.
EVEX.512.66.0F38.W1 98 /r B V/V AVX512F Multiply packed double precision floating-point
VFMADD132PD zmm1 {k1}{z}, zmm2, OR AVX10.11 values from zmm1 and zmm3/m512/m64bcst,
zmm3/m512/m64bcst{er} add to zmm2 and put result in zmm1.
EVEX.512.66.0F38.W1 A8 /r B V/V AVX512F Multiply packed double precision floating-point
VFMADD213PD zmm1 {k1}{z}, zmm2, OR AVX10.11 values from zmm1 and zmm2, add to
zmm3/m512/m64bcst{er} zmm3/m512/m64bcst and put result in zmm1.
EVEX.512.66.0F38.W1 B8 /r B V/V AVX512F Multiply packed double precision floating-point
VFMADD231PD zmm1 {k1}{z}, zmm2, OR AVX10.11 values from zmm2 and zmm3/m512/m64bcst,
zmm3/m512/m64bcst{er} add to zmm1 and put result in zmm1.

VFMADD132PD/VFMADD213PD/VFMADD231PD—Fused Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-218


NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a set of SIMD multiply-add computation on packed double precision floating-point values using three
source operands and writes the multiply-add results in the destination operand. The destination operand is also the
first source operand. The second operand must be a SIMD register. The third source operand can be a SIMD
register or a memory location.
VFMADD132PD: Multiplies the two, four or eight packed double precision floating-point values from the first source
operand to the two, four or eight packed double precision floating-point values in the third source operand, adds
the infinite precision intermediate result to the two, four or eight packed double precision floating-point values in
the second source operand, performs rounding and stores the resulting two, four or eight packed double precision
floating-point values to the destination operand (first source operand).
VFMADD213PD: Multiplies the two, four or eight packed double precision floating-point values from the second
source operand to the two, four or eight packed double precision floating-point values in the first source operand,
adds the infinite precision intermediate result to the two, four or eight packed double precision floating-point
values in the third source operand, performs rounding and stores the resulting two, four or eight packed double
precision floating-point values to the destination operand (first source operand).
VFMADD231PD: Multiplies the two, four or eight packed double precision floating-point values from the second
source to the two, four or eight packed double precision floating-point values in the third source operand, adds the
infinite precision intermediate result to the two, four or eight packed double precision floating-point values in the
first source operand, performs rounding and stores the resulting two, four or eight packed double precision
floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) is a ZMM register and encoded in
reg_field. The second source operand is a ZMM register and encoded in EVEX.vvvv. The third source operand is a
ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 64-bit memory location. The
destination operand is conditionally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.

VFMADD132PD/VFMADD213PD/VFMADD231PD—Fused Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-219


Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no
rounding).

VFMADD132PD DEST, SRC2, SRC3 (VEX encoded version)


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 64*i;
DEST[n+63:n] := RoundFPControl_MXCSR(DEST[n+63:n]*SRC3[n+63:n] + SRC2[n+63:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFMADD213PD DEST, SRC2, SRC3 (VEX encoded version)


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 64*i;
DEST[n+63:n] := RoundFPControl_MXCSR(SRC2[n+63:n]*DEST[n+63:n] + SRC3[n+63:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFMADD231PD DEST, SRC2, SRC3 (VEX encoded version)


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 64*i;
DEST[n+63:n] := RoundFPControl_MXCSR(SRC2[n+63:n]*SRC3[n+63:n] + DEST[n+63:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFMADD132PD/VFMADD213PD/VFMADD231PD—Fused Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-220


VFMADD132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
RoundFPControl(DEST[i+63:i]*SRC3[i+63:i] + SRC2[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADD132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[63:0] + SRC2[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[i+63:i] + SRC2[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADD132PD/VFMADD213PD/VFMADD231PD—Fused Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-221


VFMADD213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
RoundFPControl(SRC2[i+63:i]*DEST[i+63:i] + SRC3[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADD213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] + SRC3[63:0])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] + SRC3[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADD132PD/VFMADD213PD/VFMADD231PD—Fused Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-222


VFMADD231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
RoundFPControl(SRC2[i+63:i]*SRC3[i+63:i] + DEST[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADD231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[63:0] + DEST[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[i+63:i] + DEST[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADD132PD/VFMADD213PD/VFMADD231PD—Fused Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-223


Intel C/C++ Compiler Intrinsic Equivalent
VFMADDxxxPD __m512d _mm512_fmadd_pd(__m512d a, __m512d b, __m512d c);
VFMADDxxxPD __m512d _mm512_fmadd_round_pd(__m512d a, __m512d b, __m512d c, int r);
VFMADDxxxPD __m512d _mm512_mask_fmadd_pd(__m512d a, __mmask8 k, __m512d b, __m512d c);
VFMADDxxxPD __m512d _mm512_maskz_fmadd_pd(__mmask8 k, __m512d a, __m512d b, __m512d c);
VFMADDxxxPD __m512d _mm512_mask3_fmadd_pd(__m512d a, __m512d b, __m512d c, __mmask8 k);
VFMADDxxxPD __m512d _mm512_mask_fmadd_round_pd(__m512d a, __mmask8 k, __m512d b, __m512d c, int r);
VFMADDxxxPD __m512d _mm512_maskz_fmadd_round_pd(__mmask8 k, __m512d a, __m512d b, __m512d c, int r);
VFMADDxxxPD __m512d _mm512_mask3_fmadd_round_pd(__m512d a, __m512d b, __m512d c, __mmask8 k, int r);
VFMADDxxxPD __m256d _mm256_mask_fmadd_pd(__m256d a, __mmask8 k, __m256d b, __m256d c);
VFMADDxxxPD __m256d _mm256_maskz_fmadd_pd(__mmask8 k, __m256d a, __m256d b, __m256d c);
VFMADDxxxPD __m256d _mm256_mask3_fmadd_pd(__m256d a, __m256d b, __m256d c, __mmask8 k);
VFMADDxxxPD __m128d _mm_mask_fmadd_pd(__m128d a, __mmask8 k, __m128d b, __m128d c);
VFMADDxxxPD __m128d _mm_maskz_fmadd_pd(__mmask8 k, __m128d a, __m128d b, __m128d c);
VFMADDxxxPD __m128d _mm_mask3_fmadd_pd(__m128d a, __m128d b, __m128d c, __mmask8 k);
VFMADDxxxPD __m128d _mm_fmadd_pd (__m128d a, __m128d b, __m128d c);
VFMADDxxxPD __m256d _mm256_fmadd_pd (__m256d a, __m256d b, __m256d c);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VFMADD132PD/VFMADD213PD/VFMADD231PD—Fused Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-224


VFMADD132PS/VFMADD213PS/VFMADD231PS—Fused Multiply-Add of Packed Single
Precision Floating-Point Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.66.0F38.W0 98 /r A V/V FMA Multiply packed single precision floating-point
VFMADD132PS xmm1, xmm2, values from xmm1 and xmm3/mem, add to
xmm3/m128 xmm2 and put result in xmm1.
VEX.128.66.0F38.W0 A8 /r A V/V FMA Multiply packed single precision floating-point
VFMADD213PS xmm1, xmm2, values from xmm1 and xmm2, add to
xmm3/m128 xmm3/mem and put result in xmm1.
VEX.128.66.0F38.W0 B8 /r A V/V FMA Multiply packed single precision floating-point
VFMADD231PS xmm1, xmm2, values from xmm2 and xmm3/mem, add to
xmm3/m128 xmm1 and put result in xmm1.
VEX.256.66.0F38.W0 98 /r A V/V FMA Multiply packed single precision floating-point
VFMADD132PS ymm1, ymm2, values from ymm1 and ymm3/mem, add to
ymm3/m256 ymm2 and put result in ymm1.
VEX.256.66.0F38.W0 A8 /r A V/V FMA Multiply packed single precision floating-point
VFMADD213PS ymm1, ymm2, values from ymm1 and ymm2, add to
ymm3/m256 ymm3/mem and put result in ymm1.
VEX.256.66.0F38.0 B8 /r A V/V FMA Multiply packed single precision floating-point
VFMADD231PS ymm1, ymm2, values from ymm2 and ymm3/mem, add to
ymm3/m256 ymm1 and put result in ymm1.
EVEX.128.66.0F38.W0 98 /r B V/V (AVX512VL Multiply packed single precision floating-point
VFMADD132PS xmm1 {k1}{z}, xmm2, AND AVX512F) values from xmm1 and xmm3/m128/m32bcst,
xmm3/m128/m32bcst OR AVX10.11 add to xmm2 and put result in xmm1.
EVEX.128.66.0F38.W0 A8 /r B V/V (AVX512VL Multiply packed single precision floating-point
VFMADD213PS xmm1 {k1}{z}, xmm2, AND AVX512F) values from xmm1 and xmm2, add to
xmm3/m128/m32bcst OR AVX10.11 xmm3/m128/m32bcst and put result in xmm1.
EVEX.128.66.0F38.W0 B8 /r B V/V (AVX512VL Multiply packed single precision floating-point
VFMADD231PS xmm1 {k1}{z}, xmm2, AND AVX512F) values from xmm2 and xmm3/m128/m32bcst,
xmm3/m128/m32bcst OR AVX10.11 add to xmm1 and put result in xmm1.
EVEX.256.66.0F38.W0 98 /r B V/V (AVX512VL Multiply packed single precision floating-point
VFMADD132PS ymm1 {k1}{z}, ymm2, AND AVX512F) values from ymm1 and ymm3/m256/m32bcst,
ymm3/m256/m32bcst OR AVX10.11 add to ymm2 and put result in ymm1.
EVEX.256.66.0F38.W0 A8 /r B V/V (AVX512VL Multiply packed single precision floating-point
VFMADD213PS ymm1 {k1}{z}, ymm2, AND AVX512F) values from ymm1 and ymm2, add to
ymm3/m256/m32bcst OR AVX10.11 ymm3/m256/m32bcst and put result in ymm1.
EVEX.256.66.0F38.W0 B8 /r B V/V (AVX512VL Multiply packed single precision floating-point
VFMADD231PS ymm1 {k1}{z}, ymm2, AND AVX512F) values from ymm2 and ymm3/m256/m32bcst,
ymm3/m256/m32bcst OR AVX10.11 add to ymm1 and put result in ymm1.
EVEX.512.66.0F38.W0 98 /r B V/V AVX512F Multiply packed single precision floating-point
VFMADD132PS zmm1 {k1}{z}, zmm2, OR AVX10.11 values from zmm1 and zmm3/m512/m32bcst,
zmm3/m512/m32bcst{er} add to zmm2 and put result in zmm1.
EVEX.512.66.0F38.W0 A8 /r B V/V AVX512F Multiply packed single precision floating-point
VFMADD213PS zmm1 {k1}{z}, zmm2, OR AVX10.11 values from zmm1 and zmm2, add to
zmm3/m512/m32bcst{er} zmm3/m512/m32bcst and put result in zmm1.
EVEX.512.66.0F38.W0 B8 /r B V/V AVX512F Multiply packed single precision floating-point
VFMADD231PS zmm1 {k1}{z}, zmm2, OR AVX10.11 values from zmm2 and zmm3/m512/m32bcst,
zmm3/m512/m32bcst{er} add to zmm1 and put result in zmm1.

VFMADD132PS/VFMADD213PS/VFMADD231PS—Fused Multiply-Add of Packed Single Precision Floating-Point Values Vol. 2C 5-225


NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the processor at
run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector width and
as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a set of SIMD multiply-add computation on packed single precision floating-point values using three
source operands and writes the multiply-add results in the destination operand. The destination operand is also the
first source operand. The second operand must be a SIMD register. The third source operand can be a SIMD
register or a memory location.
VFMADD132PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the first
source operand to the four, eight or sixteen packed single precision floating-point values in the third source
operand, adds the infinite precision intermediate result to the four, eight or sixteen packed single precision floating-
point values in the second source operand, performs rounding and stores the resulting four, eight or sixteen packed
single precision floating-point values to the destination operand (first source operand).
VFMADD213PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the second
source operand to the four, eight or sixteen packed single precision floating-point values in the first source
operand, adds the infinite precision intermediate result to the four, eight or sixteen packed single precision floating-
point values in the third source operand, performs rounding and stores the resulting the four, eight or sixteen
packed single precision floating-point values to the destination operand (first source operand).
VFMADD231PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the second
source operand to the four, eight or sixteen packed single precision floating-point values in the third source
operand, adds the infinite precision intermediate result to the four, eight or sixteen packed single precision floating-
point values in the first source operand, performs rounding and stores the resulting four, eight or sixteen packed
single precision floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) is a ZMM register and encoded in
reg_field. The second source operand is a ZMM register and encoded in EVEX.vvvv. The third source operand is a
ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 32-bit memory location. The
destination operand is conditionally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.

Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no
rounding).

VFMADD132PS DEST, SRC2, SRC3


IF (VEX.128) THEN
MAXNUM := 4
ELSEIF (VEX.256)
MAXNUM := 8

VFMADD132PS/VFMADD213PS/VFMADD231PS—Fused Multiply-Add of Packed Single Precision Floating-Point Values Vol. 2C 5-226


FI
For i = 0 to MAXNUM-1 {
n := 32*i;
DEST[n+31:n] := RoundFPControl_MXCSR(DEST[n+31:n]*SRC3[n+31:n] + SRC2[n+31:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFMADD213PS DEST, SRC2, SRC3


IF (VEX.128) THEN
MAXNUM := 4
ELSEIF (VEX.256)
MAXNUM := 8
FI
For i = 0 to MAXNUM-1 {
n := 32*i;
DEST[n+31:n] := RoundFPControl_MXCSR(SRC2[n+31:n]*DEST[n+31:n] + SRC3[n+31:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFMADD231PS DEST, SRC2, SRC3


IF (VEX.128) THEN
MAXNUM := 4
ELSEIF (VEX.256)
MAXNUM := 8
FI
For i = 0 to MAXNUM-1 {
n := 32*i;
DEST[n+31:n] := RoundFPControl_MXCSR(SRC2[n+31:n]*SRC3[n+31:n] + DEST[n+31:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFMADD132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*

VFMADD132PS/VFMADD213PS/VFMADD231PS—Fused Multiply-Add of Packed Single Precision Floating-Point Values Vol. 2C 5-227


THEN DEST[i+31:i] :=
RoundFPControl(DEST[i+31:i]*SRC3[i+31:i] + SRC2[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADD132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[31:0] + SRC2[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[i+31:i] + SRC2[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADD213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
RoundFPControl(SRC2[i+31:i]*DEST[i+31:i] + SRC3[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking

VFMADD132PS/VFMADD213PS/VFMADD231PS—Fused Multiply-Add of Packed Single Precision Floating-Point Values Vol. 2C 5-228


DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADD213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] + SRC3[31:0])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] + SRC3[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADD231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
RoundFPControl(SRC2[i+31:i]*SRC3[i+31:i] + DEST[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADD132PS/VFMADD213PS/VFMADD231PS—Fused Multiply-Add of Packed Single Precision Floating-Point Values Vol. 2C 5-229


VFMADD231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[31:0] + DEST[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[i+31:i] + DEST[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFMADDxxxPS __m512 _mm512_fmadd_ps(__m512 a, __m512 b, __m512 c);
VFMADDxxxPS __m512 _mm512_fmadd_round_ps(__m512 a, __m512 b, __m512 c, int r);
VFMADDxxxPS __m512 _mm512_mask_fmadd_ps(__m512 a, __mmask16 k, __m512 b, __m512 c);
VFMADDxxxPS __m512 _mm512_maskz_fmadd_ps(__mmask16 k, __m512 a, __m512 b, __m512 c);
VFMADDxxxPS __m512 _mm512_mask3_fmadd_ps(__m512 a, __m512 b, __m512 c, __mmask16 k);
VFMADDxxxPS __m512 _mm512_mask_fmadd_round_ps(__m512 a, __mmask16 k, __m512 b, __m512 c, int r);
VFMADDxxxPS __m512 _mm512_maskz_fmadd_round_ps(__mmask16 k, __m512 a, __m512 b, __m512 c, int r);
VFMADDxxxPS __m512 _mm512_mask3_fmadd_round_ps(__m512 a, __m512 b, __m512 c, __mmask16 k, int r);
VFMADDxxxPS __m256 _mm256_mask_fmadd_ps(__m256 a, __mmask8 k, __m256 b, __m256 c);
VFMADDxxxPS __m256 _mm256_maskz_fmadd_ps(__mmask8 k, __m256 a, __m256 b, __m256 c);
VFMADDxxxPS __m256 _mm256_mask3_fmadd_ps(__m256 a, __m256 b, __m256 c, __mmask8 k);
VFMADDxxxPS __m128 _mm_mask_fmadd_ps(__m128 a, __mmask8 k, __m128 b, __m128 c);
VFMADDxxxPS __m128 _mm_maskz_fmadd_ps(__mmask8 k, __m128 a, __m128 b, __m128 c);
VFMADDxxxPS __m128 _mm_mask3_fmadd_ps(__m128 a, __m128 b, __m128 c, __mmask8 k);
VFMADDxxxPS __m128 _mm_fmadd_ps (__m128 a, __m128 b, __m128 c);
VFMADDxxxPS __m256 _mm256_fmadd_ps (__m256 a, __m256 b, __m256 c);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VFMADD132PS/VFMADD213PS/VFMADD231PS—Fused Multiply-Add of Packed Single Precision Floating-Point Values Vol. 2C 5-230


VFMADD132SD/VFMADD213SD/VFMADD231SD—Fused Multiply-Add of Scalar Double
Precision Floating-Point Values
Opcode/ Op/ 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
VEX.LIG.66.0F38.W1 99 /r A V/V FMA Multiply scalar double precision floating-point
VFMADD132SD xmm1, xmm2, value from xmm1 and xmm3/m64, add to xmm2
xmm3/m64 and put result in xmm1.
VEX.LIG.66.0F38.W1 A9 /r A V/V FMA Multiply scalar double precision floating-point
VFMADD213SD xmm1, xmm2, value from xmm1 and xmm2, add to xmm3/m64
xmm3/m64 and put result in xmm1.
VEX.LIG.66.0F38.W1 B9 /r A V/V FMA Multiply scalar double precision floating-point
VFMADD231SD xmm1, xmm2, value from xmm2 and xmm3/m64, add to xmm1
xmm3/m64 and put result in xmm1.
EVEX.LLIG.66.0F38.W1 99 /r B V/V AVX512F Multiply scalar double precision floating-point
VFMADD132SD xmm1 {k1}{z}, xmm2, OR AVX10.11 value from xmm1 and xmm3/m64, add to xmm2
xmm3/m64{er} and put result in xmm1.
EVEX.LLIG.66.0F38.W1 A9 /r B V/V AVX512F Multiply scalar double precision floating-point
VFMADD213SD xmm1 {k1}{z}, xmm2, OR AVX10.11 value from xmm1 and xmm2, add to xmm3/m64
xmm3/m64{er} and put result in xmm1.
EVEX.LLIG.66.0F38.W1 B9 /r B V/V AVX512F Multiply scalar double precision floating-point
VFMADD231SD xmm1 {k1}{z}, xmm2, OR AVX10.11 value from xmm2 and xmm3/m64, add to xmm1
xmm3/m64{er} and put result in xmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Tuple1 Scalar ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD multiply-add computation on the low double precision floating-point values using three source
operands and writes the multiply-add result in the destination operand. The destination operand is also the first
source operand. The first and second operand are XMM registers. The third source operand can be an XMM register
or a 64-bit memory location.
VFMADD132SD: Multiplies the low double precision floating-point value from the first source operand to the low
double precision floating-point value in the third source operand, adds the infinite precision intermediate result to
the low double precision floating-point values in the second source operand, performs rounding and stores the
resulting double precision floating-point value to the destination operand (first source operand).
VFMADD213SD: Multiplies the low double precision floating-point value from the second source operand to the low
double precision floating-point value in the first source operand, adds the infinite precision intermediate result to
the low double precision floating-point value in the third source operand, performs rounding and stores the
resulting double precision floating-point value to the destination operand (first source operand).
VFMADD231SD: Multiplies the low double precision floating-point value from the second source to the low double
precision floating-point value in the third source operand, adds the infinite precision intermediate result to the low
double precision floating-point value in the first source operand, performs rounding and stores the resulting double
precision floating-point value to the destination operand (first source operand).

VFMADD132SD/VFMADD213SD/VFMADD231SD—Fused Multiply-Add of Scalar Double Precision Floating-Point Values Vol. 2C 5-231


VEX.128 and EVEX encoded version: The destination operand (also first source operand) is encoded in reg_field.
The second source operand is encoded in VEX.vvvv/EVEX.vvvv. The third source operand is encoded in rm_field.
Bits 127:64 of the destination are unchanged. Bits MAXVL-1:128 of the destination register are zeroed.
EVEX encoded version: The low quadword element of the destination is updated according to the writemask.

Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no
rounding).

VFMADD132SD DEST, SRC2, SRC3 (EVEX encoded version)


IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[63:0] := RoundFPControl(DEST[63:0]*SRC3[63:0] + SRC2[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

VFMADD213SD DEST, SRC2, SRC3 (EVEX encoded version)


IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[63:0] := RoundFPControl(SRC2[63:0]*DEST[63:0] + SRC3[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

VFMADD132SD/VFMADD213SD/VFMADD231SD—Fused Multiply-Add of Scalar Double Precision Floating-Point Values Vol. 2C 5-232


VFMADD231SD DEST, SRC2, SRC3 (EVEX encoded version)
IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[63:0] := RoundFPControl(SRC2[63:0]*SRC3[63:0] + DEST[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

VFMADD132SD DEST, SRC2, SRC3 (VEX encoded version)


DEST[63:0] := MAXVL-1:128RoundFPControl_MXCSR(DEST[63:0]*SRC3[63:0] + SRC2[63:0])
DEST[127:63] := DEST[127:63]
DEST[MAXVL-1:128] := 0

VFMADD213SD DEST, SRC2, SRC3 (VEX encoded version)


DEST[63:0] := RoundFPControl_MXCSR(SRC2[63:0]*DEST[63:0] + SRC3[63:0])
DEST[127:63] := DEST[127:63]
DEST[MAXVL-1:128] := 0

VFMADD231SD DEST, SRC2, SRC3 (VEX encoded version)


DEST[63:0] := RoundFPControl_MXCSR(SRC2[63:0]*SRC3[63:0] + DEST[63:0])
DEST[127:63] := DEST[127:63]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFMADDxxxSD __m128d _mm_fmadd_round_sd(__m128d a, __m128d b, __m128d c, int r);
VFMADDxxxSD __m128d _mm_mask_fmadd_sd(__m128d a, __mmask8 k, __m128d b, __m128d c);
VFMADDxxxSD __m128d _mm_maskz_fmadd_sd(__mmask8 k, __m128d a, __m128d b, __m128d c);
VFMADDxxxSD __m128d _mm_mask3_fmadd_sd(__m128d a, __m128d b, __m128d c, __mmask8 k);
VFMADDxxxSD __m128d _mm_mask_fmadd_round_sd(__m128d a, __mmask8 k, __m128d b, __m128d c, int r);
VFMADDxxxSD __m128d _mm_maskz_fmadd_round_sd(__mmask8 k, __m128d a, __m128d b, __m128d c, int r);
VFMADDxxxSD __m128d _mm_mask3_fmadd_round_sd(__m128d a, __m128d b, __m128d c, __mmask8 k, int r);
VFMADDxxxSD __m128d _mm_fmadd_sd (__m128d a, __m128d b, __m128d c);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VFMADD132SD/VFMADD213SD/VFMADD231SD—Fused Multiply-Add of Scalar Double Precision Floating-Point Values Vol. 2C 5-233


VFMADD132SS/VFMADD213SS/VFMADD231SS—Fused Multiply-Add of Scalar Single Precision
Floating-Point Values
Opcode/ Op / 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
VEX.LIG.66.0F38.W0 99 /r A V/V FMA Multiply scalar single precision floating-point
VFMADD132SS xmm1, xmm2, value from xmm1 and xmm3/m32, add to xmm2
xmm3/m32 and put result in xmm1.
VEX.LIG.66.0F38.W0 A9 /r A V/V FMA Multiply scalar single precision floating-point
VFMADD213SS xmm1, xmm2, value from xmm1 and xmm2, add to xmm3/m32
xmm3/m32 and put result in xmm1.
VEX.LIG.66.0F38.W0 B9 /r A V/V FMA Multiply scalar single precision floating-point
VFMADD231SS xmm1, xmm2, value from xmm2 and xmm3/m32, add to xmm1
xmm3/m32 and put result in xmm1.
EVEX.LLIG.66.0F38.W0 99 /r B V/V AVX512F Multiply scalar single precision floating-point
VFMADD132SS xmm1 {k1}{z}, xmm2, OR AVX10.11 value from xmm1 and xmm3/m32, add to xmm2
xmm3/m32{er} and put result in xmm1.
EVEX.LLIG.66.0F38.W0 A9 /r B V/V AVX512F Multiply scalar single precision floating-point
VFMADD213SS xmm1 {k1}{z}, xmm2, OR AVX10.11 value from xmm1 and xmm2, add to xmm3/m32
xmm3/m32{er} and put result in xmm1.
EVEX.LLIG.66.0F38.W0 B9 /r B V/V AVX512F Multiply scalar single precision floating-point
VFMADD231SS xmm1 {k1}{z}, xmm2, OR AVX10.11 value from xmm2 and xmm3/m32, add to xmm1
xmm3/m32{er} and put result in xmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Tuple1 Scalar ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD multiply-add computation on single precision floating-point values using three source operands
and writes the multiply-add results in the destination operand. The destination operand is also the first source
operand. The first and second operands are XMM registers. The third source operand can be a XMM register or a
32-bit memory location.
VFMADD132SS: Multiplies the low single precision floating-point value from the first source operand to the low
single precision floating-point value in the third source operand, adds the infinite precision intermediate result to
the low single precision floating-point value in the second source operand, performs rounding and stores the
resulting single precision floating-point value to the destination operand (first source operand).
VFMADD213SS: Multiplies the low single precision floating-point value from the second source operand to the low
single precision floating-point value in the first source operand, adds the infinite precision intermediate result to
the low single precision floating-point value in the third source operand, performs rounding and stores the resulting
single precision floating-point value to the destination operand (first source operand).
VFMADD231SS: Multiplies the low single precision floating-point value from the second source operand to the low
single precision floating-point value in the third source operand, adds the infinite precision intermediate result to
the low single precision floating-point value in the first source operand, performs rounding and stores the resulting
single precision floating-point value to the destination operand (first source operand).

VFMADD132SS/VFMADD213SS/VFMADD231SS—Fused Multiply-Add of Scalar Single Precision Floating-Point Values Vol. 2C 5-234


VEX.128 and EVEX encoded version: The destination operand (also first source operand) is encoded in reg_field.
The second source operand is encoded in VEX.vvvv/EVEX.vvvv. The third source operand is encoded in rm_field.
Bits 127:32 of the destination are unchanged. Bits MAXVL-1:128 of the destination register are zeroed.
EVEX encoded version: The low doubleword element of the destination is updated according to the writemask.
Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the
opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations
involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction
column.

Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no
rounding).

VFMADD132SS DEST, SRC2, SRC3 (EVEX encoded version)


IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := RoundFPControl(DEST[31:0]*SRC3[31:0] + SRC2[31:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

VFMADD213SS DEST, SRC2, SRC3 (EVEX encoded version)


IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := RoundFPControl(SRC2[31:0]*DEST[31:0] + SRC3[31:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

VFMADD132SS/VFMADD213SS/VFMADD231SS—Fused Multiply-Add of Scalar Single Precision Floating-Point Values Vol. 2C 5-235


VFMADD231SS DEST, SRC2, SRC3 (EVEX encoded version)
IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := RoundFPControl(SRC2[31:0]*SRC3[31:0] + DEST[31:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0]] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

VFMADD132SS DEST, SRC2, SRC3 (VEX encoded version)


DEST[31:0] := RoundFPControl_MXCSR(DEST[31:0]*SRC3[31:0] + SRC2[31:0])
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

VFMADD213SS DEST, SRC2, SRC3 (VEX encoded version)


DEST[31:0] := RoundFPControl_MXCSR(SRC2[31:0]*DEST[31:0] + SRC3[31:0])
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

VFMADD231SS DEST, SRC2, SRC3 (VEX encoded version)


DEST[31:0] := RoundFPControl_MXCSR(SRC2[31:0]*SRC3[31:0] + DEST[31:0])
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFMADDxxxSS __m128 _mm_fmadd_round_ss(__m128 a, __m128 b, __m128 c, int r);
VFMADDxxxSS __m128 _mm_mask_fmadd_ss(__m128 a, __mmask8 k, __m128 b, __m128 c);
VFMADDxxxSS __m128 _mm_maskz_fmadd_ss(__mmask8 k, __m128 a, __m128 b, __m128 c);
VFMADDxxxSS __m128 _mm_mask3_fmadd_ss(__m128 a, __m128 b, __m128 c, __mmask8 k);
VFMADDxxxSS __m128 _mm_mask_fmadd_round_ss(__m128 a, __mmask8 k, __m128 b, __m128 c, int r);
VFMADDxxxSS __m128 _mm_maskz_fmadd_round_ss(__mmask8 k, __m128 a, __m128 b, __m128 c, int r);
VFMADDxxxSS __m128 _mm_mask3_fmadd_round_ss(__m128 a, __m128 b, __m128 c, __mmask8 k, int r);
VFMADDxxxSS __m128 _mm_fmadd_ss (__m128 a, __m128 b, __m128 c);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VFMADD132SS/VFMADD213SS/VFMADD231SS—Fused Multiply-Add of Scalar Single Precision Floating-Point Values Vol. 2C 5-236


VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD—Fused Multiply-Alternating
Add/Subtract of Packed Double Precision Floating-Point Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.66.0F38.W1 96 /r A V/V FMA Multiply packed double precision floating-point
VFMADDSUB132PD xmm1, xmm2, values from xmm1 and xmm3/mem,
xmm3/m128 add/subtract elements in xmm2 and put result
in xmm1.
VEX.128.66.0F38.W1 A6 /r A V/V FMA Multiply packed double precision floating-point
VFMADDSUB213PD xmm1, xmm2, values from xmm1 and xmm2, add/subtract
xmm3/m128 elements in xmm3/mem and put result in xmm1.
VEX.128.66.0F38.W1 B6 /r A V/V FMA Multiply packed double precision floating-point
VFMADDSUB231PD xmm1, xmm2, values from xmm2 and xmm3/mem,
xmm3/m128 add/subtract elements in xmm1 and put result
in xmm1.
VEX.256.66.0F38.W1 96 /r A V/V FMA Multiply packed double precision floating-point
VFMADDSUB132PD ymm1, ymm2, values from ymm1 and ymm3/mem,
ymm3/m256 add/subtract elements in ymm2 and put result
in ymm1.
VEX.256.66.0F38.W1 A6 /r A V/V FMA Multiply packed double precision floating-point
VFMADDSUB213PD ymm1, ymm2, values from ymm1 and ymm2, add/subtract
ymm3/m256 elements in ymm3/mem and put result in ymm1.
VEX.256.66.0F38.W1 B6 /r A V/V FMA Multiply packed double precision floating-point
VFMADDSUB231PD ymm1, ymm2, values from ymm2 and ymm3/mem,
ymm3/m256 add/subtract elements in ymm1 and put result
in ymm1.
EVEX.128.66.0F38.W1 A6 /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMADDSUB213PD xmm1 {k1}{z}, AVX512F) OR values from xmm1 and xmm2, add/subtract
xmm2, xmm3/m128/m64bcst AVX10.11 elements in xmm3/m128/m64bcst and put
result in xmm1 subject to writemask k1.
EVEX.128.66.0F38.W1 B6 /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMADDSUB231PD xmm1 {k1}{z}, AVX512F) OR values from xmm2 and xmm3/m128/m64bcst,
xmm2, xmm3/m128/m64bcst AVX10.11 add/subtract elements in xmm1 and put result
in xmm1 subject to writemask k1.
EVEX.128.66.0F38.W1 96 /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMADDSUB132PD xmm1 {k1}{z}, AVX512F) OR values from xmm1 and xmm3/m128/m64bcst,
xmm2, xmm3/m128/m64bcst AVX10.11 add/subtract elements in xmm2 and put result
in xmm1 subject to writemask k1.
EVEX.256.66.0F38.W1 A6 /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMADDSUB213PD ymm1 {k1}{z}, AVX512F) OR values from ymm1 and ymm2, add/subtract
ymm2, ymm3/m256/m64bcst AVX10.11 elements in ymm3/m256/m64bcst and put
result in ymm1 subject to writemask k1.
EVEX.256.66.0F38.W1 B6 /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMADDSUB231PD ymm1 {k1}{z}, AVX512F) OR values from ymm2 and ymm3/m256/m64bcst,
ymm2, ymm3/m256/m64bcst AVX10.11 add/subtract elements in ymm1 and put result
in ymm1 subject to writemask k1.
EVEX.256.66.0F38.W1 96 /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMADDSUB132PD ymm1 {k1}{z}, AVX512F) OR values from ymm1 and ymm3/m256/m64bcst,
ymm2, ymm3/m256/m64bcst AVX10.11 add/subtract elements in ymm2 and put result
in ymm1 subject to writemask k1.

VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD—Fused Multiply-Alternating Add/Subtract of Packed Double Precision Vol. 2C 5-237


Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.512.66.0F38.W1 A6 /r B V/V AVX512F Multiply packed double precision floating-point
VFMADDSUB213PD zmm1 {k1}{z}, OR AVX10.11 values from zmm1and zmm2, add/subtract
zmm2, zmm3/m512/m64bcst{er} elements in zmm3/m512/m64bcst and put
result in zmm1 subject to writemask k1.
EVEX.512.66.0F38.W1 B6 /r B V/V AVX512F Multiply packed double precision floating-point
VFMADDSUB231PD zmm1 {k1}{z}, OR AVX10.11 values from zmm2 and zmm3/m512/m64bcst,
zmm2, zmm3/m512/m64bcst{er} add/subtract elements in zmm1 and put result
in zmm1 subject to writemask k1.
EVEX.512.66.0F38.W1 96 /r B V/V AVX512F Multiply packed double precision floating-point
VFMADDSUB132PD zmm1 {k1}{z}, OR AVX10.11 values from zmm1 and zmm3/m512/m64bcst,
zmm2, zmm3/m512/m64bcst{er} add/subtract elements in zmm2 and put result
in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
VFMADDSUB132PD: Multiplies the two, four, or eight packed double precision floating-point values from the first
source operand to the two or four packed double precision floating-point values in the third source operand. From
the infinite precision intermediate result, adds the odd double precision floating-point elements and subtracts the
even double precision floating-point values in the second source operand, performs rounding and stores the
resulting two or four packed double precision floating-point values to the destination operand (first source
operand).
VFMADDSUB213PD: Multiplies the two, four, or eight packed double precision floating-point values from the second
source operand to the two or four packed double precision floating-point values in the first source operand. From
the infinite precision intermediate result, adds the odd double precision floating-point elements and subtracts the
even double precision floating-point values in the third source operand, performs rounding and stores the resulting
two or four packed double precision floating-point values to the destination operand (first source operand).
VFMADDSUB231PD: Multiplies the two, four, or eight packed double precision floating-point values from the second
source operand to the two or four packed double precision floating-point values in the third source operand. From
the infinite precision intermediate result, adds the odd double precision floating-point elements and subtracts the
even double precision floating-point values in the first source operand, performs rounding and stores the resulting
two or four packed double precision floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) and the second source operand are
ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is condition-
ally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.

VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD—Fused Multiply-Alternating Add/Subtract of Packed Double Precision Vol. 2C 5-238


VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.
Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the
opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations
involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction
column.

Operation
In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no
rounding).

VFMADDSUB132PD DEST, SRC2, SRC3


IF (VEX.128) THEN
DEST[63:0] := RoundFPControl_MXCSR(DEST[63:0]*SRC3[63:0] - SRC2[63:0])
DEST[127:64] := RoundFPControl_MXCSR(DEST[127:64]*SRC3[127:64] + SRC2[127:64])
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[63:0] := RoundFPControl_MXCSR(DEST[63:0]*SRC3[63:0] - SRC2[63:0])
DEST[127:64] := RoundFPControl_MXCSR(DEST[127:64]*SRC3[127:64] + SRC2[127:64])
DEST[191:128] := RoundFPControl_MXCSR(DEST[191:128]*SRC3[191:128] - SRC2[191:128])
DEST[255:192] := RoundFPControl_MXCSR(DEST[255:192]*SRC3[255:192] + SRC2[255:192]
FI

VFMADDSUB213PD DEST, SRC2, SRC3


IF (VEX.128) THEN
DEST[63:0] := RoundFPControl_MXCSR(SRC2[63:0]*DEST[63:0] - SRC3[63:0])
DEST[127:64] := RoundFPControl_MXCSR(SRC2[127:64]*DEST[127:64] + SRC3[127:64])
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[63:0] := RoundFPControl_MXCSR(SRC2[63:0]*DEST[63:0] - SRC3[63:0])
DEST[127:64] := RoundFPControl_MXCSR(SRC2[127:64]*DEST[127:64] + SRC3[127:64])
DEST[191:128] := RoundFPControl_MXCSR(SRC2[191:128]*DEST[191:128] - SRC3[191:128])
DEST[255:192] := RoundFPControl_MXCSR(SRC2[255:192]*DEST[255:192] + SRC3[255:192]
FI

VFMADDSUB231PD DEST, SRC2, SRC3


IF (VEX.128) THEN
DEST[63:0] := RoundFPControl_MXCSR(SRC2[63:0]*SRC3[63:0] - DEST[63:0])
DEST[127:64] := RoundFPControl_MXCSR(SRC2[127:64]*SRC3[127:64] + DEST[127:64])
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[63:0] := RoundFPControl_MXCSR(SRC2[63:0]*SRC3[63:0] - DEST[63:0])
DEST[127:64] := RoundFPControl_MXCSR(SRC2[127:64]*SRC3[127:64] + DEST[127:64])
DEST[191:128] := RoundFPControl_MXCSR(SRC2[191:128]*SRC3[191:128] - DEST[191:128])
DEST[255:192] := RoundFPControl_MXCSR(SRC2[255:192]*SRC3[255:192] + DEST[255:192]
FI

VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD—Fused Multiply-Alternating Add/Subtract of Packed Double Precision Vol. 2C 5-239


VFMADDSUB132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+63:i] :=
RoundFPControl(DEST[i+63:i]*SRC3[i+63:i] - SRC2[i+63:i])
ELSE DEST[i+63:i] :=
RoundFPControl(DEST[i+63:i]*SRC3[i+63:i] + SRC2[i+63:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADDSUB132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[63:0] - SRC2[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[i+63:i] - SRC2[i+63:i])
FI;
ELSE
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[63:0] + SRC2[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[i+63:i] + SRC2[i+63:i])
FI;

VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD—Fused Multiply-Alternating Add/Subtract of Packed Double Precision Vol. 2C 5-240


FI

ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADDSUB213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+63:i] :=
RoundFPControl(SRC2[i+63:i]*DEST[i+63:i] - SRC3[i+63:i])
ELSE DEST[i+63:i] :=
RoundFPControl(SRC2[i+63:i]*DEST[i+63:i] + SRC3[i+63:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADDSUB213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] - SRC3[63:0])
ELSE

VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD—Fused Multiply-Alternating Add/Subtract of Packed Double Precision Vol. 2C 5-241


DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] - SRC3[i+63:i])
FI;
ELSE
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] + SRC3[63:0])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] + SRC3[i+63:i])
FI;
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADDSUB231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+63:i] :=
RoundFPControl(SRC2[i+63:i]*SRC3[i+63:i] - DEST[i+63:i])
ELSE DEST[i+63:i] :=
RoundFPControl(SRC2[i+63:i]*SRC3[i+63:i] + DEST[i+63:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD—Fused Multiply-Alternating Add/Subtract of Packed Double Precision Vol. 2C 5-242


VFMADDSUB231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[63:0] - DEST[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[i+63:i] - DEST[i+63:i])
FI;
ELSE
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[63:0] + DEST[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[i+63:i] + DEST[i+63:i])
FI;
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFMADDSUBxxxPD __m512d _mm512_fmaddsub_pd(__m512d a, __m512d b, __m512d c);
VFMADDSUBxxxPD __m512d _mm512_fmaddsub_round_pd(__m512d a, __m512d b, __m512d c, int r);
VFMADDSUBxxxPD __m512d _mm512_mask_fmaddsub_pd(__m512d a, __mmask8 k, __m512d b, __m512d c);
VFMADDSUBxxxPD __m512d _mm512_maskz_fmaddsub_pd(__mmask8 k, __m512d a, __m512d b, __m512d c);
VFMADDSUBxxxPD __m512d _mm512_mask3_fmaddsub_pd(__m512d a, __m512d b, __m512d c, __mmask8 k);
VFMADDSUBxxxPD __m512d _mm512_mask_fmaddsub_round_pd(__m512d a, __mmask8 k, __m512d b, __m512d c, int r);
VFMADDSUBxxxPD __m512d _mm512_maskz_fmaddsub_round_pd(__mmask8 k, __m512d a, __m512d b, __m512d c, int r);
VFMADDSUBxxxPD __m512d _mm512_mask3_fmaddsub_round_pd(__m512d a, __m512d b, __m512d c, __mmask8 k, int r);
VFMADDSUBxxxPD __m256d _mm256_mask_fmaddsub_pd(__m256d a, __mmask8 k, __m256d b, __m256d c);
VFMADDSUBxxxPD __m256d _mm256_maskz_fmaddsub_pd(__mmask8 k, __m256d a, __m256d b, __m256d c);
VFMADDSUBxxxPD __m256d _mm256_mask3_fmaddsub_pd(__m256d a, __m256d b, __m256d c, __mmask8 k);
VFMADDSUBxxxPD __m128d _mm_mask_fmaddsub_pd(__m128d a, __mmask8 k, __m128d b, __m128d c);
VFMADDSUBxxxPD __m128d _mm_maskz_fmaddsub_pd(__mmask8 k, __m128d a, __m128d b, __m128d c);
VFMADDSUBxxxPD __m128d _mm_mask3_fmaddsub_pd(__m128d a, __m128d b, __m128d c, __mmask8 k);
VFMADDSUBxxxPD __m128d _mm_fmaddsub_pd (__m128d a, __m128d b, __m128d c);
VFMADDSUBxxxPD __m256d _mm256_fmaddsub_pd (__m256d a, __m256d b, __m256d c);

VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD—Fused Multiply-Alternating Add/Subtract of Packed Double Precision Vol. 2C 5-243


SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD—Fused Multiply-Alternating Add/Subtract of Packed Double Precision Vol. 2C 5-244


VFMADDSUB132PH/VFMADDSUB213PH/VFMADDSUB231PH—Fused Multiply-Alternating
Add/Subtract of Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.MAP6.W0 96 /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm1 and
VFMADDSUB132PH xmm1{k1}{z}, AND AVX512VL) xmm3/m128/m16bcst, add/subtract elements in
xmm2, xmm3/m128/m16bcst OR AVX10.11 xmm2, and store the result in xmm1 subject to
writemask k1.
EVEX.256.66.MAP6.W0 96 /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm1 and
VFMADDSUB132PH ymm1{k1}{z}, AND AVX512VL) ymm3/m256/m16bcst, add/subtract elements in
ymm2, ymm3/m256/m16bcst OR AVX10.11 ymm2, and store the result in ymm1 subject to
writemask k1.
EVEX.512.66.MAP6.W0 96 /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm1 and
VFMADDSUB132PH zmm1{k1}{z}, OR AVX10.11 zmm3/m512/m16bcst, add/subtract elements in
zmm2, zmm3/m512/m16bcst {er} zmm2, and store the result in zmm1 subject to
writemask k1.
EVEX.128.66.MAP6.W0 A6 /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm1 and
VFMADDSUB213PH xmm1{k1}{z}, AND AVX512VL) xmm2, add/subtract elements in
xmm2, xmm3/m128/m16bcst OR AVX10.11 xmm3/m128/m16bcst, and store the result in
xmm1 subject to writemask k1.
EVEX.256.66.MAP6.W0 A6 /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm1 and
VFMADDSUB213PH ymm1{k1}{z}, AND AVX512VL) ymm2, add/subtract elements in
ymm2, ymm3/m256/m16bcst OR AVX10.11 ymm3/m256/m16bcst, and store the result in
ymm1 subject to writemask k1.
EVEX.512.66.MAP6.W0 A6 /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm1 and
VFMADDSUB213PH zmm1{k1}{z}, OR AVX10.11 zmm2, add/subtract elements in
zmm2, zmm3/m512/m16bcst {er} zmm3/m512/m16bcst, and store the result in
zmm1 subject to writemask k1.
EVEX.128.66.MAP6.W0 B6 /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm2 and
VFMADDSUB231PH xmm1{k1}{z}, AND AVX512VL) xmm3/m128/m16bcst, add/subtract elements in
xmm2, xmm3/m128/m16bcst OR AVX10.11 xmm1, and store the result in xmm1 subject to
writemask k1.
EVEX.256.66.MAP6.W0 B6 /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm2 and
VFMADDSUB231PH ymm1{k1}{z}, AND AVX512VL) ymm3/m256/m16bcst, add/subtract elements in
ymm2, ymm3/m256/m16bcst OR AVX10.11 ymm1, and store the result in ymm1 subject to
writemask k1.
EVEX.512.66.MAP6.W0 B6 /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm2 and
VFMADDSUB231PH zmm1{k1}{z}, OR AVX10.11 zmm3/m512/m16bcst, add/subtract elements in
zmm2, zmm3/m512/m16bcst {er} zmm1, and store the result in zmm1 subject to
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A

VFMADDSUB132PH/VFMADDSUB213PH/VFMADDSUB231PH—Fused Multiply-Alternating Add/Subtract of Packed FP16 Values Vol. 2C 5-244


Description
This instruction performs a packed multiply-add (odd elements) or multiply-subtract (even elements) computation
on FP16 values using three source operands and writes the results in the destination operand. The destination
operand is also the first source operand. The notation’ “132”, “213” and “231” indicate the use of the operands in A
* B ± C, where each digit corresponds to the operand number, with the destination being operand 1; see Table
5-10.
The destination elements are updated according to the writemask.

Table 5-9. VFMADDSUB[132,213,231]PH Notation for Odd and Even Elements


Notation Odd Elements Even Elements
132 dest = dest*src3+src2 dest = dest*src3-src2
231 dest = src2*src3+dest dest = src2*src3-dest
213 dest = src2*dest+src3 dest = src2*dest-src3

Operation
VFMADDSUB132PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a register
VL = 128, 256 or 512
KL := VL/16

IF (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *j is even*:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j] * SRC3.fp16[j] - SRC2.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j] * SRC3.fp16[j] + SRC2.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VFMADDSUB132PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *j is even*:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j] * t3 - SRC2.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j] * t3 + SRC2.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0

VFMADDSUB132PH/VFMADDSUB213PH/VFMADDSUB231PH—Fused Multiply-Alternating Add/Subtract of Packed FP16 Values Vol. 2C 5-245


// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VFMADDSUB213PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a register
VL = 128, 256 or 512
KL := VL/16

IF (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *j is even*:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j]*DEST.fp16[j] - SRC3.fp16[j])
ELSE
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j]*DEST.fp16[j] + SRC3.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VFMADDSUB213PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *j is even*:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * DEST.fp16[j] - t3)
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * DEST.fp16[j] + t3)
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VFMADDSUB132PH/VFMADDSUB213PH/VFMADDSUB231PH—Fused Multiply-Alternating Add/Subtract of Packed FP16 Values Vol. 2C 5-246


VFMADDSUB231PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a register
VL = 128, 256 or 512
KL := VL/16

IF (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *j is even:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * SRC3.fp16[j] - DEST.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * SRC3.fp16[j] + DEST.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VFMADDSUB231PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *j is even*:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * t3 - DEST.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * t3 + DEST.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VFMADDSUB132PH/VFMADDSUB213PH/VFMADDSUB231PH—Fused Multiply-Alternating Add/Subtract of Packed FP16 Values Vol. 2C 5-247


Intel C/C++ Compiler Intrinsic Equivalent
VFMADDSUB132PH, VFMADDSUB213PH, and VFMADDSUB231PH:
__m128h _mm_fmaddsub_ph (__m128h a, __m128h b, __m128h c);
__m128h _mm_mask_fmaddsub_ph (__m128h a, __mmask8 k, __m128h b, __m128h c);
__m128h _mm_mask3_fmaddsub_ph (__m128h a, __m128h b, __m128h c, __mmask8 k);
__m128h _mm_maskz_fmaddsub_ph (__mmask8 k, __m128h a, __m128h b, __m128h c);
__m256h _mm256_fmaddsub_ph (__m256h a, __m256h b, __m256h c);
__m256h _mm256_mask_fmaddsub_ph (__m256h a, __mmask16 k, __m256h b, __m256h c);
__m256h _mm256_mask3_fmaddsub_ph (__m256h a, __m256h b, __m256h c, __mmask16 k);
__m256h _mm256_maskz_fmaddsub_ph (__mmask16 k, __m256h a, __m256h b, __m256h c);
__m512h _mm512_fmaddsub_ph (__m512h a, __m512h b, __m512h c);
__m512h _mm512_mask_fmaddsub_ph (__m512h a, __mmask32 k, __m512h b, __m512h c);
__m512h _mm512_mask3_fmaddsub_ph (__m512h a, __m512h b, __m512h c, __mmask32 k);
__m512h _mm512_maskz_fmaddsub_ph (__mmask32 k, __m512h a, __m512h b, __m512h c);
__m512h _mm512_fmaddsub_round_ph (__m512h a, __m512h b, __m512h c, const int rounding);
__m512h _mm512_mask_fmaddsub_round_ph (__m512h a, __mmask32 k, __m512h b, __m512h c, const int rounding);
__m512h _mm512_mask3_fmaddsub_round_ph (__m512h a, __m512h b, __m512h c, __mmask32 k, const int rounding);
__m512h _mm512_maskz_fmaddsub_round_ph (__mmask32 k, __m512h a, __m512h b, __m512h c, const int rounding);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VFMADDSUB132PH/VFMADDSUB213PH/VFMADDSUB231PH—Fused Multiply-Alternating Add/Subtract of Packed FP16 Values Vol. 2C 5-248


VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS—Fused Multiply-Alternating
Add/Subtract of Packed Single Precision Floating-Point Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.66.0F38.W0 96 /r A V/V FMA Multiply packed single precision floating-point
VFMADDSUB132PS xmm1, xmm2, values from xmm1 and xmm3/mem, add/subtract
xmm3/m128 elements in xmm2 and put result in xmm1.
VEX.128.66.0F38.W0 A6 /r A V/V FMA Multiply packed single precision floating-point
VFMADDSUB213PS xmm1, xmm2, values from xmm1 and xmm2, add/subtract
xmm3/m128 elements in xmm3/mem and put result in xmm1.
VEX.128.66.0F38.W0 B6 /r A V/V FMA Multiply packed single precision floating-point
VFMADDSUB231PS xmm1, xmm2, values from xmm2 and xmm3/mem, add/subtract
xmm3/m128 elements in xmm1 and put result in xmm1.
VEX.256.66.0F38.W0 96 /r A V/V FMA Multiply packed single precision floating-point
VFMADDSUB132PS ymm1, ymm2, values from ymm1 and ymm3/mem, add/subtract
ymm3/m256 elements in ymm2 and put result in ymm1.
VEX.256.66.0F38.W0 A6 /r A V/V FMA Multiply packed single precision floating-point
VFMADDSUB213PS ymm1, ymm2, values from ymm1 and ymm2, add/subtract
ymm3/m256 elements in ymm3/mem and put result in ymm1.
VEX.256.66.0F38.W0 B6 /r A V/V FMA Multiply packed single precision floating-point
VFMADDSUB231PS ymm1, ymm2, values from ymm2 and ymm3/mem, add/subtract
ymm3/m256 elements in ymm1 and put result in ymm1.
EVEX.128.66.0F38.W0 A6 /r B V/V (AVX512VL AND Multiply packed single precision floating-point
VFMADDSUB213PS xmm1 {k1}{z}, AVX512F) OR values from xmm1 and xmm2, add/subtract
xmm2, xmm3/m128/m32bcst AVX10.11 elements in xmm3/m128/m32bcst and put result in
xmm1 subject to writemask k1.
EVEX.128.66.0F38.W0 B6 /r B V/V (AVX512VL AND Multiply packed single precision floating-point
VFMADDSUB231PS xmm1 {k1}{z}, AVX512F) OR values from xmm2 and xmm3/m128/m32bcst,
xmm2, xmm3/m128/m32bcst AVX10.11 add/subtract elements in xmm1 and put result in
xmm1 subject to writemask k1.
EVEX.128.66.0F38.W0 96 /r B V/V (AVX512VL AND Multiply packed single precision floating-point
VFMADDSUB132PS xmm1 {k1}{z}, AVX512F) OR values from xmm1 and xmm3/m128/m32bcst,
xmm2, xmm3/m128/m32bcst AVX10.11 add/subtract elements in zmm2 and put result in
xmm1 subject to writemask k1.
EVEX.256.66.0F38.W0 A6 /r B V/V (AVX512VL AND Multiply packed single precision floating-point
VFMADDSUB213PS ymm1 {k1}{z}, AVX512F) OR values from ymm1 and ymm2, add/subtract
ymm2, ymm3/m256/m32bcst AVX10.11 elements in ymm3/m256/m32bcst and put result in
ymm1 subject to writemask k1.
EVEX.256.66.0F38.W0 B6 /r B V/V (AVX512VL AND Multiply packed single precision floating-point
VFMADDSUB231PS ymm1 {k1}{z}, AVX512F) OR values from ymm2 and ymm3/m256/m32bcst,
ymm2, ymm3/m256/m32bcst AVX10.11 add/subtract elements in ymm1 and put result in
ymm1 subject to writemask k1.
EVEX.256.66.0F38.W0 96 /r B V/V (AVX512VL AND Multiply packed single precision floating-point
VFMADDSUB132PS ymm1 {k1}{z}, AVX512F) OR values from ymm1 and ymm3/m256/m32bcst,
ymm2, ymm3/m256/m32bcst AVX10.11 add/subtract elements in ymm2 and put result in
ymm1 subject to writemask k1.

VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS—Fused Multiply-Alternating Add/Subtract of Packed Single Precision Vol. 2C 5-249


Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.512.66.0F38.W0 A6 /r B V/V AVX512F Multiply packed single precision floating-point
VFMADDSUB213PS zmm1 {k1}{z}, OR AVX10.11 values from zmm1 and zmm2, add/subtract
zmm2, zmm3/m512/m32bcst{er} elements in zmm3/m512/m32bcst and put result in
zmm1 subject to writemask k1.
EVEX.512.66.0F38.W0 B6 /r B V/V AVX512F Multiply packed single precision floating-point
VFMADDSUB231PS zmm1 {k1}{z}, OR AVX10.11 values from zmm2 and zmm3/m512/m32bcst,
zmm2, zmm3/m512/m32bcst{er} add/subtract elements in zmm1 and put result in
zmm1 subject to writemask k1.
EVEX.512.66.0F38.W0 96 /r B V/V AVX512F Multiply packed single precision floating-point
VFMADDSUB132PS zmm1 {k1}{z}, OR AVX10.11 values from zmm1 and zmm3/m512/m32bcst,
zmm2, zmm3/m512/m32bcst{er} add/subtract elements in zmm2 and put result in
zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
VFMADDSUB132PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the first
source operand to the corresponding packed single precision floating-point values in the third source operand.
From the infinite precision intermediate result, adds the odd single precision floating-point elements and subtracts
the even single precision floating-point values in the second source operand, performs rounding and stores the
resulting packed single precision floating-point values to the destination operand (first source operand).
VFMADDSUB213PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the
second source operand to the corresponding packed single precision floating-point values in the first source
operand. From the infinite precision intermediate result, adds the odd single precision floating-point elements and
subtracts the even single precision floating-point values in the third source operand, performs rounding and stores
the resulting packed single precision floating-point values to the destination operand (first source operand).
VFMADDSUB231PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the
second source operand to the corresponding packed single precision floating-point values in the third source
operand. From the infinite precision intermediate result, adds the odd single precision floating-point elements and
subtracts the even single precision floating-point values in the first source operand, performs rounding and stores
the resulting packed single precision floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) and the second source operand are
ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is condition-
ally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.

VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS—Fused Multiply-Alternating Add/Subtract of Packed Single Precision Vol. 2C 5-250


Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the
opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations
involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction
column.

Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no
rounding).

VFMADDSUB132PS DEST, SRC2, SRC3


IF (VEX.128) THEN
MAXNUM :=2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM -1{
n := 64*i;
DEST[n+31:n] := RoundFPControl_MXCSR(DEST[n+31:n]*SRC3[n+31:n] - SRC2[n+31:n])
DEST[n+63:n+32] := RoundFPControl_MXCSR(DEST[n+63:n+32]*SRC3[n+63:n+32] + SRC2[n+63:n+32])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFMADDSUB213PS DEST, SRC2, SRC3


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM -1{
n := 64*i;
DEST[n+31:n] := RoundFPControl_MXCSR(SRC2[n+31:n]*DEST[n+31:n] - SRC3[n+31:n])
DEST[n+63:n+32] := RoundFPControl_MXCSR(SRC2[n+63:n+32]*DEST[n+63:n+32] + SRC3[n+63:n+32])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFMADDSUB231PS DEST, SRC2, SRC3


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM -1{
n := 64*i;
DEST[n+31:n] := RoundFPControl_MXCSR(SRC2[n+31:n]*SRC3[n+31:n] - DEST[n+31:n])
DEST[n+63:n+32] :=RoundFPControl_MXCSR(SRC2[n+63:n+32]*SRC3[n+63:n+32] + DEST[n+63:n+32])
}

VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS—Fused Multiply-Alternating Add/Subtract of Packed Single Precision Vol. 2C 5-251


IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFMADDSUB132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) (4, 128), (8, 256),= (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+31:i] :=
RoundFPControl(DEST[i+31:i]*SRC3[i+31:i] - SRC2[i+31:i])
ELSE DEST[i+31:i] :=
RoundFPControl(DEST[i+31:i]*SRC3[i+31:i] + SRC2[i+31:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADDSUB132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[31:0] - SRC2[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[i+31:i] - SRC2[i+31:i])
FI;
ELSE
IF (EVEX.b = 1)
THEN

VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS—Fused Multiply-Alternating Add/Subtract of Packed Single Precision Vol. 2C 5-252


DEST[i+31:i] :=
RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[31:0] + SRC2[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[i+31:i] + SRC2[i+31:i])
FI;
FI

ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADDSUB213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+31:i] :=
RoundFPControl(SRC2[i+31:i]*DEST[i+31:i] - SRC3[i+31:i])
ELSE DEST[i+31:i] :=
RoundFPControl(SRC2[i+31:i]*DEST[i+31:i] + SRC3[i+31:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADDSUB213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*

VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS—Fused Multiply-Alternating Add/Subtract of Packed Single Precision Vol. 2C 5-253


THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] - SRC3[31:0])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] - SRC3[i+31:i])
FI;
ELSE
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] + SRC3[31:0])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] + SRC3[i+31:i])
FI;
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADDSUB231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+31:i] :=
RoundFPControl(SRC2[i+31:i]*SRC3[i+31:i] - DEST[i+31:i])
ELSE DEST[i+31:i] :=
RoundFPControl(SRC2[i+31:i]*SRC3[i+31:i] + DEST[i+31:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;

VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS—Fused Multiply-Alternating Add/Subtract of Packed Single Precision Vol. 2C 5-254


ENDFOR
DEST[MAXVL-1:VL] := 0

VFMADDSUB231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[31:0] - DEST[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[i+31:i] - DEST[i+31:i])
FI;
ELSE
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[31:0] + DEST[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[i+31:i] + DEST[i+31:i])
FI;
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFMADDSUBxxxPS __m512 _mm512_fmaddsub_ps(__m512 a, __m512 b, __m512 c);
VFMADDSUBxxxPS __m512 _mm512_fmaddsub_round_ps(__m512 a, __m512 b, __m512 c, int r);
VFMADDSUBxxxPS __m512 _mm512_mask_fmaddsub_ps(__m512 a, __mmask16 k, __m512 b, __m512 c);
VFMADDSUBxxxPS __m512 _mm512_maskz_fmaddsub_ps(__mmask16 k, __m512 a, __m512 b, __m512 c);
VFMADDSUBxxxPS __m512 _mm512_mask3_fmaddsub_ps(__m512 a, __m512 b, __m512 c, __mmask16 k);
VFMADDSUBxxxPS __m512 _mm512_mask_fmaddsub_round_ps(__m512 a, __mmask16 k, __m512 b, __m512 c, int r);
VFMADDSUBxxxPS __m512 _mm512_maskz_fmaddsub_round_ps(__mmask16 k, __m512 a, __m512 b, __m512 c, int r);
VFMADDSUBxxxPS __m512 _mm512_mask3_fmaddsub_round_ps(__m512 a, __m512 b, __m512 c, __mmask16 k, int r);
VFMADDSUBxxxPS __m256 _mm256_mask_fmaddsub_ps(__m256 a, __mmask8 k, __m256 b, __m256 c);
VFMADDSUBxxxPS __m256 _mm256_maskz_fmaddsub_ps(__mmask8 k, __m256 a, __m256 b, __m256 c);
VFMADDSUBxxxPS __m256 _mm256_mask3_fmaddsub_ps(__m256 a, __m256 b, __m256 c, __mmask8 k);
VFMADDSUBxxxPS __m128 _mm_mask_fmaddsub_ps(__m128 a, __mmask8 k, __m128 b, __m128 c);
VFMADDSUBxxxPS __m128 _mm_maskz_fmaddsub_ps(__mmask8 k, __m128 a, __m128 b, __m128 c);

VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS—Fused Multiply-Alternating Add/Subtract of Packed Single Precision Vol. 2C 5-255


VFMADDSUBxxxPS __m128 _mm_mask3_fmaddsub_ps(__m128 a, __m128 b, __m128 c, __mmask8 k);
VFMADDSUBxxxPS __m128 _mm_fmaddsub_ps (__m128 a, __m128 b, __m128 c);
VFMADDSUBxxxPS __m256 _mm256_fmaddsub_ps (__m256 a, __m256 b, __m256 c);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS—Fused Multiply-Alternating Add/Subtract of Packed Single Precision Vol. 2C 5-256


VFMSUB132PD/VFMSUB213PD/VFMSUB231PD—Fused Multiply-Subtract of Packed Double
Precision Floating-Point Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.66.0F38.W1 9A /r A V/V FMA Multiply packed double precision floating-point
VFMSUB132PD xmm1, xmm2, values from xmm1 and xmm3/mem, subtract xmm2
xmm3/m128 and put result in xmm1.
VEX.128.66.0F38.W1 AA /r A V/V FMA Multiply packed double precision floating-point
VFMSUB213PD xmm1, xmm2, values from xmm1 and xmm2, subtract xmm3/mem
xmm3/m128 and put result in xmm1.
VEX.128.66.0F38.W1 BA /r A V/V FMA Multiply packed double precision floating-point
VFMSUB231PD xmm1, xmm2, values from xmm2 and xmm3/mem, subtract xmm1
xmm3/m128 and put result in xmm1.
VEX.256.66.0F38.W1 9A /r A V/V FMA Multiply packed double precision floating-point
VFMSUB132PD ymm1, ymm2, values from ymm1 and ymm3/mem, subtract ymm2
ymm3/m256 and put result in ymm1.
VEX.256.66.0F38.W1 AA /r A V/V FMA Multiply packed double precision floating-point
VFMSUB213PD ymm1, ymm2, values from ymm1 and ymm2, subtract ymm3/mem
ymm3/m256 and put result in ymm1.
VEX.256.66.0F38.W1 BA /r A V/V FMA Multiply packed double precision floating-point
VFMSUB231PD ymm1, ymm2, values from ymm2 and ymm3/mem, subtract ymm1
ymm3/m256 and put result in ymm1.S
EVEX.128.66.0F38.W1 9A /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMSUB132PD xmm1 {k1}{z}, AVX512F) OR values from xmm1 and xmm3/m128/m64bcst,
xmm2, xmm3/m128/m64bcst AVX10.11 subtract xmm2 and put result in xmm1 subject to
writemask k1.
EVEX.128.66.0F38.W1 AA /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMSUB213PD xmm1 {k1}{z}, AVX512F) OR values from xmm1 and xmm2, subtract
xmm2, xmm3/m128/m64bcst AVX10.11 xmm3/m128/m64bcst and put result in xmm1
subject to writemask k1.
EVEX.128.66.0F38.W1 BA /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMSUB231PD xmm1 {k1}{z}, AVX512F) OR values from xmm2 and xmm3/m128/m64bcst,
xmm2, xmm3/m128/m64bcst AVX10.11 subtract xmm1 and put result in xmm1 subject to
writemask k1.
EVEX.256.66.0F38.W1 9A /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMSUB132PD ymm1 {k1}{z}, AVX512F) OR values from ymm1 and ymm3/m256/m64bcst,
ymm2, ymm3/m256/m64bcst AVX10.11 subtract ymm2 and put result in ymm1 subject to
writemask k1.
EVEX.256.66.0F38.W1 AA /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMSUB213PD ymm1 {k1}{z}, AVX512F) OR values from ymm1 and ymm2, subtract
ymm2, ymm3/m256/m64bcst AVX10.11 ymm3/m256/m64bcst and put result in ymm1
subject to writemask k1.
EVEX.256.66.0F38.W1 BA /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMSUB231PD ymm1 {k1}{z}, AVX512F) OR values from ymm2 and ymm3/m256/m64bcst,
ymm2, ymm3/m256/m64bcst AVX10.11 subtract ymm1 and put result in ymm1 subject to
writemask k1.

VFMSUB132PD/VFMSUB213PD/VFMSUB231PD—Fused Multiply-Subtract of Packed Double Precision Floating-Point Values Vol. 2C 5-256


Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.512.66.0F38.W1 9A /r B V/V AVX512F Multiply packed double precision floating-point
VFMSUB132PD zmm1 {k1}{z}, OR AVX10.11 values from zmm1 and zmm3/m512/m64bcst,
zmm2, zmm3/m512/m64bcst{er} subtract zmm2 and put result in zmm1 subject to
writemask k1.
EVEX.512.66.0F38.W1 AA /r B V/V AVX512F Multiply packed double precision floating-point
VFMSUB213PD zmm1 {k1}{z}, OR AVX10.11 values from zmm1 and zmm2, subtract
zmm2, zmm3/m512/m64bcst{er} zmm3/m512/m64bcst and put result in zmm1
subject to writemask k1.
EVEX.512.66.0F38.W1 BA /r B V/V AVX512F Multiply packed double precision floating-point
VFMSUB231PD zmm1 {k1}{z}, OR AVX10.11 values from zmm2 and zmm3/m512/m64bcst,
zmm2, zmm3/m512/m64bcst{er} subtract zmm1 and put result in zmm1 subject to
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a set of SIMD multiply-subtract computation on packed double precision floating-point values using three
source operands and writes the multiply-subtract results in the destination operand. The destination operand is
also the first source operand. The second operand must be a SIMD register. The third source operand can be a
SIMD register or a memory location.
VFMSUB132PD: Multiplies the two, four or eight packed double precision floating-point values from the first source
operand to the two, four or eight packed double precision floating-point values in the third source operand. From
the infinite precision intermediate result, subtracts the two, four or eight packed double precision floating-point
values in the second source operand, performs rounding and stores the resulting two, four or eight packed double
precision floating-point values to the destination operand (first source operand).
VFMSUB213PD: Multiplies the two, four or eight packed double precision floating-point values from the second
source operand to the two, four or eight packed double precision floating-point values in the first source operand.
From the infinite precision intermediate result, subtracts the two, four or eight packed double precision floating-
point values in the third source operand, performs rounding and stores the resulting two, four or eight packed
double precision floating-point values to the destination operand (first source operand).
VFMSUB231PD: Multiplies the two, four or eight packed double precision floating-point values from the second
source to the two, four or eight packed double precision floating-point values in the third source operand. From the
infinite precision intermediate result, subtracts the two, four or eight packed double precision floating-point values
in the first source operand, performs rounding and stores the resulting two, four or eight packed double precision
floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) and the second source operand are
ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is condition-
ally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.

VFMSUB132PD/VFMSUB213PD/VFMSUB231PD—Fused Multiply-Subtract of Packed Double Precision Floating-Point Values Vol. 2C 5-257


VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.

Operation
In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no
rounding).

VFMSUB132PD DEST, SRC2, SRC3 (VEX Encoded Versions)


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 64*i;
DEST[n+63:n] := RoundFPControl_MXCSR(DEST[n+63:n]*SRC3[n+63:n] - SRC2[n+63:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFMSUB213PD DEST, SRC2, SRC3 (VEX Encoded Versions)


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 64*i;
DEST[n+63:n] := RoundFPControl_MXCSR(SRC2[n+63:n]*DEST[n+63:n] - SRC3[n+63:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFMSUB231PD DEST, SRC2, SRC3 (VEX Encoded Versions)


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 64*i;
DEST[n+63:n] := RoundFPControl_MXCSR(SRC2[n+63:n]*SRC3[n+63:n] - DEST[n+63:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)

VFMSUB132PD/VFMSUB213PD/VFMSUB231PD—Fused Multiply-Subtract of Packed Double Precision Floating-Point Values Vol. 2C 5-258


DEST[MAXVL-1:256] := 0
FI

VFMSUB132PD DEST, SRC2, SRC3 (EVEX Encoded Versions, When SRC3 Operand is a Register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
RoundFPControl(DEST[i+63:i]*SRC3[i+63:i] - SRC2[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUB132PD DEST, SRC2, SRC3 (EVEX Encoded Versions, When SRC3 Operand is a Memory Source)
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[63:0] - SRC2[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[i+63:i] - SRC2[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUB132PD/VFMSUB213PD/VFMSUB231PD—Fused Multiply-Subtract of Packed Double Precision Floating-Point Values Vol. 2C 5-259


VFMSUB213PD DEST, SRC2, SRC3 (EVEX Encoded Versions, When SRC3 Operand is a Register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
RoundFPControl(SRC2[i+63:i]*DEST[i+63:i] - SRC3[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUB213PD DEST, SRC2, SRC3 (EVEX Encoded Versions, When SRC3 Operand is a Memory Source)
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] - SRC3[63:0])
+31:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] - SRC3[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUB132PD/VFMSUB213PD/VFMSUB231PD—Fused Multiply-Subtract of Packed Double Precision Floating-Point Values Vol. 2C 5-260


VFMSUB231PD DEST, SRC2, SRC3 (EVEX Encoded Versions, When SRC3 Operand is a Register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
RoundFPControl(SRC2[i+63:i]*SRC3[i+63:i] - DEST[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUB231PD DEST, SRC2, SRC3 (EVEX Encoded Versions, When SRC3 Operand is a Memory Source)
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[63:0] - DEST[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[i+63:i] - DEST[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUB132PD/VFMSUB213PD/VFMSUB231PD—Fused Multiply-Subtract of Packed Double Precision Floating-Point Values Vol. 2C 5-261


Intel C/C++ Compiler Intrinsic Equivalent
VFMSUBxxxPD __m512d _mm512_fmsub_pd(__m512d a, __m512d b, __m512d c);
VFMSUBxxxPD __m512d _mm512_fmsub_round_pd(__m512d a, __m512d b, __m512d c, int r);
VFMSUBxxxPD __m512d _mm512_mask_fmsub_pd(__m512d a, __mmask8 k, __m512d b, __m512d c);
VFMSUBxxxPD __m512d _mm512_maskz_fmsub_pd(__mmask8 k, __m512d a, __m512d b, __m512d c);
VFMSUBxxxPD __m512d _mm512_mask3_fmsub_pd(__m512d a, __m512d b, __m512d c, __mmask8 k);
VFMSUBxxxPD __m512d _mm512_mask_fmsub_round_pd(__m512d a, __mmask8 k, __m512d b, __m512d c, int r);
VFMSUBxxxPD __m512d _mm512_maskz_fmsub_round_pd(__mmask8 k, __m512d a, __m512d b, __m512d c, int r);
VFMSUBxxxPD __m512d _mm512_mask3_fmsub_round_pd(__m512d a, __m512d b, __m512d c, __mmask8 k, int r);
VFMSUBxxxPD __m256d _mm256_mask_fmsub_pd(__m256d a, __mmask8 k, __m256d b, __m256d c);
VFMSUBxxxPD __m256d _mm256_maskz_fmsub_pd(__mmask8 k, __m256d a, __m256d b, __m256d c);
VFMSUBxxxPD __m256d _mm256_mask3_fmsub_pd(__m256d a, __m256d b, __m256d c, __mmask8 k);
VFMSUBxxxPD __m128d _mm_mask_fmsub_pd(__m128d a, __mmask8 k, __m128d b, __m128d c);
VFMSUBxxxPD __m128d _mm_maskz_fmsub_pd(__mmask8 k, __m128d a, __m128d b, __m128d c);
VFMSUBxxxPD __m128d _mm_mask3_fmsub_pd(__m128d a, __m128d b, __m128d c, __mmask8 k);
VFMSUBxxxPD __m128d _mm_fmsub_pd (__m128d a, __m128d b, __m128d c);
VFMSUBxxxPD __m256d _mm256_fmsub_pd (__m256d a, __m256d b, __m256d c);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-46, “Type E2 Class Exception Conditions.”

VFMSUB132PD/VFMSUB213PD/VFMSUB231PD—Fused Multiply-Subtract of Packed Double Precision Floating-Point Values Vol. 2C 5-262


VFMSUB132PS/VFMSUB213PS/VFMSUB231PS—Fused Multiply-Subtract of Packed Single
Precision Floating-Point Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.66.0F38.W0 9A /r A V/V FMA Multiply packed single precision floating-point
VFMSUB132PS xmm1, xmm2, values from xmm1 and xmm3/mem, subtract
xmm3/m128 xmm2 and put result in xmm1.
VEX.128.66.0F38.W0 AA /r A V/V FMA Multiply packed single precision floating-point
VFMSUB213PS xmm1, xmm2, values from xmm1 and xmm2, subtract
xmm3/m128 xmm3/mem and put result in xmm1.
VEX.128.66.0F38.W0 BA /r A V/V FMA Multiply packed single precision floating-point
VFMSUB231PS xmm1, xmm2, values from xmm2 and xmm3/mem, subtract
xmm3/m128 xmm1 and put result in xmm1.
VEX.256.66.0F38.W0 9A /r A V/V FMA Multiply packed single precision floating-point
VFMSUB132PS ymm1, ymm2, values from ymm1 and ymm3/mem, subtract
ymm3/m256 ymm2 and put result in ymm1.
VEX.256.66.0F38.W0 AA /r A V/V FMA Multiply packed single precision floating-point
VFMSUB213PS ymm1, ymm2, values from ymm1 and ymm2, subtract
ymm3/m256 ymm3/mem and put result in ymm1.
VEX.256.66.0F38.0 BA /r A V/V FMA Multiply packed single precision floating-point
VFMSUB231PS ymm1, ymm2, values from ymm2 and ymm3/mem, subtract
ymm3/m256 ymm1 and put result in ymm1.
EVEX.128.66.0F38.W0 9A /r B V/V (AVX512VL AND Multiply packed single precision floating-point
VFMSUB132PS xmm1 {k1}{z}, AVX512F) OR values from xmm1 and xmm3/m128/m32bcst,
xmm2, xmm3/m128/m32bcst AVX10.11 subtract xmm2 and put result in xmm1.
EVEX.128.66.0F38.W0 AA /r B V/V (AVX512VL AND Multiply packed single precision floating-point
VFMSUB213PS xmm1 {k1}{z}, AVX512F) OR values from xmm1 and xmm2, subtract
xmm2, xmm3/m128/m32bcst AVX10.11 xmm3/m128/m32bcst and put result in xmm1.
EVEX.128.66.0F38.W0 BA /r B V/V (AVX512VL AND Multiply packed single precision floating-point
VFMSUB231PS xmm1 {k1}{z}, AVX512F) OR values from xmm2 and xmm3/m128/m32bcst,
xmm2, xmm3/m128/m32bcst AVX10.11 subtract xmm1 and put result in xmm1.
EVEX.256.66.0F38.W0 9A /r B V/V (AVX512VL AND Multiply packed single precision floating-point
VFMSUB132PS ymm1 {k1}{z}, AVX512F) OR values from ymm1 and ymm3/m256/m32bcst,
ymm2, ymm3/m256/m32bcst AVX10.11 subtract ymm2 and put result in ymm1.
EVEX.256.66.0F38.W0 AA /r B V/V (AVX512VL AND Multiply packed single precision floating-point
VFMSUB213PS ymm1 {k1}{z}, AVX512F) OR values from ymm1 and ymm2, subtract
ymm2, ymm3/m256/m32bcst AVX10.11 ymm3/m256/m32bcst and put result in ymm1.
EVEX.256.66.0F38.W0 BA /r B V/V (AVX512VL AND Multiply packed single precision floating-point
VFMSUB231PS ymm1 {k1}{z}, AVX512F) OR values from ymm2 and ymm3/m256/m32bcst,
ymm2, ymm3/m256/m32bcst AVX10.11 subtract ymm1 and put result in ymm1.
EVEX.512.66.0F38.W0 9A /r B V/V AVX512F OR Multiply packed single precision floating-point
VFMSUB132PS zmm1 {k1}{z}, AVX10.11 values from zmm1 and zmm3/m512/m32bcst,
zmm2, zmm3/m512/m32bcst{er} subtract zmm2 and put result in zmm1.
EVEX.512.66.0F38.W0 AA /r B V/V AVX512F OR Multiply packed single precision floating-point
VFMSUB213PS zmm1 {k1}{z}, AVX10.11 values from zmm1 and zmm2, subtract
zmm2, zmm3/m512/m32bcst{er} zmm3/m512/m32bcst and put result in zmm1.
EVEX.512.66.0F38.W0 BA /r B V/V AVX512F OR Multiply packed single precision floating-point
VFMSUB231PS zmm1 {k1}{z}, AVX10.11 values from zmm2 and zmm3/m512/m32bcst,
zmm2, zmm3/m512/m32bcst{er} subtract zmm1 and put result in zmm1.

VFMSUB132PS/VFMSUB213PS/VFMSUB231PS—Fused Multiply-Subtract of Packed Single Precision Floating-Point Values Vol. 2C 5-263


NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a set of SIMD multiply-subtract computation on packed single precision floating-point values using three
source operands and writes the multiply-subtract results in the destination operand. The destination operand is
also the first source operand. The second operand must be a SIMD register. The third source operand can be a
SIMD register or a memory location.
VFMSUB132PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the first
source operand to the four, eight or sixteen packed single precision floating-point values in the third source
operand. From the infinite precision intermediate result, subtracts the four, eight or sixteen packed single precision
floating-point values in the second source operand, performs rounding and stores the resulting four, eight or
sixteen packed single precision floating-point values to the destination operand (first source operand).
VFMSUB213PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the second
source operand to the four, eight or sixteen packed single precision floating-point values in the first source
operand. From the infinite precision intermediate result, subtracts the four, eight or sixteen packed single precision
floating-point values in the third source operand, performs rounding and stores the resulting four, eight or sixteen
packed single precision floating-point values to the destination operand (first source operand).
VFMSUB231PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the second
source to the four, eight or sixteen packed single precision floating-point values in the third source operand. From
the infinite precision intermediate result, subtracts the four, eight or sixteen packed single precision floating-point
values in the first source operand, performs rounding and stores the resulting four, eight or sixteen packed single
precision floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) and the second source operand are
ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is condition-
ally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.

Operation
In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no
rounding).

VFMSUB132PS/VFMSUB213PS/VFMSUB231PS—Fused Multiply-Subtract of Packed Single Precision Floating-Point Values Vol. 2C 5-264


VFMSUB132PS DEST, SRC2, SRC3 (VEX encoded version)
IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 32*i;
DEST[n+31:n] := RoundFPControl_MXCSR(DEST[n+31:n]*SRC3[n+31:n] - SRC2[n+31:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFMSUB213PS DEST, SRC2, SRC3 (VEX encoded version)


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 32*i;
DEST[n+31:n] := RoundFPControl_MXCSR(SRC2[n+31:n]*DEST[n+31:n] - SRC3[n+31:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFMSUB231PS DEST, SRC2, SRC3 (VEX encoded version)


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 32*i;
DEST[n+31:n] := RoundFPControl_MXCSR(SRC2[n+31:n]*SRC3[n+31:n] - DEST[n+31:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFMSUB132PS/VFMSUB213PS/VFMSUB231PS—Fused Multiply-Subtract of Packed Single Precision Floating-Point Values Vol. 2C 5-265


VFMSUB132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
RoundFPControl(DEST[i+31:i]*SRC3[i+31:i] - SRC2[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUB132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[31:0] - SRC2[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[i+31:i] - SRC2[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUB132PS/VFMSUB213PS/VFMSUB231PS—Fused Multiply-Subtract of Packed Single Precision Floating-Point Values Vol. 2C 5-266


VFMSUB213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] - SRC3[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUB213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] - SRC3[31:0])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] - SRC3[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUB132PS/VFMSUB213PS/VFMSUB231PS—Fused Multiply-Subtract of Packed Single Precision Floating-Point Values Vol. 2C 5-267


VFMSUB231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[i+31:i] - DEST[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUB231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[31:0] - DEST[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[i+31:i] - DEST[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUB132PS/VFMSUB213PS/VFMSUB231PS—Fused Multiply-Subtract of Packed Single Precision Floating-Point Values Vol. 2C 5-268


Intel C/C++ Compiler Intrinsic Equivalent
VFMSUBxxxPS __m512 _mm512_fmsub_ps(__m512 a, __m512 b, __m512 c);
VFMSUBxxxPS __m512 _mm512_fmsub_round_ps(__m512 a, __m512 b, __m512 c, int r);
VFMSUBxxxPS __m512 _mm512_mask_fmsub_ps(__m512 a, __mmask16 k, __m512 b, __m512 c);
VFMSUBxxxPS __m512 _mm512_maskz_fmsub_ps(__mmask16 k, __m512 a, __m512 b, __m512 c);
VFMSUBxxxPS __m512 _mm512_mask3_fmsub_ps(__m512 a, __m512 b, __m512 c, __mmask16 k);
VFMSUBxxxPS __m512 _mm512_mask_fmsub_round_ps(__m512 a, __mmask16 k, __m512 b, __m512 c, int r);
VFMSUBxxxPS __m512 _mm512_maskz_fmsub_round_ps(__mmask16 k, __m512 a, __m512 b, __m512 c, int r);
VFMSUBxxxPS __m512 _mm512_mask3_fmsub_round_ps(__m512 a, __m512 b, __m512 c, __mmask16 k, int r);
VFMSUBxxxPS __m256 _mm256_mask_fmsub_ps(__m256 a, __mmask8 k, __m256 b, __m256 c);
VFMSUBxxxPS __m256 _mm256_maskz_fmsub_ps(__mmask8 k, __m256 a, __m256 b, __m256 c);
VFMSUBxxxPS __m256 _mm256_mask3_fmsub_ps(__m256 a, __m256 b, __m256 c, __mmask8 k);
VFMSUBxxxPS __m128 _mm_mask_fmsub_ps(__m128 a, __mmask8 k, __m128 b, __m128 c);
VFMSUBxxxPS __m128 _mm_maskz_fmsub_ps(__mmask8 k, __m128 a, __m128 b, __m128 c);
VFMSUBxxxPS __m128 _mm_mask3_fmsub_ps(__m128 a, __m128 b, __m128 c, __mmask8 k);
VFMSUBxxxPS __m128 _mm_fmsub_ps (__m128 a, __m128 b, __m128 c);
VFMSUBxxxPS __m256 _mm256_fmsub_ps (__m256 a, __m256 b, __m256 c);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VFMSUB132PS/VFMSUB213PS/VFMSUB231PS—Fused Multiply-Subtract of Packed Single Precision Floating-Point Values Vol. 2C 5-269


VFMSUB132SD/VFMSUB213SD/VFMSUB231SD—Fused Multiply-Subtract of Scalar Double
Precision Floating-Point Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.LIG.66.0F38.W1 9B /r A V/V FMA Multiply scalar double precision floating-point value
VFMSUB132SD xmm1, xmm2, from xmm1 and xmm3/m64, subtract xmm2 and
xmm3/m64 put result in xmm1.
VEX.LIG.66.0F38.W1 AB /r A V/V FMA Multiply scalar double precision floating-point value
VFMSUB213SD xmm1, xmm2, from xmm1 and xmm2, subtract xmm3/m64 and
xmm3/m64 put result in xmm1.
VEX.LIG.66.0F38.W1 BB /r A V/V FMA Multiply scalar double precision floating-point value
VFMSUB231SD xmm1, xmm2, from xmm2 and xmm3/m64, subtract xmm1 and
xmm3/m64 put result in xmm1.
EVEX.LLIG.66.0F38.W1 9B /r B V/V AVX512F Multiply scalar double precision floating-point value
VFMSUB132SD xmm1 {k1}{z}, OR AVX10.11 from xmm1 and xmm3/m64, subtract xmm2 and
xmm2, xmm3/m64{er} put result in xmm1.
EVEX.LLIG.66.0F38.W1 AB /r B V/V AVX512F Multiply scalar double precision floating-point value
VFMSUB213SD xmm1 {k1}{z}, OR AVX10.11 from xmm1 and xmm2, subtract xmm3/m64 and
xmm2, xmm3/m64{er} put result in xmm1.
EVEX.LLIG.66.0F38.W1 BB /r B V/V AVX512F Multiply scalar double precision floating-point value
VFMSUB231SD xmm1 {k1}{z}, OR AVX10.11 from xmm2 and xmm3/m64, subtract xmm1 and
xmm2, xmm3/m64{er} put result in xmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Tuple1 Scalar ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD multiply-subtract computation on the low packed double precision floating-point values using
three source operands and writes the multiply-subtract result in the destination operand. The destination operand
is also the first source operand. The second operand must be a XMM register. The third source operand can be a
XMM register or a 64-bit memory location.
VFMSUB132SD: Multiplies the low packed double precision floating-point value from the first source operand to the
low packed double precision floating-point value in the third source operand. From the infinite precision interme-
diate result, subtracts the low packed double precision floating-point values in the second source operand,
performs rounding and stores the resulting packed double precision floating-point value to the destination operand
(first source operand).
VFMSUB213SD: Multiplies the low packed double precision floating-point value from the second source operand to
the low packed double precision floating-point value in the first source operand. From the infinite precision inter-
mediate result, subtracts the low packed double precision floating-point value in the third source operand,
performs rounding and stores the resulting packed double precision floating-point value to the destination operand
(first source operand).
VFMSUB231SD: Multiplies the low packed double precision floating-point value from the second source to the low
packed double precision floating-point value in the third source operand. From the infinite precision intermediate
result, subtracts the low packed double precision floating-point value in the first source operand, performs

VFMSUB132SD/VFMSUB213SD/VFMSUB231SD—Fused Multiply-Subtract of Scalar Double Precision Floating-Point Values Vol. 2C 5-269


rounding and stores the resulting packed double precision floating-point value to the destination operand (first
source operand).
VEX.128 and EVEX encoded version: The destination operand (also first source operand) is encoded in reg_field.
The second source operand is encoded in VEX.vvvv/EVEX.vvvv. The third source operand is encoded in rm_field.
Bits 127:64 of the destination are unchanged. Bits MAXVL-1:128 of the destination register are zeroed.
EVEX encoded version: The low quadword element of the destination is updated according to the writemask.
Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the
opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations
involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction
column.

Operation
In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no
rounding).

VFMSUB132SD DEST, SRC2, SRC3 (EVEX encoded version)


IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[63:0] := RoundFPControl(DEST[63:0]*SRC3[63:0] - SRC2[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

VFMSUB213SD DEST, SRC2, SRC3 (EVEX encoded version)


IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[63:0] := RoundFPControl(SRC2[63:0]*DEST[63:0] - SRC3[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

VFMSUB132SD/VFMSUB213SD/VFMSUB231SD—Fused Multiply-Subtract of Scalar Double Precision Floating-Point Values Vol. 2C 5-270


VFMSUB231SD DEST, SRC2, SRC3 (EVEX encoded version)
IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[63:0] := RoundFPControl(SRC2[63:0]*SRC3[63:0] - DEST[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

VFMSUB132SD DEST, SRC2, SRC3 (VEX encoded version)


DEST[63:0] := RoundFPControl_MXCSR(DEST[63:0]*SRC3[63:0] - SRC2[63:0])
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

VFMSUB213SD DEST, SRC2, SRC3 (VEX encoded version)


DEST[63:0] := RoundFPControl_MXCSR(SRC2[63:0]*DEST[63:0] - SRC3[63:0])
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

VFMSUB231SD DEST, SRC2, SRC3 (VEX encoded version)


DEST[63:0] := RoundFPControl_MXCSR(SRC2[63:0]*SRC3[63:0] - DEST[63:0])
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFMSUBxxxSD __m128d _mm_fmsub_round_sd(__m128d a, __m128d b, __m128d c, int r);
VFMSUBxxxSD __m128d _mm_mask_fmsub_sd(__m128d a, __mmask8 k, __m128d b, __m128d c);
VFMSUBxxxSD __m128d _mm_maskz_fmsub_sd(__mmask8 k, __m128d a, __m128d b, __m128d c);
VFMSUBxxxSD __m128d _mm_mask3_fmsub_sd(__m128d a, __m128d b, __m128d c, __mmask8 k);
VFMSUBxxxSD __m128d _mm_mask_fmsub_round_sd(__m128d a, __mmask8 k, __m128d b, __m128d c, int r);
VFMSUBxxxSD __m128d _mm_maskz_fmsub_round_sd(__mmask8 k, __m128d a, __m128d b, __m128d c, int r);
VFMSUBxxxSD __m128d _mm_mask3_fmsub_round_sd(__m128d a, __m128d b, __m128d c, __mmask8 k, int r);
VFMSUBxxxSD __m128d _mm_fmsub_sd (__m128d a, __m128d b, __m128d c);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VFMSUB132SD/VFMSUB213SD/VFMSUB231SD—Fused Multiply-Subtract of Scalar Double Precision Floating-Point Values Vol. 2C 5-271


VFMSUB132SS/VFMSUB213SS/VFMSUB231SS—Fused Multiply-Subtract of Scalar Single
Precision Floating-Point Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.LIG.66.0F38.W0 9B /r A V/V FMA Multiply scalar single precision floating-point value
VFMSUB132SS xmm1, xmm2, from xmm1 and xmm3/m32, subtract xmm2 and put
xmm3/m32 result in xmm1.
VEX.LIG.66.0F38.W0 AB /r A V/V FMA Multiply scalar single precision floating-point value
VFMSUB213SS xmm1, xmm2, from xmm1 and xmm2, subtract xmm3/m32 and put
xmm3/m32 result in xmm1.
VEX.LIG.66.0F38.W0 BB /r A V/V FMA Multiply scalar single precision floating-point value
VFMSUB231SS xmm1, xmm2, from xmm2 and xmm3/m32, subtract xmm1 and put
xmm3/m32 result in xmm1.
EVEX.LLIG.66.0F38.W0 9B /r B V/V AVX512F Multiply scalar single precision floating-point value
VFMSUB132SS xmm1 {k1}{z}, OR AVX10.11 from xmm1 and xmm3/m32, subtract xmm2 and put
xmm2, xmm3/m32{er} result in xmm1.
EVEX.LLIG.66.0F38.W0 AB /r B V/V AVX512F Multiply scalar single precision floating-point value
VFMSUB213SS xmm1 {k1}{z}, OR AVX10.11 from xmm1 and xmm2, subtract xmm3/m32 and put
xmm2, xmm3/m32{er} result in xmm1.
EVEX.LLIG.66.0F38.W0 BB /r B V/V AVX512F Multiply scalar single precision floating-point value
VFMSUB231SS xmm1 {k1}{z}, OR AVX10.11 from xmm2 and xmm3/m32, subtract xmm1 and put
xmm2, xmm3/m32{er} result in xmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Tuple1 Scalar ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD multiply-subtract computation on the low packed single precision floating-point values using
three source operands and writes the multiply-subtract result in the destination operand. The destination operand
is also the first source operand. The second operand must be a XMM register. The third source operand can be a
XMM register or a 32-bit memory location.
VFMSUB132SS: Multiplies the low packed single precision floating-point value from the first source operand to the
low packed single precision floating-point value in the third source operand. From the infinite precision interme-
diate result, subtracts the low packed single precision floating-point values in the second source operand, performs
rounding and stores the resulting packed single precision floating-point value to the destination operand (first
source operand).
VFMSUB213SS: Multiplies the low packed single precision floating-point value from the second source operand to
the low packed single precision floating-point value in the first source operand. From the infinite precision interme-
diate result, subtracts the low packed single precision floating-point value in the third source operand, performs
rounding and stores the resulting packed single precision floating-point value to the destination operand (first
source operand).
VFMSUB231SS: Multiplies the low packed single precision floating-point value from the second source to the low
packed single precision floating-point value in the third source operand. From the infinite precision intermediate
result, subtracts the low packed single precision floating-point value in the first source operand, performs rounding

VFMSUB132SS/VFMSUB213SS/VFMSUB231SS—Fused Multiply-Subtract of Scalar Single Precision Floating-Point Values Vol. 2C 5-272


and stores the resulting packed single precision floating-point value to the destination operand (first source
operand).
VEX.128 and EVEX encoded version: The destination operand (also first source operand) is encoded in reg_field.
The second source operand is encoded in VEX.vvvv/EVEX.vvvv. The third source operand is encoded in rm_field.
Bits 127:32 of the destination are unchanged. Bits MAXVL-1:128 of the destination register are zeroed.
EVEX encoded version: The low doubleword element of the destination is updated according to the writemask.
Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the
opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations
involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction
column.

Operation
In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no
rounding).

VFMSUB132SS DEST, SRC2, SRC3 (EVEX encoded version)


IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := RoundFPControl(DEST[31:0]*SRC3[31:0] - SRC2[31:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

VFMSUB213SS DEST, SRC2, SRC3 (EVEX encoded version)


IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := RoundFPControl(SRC2[31:0]*DEST[31:0] - SRC3[31:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

VFMSUB132SS/VFMSUB213SS/VFMSUB231SS—Fused Multiply-Subtract of Scalar Single Precision Floating-Point Values Vol. 2C 5-273


VFMSUB231SS DEST, SRC2, SRC3 (EVEX encoded version)
IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := RoundFPControl(SRC2[31:0]*SRC3[63:0] - DEST[31:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

VFMSUB132SS DEST, SRC2, SRC3 (VEX encoded version)


DEST[31:0] := RoundFPControl_MXCSR(DEST[31:0]*SRC3[31:0] - SRC2[31:0])
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

VFMSUB213SS DEST, SRC2, SRC3 (VEX encoded version)


DEST[31:0] := RoundFPControl_MXCSR(SRC2[31:0]*DEST[31:0] - SRC3[31:0])
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

VFMSUB231SS DEST, SRC2, SRC3 (VEX encoded version)


DEST[31:0] := RoundFPControl_MXCSR(SRC2[31:0]*SRC3[31:0] - DEST[31:0])
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFMSUBxxxSS __m128 _mm_fmsub_round_ss(__m128 a, __m128 b, __m128 c, int r);
VFMSUBxxxSS __m128 _mm_mask_fmsub_ss(__m128 a, __mmask8 k, __m128 b, __m128 c);
VFMSUBxxxSS __m128 _mm_maskz_fmsub_ss(__mmask8 k, __m128 a, __m128 b, __m128 c);
VFMSUBxxxSS __m128 _mm_mask3_fmsub_ss(__m128 a, __m128 b, __m128 c, __mmask8 k);
VFMSUBxxxSS __m128 _mm_mask_fmsub_round_ss(__m128 a, __mmask8 k, __m128 b, __m128 c, int r);
VFMSUBxxxSS __m128 _mm_maskz_fmsub_round_ss(__mmask8 k, __m128 a, __m128 b, __m128 c, int r);
VFMSUBxxxSS __m128 _mm_mask3_fmsub_round_ss(__m128 a, __m128 b, __m128 c, __mmask8 k, int r);
VFMSUBxxxSS __m128 _mm_fmsub_ss (__m128 a, __m128 b, __m128 c);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VFMSUB132SS/VFMSUB213SS/VFMSUB231SS—Fused Multiply-Subtract of Scalar Single Precision Floating-Point Values Vol. 2C 5-274


VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD—Fused Multiply-Alternating
Subtract/Add of Packed Double Precision Floating-Point Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.66.0F38.W1 97 /r A V/V FMA Multiply packed double precision floating-point
VFMSUBADD132PD xmm1, xmm2, values from xmm1 and xmm3/mem,
xmm3/m128 subtract/add elements in xmm2 and put result
in xmm1.
VEX.128.66.0F38.W1 A7 /r A V/V FMA Multiply packed double precision floating-point
VFMSUBADD213PD xmm1, xmm2, values from xmm1 and xmm2, subtract/add
xmm3/m128 elements in xmm3/mem and put result in xmm1.
VEX.128.66.0F38.W1 B7 /r A V/V FMA Multiply packed double precision floating-point
VFMSUBADD231PD xmm1, xmm2, values from xmm2 and xmm3/mem,
xmm3/m128 subtract/add elements in xmm1 and put result
in xmm1.
VEX.256.66.0F38.W1 97 /r A V/V FMA Multiply packed double precision floating-point
VFMSUBADD132PD ymm1, ymm2, values from ymm1 and ymm3/mem,
ymm3/m256 subtract/add elements in ymm2 and put result
in ymm1.
VEX.256.66.0F38.W1 A7 /r A V/V FMA Multiply packed double precision floating-point
VFMSUBADD213PD ymm1, ymm2, values from ymm1 and ymm2, subtract/add
ymm3/m256 elements in ymm3/mem and put result in ymm1.
VEX.256.66.0F38.W1 B7 /r A V/V FMA Multiply packed double precision floating-point
VFMSUBADD231PD ymm1, ymm2, values from ymm2 and ymm3/mem,
ymm3/m256 subtract/add elements in ymm1 and put result
in ymm1.
EVEX.128.66.0F38.W1 97 /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMSUBADD132PD xmm1 {k1}{z}, AVX512F) OR values from xmm1 and xmm3/m128/m64bcst,
xmm2, xmm3/m128/m64bcst AVX10.11 subtract/add elements in xmm2 and put result
in xmm1 subject to writemask k1.
EVEX.128.66.0F38.W1 A7 /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMSUBADD213PD xmm1 {k1}{z}, AVX512F) OR values from xmm1 and xmm2, subtract/add
xmm2, xmm3/m128/m64bcst AVX10.11 elements in xmm3/m128/m64bcst and put
result in xmm1 subject to writemask k1.
EVEX.128.66.0F38.W1 B7 /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMSUBADD231PD xmm1 {k1}{z}, AVX512F) OR values from xmm2 and xmm3/m128/m64bcst,
xmm2, xmm3/m128/m64bcst AVX10.11 subtract/add elements in xmm1 and put result
in xmm1 subject to writemask k1.
EVEX.256.66.0F38.W1 97 /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMSUBADD132PD ymm1 {k1}{z}, AVX512F) OR values from ymm1 and ymm3/m256/m64bcst,
ymm2, ymm3/m256/m64bcst AVX10.11 subtract/add elements in ymm2 and put result
in ymm1 subject to writemask k1.
EVEX.256.66.0F38.W1 A7 /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMSUBADD213PD ymm1 {k1}{z}, AVX512F) OR values from ymm1 and ymm2, subtract/add
ymm2, ymm3/m256/m64bcst AVX10.11 elements in ymm3/m256/m64bcst and put
result in ymm1 subject to writemask k1.
EVEX.256.66.0F38.W1 B7 /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFMSUBADD231PD ymm1 {k1}{z}, AVX512F) OR values from ymm2 and ymm3/m256/m64bcst,
ymm2, ymm3/m256/m64bcst AVX10.11 subtract/add elements in ymm1 and put result
in ymm1 subject to writemask k1.

VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD—Fused Multiply-Alternating Subtract/Add of Packed Double Precision Vol. 2C 5-275


Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.512.66.0F38.W1 97 /r B V/V AVX512F Multiply packed double precision floating-point
VFMSUBADD132PD zmm1 {k1}{z}, OR AVX10.11 values from zmm1 and zmm3/m512/m64bcst,
zmm2, zmm3/m512/m64bcst{er} subtract/add elements in zmm2 and put result in
zmm1 subject to writemask k1.
EVEX.512.66.0F38.W1 A7 /r B V/V AVX512F Multiply packed double precision floating-point
VFMSUBADD213PD zmm1 {k1}{z}, OR AVX10.11 values from zmm1 and zmm2, subtract/add
zmm2, zmm3/m512/m64bcst{er} elements in zmm3/m512/m64bcst and put
result in zmm1 subject to writemask k1.
EVEX.512.66.0F38.W1 B7 /r B V/V AVX512F Multiply packed double precision floating-point
VFMSUBADD231PD zmm1 {k1}{z}, OR AVX10.11 values from zmm2 and zmm3/m512/m64bcst,
zmm2, zmm3/m512/m64bcst{er} subtract/add elements in zmm1 and put result in
zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
VFMSUBADD132PD: Multiplies the two, four, or eight packed double precision floating-point values from the first
source operand to the two or four packed double precision floating-point values in the third source operand. From
the infinite precision intermediate result, subtracts the odd double precision floating-point elements and adds the
even double precision floating-point values in the second source operand, performs rounding and stores the
resulting two or four packed double precision floating-point values to the destination operand (first source
operand).
VFMSUBADD213PD: Multiplies the two, four, or eight packed double precision floating-point values from the second
source operand to the two or four packed double precision floating-point values in the first source operand. From
the infinite precision intermediate result, subtracts the odd double precision floating-point elements and adds the
even double precision floating-point values in the third source operand, performs rounding and stores the resulting
two or four packed double precision floating-point values to the destination operand (first source operand).
VFMSUBADD231PD: Multiplies the two, four, or eight packed double precision floating-point values from the second
source operand to the two or four packed double precision floating-point values in the third source operand. From
the infinite precision intermediate result, subtracts the odd double precision floating-point elements and adds the
even double precision floating-point values in the first source operand, performs rounding and stores the resulting
two or four packed double precision floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) and the second source operand are
ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is condition-
ally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a

VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD—Fused Multiply-Alternating Subtract/Add of Packed Double Precision Vol. 2C 5-276


XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.
Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the
opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations
involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction
column.

Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no
rounding).

VFMSUBADD132PD DEST, SRC2, SRC3


IF (VEX.128) THEN
DEST[63:0] := RoundFPControl_MXCSR(DEST[63:0]*SRC3[63:0] + SRC2[63:0])
DEST[127:64] := RoundFPControl_MXCSR(DEST[127:64]*SRC3[127:64] - SRC2[127:64])
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[63:0] := RoundFPControl_MXCSR(DEST[63:0]*SRC3[63:0] + SRC2[63:0])
DEST[127:64] := RoundFPControl_MXCSR(DEST[127:64]*SRC3[127:64] - SRC2[127:64])
DEST[191:128] := RoundFPControl_MXCSR(DEST[191:128]*SRC3[191:128] + SRC2[191:128])
DEST[255:192] := RoundFPControl_MXCSR(DEST[255:192]*SRC3[255:192] - SRC2[255:192]
FI

VFMSUBADD213PD DEST, SRC2, SRC3


IF (VEX.128) THEN
DEST[63:0] := RoundFPControl_MXCSR(SRC2[63:0]*DEST[63:0] + SRC3[63:0])
DEST[127:64] := RoundFPControl_MXCSR(SRC2[127:64]*DEST[127:64] - SRC3[127:64])
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[63:0] := RoundFPControl_MXCSR(SRC2[63:0]*DEST[63:0] + SRC3[63:0])
DEST[127:64] := RoundFPControl_MXCSR(SRC2[127:64]*DEST[127:64] - SRC3[127:64])
DEST[191:128] := RoundFPControl_MXCSR(SRC2[191:128]*DEST[191:128] + SRC3[191:128])
DEST[255:192] := RoundFPControl_MXCSR(SRC2[255:192]*DEST[255:192] - SRC3[255:192]
FI

VFMSUBADD231PD DEST, SRC2, SRC3


IF (VEX.128) THEN
DEST[63:0] := RoundFPControl_MXCSR(SRC2[63:0]*SRC3[63:0] + DEST[63:0])
DEST[127:64] := RoundFPControl_MXCSR(SRC2[127:64]*SRC3[127:64] - DEST[127:64])
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[63:0] := RoundFPControl_MXCSR(SRC2[63:0]*SRC3[63:0] + DEST[63:0])
DEST[127:64] := RoundFPControl_MXCSR(SRC2[127:64]*SRC3[127:64] - DEST[127:64])
DEST[191:128] := RoundFPControl_MXCSR(SRC2[191:128]*SRC3[191:128] + DEST[191:128])
DEST[255:192] := RoundFPControl_MXCSR(SRC2[255:192]*SRC3[255:192] - DEST[255:192]
FI

VFMSUBADD132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);

VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD—Fused Multiply-Alternating Subtract/Add of Packed Double Precision Vol. 2C 5-277


FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+63:i] :=
RoundFPControl(DEST[i+63:i]*SRC3[i+63:i] + SRC2[i+63:i])
ELSE DEST[i+63:i] :=
RoundFPControl(DEST[i+63:i]*SRC3[i+63:i] - SRC2[i+63:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUBADD132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[63:0] + SRC2[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[i+63:i] + SRC2[i+63:i])
FI;
ELSE
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[63:0] - SRC2[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[i+63:i] - SRC2[i+63:i])
FI;
FI

ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0

VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD—Fused Multiply-Alternating Subtract/Add of Packed Double Precision Vol. 2C 5-278


FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUBADD213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+63:i] :=
RoundFPControl(SRC2[i+63:i]*DEST[i+63:i] + SRC3[i+63:i])
ELSE DEST[i+63:i] :=
RoundFPControl(SRC2[i+63:i]*DEST[i+63:i] - SRC3[i+63:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUBADD213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] + SRC3[63:0])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] + SRC3[i+63:i])
FI;
ELSE
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=

VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD—Fused Multiply-Alternating Subtract/Add of Packed Double Precision Vol. 2C 5-279


RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] - SRC3[63:0])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] - SRC3[i+63:i])
FI;
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUBADD231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+63:i] :=
RoundFPControl(SRC2[i+63:i]*SRC3[i+63:i] + DEST[i+63:i])
ELSE DEST[i+63:i] :=
RoundFPControl(SRC2[i+63:i]*SRC3[i+63:i] - DEST[i+63:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUBADD231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN
IF (EVEX.b = 1)

VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD—Fused Multiply-Alternating Subtract/Add of Packed Double Precision Vol. 2C 5-280


THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[63:0] + DEST[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[i+63:i] + DEST[i+63:i])
FI;
ELSE
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[63:0] - DEST[i+63:i])

ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[i+63:i] - DEST[i+63:i])
FI;
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFMSUBADDxxxPD __m512d _mm512_fmsubadd_pd(__m512d a, __m512d b, __m512d c);
VFMSUBADDxxxPD __m512d _mm512_fmsubadd_round_pd(__m512d a, __m512d b, __m512d c, int r);
VFMSUBADDxxxPD __m512d _mm512_mask_fmsubadd_pd(__m512d a, __mmask8 k, __m512d b, __m512d c);
VFMSUBADDxxxPD __m512d _mm512_maskz_fmsubadd_pd(__mmask8 k, __m512d a, __m512d b, __m512d c);
VFMSUBADDxxxPD __m512d _mm512_mask3_fmsubadd_pd(__m512d a, __m512d b, __m512d c, __mmask8 k);
VFMSUBADDxxxPD __m512d _mm512_mask_fmsubadd_round_pd(__m512d a, __mmask8 k, __m512d b, __m512d c, int r);
VFMSUBADDxxxPD __m512d _mm512_maskz_fmsubadd_round_pd(__mmask8 k, __m512d a, __m512d b, __m512d c, int r);
VFMSUBADDxxxPD __m512d _mm512_mask3_fmsubadd_round_pd(__m512d a, __m512d b, __m512d c, __mmask8 k, int r);
VFMSUBADDxxxPD __m256d _mm256_mask_fmsubadd_pd(__m256d a, __mmask8 k, __m256d b, __m256d c);
VFMSUBADDxxxPD __m256d _mm256_maskz_fmsubadd_pd(__mmask8 k, __m256d a, __m256d b, __m256d c);
VFMSUBADDxxxPD __m256d _mm256_mask3_fmsubadd_pd(__m256d a, __m256d b, __m256d c, __mmask8 k);
VFMSUBADDxxxPD __m128d _mm_mask_fmsubadd_pd(__m128d a, __mmask8 k, __m128d b, __m128d c);
VFMSUBADDxxxPD __m128d _mm_maskz_fmsubadd_pd(__mmask8 k, __m128d a, __m128d b, __m128d c);
VFMSUBADDxxxPD __m128d _mm_mask3_fmsubadd_pd(__m128d a, __m128d b, __m128d c, __mmask8 k);
VFMSUBADDxxxPD __m128d _mm_fmsubadd_pd (__m128d a, __m128d b, __m128d c);
VFMSUBADDxxxPD __m256d _mm256_fmsubadd_pd (__m256d a, __m256d b, __m256d c);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD—Fused Multiply-Alternating Subtract/Add of Packed Double Precision Vol. 2C 5-281


VFMSUBADD132PH/VFMSUBADD213PH/VFMSUBADD231PH—Fused Multiply-Alternating
Subtract/Add of Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.MAP6.W0 97 /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm1 and
VFMSUBADD132PH xmm1{k1}{z}, AND AVX512VL) xmm3/m128/m16bcst, subtract/add elements in
xmm2, xmm3/m128/m16bcst OR AVX10.11 xmm2, and store the result in xmm1 subject to
writemask k1.
EVEX.256.66.MAP6.W0 97 /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm1 and
VFMSUBADD132PH ymm1{k1}{z}, AND AVX512VL) ymm3/m256/m16bcst, subtract/add elements in
ymm2, ymm3/m256/m16bcst OR AVX10.11 ymm2, and store the result in ymm1 subject to
writemask k1.
EVEX.512.66.MAP6.W0 97 /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm1 and
VFMSUBADD132PH zmm1{k1}{z}, OR AVX10.11 zmm3/m512/m16bcst, subtract/add elements in
zmm2, zmm3/m512/m16bcst {er} zmm2, and store the result in zmm1 subject to
writemask k1.
EVEX.128.66.MAP6.W0 A7 /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm1 and
VFMSUBADD213PH xmm1{k1}{z}, AND AVX512VL) xmm2, subtract/add elements in
xmm2, xmm3/m128/m16bcst OR AVX10.11 xmm3/m128/m16bcst, and store the result in
xmm1 subject to writemask k1.
EVEX.256.66.MAP6.W0 A7 /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm1 and
VFMSUBADD213PH ymm1{k1}{z}, AND AVX512VL) ymm2, subtract/add elements in
ymm2, ymm3/m256/m16bcst OR AVX10.11 ymm3/m256/m16bcst, and store the result in
ymm1 subject to writemask k1.
EVEX.512.66.MAP6.W0 A7 /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm1 and
VFMSUBADD213PH zmm1{k1}{z}, OR AVX10.11 zmm2, subtract/add elements in
zmm2, zmm3/m512/m16bcst {er} zmm3/m512/m16bcst, and store the result in
zmm1 subject to writemask k1.
EVEX.128.66.MAP6.W0 B7 /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm2 and
VFMSUBADD231PH xmm1{k1}{z}, AND AVX512VL) xmm3/m128/m16bcst, subtract/add elements in
xmm2, xmm3/m128/m16bcst OR AVX10.11 xmm1, and store the result in xmm1 subject to
writemask k1.
EVEX.256.66.MAP6.W0 B7 /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm2 and
VFMSUBADD231PH ymm1{k1}{z}, AND AVX512VL) ymm3/m256/m16bcst, subtract/add elements in
ymm2, ymm3/m256/m16bcst OR AVX10.11 ymm1, and store the result in ymm1 subject to
writemask k1.
EVEX.512.66.MAP6.W0 B7 /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm2 and
VFMSUBADD231PH zmm1{k1}{z}, OR AVX10.11 zmm3/m512/m16bcst, subtract/add elements in
zmm2, zmm3/m512/m16bcst {er} zmm1, and store the result in zmm1 subject to
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A

VFMSUBADD132PH/VFMSUBADD213PH/VFMSUBADD231PH—Fused Multiply-Alternating Subtract/Add of Packed FP16 Values Vol. 2C 5-282


Description
This instruction performs a packed multiply-add (even elements) or multiply-subtract (odd elements) computation
on FP16 values using three source operands and writes the results in the destination operand. The destination
operand is also the first source operand. The notation “132”, “213” and “231” indicate the use of the operands in A
* B ± C, where each digit corresponds to the operand number, with the destination being operand 1; see Table
5-10.
The destination elements are updated according to the writemask.

Table 5-10. VFMSUBADD[132,213,231]PH Notation for Odd and Even Elements


Notation Odd Elements Even Elements
132 dest = dest*src3-src2 dest = dest*src3+src2
231 dest = src2*src3-dest dest = src2*src3+dest
213 dest = src2*dest-src3 dest = src2*dest+src3

Operation
VFMSUBADD132PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a register
VL = 128, 256 or 512
KL := VL/16

IF (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *j is even*:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j]*SRC3.fp16[j] + SRC2.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j]*SRC3.fp16[j] - SRC2.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VFMSUBADD132PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *j is even*:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j] * t3 + SRC2.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j] * t3 - SRC2.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0

VFMSUBADD132PH/VFMSUBADD213PH/VFMSUBADD231PH—Fused Multiply-Alternating Subtract/Add of Packed FP16 Values Vol. 2C 5-283


// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0:

VFMSUBADD213PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a register
VL = 128, 256 or 512
KL := VL/16

IF (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *j is even*:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j]*DEST.fp16[j] + SRC3.fp16[j])
ELSE
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j]*DEST.fp16[j] - SRC3.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VFMSUBADD213PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *j is even*:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * DEST.fp16[j] + t3 )
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * DEST.fp16[j] - t3 )
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0:

VFMSUBADD132PH/VFMSUBADD213PH/VFMSUBADD231PH—Fused Multiply-Alternating Subtract/Add of Packed FP16 Values Vol. 2C 5-284


VFMSUBADD231PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a register
VL = 128, 256 or 512
KL := VL/16

IF (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *j is even:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j]*SRC3.fp16[j] + DEST.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j]*SRC3.fp16[j] - DEST.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VFMSUBADD231PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *j is even*:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * t3 + DEST.fp16[j] )
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * t3 - DEST.fp16[j] )
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VFMSUBADD132PH/VFMSUBADD213PH/VFMSUBADD231PH—Fused Multiply-Alternating Subtract/Add of Packed FP16 Values Vol. 2C 5-285


Intel C/C++ Compiler Intrinsic Equivalent
VFMSUBADD132PH, VFMSUBADD213PH, and VFMSUBADD231PH:
__m128h _mm_fmsubadd_ph (__m128h a, __m128h b, __m128h c);
__m128h _mm_mask_fmsubadd_ph (__m128h a, __mmask8 k, __m128h b, __m128h c);
__m128h _mm_mask3_fmsubadd_ph (__m128h a, __m128h b, __m128h c, __mmask8 k);
__m128h _mm_maskz_fmsubadd_ph (__mmask8 k, __m128h a, __m128h b, __m128h c);
__m256h _mm256_fmsubadd_ph (__m256h a, __m256h b, __m256h c);
__m256h _mm256_mask_fmsubadd_ph (__m256h a, __mmask16 k, __m256h b, __m256h c);
__m256h _mm256_mask3_fmsubadd_ph (__m256h a, __m256h b, __m256h c, __mmask16 k);
__m256h _mm256_maskz_fmsubadd_ph (__mmask16 k, __m256h a, __m256h b, __m256h c);
__m512h _mm512_fmsubadd_ph (__m512h a, __m512h b, __m512h c);
__m512h _mm512_mask_fmsubadd_ph (__m512h a, __mmask32 k, __m512h b, __m512h c);
__m512h _mm512_mask3_fmsubadd_ph (__m512h a, __m512h b, __m512h c, __mmask32 k);
__m512h _mm512_maskz_fmsubadd_ph (__mmask32 k, __m512h a, __m512h b, __m512h c);
__m512h _mm512_fmsubadd_round_ph (__m512h a, __m512h b, __m512h c, const int rounding);
__m512h _mm512_mask_fmsubadd_round_ph (__m512h a, __mmask32 k, __m512h b, __m512h c, const int rounding);
__m512h _mm512_mask3_fmsubadd_round_ph (__m512h a, __m512h b, __m512h c, __mmask32 k, const int rounding);
__m512h _mm512_maskz_fmsubadd_round_ph (__mmask32 k, __m512h a, __m512h b, __m512h c, const int rounding);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VFMSUBADD132PH/VFMSUBADD213PH/VFMSUBADD231PH—Fused Multiply-Alternating Subtract/Add of Packed FP16 Values Vol. 2C 5-286


VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS—Fused Multiply-Alternating
Subtract/Add of Packed Single Precision Floating-Point Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.66.0F38.W0 97 /r A V/V FMA Multiply packed single precision floating-point
VFMSUBADD132PS xmm1, xmm2, values from xmm1 and xmm3/mem, subtract/add
xmm3/m128 elements in xmm2 and put result in xmm1.
VEX.128.66.0F38.W0 A7 /r A V/V FMA Multiply packed single precision floating-point
VFMSUBADD213PS xmm1, xmm2, values from xmm1 and xmm2, subtract/add
xmm3/m128 elements in xmm3/mem and put result in xmm1.
VEX.128.66.0F38.W0 B7 /r A V/V FMA Multiply packed single precision floating-point
VFMSUBADD231PS xmm1, xmm2, values from xmm2 and xmm3/mem, subtract/add
xmm3/m128 elements in xmm1 and put result in xmm1.
VEX.256.66.0F38.W0 97 /r A V/V FMA Multiply packed single precision floating-point
VFMSUBADD132PS ymm1, ymm2, values from ymm1 and ymm3/mem, subtract/add
ymm3/m256 elements in ymm2 and put result in ymm1.
VEX.256.66.0F38.W0 A7 /r A V/V FMA Multiply packed single precision floating-point
VFMSUBADD213PS ymm1, ymm2, values from ymm1 and ymm2, subtract/add
ymm3/m256 elements in ymm3/mem and put result in ymm1.
VEX.256.66.0F38.W0 B7 /r A V/V FMA Multiply packed single precision floating-point
VFMSUBADD231PS ymm1, ymm2, values from ymm2 and ymm3/mem, subtract/add
ymm3/m256 elements in ymm1 and put result in ymm1.
EVEX.128.66.0F38.W0 97 /r B V/V (AVX512VL AND Multiply packed single precision floating-point
VFMSUBADD132PS xmm1 {k1}{z}, AVX512F) OR values from xmm1 and xmm3/m128/m32bcst,
xmm2, xmm3/m128/m32bcst AVX10.11 subtract/add elements in xmm2 and put result in
xmm1 subject to writemask k1.
EVEX.128.66.0F38.W0 A7 /r B V/V (AVX512VL AND Multiply packed single precision floating-point
VFMSUBADD213PS xmm1 {k1}{z}, AVX512F) OR values from xmm1 and xmm2, subtract/add
xmm2, xmm3/m128/m32bcst AVX10.11 elements in xmm3/m128/m32bcst and put result
in xmm1 subject to writemask k1.
EVEX.128.66.0F38.W0 B7 /r B V/V (AVX512VL AND Multiply packed single precision floating-point
VFMSUBADD231PS xmm1 {k1}{z}, AVX512F) OR values from xmm2 and xmm3/m128/m32bcst,
xmm2, xmm3/m128/m32bcst AVX10.11 subtract/add elements in xmm1 and put result in
xmm1 subject to writemask k1.
EVEX.256.66.0F38.W0 97 /r B V/V (AVX512VL AND Multiply packed single precision floating-point
VFMSUBADD132PS ymm1 {k1}{z}, AVX512F) OR values from ymm1 and ymm3/m256/m32bcst,
ymm2, ymm3/m256/m32bcst AVX10.11 subtract/add elements in ymm2 and put result in
ymm1 subject to writemask k1.
EVEX.256.66.0F38.W0 A7 /r B V/V (AVX512VL AND Multiply packed single precision floating-point
VFMSUBADD213PS ymm1 {k1}{z}, AVX512F) OR values from ymm1 and ymm2, subtract/add
ymm2, ymm3/m256/m32bcst AVX10.11 elements in ymm3/m256/m32bcst and put result
in ymm1 subject to writemask k1.
EVEX.256.66.0F38.W0 B7 /r B V/V (AVX512VL AND Multiply packed single precision floating-point
VFMSUBADD231PS ymm1 {k1}{z}, AVX512F) OR values from ymm2 and ymm3/m256/m32bcst,
ymm2, ymm3/m256/m32bcst AVX10.11 subtract/add elements in ymm1 and put result in
ymm1 subject to writemask k1.

VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS—Fused Multiply-Alternating Subtract/Add of Packed Single Precision Vol. 2C 5-287


Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.512.66.0F38.W0 97 /r B V/V AVX512F Multiply packed single precision floating-point
VFMSUBADD132PS zmm1 {k1}{z}, OR AVX10.11 values from zmm1 and zmm3/m512/m32bcst,
zmm2, zmm3/m512/m32bcst{er} subtract/add elements in zmm2 and put result in
zmm1 subject to writemask k1.
EVEX.512.66.0F38.W0 A7 /r B V/V AVX512F Multiply packed single precision floating-point
VFMSUBADD213PS zmm1 {k1}{z}, OR AVX10.11 values from zmm1 and zmm2, subtract/add
zmm2, zmm3/m512/m32bcst{er} elements in zmm3/m512/m32bcst and put result
in zmm1 subject to writemask k1.
EVEX.512.66.0F38.W0 B7 /r B V/V AVX512F Multiply packed single precision floating-point
VFMSUBADD231PS zmm1 {k1}{z}, OR AVX10.11 values from zmm2 and zmm3/m512/m32bcst,
zmm2, zmm3/m512/m32bcst{er} subtract/add elements in zmm1 and put result in
zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
VFMSUBADD132PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the first
source operand to the corresponding packed single precision floating-point values in the third source operand.
From the infinite precision intermediate result, subtracts the odd single precision floating-point elements and adds
the even single precision floating-point values in the second source operand, performs rounding and stores the
resulting packed single precision floating-point values to the destination operand (first source operand).
VFMSUBADD213PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the
second source operand to the corresponding packed single precision floating-point values in the first source
operand. From the infinite precision intermediate result, subtracts the odd single precision floating-point elements
and adds the even single precision floating-point values in the third source operand, performs rounding and stores
the resulting packed single precision floating-point values to the destination operand (first source operand).
VFMSUBADD231PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the
second source operand to the corresponding packed single precision floating-point values in the third source
operand. From the infinite precision intermediate result, subtracts the odd single precision floating-point elements
and adds the even single precision floating-point values in the first source operand, performs rounding and stores
the resulting packed single precision floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) and the second source operand are
ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is condition-
ally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.

VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS—Fused Multiply-Alternating Subtract/Add of Packed Single Precision Vol. 2C 5-288


Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the
opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations
involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction
column.

Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no
rounding).

VFMSUBADD132PS DEST, SRC2, SRC3


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM -1{
n := 64*i;
DEST[n+31:n] := RoundFPControl_MXCSR(DEST[n+31:n]*SRC3[n+31:n] + SRC2[n+31:n])
DEST[n+63:n+32] := RoundFPControl_MXCSR(DEST[n+63:n+32]*SRC3[n+63:n+32] -SRC2[n+63:n+32])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFMSUBADD213PS DEST, SRC2, SRC3


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM -1{
n := 64*i;
DEST[n+31:n] := RoundFPControl_MXCSR(SRC2[n+31:n]*DEST[n+31:n] +SRC3[n+31:n])
DEST[n+63:n+32] := RoundFPControl_MXCSR(SRC2[n+63:n+32]*DEST[n+63:n+32] -SRC3[n+63:n+32])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFMSUBADD231PS DEST, SRC2, SRC3


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM -1{
n := 64*i;
DEST[n+31:n] := RoundFPControl_MXCSR(SRC2[n+31:n]*SRC3[n+31:n] + DEST[n+31:n])
DEST[n+63:n+32] := RoundFPControl_MXCSR(SRC2[n+63:n+32]*SRC3[n+63:n+32] -DEST[n+63:n+32])
}

VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS—Fused Multiply-Alternating Subtract/Add of Packed Single Precision Vol. 2C 5-289


IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFMSUBADD132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+31:i] :=
RoundFPControl(DEST[i+31:i]*SRC3[i+31:i] + SRC2[i+31:i])
ELSE DEST[i+31:i] :=
RoundFPControl(DEST[i+31:i]*SRC3[i+31:i] - SRC2[i+31:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUBADD132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[31:0] + SRC2[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[i+31:i] + SRC2[i+31:i])
FI;
ELSE
IF (EVEX.b = 1)
THEN

VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS—Fused Multiply-Alternating Subtract/Add of Packed Single Precision Vol. 2C 5-290


DEST[i+31:i] :=
RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[31:0] - SRC2[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[i+31:i] - SRC2[i+31:i])
FI;
FI

ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUBADD213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+31:i] :=
RoundFPControl(SRC2[i+31:i]*DEST[i+31:i] + SRC3[i+31:i])
ELSE DEST[i+31:i] :=
RoundFPControl(SRC2[i+31:i]*DEST[i+31:i] - SRC3[i+31:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUBADD213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*

VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS—Fused Multiply-Alternating Subtract/Add of Packed Single Precision Vol. 2C 5-291


THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] + SRC3[31:0])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] + SRC3[i+31:i])
FI;
ELSE
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] - SRC3[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] - SRC3[31:0])
FI;
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUBADD231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+31:i] :=
RoundFPControl(SRC2[i+31:i]*SRC3[i+31:i] + DEST[i+31:i])
ELSE DEST[i+31:i] :=
RoundFPControl(SRC2[i+31:i]*SRC3[i+31:i] - DEST[i+31:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;

VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS—Fused Multiply-Alternating Subtract/Add of Packed Single Precision Vol. 2C 5-292


ENDFOR
DEST[MAXVL-1:VL] := 0

VFMSUBADD231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[31:0] + DEST[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[i+31:i] + DEST[i+31:i])
FI;
ELSE
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[31:0] - DEST[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[i+31:i] - DEST[i+31:i])
FI;
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFMSUBADDxxxPS __m512 _mm512_fmsubadd_ps(__m512 a, __m512 b, __m512 c);
VFMSUBADDxxxPS __m512 _mm512_fmsubadd_round_ps(__m512 a, __m512 b, __m512 c, int r);
VFMSUBADDxxxPS __m512 _mm512_mask_fmsubadd_ps(__m512 a, __mmask16 k, __m512 b, __m512 c);
VFMSUBADDxxxPS __m512 _mm512_maskz_fmsubadd_ps(__mmask16 k, __m512 a, __m512 b, __m512 c);
VFMSUBADDxxxPS __m512 _mm512_mask3_fmsubadd_ps(__m512 a, __m512 b, __m512 c, __mmask16 k);
VFMSUBADDxxxPS __m512 _mm512_mask_fmsubadd_round_ps(__m512 a, __mmask16 k, __m512 b, __m512 c, int r);
VFMSUBADDxxxPS __m512 _mm512_maskz_fmsubadd_round_ps(__mmask16 k, __m512 a, __m512 b, __m512 c, int r);
VFMSUBADDxxxPS __m512 _mm512_mask3_fmsubadd_round_ps(__m512 a, __m512 b, __m512 c, __mmask16 k, int r);
VFMSUBADDxxxPS __m256 _mm256_mask_fmsubadd_ps(__m256 a, __mmask8 k, __m256 b, __m256 c);
VFMSUBADDxxxPS __m256 _mm256_maskz_fmsubadd_ps(__mmask8 k, __m256 a, __m256 b, __m256 c);
VFMSUBADDxxxPS __m256 _mm256_mask3_fmsubadd_ps(__m256 a, __m256 b, __m256 c, __mmask8 k);
VFMSUBADDxxxPS __m128 _mm_mask_fmsubadd_ps(__m128 a, __mmask8 k, __m128 b, __m128 c);
VFMSUBADDxxxPS __m128 _mm_maskz_fmsubadd_ps(__mmask8 k, __m128 a, __m128 b, __m128 c);

VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS—Fused Multiply-Alternating Subtract/Add of Packed Single Precision Vol. 2C 5-293


VFMSUBADDxxxPS __m128 _mm_mask3_fmsubadd_ps(__m128 a, __m128 b, __m128 c, __mmask8 k);
VFMSUBADDxxxPS __m128 _mm_fmsubadd_ps (__m128 a, __m128 b, __m128 c);
VFMSUBADDxxxPS __m256 _mm256_fmsubadd_ps (__m256 a, __m256 b, __m256 c);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS—Fused Multiply-Alternating Subtract/Add of Packed Single Precision Vol. 2C 5-294


VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed
Double Precision Floating-Point Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.66.0F38.W1 9C /r A V/V FMA Multiply packed double precision floating-point
VFNMADD132PD xmm1, xmm2, values from xmm1 and xmm3/mem, negate the
xmm3/m128 multiplication result and add to xmm2 and put result
in xmm1.
VEX.128.66.0F38.W1 AC /r A V/V FMA Multiply packed double precision floating-point
VFNMADD213PD xmm1, xmm2, values from xmm1 and xmm2, negate the
xmm3/m128 multiplication result and add to xmm3/mem and put
result in xmm1.
VEX.128.66.0F38.W1 BC /r A V/V FMA Multiply packed double precision floating-point
VFNMADD231PD xmm1, xmm2, values from xmm2 and xmm3/mem, negate the
xmm3/m128 multiplication result and add to xmm1 and put result
in xmm1.
VEX.256.66.0F38.W1 9C /r A V/V FMA Multiply packed double precision floating-point
VFNMADD132PD ymm1, ymm2, values from ymm1 and ymm3/mem, negate the
ymm3/m256 multiplication result and add to ymm2 and put result
in ymm1.
VEX.256.66.0F38.W1 AC /r A V/V FMA Multiply packed double precision floating-point
VFNMADD213PD ymm1, ymm2, values from ymm1 and ymm2, negate the
ymm3/m256 multiplication result and add to ymm3/mem and put
result in ymm1.
VEX.256.66.0F38.W1 BC /r A V/V FMA Multiply packed double precision floating-point
VFNMADD231PD ymm1, ymm2, values from ymm2 and ymm3/mem, negate the
ymm3/m256 multiplication result and add to ymm1 and put result
in ymm1.
EVEX.128.66.0F38.W1 9C /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFNMADD132PD xmm0 {k1}{z}, AVX512F) OR values from xmm1 and xmm3/m128/m64bcst,
xmm1, xmm2/m128/m64bcst AVX10.11 negate the multiplication result and add to xmm2
and put result in xmm1.
EVEX.128.66.0F38.W1 AC /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFNMADD213PD xmm1 {k1}{z}, AVX512F) OR values from xmm1 and xmm2, negate the
xmm2, xmm3/m128/m64bcst AVX10.11 multiplication result and add to
xmm3/m128/m64bcst and put result in xmm1.
EVEX.128.66.0F38.W1 BC /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFNMADD231PD xmm1 {k1}{z}, AVX512F) OR values from xmm2 and xmm3/m128/m64bcst,
xmm2, xmm3/m128/m64bcst AVX10.11 negate the multiplication result and add to xmm1
and put result in xmm1.
EVEX.256.66.0F38.W1 9C /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFNMADD132PD ymm1 {k1}{z}, AVX512F) OR values from ymm1 and ymm3/m256/m64bcst,
ymm2, ymm3/m256/m64bcst AVX10.11 negate the multiplication result and add to ymm2
and put result in ymm1.
EVEX.256.66.0F38.W1 AC /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFNMADD213PD ymm1 {k1}{z}, AVX512F) OR values from ymm1 and ymm2, negate the
ymm2, ymm3/m256/m64bcst AVX10.11 multiplication result and add to
ymm3/m256/m64bcst and put result in ymm1.
EVEX.256.66.0F38.W1 BC /r B V/V (AVX512VL AND Multiply packed double precision floating-point
VFNMADD231PD ymm1 {k1}{z}, AVX512F) OR values from ymm2 and ymm3/m256/m64bcst,
ymm2, ymm3/m256/m64bcst AVX10.11 negate the multiplication result and add to ymm1
and put result in ymm1.

VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-295
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.512.66.0F38.W1 9C /r B V/V AVX512F Multiply packed double precision floating-point
VFNMADD132PD zmm1 {k1}{z}, OR AVX10.11 values from zmm1 and zmm3/m512/m64bcst,
zmm2, zmm3/m512/m64bcst{er} negate the multiplication result and add to zmm2
and put result in zmm1.
EVEX.512.66.0F38.W1 AC /r B V/V AVX512F Multiply packed double precision floating-point
VFNMADD213PD zmm1 {k1}{z}, OR AVX10.11 values from zmm1 and zmm2, negate the
zmm2, zmm3/m512/m64bcst{er} multiplication result and add to
zmm3/m512/m64bcst and put result in zmm1.
EVEX.512.66.0F38.W1 BC /r B V/V AVX512F Multiply packed double precision floating-point
VFNMADD231PD zmm1 {k1}{z}, OR AVX10.11 values from zmm2 and zmm3/m512/m64bcst,
zmm2, zmm3/m512/m64bcst{er} negate the multiplication result and add to zmm1
and put result in zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
VFNMADD132PD: Multiplies the two, four or eight packed double precision floating-point values from the first
source operand to the two, four or eight packed double precision floating-point values in the third source operand,
adds the negated infinite precision intermediate result to the two, four or eight packed double precision floating-
point values in the second source operand, performs rounding and stores the resulting two, four or eight packed
double precision floating-point values to the destination operand (first source operand).
VFNMADD213PD: Multiplies the two, four or eight packed double precision floating-point values from the second
source operand to the two, four or eight packed double precision floating-point values in the first source operand,
adds the negated infinite precision intermediate result to the two, four or eight packed double precision floating-
point values in the third source operand, performs rounding and stores the resulting two, four or eight packed
double precision floating-point values to the destination operand (first source operand).
VFNMADD231PD: Multiplies the two, four or eight packed double precision floating-point values from the second
source to the two, four or eight packed double precision floating-point values in the third source operand, the
negated infinite precision intermediate result to the two, four or eight packed double precision floating-point values
in the first source operand, performs rounding and stores the resulting two, four or eight packed double precision
floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) and the second source operand are
ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is condition-
ally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.

VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-296
Operation
In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no
rounding).

VFNMADD132PD DEST, SRC2, SRC3 (VEX encoded version)


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 64*i;
DEST[n+63:n] := RoundFPControl_MXCSR(-(DEST[n+63:n]*SRC3[n+63:n]) + SRC2[n+63:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFNMADD213PD DEST, SRC2, SRC3 (VEX encoded version)


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 64*i;
DEST[n+63:n] := RoundFPControl_MXCSR(-(SRC2[n+63:n]*DEST[n+63:n]) + SRC3[n+63:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFNMADD231PD DEST, SRC2, SRC3 (VEX encoded version)


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 64*i;
DEST[n+63:n] := RoundFPControl_MXCSR(-(SRC2[n+63:n]*SRC3[n+63:n]) + DEST[n+63:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-297
VFNMADD132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
RoundFPControl(-(DEST[i+63:i]*SRC3[i+63:i]) + SRC2[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMADD132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(DEST[i+63:i]*SRC3[63:0]) + SRC2[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(DEST[i+63:i]*SRC3[i+63:i]) + SRC2[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-298
VFNMADD213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
RoundFPControl(-(SRC2[i+63:i]*DEST[i+63:i]) + SRC3[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMADD213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(SRC2[i+63:i]*DEST[i+63:i]) + SRC3[63:0])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(SRC2[i+63:i]*DEST[i+63:i]) + SRC3[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-299
VFNMADD231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
RoundFPControl(-(SRC2[i+63:i]*SRC3[i+63:i]) + DEST[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMADD231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(SRC2[i+63:i]*SRC3[63:0]) + DEST[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(SRC2[i+63:i]*SRC3[i+63:i]) + DEST[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-300
Intel C/C++ Compiler Intrinsic Equivalent
VFNMADDxxxPD __m512d _mm512_fnmadd_pd(__m512d a, __m512d b, __m512d c);
VFNMADDxxxPD __m512d _mm512_fnmadd_round_pd(__m512d a, __m512d b, __m512d c, int r);
VFNMADDxxxPD __m512d _mm512_mask_fnmadd_pd(__m512d a, __mmask8 k, __m512d b, __m512d c);
VFNMADDxxxPD __m512d _mm512_maskz_fnmadd_pd(__mmask8 k, __m512d a, __m512d b, __m512d c);
VFNMADDxxxPD __m512d _mm512_mask3_fnmadd_pd(__m512d a, __m512d b, __m512d c, __mmask8 k);
VFNMADDxxxPD __m512d _mm512_mask_fnmadd_round_pd(__m512d a, __mmask8 k, __m512d b, __m512d c, int r);
VFNMADDxxxPD __m512d _mm512_maskz_fnmadd_round_pd(__mmask8 k, __m512d a, __m512d b, __m512d c, int r);
VFNMADDxxxPD __m512d _mm512_mask3_fnmadd_round_pd(__m512d a, __m512d b, __m512d c, __mmask8 k, int r);
VFNMADDxxxPD __m256d _mm256_mask_fnmadd_pd(__m256d a, __mmask8 k, __m256d b, __m256d c);
VFNMADDxxxPD __m256d _mm256_maskz_fnmadd_pd(__mmask8 k, __m256d a, __m256d b, __m256d c);
VFNMADDxxxPD __m256d _mm256_mask3_fnmadd_pd(__m256d a, __m256d b, __m256d c, __mmask8 k);
VFNMADDxxxPD __m128d _mm_mask_fnmadd_pd(__m128d a, __mmask8 k, __m128d b, __m128d c);
VFNMADDxxxPD __m128d _mm_maskz_fnmadd_pd(__mmask8 k, __m128d a, __m128d b, __m128d c);
VFNMADDxxxPD __m128d _mm_mask3_fnmadd_pd(__m128d a, __m128d b, __m128d c, __mmask8 k);
VFNMADDxxxPD __m128d _mm_fnmadd_pd (__m128d a, __m128d b, __m128d c);
VFNMADDxxxPD __m256d _mm256_fnmadd_pd (__m256d a, __m256d b, __m256d c);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-301
VF[,N]MADD[132,213,231]PH—Fused Multiply-Add of Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.MAP6.W0 98 /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm1 and
VFMADD132PH xmm1{k1}{z}, xmm2, AND AVX512VL) xmm3/m128/m16bcst, add to xmm2, and store
xmm3/m128/m16bcst OR AVX10.11 the result in xmm1.
EVEX.256.66.MAP6.W0 98 /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm1 and
VFMADD132PH ymm1{k1}{z}, ymm2, AND AVX512VL) ymm3/m256/m16bcst, add to ymm2, and store
ymm3/m256/m16bcst OR AVX10.11 the result in ymm1.
EVEX.512.66.MAP6.W0 98 /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm1 and
VFMADD132PH zmm1{k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m16bcst, add to zmm2, and store
zmm3/m512/m16bcst {er} the result in zmm1.
EVEX.128.66.MAP6.W0 A8 /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm1 and
VFMADD213PH xmm1{k1}{z}, xmm2, AND AVX512VL) xmm2, add to xmm3/m128/m16bcst, and store
xmm3/m128/m16bcst OR AVX10.11 the result in xmm1.
EVEX.256.66.MAP6.W0 A8 /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm1 and
VFMADD213PH ymm1{k1}{z}, ymm2, AND AVX512VL) ymm2, add to ymm3/m256/m16bcst, and store
ymm3/m256/m16bcst OR AVX10.11 the result in ymm1.
EVEX.512.66.MAP6.W0 A8 /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm1 and
VFMADD213PH zmm1{k1}{z}, zmm2, OR AVX10.11 zmm2, add to zmm3/m512/m16bcst, and store
zmm3/m512/m16bcst {er} the result in zmm1.
EVEX.128.66.MAP6.W0 B8 /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm2 and
VFMADD231PH xmm1{k1}{z}, xmm2, AND AVX512VL) xmm3/m128/m16bcst, add to xmm1, and store
xmm3/m128/m16bcst OR AVX10.11 the result in xmm1.
EVEX.256.66.MAP6.W0 B8 /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm2 and
VFMADD231PH ymm1{k1}{z}, ymm2, AND AVX512VL) ymm3/m256/m16bcst, add to ymm1, and store
ymm3/m256/m16bcst OR AVX10.11 the result in ymm1.
EVEX.512.66.MAP6.W0 B8 /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm2 and
VFMADD231PH zmm1{k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m16bcst, add to zmm1, and store
zmm3/m512/m16bcst {er} the result in zmm1.
EVEX.128.66.MAP6.W0 9C /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm1 and
VFNMADD132PH xmm1{k1}{z}, AND AVX512VL) xmm3/m128/m16bcst, and negate the value.
xmm2, xmm3/m128/m16bcst OR AVX10.11 Add this value to xmm2, and store the result in
xmm1.
EVEX.256.66.MAP6.W0 9C /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm1 and
VFNMADD132PH ymm1{k1}{z}, AND AVX512VL) ymm3/m256/m16bcst, and negate the value.
ymm2, ymm3/m256/m16bcst OR AVX10.11 Add this value to ymm2, and store the result in
ymm1.
EVEX.512.66.MAP6.W0 9C /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm1 and
VFNMADD132PH zmm1{k1}{z}, OR AVX10.11 zmm3/m512/m16bcst, and negate the value.
zmm2, zmm3/m512/m16bcst {er} Add this value to zmm2, and store the result in
zmm1.
EVEX.128.66.MAP6.W0 AC /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm1 and
VFNMADD213PH xmm1{k1}{z}, AND AVX512VL) xmm2, and negate the value. Add this value to
xmm2, xmm3/m128/m16bcst OR AVX10.11 xmm3/m128/m16bcst, and store the result in
xmm1.
EVEX.256.66.MAP6.W0 AC /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm1 and
VFNMADD213PH ymm1{k1}{z}, AND AVX512VL) ymm2, and negate the value. Add this value to
ymm2, ymm3/m256/m16bcst OR AVX10.11 ymm3/m256/m16bcst, and store the result in
ymm1.

VF[,N]MADD[132,213,231]PH—Fused Multiply-Add of Packed FP16 Values Vol. 2C 5-176


Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.512.66.MAP6.W0 AC /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm1 and
VFNMADD213PH zmm1{k1}{z}, OR AVX10.11 zmm2, and negate the value. Add this value to
zmm2, zmm3/m512/m16bcst {er} zmm3/m512/m16bcst, and store the result in
zmm1.
EVEX.128.66.MAP6.W0 BC /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm2 and
VFNMADD231PH xmm1{k1}{z}, AND AVX512VL) xmm3/m128/m16bcst, and negate the value.
xmm2, xmm3/m128/m16bcst OR AVX10.11 Add this value to xmm1, and store the result in
xmm1.
EVEX.256.66.MAP6.W0 BC /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm2 and
VFNMADD231PH ymm1{k1}{z}, AND AVX512VL) ymm3/m256/m16bcst, and negate the value.
ymm2, ymm3/m256/m16bcst OR AVX10.11 Add this value to ymm1, and store the result in
ymm1.
EVEX.512.66.MAP6.W0 BC /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm2 and
VFNMADD231PH zmm1{k1}{z}, OR AVX10.11 zmm3/m512/m16bcst, and negate the value.
zmm2, zmm3/m512/m16bcst {er} Add this value to zmm1, and store the result in
zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs a packed multiply-add or negated multiply-add computation on FP16 values using three
source operands and writes the results in the destination operand. The destination operand is also the first source
operand. The “N” (negated) forms of this instruction add the negated infinite precision intermediate product to the
corresponding remaining operand. The notation’ “132”, “213” and “231” indicate the use of the operands in ±A * B
+ C, where each digit corresponds to the operand number, with the destination being operand 1; see Table 5-5.
The destination elements are updated according to the writemask.

Table 5-5. VF[,N]MADD[132,213,231]PH Notation for Operands


Notation Operands
132 dest = ± dest*src3+src2
231 dest = ± src2*src3+dest
213 dest = ± src2*dest+src3

VF[,N]MADD[132,213,231]PH—Fused Multiply-Add of Packed FP16 Values Vol. 2C 5-177


Operation
VF[,N]MADD132PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a register
VL = 128, 256 or 512
KL := VL/16

IF (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *negative form*:
DEST.fp16[j] := RoundFPControl(-DEST.fp16[j]*SRC3.fp16[j] + SRC2.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j]*SRC3.fp16[j] + SRC2.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VF[,N]MADD132PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *negative form*:
DEST.fp16[j] := RoundFPControl(-DEST.fp16[j] * t3 + SRC2.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j] * t3 + SRC2.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VF[,N]MADD[132,213,231]PH—Fused Multiply-Add of Packed FP16 Values Vol. 2C 5-178


VF[,N]MADD213PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a register
VL = 128, 256 or 512
KL := VL/16

IF (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *negative form*:
DEST.fp16[j] := RoundFPControl(-SRC2.fp16[j]*DEST.fp16[j] + SRC3.fp16[j])
ELSE
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j]*DEST.fp16[j] + SRC3.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VF[,N]MADD213PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *negative form*:
DEST.fp16[j] := RoundFPControl(-SRC2.fp16[j] * DEST.fp16[j] + t3 )
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * DEST.fp16[j] + t3 )
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VF[,N]MADD[132,213,231]PH—Fused Multiply-Add of Packed FP16 Values Vol. 2C 5-179


VF[,N]MADD231PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a register
VL = 128, 256 or 512
KL := VL/16

IF (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *negative form:
DEST.fp16[j] := RoundFPControl(-SRC2.fp16[j]*SRC3.fp16[j] + DEST.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j]*SRC3.fp16[j] + DEST.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VF[,N]MADD231PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *negative form*:
DEST.fp16[j] := RoundFPControl(-SRC2.fp16[j] * t3 + DEST.fp16[j] )
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * t3 + DEST.fp16[j] )
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VF[,N]MADD[132,213,231]PH—Fused Multiply-Add of Packed FP16 Values Vol. 2C 5-180


Intel C/C++ Compiler Intrinsic Equivalent
VFMADD132PH, VFMADD213PH , and VFMADD231PH:
__m128h _mm_fmadd_ph (__m128h a, __m128h b, __m128h c);
__m128h _mm_mask_fmadd_ph (__m128h a, __mmask8 k, __m128h b, __m128h c);
__m128h _mm_mask3_fmadd_ph (__m128h a, __m128h b, __m128h c, __mmask8 k);
__m128h _mm_maskz_fmadd_ph (__mmask8 k, __m128h a, __m128h b, __m128h c);
__m256h _mm256_fmadd_ph (__m256h a, __m256h b, __m256h c);
__m256h _mm256_mask_fmadd_ph (__m256h a, __mmask16 k, __m256h b, __m256h c);
__m256h _mm256_mask3_fmadd_ph (__m256h a, __m256h b, __m256h c, __mmask16 k);
__m256h _mm256_maskz_fmadd_ph (__mmask16 k, __m256h a, __m256h b, __m256h c);
__m512h _mm512_fmadd_ph (__m512h a, __m512h b, __m512h c);
__m512h _mm512_mask_fmadd_ph (__m512h a, __mmask32 k, __m512h b, __m512h c);
__m512h _mm512_mask3_fmadd_ph (__m512h a, __m512h b, __m512h c, __mmask32 k);
__m512h _mm512_maskz_fmadd_ph (__mmask32 k, __m512h a, __m512h b, __m512h c);
__m512h _mm512_fmadd_round_ph (__m512h a, __m512h b, __m512h c, const int rounding);
__m512h _mm512_mask_fmadd_round_ph (__m512h a, __mmask32 k, __m512h b, __m512h c, const int rounding);
__m512h _mm512_mask3_fmadd_round_ph (__m512h a, __m512h b, __m512h c, __mmask32 k, const int rounding);
__m512h _mm512_maskz_fmadd_round_ph (__mmask32 k, __m512h a, __m512h b, __m512h c, const int rounding);

VFNMADD132PH, VFNMADD213PH, and VFNMADD231PH:


__m128h _mm_fnmadd_ph (__m128h a, __m128h b, __m128h c);
__m128h _mm_mask_fnmadd_ph (__m128h a, __mmask8 k, __m128h b, __m128h c);
__m128h _mm_mask3_fnmadd_ph (__m128h a, __m128h b, __m128h c, __mmask8 k);
__m128h _mm_maskz_fnmadd_ph (__mmask8 k, __m128h a, __m128h b, __m128h c);
__m256h _mm256_fnmadd_ph (__m256h a, __m256h b, __m256h c);
__m256h _mm256_mask_fnmadd_ph (__m256h a, __mmask16 k, __m256h b, __m256h c);
__m256h _mm256_mask3_fnmadd_ph (__m256h a, __m256h b, __m256h c, __mmask16 k);
__m256h _mm256_maskz_fnmadd_ph (__mmask16 k, __m256h a, __m256h b, __m256h c);
__m512h _mm512_fnmadd_ph (__m512h a, __m512h b, __m512h c);
__m512h _mm512_mask_fnmadd_ph (__m512h a, __mmask32 k, __m512h b, __m512h c);
__m512h _mm512_mask3_fnmadd_ph (__m512h a, __m512h b, __m512h c, __mmask32 k);
__m512h _mm512_maskz_fnmadd_ph (__mmask32 k, __m512h a, __m512h b, __m512h c);
__m512h _mm512_fnmadd_round_ph (__m512h a, __m512h b, __m512h c, const int rounding);
__m512h _mm512_mask_fnmadd_round_ph (__m512h a, __mmask32 k, __m512h b, __m512h c, const int rounding);
__m512h _mm512_mask3_fnmadd_round_ph (__m512h a, __m512h b, __m512h c, __mmask32 k, const int rounding);
__m512h _mm512_maskz_fnmadd_round_ph (__mmask32 k, __m512h a, __m512h b, __m512h c, const int rounding);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VF[,N]MADD[132,213,231]PH—Fused Multiply-Add of Packed FP16 Values Vol. 2C 5-181


VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed
Single Precision Floating-Point Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.66.0F38.W0 9C /r A V/V FMA Multiply packed single precision floating-point values
VFNMADD132PS xmm1, xmm2, from xmm1 and xmm3/mem, negate the
xmm3/m128 multiplication result and add to xmm2 and put result
in xmm1.
VEX.128.66.0F38.W0 AC /r A V/V FMA Multiply packed single precision floating-point values
VFNMADD213PS xmm1, xmm2, from xmm1 and xmm2, negate the multiplication
xmm3/m128 result and add to xmm3/mem and put result in xmm1.
VEX.128.66.0F38.W0 BC /r A V/V FMA Multiply packed single precision floating-point values
VFNMADD231PS xmm1, xmm2, from xmm2 and xmm3/mem, negate the
xmm3/m128 multiplication result and add to xmm1 and put result
in xmm1.
VEX.256.66.0F38.W0 9C /r A V/V FMA Multiply packed single precision floating-point values
VFNMADD132PS ymm1, ymm2, from ymm1 and ymm3/mem, negate the
ymm3/m256 multiplication result and add to ymm2 and put result
in ymm1.
VEX.256.66.0F38.W0 AC /r A V/V FMA Multiply packed single precision floating-point values
VFNMADD213PS ymm1, ymm2, from ymm1 and ymm2, negate the multiplication
ymm3/m256 result and add to ymm3/mem and put result in ymm1.
VEX.256.66.0F38.0 BC /r A V/V FMA Multiply packed single precision floating-point values
VFNMADD231PS ymm1, ymm2, from ymm2 and ymm3/mem, negate the
ymm3/m256 multiplication result and add to ymm1 and put result
in ymm1.
EVEX.128.66.0F38.W0 9C /r B V/V (AVX512VL AND Multiply packed single precision floating-point values
VFNMADD132PS xmm1 {k1}{z}, AVX512F) OR from xmm1 and xmm3/m128/m32bcst, negate the
xmm2, xmm3/m128/m32bcst AVX10.11 multiplication result and add to xmm2 and put result
in xmm1.
EVEX.128.66.0F38.W0 AC /r B V/V (AVX512VL AND Multiply packed single precision floating-point values
VFNMADD213PS xmm1 {k1}{z}, AVX512F) OR from xmm1 and xmm2, negate the multiplication
xmm2, xmm3/m128/m32bcst AVX10.11 result and add to xmm3/m128/m32bcst and put
result in xmm1.
EVEX.128.66.0F38.W0 BC /r B V/V (AVX512VL AND Multiply packed single precision floating-point values
VFNMADD231PS xmm1 {k1}{z}, AVX512F) OR from xmm2 and xmm3/m128/m32bcst, negate the
xmm2, xmm3/m128/m32bcst AVX10.11 multiplication result and add to xmm1 and put result
in xmm1.
EVEX.256.66.0F38.W0 9C /r B V/V (AVX512VL AND Multiply packed single precision floating-point values
VFNMADD132PS ymm1 {k1}{z}, AVX512F) OR from ymm1 and ymm3/m256/m32bcst, negate the
ymm2, ymm3/m256/m32bcst AVX10.11 multiplication result and add to ymm2 and put result
in ymm1.
EVEX.256.66.0F38.W0 AC /r B V/V (AVX512VL AND Multiply packed single precision floating-point values
VFNMADD213PS ymm1 {k1}{z}, AVX512F) OR from ymm1 and ymm2, negate the multiplication
ymm2, ymm3/m256/m32bcst AVX10.11 result and add to ymm3/m256/m32bcst and put
result in ymm1.
EVEX.256.66.0F38.W0 BC /r B V/V (AVX512VL AND Multiply packed single precision floating-point values
VFNMADD231PS ymm1 {k1}{z}, AVX512F) OR from ymm2 and ymm3/m256/m32bcst, negate the
ymm2, ymm3/m256/m32bcst AVX10.11 multiplication result and add to ymm1 and put result
in ymm1.

VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single Precision Floating-Point Values Vol. 2C 5-302
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.512.66.0F38.W0 9C /r B V/V (AVX512VL AND Multiply packed single precision floating-point values
VFNMADD132PS zmm1 {k1}{z}, AVX512F) OR from zmm1 and zmm3/m512/m32bcst, negate the
zmm2, zmm3/m512/m32bcst{er} AVX10.11 multiplication result and add to zmm2 and put result
in zmm1.
EVEX.512.66.0F38.W0 AC /r B V/V AVX512F OR Multiply packed single precision floating-point values
VFNMADD213PS zmm1 {k1}{z}, AVX10.11 from zmm1 and zmm2, negate the multiplication
zmm2, zmm3/m512/m32bcst{er} result and add to zmm3/m512/m32bcst and put
result in zmm1.
EVEX.512.66.0F38.W0 BC /r B V/V AVX512F OR Multiply packed single precision floating-point values
VFNMADD231PS zmm1 {k1}{z}, AVX10.11 from zmm2 and zmm3/m512/m32bcst, negate the
zmm2, zmm3/m512/m32bcst{er} multiplication result and add to zmm1 and put result
in zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
VFNMADD132PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the first
source operand to the four, eight or sixteen packed single precision floating-point values in the third source
operand, adds the negated infinite precision intermediate result to the four, eight or sixteen packed single precision
floating-point values in the second source operand, performs rounding and stores the resulting four, eight or
sixteen packed single precision floating-point values to the destination operand (first source operand).
VFNMADD213PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the second
source operand to the four, eight or sixteen packed single precision floating-point values in the first source
operand, adds the negated infinite precision intermediate result to the four, eight or sixteen packed single precision
floating-point values in the third source operand, performs rounding and stores the resulting the four, eight or
sixteen packed single precision floating-point values to the destination operand (first source operand).
VFNMADD231PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the second
source operand to the four, eight or sixteen packed single precision floating-point values in the third source
operand, adds the negated infinite precision intermediate result to the four, eight or sixteen packed single precision
floating-point values in the first source operand, performs rounding and stores the resulting four, eight or sixteen
packed single precision floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) and the second source operand are
ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is condition-
ally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.

VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single Precision Floating-Point Values Vol. 2C 5-303
Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no
rounding).

VFNMADD132PS DEST, SRC2, SRC3 (VEX encoded version)


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 32*i;
DEST[n+31:n] := RoundFPControl_MXCSR(- (DEST[n+31:n]*SRC3[n+31:n]) + SRC2[n+31:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI
VFNMADD213PS DEST, SRC2, SRC3 (VEX encoded version)
IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 32*i;
DEST[n+31:n] := RoundFPControl_MXCSR(- (SRC2[n+31:n]*DEST[n+31:n]) + SRC3[n+31:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFNMADD231PS DEST, SRC2, SRC3 (VEX encoded version)


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 32*i;
DEST[n+31:n] := RoundFPControl_MXCSR(- (SRC2[n+31:n]*SRC3[n+31:n]) + DEST[n+31:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single Precision Floating-Point Values Vol. 2C 5-304
VFNMADD132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
RoundFPControl(-(DEST[i+31:i]*SRC3[i+31:i]) + SRC2[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMADD132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(DEST[i+31:i]*SRC3[31:0]) + SRC2[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(DEST[i+31:i]*SRC3[i+31:i]) + SRC2[i+31:i])
FI;

ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single Precision Floating-Point Values Vol. 2C 5-305
VFNMADD213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
RoundFPControl(-(SRC2[i+31:i]*DEST[i+31:i]) + SRC3[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMADD213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(SRC2[i+31:i]*DEST[i+31:i]) + SRC3[31:0])

ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(SRC2[i+31:i]*DEST[i+31:i]) + SRC3[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single Precision Floating-Point Values Vol. 2C 5-306
VFNMADD231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
RoundFPControl(-(SRC2[i+31:i]*SRC3[i+31:i]) + DEST[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMADD231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(SRC2[i+31:i]*SRC3[31:0]) + DEST[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(SRC2[i+31:i]*SRC3[i+31:i]) + DEST[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single Precision Floating-Point Values Vol. 2C 5-307
Intel C/C++ Compiler Intrinsic Equivalent
VFNMADDxxxPS __m512 _mm512_fnmadd_ps(__m512 a, __m512 b, __m512 c);
VFNMADDxxxPS __m512 _mm512_fnmadd_round_ps(__m512 a, __m512 b, __m512 c, int r);
VFNMADDxxxPS __m512 _mm512_mask_fnmadd_ps(__m512 a, __mmask16 k, __m512 b, __m512 c);
VFNMADDxxxPS __m512 _mm512_maskz_fnmadd_ps(__mmask16 k, __m512 a, __m512 b, __m512 c);
VFNMADDxxxPS __m512 _mm512_mask3_fnmadd_ps(__m512 a, __m512 b, __m512 c, __mmask16 k);
VFNMADDxxxPS __m512 _mm512_mask_fnmadd_round_ps(__m512 a, __mmask16 k, __m512 b, __m512 c, int r);
VFNMADDxxxPS __m512 _mm512_maskz_fnmadd_round_ps(__mmask16 k, __m512 a, __m512 b, __m512 c, int r);
VFNMADDxxxPS __m512 _mm512_mask3_fnmadd_round_ps(__m512 a, __m512 b, __m512 c, __mmask16 k, int r);
VFNMADDxxxPS __m256 _mm256_mask_fnmadd_ps(__m256 a, __mmask8 k, __m256 b, __m256 c);
VFNMADDxxxPS __m256 _mm256_maskz_fnmadd_ps(__mmask8 k, __m256 a, __m256 b, __m256 c);
VFNMADDxxxPS __m256 _mm256_mask3_fnmadd_ps(__m256 a, __m256 b, __m256 c, __mmask8 k);
VFNMADDxxxPS __m128 _mm_mask_fnmadd_ps(__m128 a, __mmask8 k, __m128 b, __m128 c);
VFNMADDxxxPS __m128 _mm_maskz_fnmadd_ps(__mmask8 k, __m128 a, __m128 b, __m128 c);
VFNMADDxxxPS __m128 _mm_mask3_fnmadd_ps(__m128 a, __m128 b, __m128 c, __mmask8 k);
VFNMADDxxxPS __m128 _mm_fnmadd_ps (__m128 a, __m128 b, __m128 c);
VFNMADDxxxPS __m256 _mm256_fnmadd_ps (__m256 a, __m256 b, __m256 c);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single Precision Floating-Point Values Vol. 2C 5-308
VFNMADD132SD/VFNMADD213SD/VFNMADD231SD—Fused Negative Multiply-Add of Scalar
Double Precision Floating-Point Values
Opcode/ Op / 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
VEX.LIG.66.0F38.W1 9D /r A V/V FMA Multiply scalar double precision floating-point value from
VFNMADD132SD xmm1, xmm2, xmm1 and xmm3/mem, negate the multiplication result
xmm3/m64 and add to xmm2 and put result in xmm1.
VEX.LIG.66.0F38.W1 AD /r A V/V FMA Multiply scalar double precision floating-point value from
VFNMADD213SD xmm1, xmm2, xmm1 and xmm2, negate the multiplication result and add
xmm3/m64 to xmm3/mem and put result in xmm1.
VEX.LIG.66.0F38.W1 BD /r A V/V FMA Multiply scalar double precision floating-point value from
VFNMADD231SD xmm1, xmm2, xmm2 and xmm3/mem, negate the multiplication result
xmm3/m64 and add to xmm1 and put result in xmm1.
EVEX.LLIG.66.0F38.W1 9D /r B V/V AVX512F Multiply scalar double precision floating-point value from
VFNMADD132SD xmm1 {k1}{z}, OR AVX10.11 xmm1 and xmm3/m64, negate the multiplication result
xmm2, xmm3/m64{er} and add to xmm2 and put result in xmm1.
EVEX.LLIG.66.0F38.W1 AD /r B V/V AVX512F Multiply scalar double precision floating-point value from
VFNMADD213SD xmm1 {k1}{z}, OR AVX10.11 xmm1 and xmm2, negate the multiplication result and add
xmm2, xmm3/m64{er} to xmm3/m64 and put result in xmm1.
EVEX.LLIG.66.0F38.W1 BD /r B V/V AVX512F Multiply scalar double precision floating-point value from
VFNMADD231SD xmm1 {k1}{z}, OR AVX10.11 xmm2 and xmm3/m64, negate the multiplication result
xmm2, xmm3/m64{er} and add to xmm1 and put result in xmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Tuple1 Scalar ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
VFNMADD132SD: Multiplies the low packed double precision floating-point value from the first source operand to
the low packed double precision floating-point value in the third source operand, adds the negated infinite precision
intermediate result to the low packed double precision floating-point values in the second source operand,
performs rounding and stores the resulting packed double precision floating-point value to the destination operand
(first source operand).
VFNMADD213SD: Multiplies the low packed double precision floating-point value from the second source operand
to the low packed double precision floating-point value in the first source operand, adds the negated infinite preci-
sion intermediate result to the low packed double precision floating-point value in the third source operand,
performs rounding and stores the resulting packed double precision floating-point value to the destination operand
(first source operand).
VFNMADD231SD: Multiplies the low packed double precision floating-point value from the second source to the low
packed double precision floating-point value in the third source operand, adds the negated infinite precision inter-
mediate result to the low packed double precision floating-point value in the first source operand, performs
rounding and stores the resulting packed double precision floating-point value to the destination operand (first
source operand).

VFNMADD132SD/VFNMADD213SD/VFNMADD231SD—Fused Negative Multiply-Add of Scalar Double Precision Floating-Point Values Vol. 2C 5-308
VEX.128 and EVEX encoded version: The destination operand (also first source operand) is encoded in reg_field.
The second source operand is encoded in VEX.vvvv/EVEX.vvvv. The third source operand is encoded in rm_field.
Bits 127:64 of the destination are unchanged. Bits MAXVL-1:128 of the destination register are zeroed.
EVEX encoded version: The low quadword element of the destination is updated according to the writemask.
Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the
opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations
involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction
column.

Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no
rounding).

VFNMADD132SD DEST, SRC2, SRC3 (EVEX encoded version)


IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[63:0] := RoundFPControl(-(DEST[63:0]*SRC3[63:0]) + SRC2[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

VFNMADD213SD DEST, SRC2, SRC3 (EVEX encoded version)


IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[63:0] := RoundFPControl(-(SRC2[63:0]*DEST[63:0]) + SRC3[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

VFNMADD132SD/VFNMADD213SD/VFNMADD231SD—Fused Negative Multiply-Add of Scalar Double Precision Floating-Point Values Vol. 2C 5-309
VFNMADD231SD DEST, SRC2, SRC3 (EVEX encoded version)
IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[63:0] := RoundFPControl(-(SRC2[63:0]*SRC3[63:0]) + DEST[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

VFNMADD132SD DEST, SRC2, SRC3 (VEX encoded version)


DEST[63:0] := RoundFPControl_MXCSR(- (DEST[63:0]*SRC3[63:0]) + SRC2[63:0])
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

VFNMADD213SD DEST, SRC2, SRC3 (VEX encoded version)


DEST[63:0] := RoundFPControl_MXCSR(- (SRC2[63:0]*DEST[63:0]) + SRC3[63:0])
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

VFNMADD231SD DEST, SRC2, SRC3 (VEX encoded version)


DEST[63:0] := RoundFPControl_MXCSR(- (SRC2[63:0]*SRC3[63:0]) + DEST[63:0])
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFNMADDxxxSD __m128d _mm_fnmadd_round_sd(__m128d a, __m128d b, __m128d c, int r);
VFNMADDxxxSD __m128d _mm_mask_fnmadd_sd(__m128d a, __mmask8 k, __m128d b, __m128d c);
VFNMADDxxxSD __m128d _mm_maskz_fnmadd_sd(__mmask8 k, __m128d a, __m128d b, __m128d c);
VFNMADDxxxSD __m128d _mm_mask3_fnmadd_sd(__m128d a, __m128d b, __m128d c, __mmask8 k);
VFNMADDxxxSD __m128d _mm_mask_fnmadd_round_sd(__m128d a, __mmask8 k, __m128d b, __m128d c, int r);
VFNMADDxxxSD __m128d _mm_maskz_fnmadd_round_sd(__mmask8 k, __m128d a, __m128d b, __m128d c, int r);
VFNMADDxxxSD __m128d _mm_mask3_fnmadd_round_sd(__m128d a, __m128d b, __m128d c, __mmask8 k, int r);
VFNMADDxxxSD __m128d _mm_fnmadd_sd (__m128d a, __m128d b, __m128d c);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VFNMADD132SD/VFNMADD213SD/VFNMADD231SD—Fused Negative Multiply-Add of Scalar Double Precision Floating-Point Values Vol. 2C 5-310
VF[,N]MADD[132,213,231]SH—Fused Multiply-Add of Scalar FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.66.MAP6.W0 99 /r A V/V AVX512-FP16 Multiply FP16 values from xmm1 and
VFMADD132SH xmm1{k1}{z}, xmm2, OR AVX10.11 xmm3/m16, add to xmm2, and store the result in
xmm3/m16 {er} xmm1.
EVEX.LLIG.66.MAP6.W0 A9 /r A V/V AVX512-FP16 Multiply FP16 values from xmm1 and xmm2, add
VFMADD213SH xmm1{k1}{z}, xmm2, OR AVX10.11 to xmm3/m16, and store the result in xmm1.
xmm3/m16 {er}
EVEX.LLIG.66.MAP6.W0 B9 /r A V/V AVX512-FP16 Multiply FP16 values from xmm2 and
VFMADD231SH xmm1{k1}{z}, xmm2, OR AVX10.11 xmm3/m16, add to xmm1, and store the result in
xmm3/m16 {er} xmm1.
EVEX.LLIG.66.MAP6.W0 9D /r A V/V AVX512-FP16 Multiply FP16 values from xmm1 and
VFNMADD132SH xmm1{k1}{z}, OR AVX10.11 xmm3/m16, and negate the value. Add this value
xmm2, xmm3/m16 {er} to xmm2, and store the result in xmm1.
EVEX.LLIG.66.MAP6.W0 AD /r A V/V AVX512-FP16 Multiply FP16 values from xmm1 and xmm2, and
VFNMADD213SH xmm1{k1}{z}, OR AVX10.11 negate the value. Add this value to xmm3/m16,
xmm2, xmm3/m16 {er} and store the result in xmm1.
EVEX.LLIG.66.MAP6.W0 BD /r A V/V AVX512-FP16 Multiply FP16 values from xmm2 and
VFNMADD231SH xmm1{k1}{z}, OR AVX10.11 xmm3/m16, and negate the value. Add this value
xmm2, xmm3/m16 {er} to xmm1, and store the result in xmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a scalar multiply-add or negated multiply-add computation on the low FP16 values using three source
operands and writes the result in the destination operand. The destination operand is also the first source operand.
The “N” (negated) forms of this instruction add the negated infinite precision intermediate product to the corre-
sponding remaining operand. The notation’ “132”, “213” and “231” indicate the use of the operands in ±A * B + C,
where each digit corresponds to the operand number, with the destination being operand 1; see Table 5-6.
Bits 127:16 of the destination operand are preserved. Bits MAXVL-1:128 of the destination operand are zeroed. The
low FP16 element of the destination is updated according to the writemask.

Table 5-6. VF[,N]MADD[132,213,231]SH Notation for Operands


Notation Operands
132 dest = ± dest*src3+src2
231 dest = ± src2*src3+dest
213 dest = ± src2*dest+src3

VF[,N]MADD[132,213,231]SH—Fused Multiply-Add of Scalar FP16 Values Vol. 2C 5-182


Operation
VF[,N]MADD132SH DEST, SRC2, SRC3 (EVEX encoded versions)
IF EVEX.b = 1 and SRC3 is a register:
SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

IF k1[0] OR *no writemask*:


IF *negative form*:
DEST.fp16[0] := RoundFPControl(-DEST.fp16[0]*SRC3.fp16[0] + SRC2.fp16[0])
ELSE:
DEST.fp16[0] := RoundFPControl(DEST.fp16[0]*SRC3.fp16[0] + SRC2.fp16[0])
ELSE IF *zeroing*:
DEST.fp16[0] := 0
// else DEST.fp16[0] remains unchanged

//DEST[127:16] remains unchanged


DEST[MAXVL-1:128] := 0

VF[,N]MADD213SH DEST, SRC2, SRC3 (EVEX encoded versions)


IF EVEX.b = 1 and SRC3 is a register:
SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

IF k1[0] OR *no writemask*:


IF *negative form:
DEST.fp16[0] := RoundFPControl(-SRC2.fp16[0]*DEST.fp16[0] + SRC3.fp16[0])
ELSE:
DEST.fp16[0] := RoundFPControl(SRC2.fp16[0]*DEST.fp16[0] + SRC3.fp16[0])
ELSE IF *zeroing*:
DEST.fp16[0] := 0
// else DEST.fp16[0] remains unchanged

//DEST[127:16] remains unchanged


DEST[MAXVL-1:128] := 0

VF[,N]MADD231SH DEST, SRC2, SRC3 (EVEX encoded versions)


IF EVEX.b = 1 and SRC3 is a register:
SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

IF k1[0] OR *no writemask*:


IF *negative form*:
DEST.fp16[0] := RoundFPControl(-SRC2.fp16[0]*SRC3.fp16[0] + DEST.fp16[0])
ELSE:
DEST.fp16[0] := RoundFPControl(SRC2.fp16[0]*SRC3.fp16[0] + DEST.fp16[0])
ELSE IF *zeroing*:
DEST.fp16[0] := 0
// else DEST.fp16[0] remains unchanged

//DEST[127:16] remains unchanged


DEST[MAXVL-1:128] := 0

VF[,N]MADD[132,213,231]SH—Fused Multiply-Add of Scalar FP16 Values Vol. 2C 5-183


Intel C/C++ Compiler Intrinsic Equivalent
VFMADD132SH, VFMADD213SH, and VFMADD231SH:
__m128h _mm_fmadd_round_sh (__m128h a, __m128h b, __m128h c, const int rounding);
__m128h _mm_mask_fmadd_round_sh (__m128h a, __mmask8 k, __m128h b, __m128h c, const int rounding);
__m128h _mm_mask3_fmadd_round_sh (__m128h a, __m128h b, __m128h c, __mmask8 k, const int rounding);
__m128h _mm_maskz_fmadd_round_sh (__mmask8 k, __m128h a, __m128h b, __m128h c, const int rounding);
__m128h _mm_fmadd_sh (__m128h a, __m128h b, __m128h c);
__m128h _mm_mask_fmadd_sh (__m128h a, __mmask8 k, __m128h b, __m128h c);
__m128h _mm_mask3_fmadd_sh (__m128h a, __m128h b, __m128h c, __mmask8 k);
__m128h _mm_maskz_fmadd_sh (__mmask8 k, __m128h a, __m128h b, __m128h c);

VFNMADD132SH, VFNMADD213SH, and VFNMADD231SH:


__m128h _mm_fnmadd_round_sh (__m128h a, __m128h b, __m128h c, const int rounding);
__m128h _mm_mask_fnmadd_round_sh (__m128h a, __mmask8 k, __m128h b, __m128h c, const int rounding);
__m128h _mm_mask3_fnmadd_round_sh (__m128h a, __m128h b, __m128h c, __mmask8 k, const int rounding);
__m128h _mm_maskz_fnmadd_round_sh (__mmask8 k, __m128h a, __m128h b, __m128h c, const int rounding);
__m128h _mm_fnmadd_sh (__m128h a, __m128h b, __m128h c);
__m128h _mm_mask_fnmadd_sh (__m128h a, __mmask8 k, __m128h b, __m128h c);
__m128h _mm_mask3_fnmadd_sh (__m128h a, __m128h b, __m128h c, __mmask8 k);
__m128h _mm_maskz_fnmadd_sh (__mmask8 k, __m128h a, __m128h b, __m128h c);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal

Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VF[,N]MADD[132,213,231]SH—Fused Multiply-Add of Scalar FP16 Values Vol. 2C 5-184


VFNMADD132SS/VFNMADD213SS/VFNMADD231SS—Fused Negative Multiply-Add of Scalar
Single Precision Floating-Point Values
Opcode/ Op / 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
VEX.LIG.66.0F38.W0 9D /r A V/V FMA Multiply scalar single precision floating-point value
VFNMADD132SS xmm1, xmm2, from xmm1 and xmm3/m32, negate the multiplication
xmm3/m32 result and add to xmm2 and put result in xmm1.
VEX.LIG.66.0F38.W0 AD /r A V/V FMA Multiply scalar single precision floating-point value
VFNMADD213SS xmm1, xmm2, from xmm1 and xmm2, negate the multiplication
xmm3/m32 result and add to xmm3/m32 and put result in xmm1.
VEX.LIG.66.0F38.W0 BD /r A V/V FMA Multiply scalar single precision floating-point value
VFNMADD231SS xmm1, xmm2, from xmm2 and xmm3/m32, negate the multiplication
xmm3/m32 result and add to xmm1 and put result in xmm1.
EVEX.LLIG.66.0F38.W0 9D /r B V/V AVX512F Multiply scalar single-precision floating-point value
VFNMADD132SS xmm1 {k1}{z}, OR AVX10.11 from xmm1 and xmm3/m32, negate the multiplication
xmm2, xmm3/m32{er} result and add to xmm2 and put result in xmm1.
EVEX.LLIG.66.0F38.W0 AD /r B V/V AVX512F Multiply scalar single-precision floating-point value
VFNMADD213SS xmm1 {k1}{z}, OR AVX10.11 from xmm1 and xmm2, negate the multiplication
xmm2, xmm3/m32{er} result and add to xmm3/m32 and put result in xmm1.
EVEX.LLIG.66.0F38.W0 BD /r B V/V AVX512F Multiply scalar single-precision floating-point value
VFNMADD231SS xmm1 {k1}{z}, OR AVX10.11 from xmm2 and xmm3/m32, negate the multiplication
xmm2, xmm3/m32{er} result and add to xmm1 and put result in xmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Tuple1 Scalar ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
VFNMADD132SS: Multiplies the low packed single precision floating-point value from the first source operand to
the low packed single precision floating-point value in the third source operand, adds the negated infinite precision
intermediate result to the low packed single precision floating-point value in the second source operand, performs
rounding and stores the resulting packed single precision floating-point value to the destination operand (first
source operand).
VFNMADD213SS: Multiplies the low packed single precision floating-point value from the second source operand to
the low packed single precision floating-point value in the first source operand, adds the negated infinite precision
intermediate result to the low packed single precision floating-point value in the third source operand, performs
rounding and stores the resulting packed single precision floating-point value to the destination operand (first
source operand).
VFNMADD231SS: Multiplies the low packed single precision floating-point value from the second source operand to
the low packed single precision floating-point value in the third source operand, adds the negated infinite precision
intermediate result to the low packed single precision floating-point value in the first source operand, performs
rounding and stores the resulting packed single precision floating-point value to the destination operand (first
source operand).

VFNMADD132SS/VFNMADD213SS/VFNMADD231SS—Fused Negative Multiply-Add of Scalar Single Precision Floating-Point Values Vol. 2C 5-311
VEX.128 and EVEX encoded version: The destination operand (also first source operand) is encoded in reg_field.
The second source operand is encoded in VEX.vvvv/EVEX.vvvv. The third source operand is encoded in rm_field.
Bits 127:32 of the destination are unchanged. Bits MAXVL-1:128 of the destination register are zeroed.
EVEX encoded version: The low doubleword element of the destination is updated according to the writemask.
Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the
opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations
involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction
column.

Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no
rounding).

VFNMADD132SS DEST, SRC2, SRC3 (EVEX encoded version)


IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := RoundFPControl(-(DEST[31:0]*SRC3[31:0]) + SRC2[31:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

VFNMADD213SS DEST, SRC2, SRC3 (EVEX encoded version)


IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := RoundFPControl(-(SRC2[31:0]*DEST[31:0]) + SRC3[31:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

VFNMADD132SS/VFNMADD213SS/VFNMADD231SS—Fused Negative Multiply-Add of Scalar Single Precision Floating-Point Values Vol. 2C 5-312
VFNMADD231SS DEST, SRC2, SRC3 (EVEX encoded version)
IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := RoundFPControl(-(SRC2[31:0]*SRC3[63:0]) + DEST[31:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

VFNMADD132SS DEST, SRC2, SRC3 (VEX encoded version)


DEST[31:0] := RoundFPControl_MXCSR(- (DEST[31:0]*SRC3[31:0]) + SRC2[31:0])
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

VFNMADD213SS DEST, SRC2, SRC3 (VEX encoded version)


DEST[31:0] := RoundFPControl_MXCSR(- (SRC2[31:0]*DEST[31:0]) + SRC3[31:0])
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

VFNMADD231SS DEST, SRC2, SRC3 (VEX encoded version)


DEST[31:0] := RoundFPControl_MXCSR(- (SRC2[31:0]*SRC3[31:0]) + DEST[31:0])
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFNMADDxxxSS __m128 _mm_fnmadd_round_ss(__m128 a, __m128 b, __m128 c, int r);
VFNMADDxxxSS __m128 _mm_mask_fnmadd_ss(__m128 a, __mmask8 k, __m128 b, __m128 c);
VFNMADDxxxSS __m128 _mm_maskz_fnmadd_ss(__mmask8 k, __m128 a, __m128 b, __m128 c);
VFNMADDxxxSS __m128 _mm_mask3_fnmadd_ss(__m128 a, __m128 b, __m128 c, __mmask8 k);
VFNMADDxxxSS __m128 _mm_mask_fnmadd_round_ss(__m128 a, __mmask8 k, __m128 b, __m128 c, int r);
VFNMADDxxxSS __m128 _mm_maskz_fnmadd_round_ss(__mmask8 k, __m128 a, __m128 b, __m128 c, int r);
VFNMADDxxxSS __m128 _mm_mask3_fnmadd_round_ss(__m128 a, __m128 b, __m128 c, __mmask8 k, int r);
VFNMADDxxxSS __m128 _mm_fnmadd_ss (__m128 a, __m128 b, __m128 c);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VFNMADD132SS/VFNMADD213SS/VFNMADD231SS—Fused Negative Multiply-Add of Scalar Single Precision Floating-Point Values Vol. 2C 5-313
VFNMSUB132PD/VFNMSUB213PD/VFNMSUB231PD—Fused Negative Multiply-Subtract of
Packed Double Precision Floating-Point Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.66.0F38.W1 9E /r A V/V FMA Multiply packed double precision floating-point values
VFNMSUB132PD xmm1, xmm2, from xmm1 and xmm3/mem, negate the
xmm3/m128 multiplication result and subtract xmm2 and put
result in xmm1.
VEX.128.66.0F38.W1 AE /r A V/V FMA Multiply packed double precision floating-point values
VFNMSUB213PD xmm1, xmm2, from xmm1 and xmm2, negate the multiplication
xmm3/m128 result and subtract xmm3/mem and put result in
xmm1.
VEX.128.66.0F38.W1 BE /r A V/V FMA Multiply packed double precision floating-point values
VFNMSUB231PD xmm1, xmm2, from xmm2 and xmm3/mem, negate the
xmm3/m128 multiplication result and subtract xmm1 and put
result in xmm1.
VEX.256.66.0F38.W1 9E /r A V/V FMA Multiply packed double precision floating-point values
VFNMSUB132PD ymm1, ymm2, from ymm1 and ymm3/mem, negate the
ymm3/m256 multiplication result and subtract ymm2 and put
result in ymm1.
VEX.256.66.0F38.W1 AE /r A V/V FMA Multiply packed double precision floating-point values
VFNMSUB213PD ymm1, ymm2, from ymm1 and ymm2, negate the multiplication
ymm3/m256 result and subtract ymm3/mem and put result in
ymm1.
VEX.256.66.0F38.W1 BE /r A V/V FMA Multiply packed double precision floating-point values
VFNMSUB231PD ymm1, ymm2, from ymm2 and ymm3/mem, negate the
ymm3/m256 multiplication result and subtract ymm1 and put
result in ymm1.
EVEX.128.66.0F38.W1 9E /r B V/V (AVX512VL AND Multiply packed double precision floating-point values
VFNMSUB132PD xmm1 {k1}{z}, AVX512F) OR from xmm1 and xmm3/m128/m64bcst, negate the
xmm2, xmm3/m128/m64bcst AVX10.11 multiplication result and subtract xmm2 and put
result in xmm1.
EVEX.128.66.0F38.W1 AE /r B V/V (AVX512VL AND Multiply packed double precision floating-point values
VFNMSUB213PD xmm1 {k1}{z}, AVX512F) OR from xmm1 and xmm2, negate the multiplication
xmm2, xmm3/m128/m64bcst AVX10.11 result and subtract xmm3/m128/m64bcst and put
result in xmm1.
EVEX.128.66.0F38.W1 BE /r B V/V (AVX512VL AND Multiply packed double precision floating-point values
VFNMSUB231PD xmm1 {k1}{z}, AVX512F) OR from xmm2 and xmm3/m128/m64bcst, negate the
xmm2, xmm3/m128/m64bcst AVX10.11 multiplication result and subtract xmm1 and put
result in xmm1.
EVEX.256.66.0F38.W1 9E /r B V/V (AVX512VL AND Multiply packed double precision floating-point values
VFNMSUB132PD ymm1 {k1}{z}, AVX512F) OR from ymm1 and ymm3/m256/m64bcst, negate the
ymm2, ymm3/m256/m64bcst AVX10.11 multiplication result and subtract ymm2 and put
result in ymm1.
EVEX.256.66.0F38.W1 AE /r B V/V (AVX512VL AND Multiply packed double precision floating-point values
VFNMSUB213PD ymm1 {k1}{z}, AVX512F) OR from ymm1 and ymm2, negate the multiplication
ymm2, ymm3/m256/m64bcst AVX10.11 result and subtract ymm3/m256/m64bcst and put
result in ymm1.
EVEX.256.66.0F38.W1 BE /r B V/V (AVX512VL AND Multiply packed double precision floating-point values
VFNMSUB231PD ymm1 {k1}{z}, AVX512F) OR from ymm2 and ymm3/m256/m64bcst, negate the
ymm2, ymm3/m256/m64bcst AVX10.11 multiplication result and subtract ymm1 and put
result in ymm1.

VFNMSUB132PD/VFNMSUB213PD/VFNMSUB231PD—Fused Negative Multiply-Subtract of Packed Double Precision Floating-Point Vol. 2C 5-314


Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.512.66.0F38.W1 9E /r B V/V AVX512F Multiply packed double precision floating-point values
VFNMSUB132PD zmm1 {k1}{z}, OR AVX10.11 from zmm1 and zmm3/m512/m64bcst, negate the
zmm2, zmm3/m512/m64bcst{er} multiplication result and subtract zmm2 and put
result in zmm1.
EVEX.512.66.0F38.W1 AE /r B V/V AVX512F Multiply packed double precision floating-point values
VFNMSUB213PD zmm1 {k1}{z}, OR AVX10.11 from zmm1 and zmm2, negate the multiplication
zmm2, zmm3/m512/m64bcst{er} result and subtract zmm3/m512/m64bcst and put
result in zmm1.
EVEX.512.66.0F38.W1 BE /r B V/V AVX512F Multiply packed double precision floating-point values
VFNMSUB231PD zmm1 {k1}{z}, OR AVX10.11 from zmm2 and zmm3/m512/m64bcst, negate the
zmm2, zmm3/m512/m64bcst{er} multiplication result and subtract zmm1 and put
result in zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
VFNMSUB132PD: Multiplies the two, four or eight packed double precision floating-point values from the first
source operand to the two, four or eight packed double precision floating-point values in the third source operand.
From negated infinite precision intermediate results, subtracts the two, four or eight packed double precision
floating-point values in the second source operand, performs rounding and stores the resulting two, four or eight
packed double precision floating-point values to the destination operand (first source operand).
VFNMSUB213PD: Multiplies the two, four or eight packed double precision floating-point values from the second
source operand to the two, four or eight packed double precision floating-point values in the first source operand.
From negated infinite precision intermediate results, subtracts the two, four or eight packed double precision
floating-point values in the third source operand, performs rounding and stores the resulting two, four or eight
packed double precision floating-point values to the destination operand (first source operand).
VFNMSUB231PD: Multiplies the two, four or eight packed double precision floating-point values from the second
source to the two, four or eight packed double precision floating-point values in the third source operand. From
negated infinite precision intermediate results, subtracts the two, four or eight packed double precision floating-
point values in the first source operand, performs rounding and stores the resulting two, four or eight packed
double precision floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) and the second source operand are
ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is condition-
ally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.

VFNMSUB132PD/VFNMSUB213PD/VFNMSUB231PD—Fused Negative Multiply-Subtract of Packed Double Precision Floating-Point Vol. 2C 5-315


Operation
In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no
rounding).

VFNMSUB132PD DEST, SRC2, SRC3 (VEX encoded version)


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 64*i;
DEST[n+63:n] := RoundFPControl_MXCSR( - (DEST[n+63:n]*SRC3[n+63:n]) - SRC2[n+63:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFNMSUB213PD DEST, SRC2, SRC3 (VEX encoded version)


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 64*i;
DEST[n+63:n] := RoundFPControl_MXCSR( - (SRC2[n+63:n]*DEST[n+63:n]) - SRC3[n+63:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFNMSUB231PD DEST, SRC2, SRC3 (VEX encoded version)


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 64*i;
DEST[n+63:n] := RoundFPControl_MXCSR( - (SRC2[n+63:n]*SRC3[n+63:n]) - DEST[n+63:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFNMSUB132PD/VFNMSUB213PD/VFNMSUB231PD—Fused Negative Multiply-Subtract of Packed Double Precision Floating-Point Vol. 2C 5-316


VFNMSUB132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
RoundFPControl(-(DEST[i+63:i]*SRC3[i+63:i]) - SRC2[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMSUB132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(DEST[i+63:i]*SRC3[63:0]) - SRC2[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(DEST[i+63:i]*SRC3[i+63:i]) - SRC2[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMSUB132PD/VFNMSUB213PD/VFNMSUB231PD—Fused Negative Multiply-Subtract of Packed Double Precision Floating-Point Vol. 2C 5-317


VFNMSUB213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
RoundFPControl(-(SRC2[i+63:i]*DEST[i+63:i]) - SRC3[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMSUB213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(SRC2[i+63:i]*DEST[i+63:i]) - SRC3[63:0])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(SRC2[i+63:i]*DEST[i+63:i]) - SRC3[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMSUB132PD/VFNMSUB213PD/VFNMSUB231PD—Fused Negative Multiply-Subtract of Packed Double Precision Floating-Point Vol. 2C 5-318


VFNMSUB231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
RoundFPControl(-(SRC2[i+63:i]*SRC3[i+63:i]) - DEST[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMSUB231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(SRC2[i+63:i]*SRC3[63:0]) - DEST[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(SRC2[i+63:i]*SRC3[i+63:i]) - DEST[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMSUB132PD/VFNMSUB213PD/VFNMSUB231PD—Fused Negative Multiply-Subtract of Packed Double Precision Floating-Point Vol. 2C 5-319


Intel C/C++ Compiler Intrinsic Equivalent
VFNMSUBxxxPD __m512d _mm512_fnmsub_pd(__m512d a, __m512d b, __m512d c);
VFNMSUBxxxPD __m512d _mm512_fnmsub_round_pd(__m512d a, __m512d b, __m512d c, int r);
VFNMSUBxxxPD __m512d _mm512_mask_fnmsub_pd(__m512d a, __mmask8 k, __m512d b, __m512d c);
VFNMSUBxxxPD __m512d _mm512_maskz_fnmsub_pd(__mmask8 k, __m512d a, __m512d b, __m512d c);
VFNMSUBxxxPD __m512d _mm512_mask3_fnmsub_pd(__m512d a, __m512d b, __m512d c, __mmask8 k);
VFNMSUBxxxPD __m512d _mm512_mask_fnmsub_round_pd(__m512d a, __mmask8 k, __m512d b, __m512d c, int r);
VFNMSUBxxxPD __m512d _mm512_maskz_fnmsub_round_pd(__mmask8 k, __m512d a, __m512d b, __m512d c, int r);
VFNMSUBxxxPD __m512d _mm512_mask3_fnmsub_round_pd(__m512d a, __m512d b, __m512d c, __mmask8 k, int r);
VFNMSUBxxxPD __m256d _mm256_mask_fnmsub_pd(__m256d a, __mmask8 k, __m256d b, __m256d c);
VFNMSUBxxxPD __m256d _mm256_maskz_fnmsub_pd(__mmask8 k, __m256d a, __m256d b, __m256d c);
VFNMSUBxxxPD __m256d _mm256_mask3_fnmsub_pd(__m256d a, __m256d b, __m256d c, __mmask8 k);
VFNMSUBxxxPD __m128d _mm_mask_fnmsub_pd(__m128d a, __mmask8 k, __m128d b, __m128d c);
VFNMSUBxxxPD __m128d _mm_maskz_fnmsub_pd(__mmask8 k, __m128d a, __m128d b, __m128d c);
VFNMSUBxxxPD __m128d _mm_mask3_fnmsub_pd(__m128d a, __m128d b, __m128d c, __mmask8 k);
VFNMSUBxxxPD __m128d _mm_fnmsub_pd (__m128d a, __m128d b, __m128d c);
VFNMSUBxxxPD __m256d _mm256_fnmsub_pd (__m256d a, __m256d b, __m256d c);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VFNMSUB132PD/VFNMSUB213PD/VFNMSUB231PD—Fused Negative Multiply-Subtract of Packed Double Precision Floating-Point Vol. 2C 5-320


VF[,N]MSUB[132,213,231]PH—Fused Multiply-Subtract of Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.MAP6.W0 9A /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm1 and
VFMSUB132PH xmm1{k1}{z}, xmm2, AND AVX512VL) xmm3/m128/m16bcst, subtract xmm2, and store
xmm3/m128/m16bcst OR AVX10.11 the result in xmm1 subject to writemask k1.
EVEX.256.66.MAP6.W0 9A /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm1 and
VFMSUB132PH ymm1{k1}{z}, ymm2, AND AVX512VL) ymm3/m256/m16bcst, subtract ymm2, and store
ymm3/m256/m16bcst OR AVX10.11 the result in ymm1 subject to writemask k1.
EVEX.512.66.MAP6.W0 9A /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm1 and
VFMSUB132PH zmm1{k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m16bcst, subtract zmm2, and store
zmm3/m512/m16bcst {er} the result in zmm1 subject to writemask k1.
EVEX.128.66.MAP6.W0 AA /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm1 and
VFMSUB213PH xmm1{k1}{z}, xmm2, AND AVX512VL) xmm2, subtract xmm3/m128/m16bcst, and store
xmm3/m128/m16bcst OR AVX10.11 the result in xmm1 subject to writemask k1.
EVEX.256.66.MAP6.W0 AA /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm1 and
VFMSUB213PH ymm1{k1}{z}, ymm2, AND AVX512VL) ymm2, subtract ymm3/m256/m16bcst, and store
ymm3/m256/m16bcst OR AVX10.11 the result in ymm1 subject to writemask k1.
EVEX.512.66.MAP6.W0 AA /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm1 and
VFMSUB213PH zmm1{k1}{z}, zmm2, OR AVX10.11 zmm2, subtract zmm3/m512/m16bcst, and store
zmm3/m512/m16bcst {er} the result in zmm1 subject to writemask k1.
EVEX.128.66.MAP6.W0 BA /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm2 and
VFMSUB231PH xmm1{k1}{z}, xmm2, AND AVX512VL) xmm3/m128/m16bcst, subtract xmm1, and store
xmm3/m128/m16bcst OR AVX10.11 the result in xmm1 subject to writemask k1.
EVEX.256.66.MAP6.W0 BA /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm2 and
VFMSUB231PH ymm1{k1}{z}, ymm2, AND AVX512VL) ymm3/m256/m16bcst, subtract ymm1, and store
ymm3/m256/m16bcst OR AVX10.11 the result in ymm1 subject to writemask k1.
EVEX.512.66.MAP6.W0 BA /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm2 and
VFMSUB231PH zmm1{k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m16bcst, subtract zmm1, and store
zmm3/m512/m16bcst {er} the result in zmm1 subject to writemask k1.
EVEX.128.66.MAP6.W0 9E /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm1 and
VFNMSUB132PH xmm1{k1}{z}, AND AVX512VL) xmm3/m128/m16bcst, and negate the value.
xmm2, xmm3/m128/m16bcst OR AVX10.11 Subtract xmm2 from this value, and store the
result in xmm1 subject to writemask k1.
EVEX.256.66.MAP6.W0 9E /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm1 and
VFNMSUB132PH ymm1{k1}{z}, AND AVX512VL) ymm3/m256/m16bcst, and negate the value.
ymm2, ymm3/m256/m16bcst OR AVX10.11 Subtract ymm2 from this value, and store the
result in ymm1 subject to writemask k1.
EVEX.512.66.MAP6.W0 9E /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm1 and
VFNMSUB132PH zmm1{k1}{z}, OR AVX10.11 zmm3/m512/m16bcst, and negate the value.
zmm2, zmm3/m512/m16bcst {er} Subtract zmm2 from this value, and store the
result in zmm1 subject to writemask k1.
EVEX.128.66.MAP6.W0 AE /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm1 and
VFNMSUB213PH xmm1{k1}{z}, AND AVX512VL) xmm2, and negate the value. Subtract
xmm2, xmm3/m128/m16bcst OR AVX10.11 xmm3/m128/m16bcst from this value, and store
the result in xmm1 subject to writemask k1.
EVEX.256.66.MAP6.W0 AE /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm1 and
VFNMSUB213PH ymm1{k1}{z}, AND AVX512VL) ymm2, and negate the value. Subtract
ymm2, ymm3/m256/m16bcst OR AVX10.11 ymm3/m256/m16bcst from this value, and store
the result in ymm1 subject to writemask k1.

VF[,N]MSUB[132,213,231]PH—Fused Multiply-Subtract of Packed FP16 Values Vol. 2C 5-185


Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.512.66.MAP6.W0 AE /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm1 and
VFNMSUB213PH zmm1{k1}{z}, OR AVX10.11 zmm2, and negate the value. Subtract
zmm2, zmm3/m512/m16bcst {er} zmm3/m512/m16bcst from this value, and store
the result in zmm1 subject to writemask k1.
EVEX.128.66.MAP6.W0 BE /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm2 and
VFNMSUB231PH xmm1{k1}{z}, AND AVX512VL) xmm3/m128/m16bcst, and negate the value.
xmm2, xmm3/m128/m16bcst OR AVX10.11 Subtract xmm1 from this value, and store the
result in xmm1 subject to writemask k1.
EVEX.256.66.MAP6.W0 BE /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm2 and
VFNMSUB231PH ymm1{k1}{z}, AND AVX512VL) ymm3/m256/m16bcst, and negate the value.
ymm2, ymm3/m256/m16bcst OR AVX10.11 Subtract ymm1 from this value, and store the
result in ymm1 subject to writemask k1.
EVEX.512.66.MAP6.W0 BE /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm2 and
VFNMSUB231PH zmm1{k1}{z}, OR AVX10.11 zmm3/m512/m16bcst, and negate the value.
zmm2, zmm3/m512/m16bcst {er} Subtract zmm1 from this value, and store the
result in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs a packed multiply-subtract or a negated multiply-subtract computation on FP16 values
using three source operands and writes the results in the destination operand. The destination operand is also the
first source operand. The “N” (negated) forms of this instruction subtract the remaining operand from the negated
infinite precision intermediate product. The notation’ “132”, “213” and “231” indicate the use of the operands in ±A
* B − C, where each digit corresponds to the operand number, with the destination being operand 1; see Table 5-7.
The destination elements are updated according to the writemask.

Table 5-7. VF[,N]MSUB[132,213,231]PH Notation for Operands


Notation Operands
132 dest = ± dest*src3-src2
231 dest = ± src2*src3-dest
213 dest = ± src2*dest-src3

VF[,N]MSUB[132,213,231]PH—Fused Multiply-Subtract of Packed FP16 Values Vol. 2C 5-186


Operation
VF[,N]MSUB132PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a register
VL = 128, 256 or 512
KL := VL/16

IF (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *negative form*:
DEST.fp16[j] := RoundFPControl(-DEST.fp16[j]*SRC3.fp16[j] - SRC2.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j]*SRC3.fp16[j] - SRC2.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VF[,N]MSUB132PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *negative form*:
DEST.fp16[j] := RoundFPControl(-DEST.fp16[j] * t3 - SRC2.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j] * t3 - SRC2.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VF[,N]MSUB[132,213,231]PH—Fused Multiply-Subtract of Packed FP16 Values Vol. 2C 5-187


VF[,N]MSUB213PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a register
VL = 128, 256 or 512
KL := VL/16

IF (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *negative form*:
DEST.fp16[j] := RoundFPControl(-SRC2.fp16[j]*DEST.fp16[j] - SRC3.fp16[j])
ELSE
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j]*DEST.fp16[j] - SRC3.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VF[,N]MSUB213PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *negative form*:
DEST.fp16[j] := RoundFPControl(-SRC2.fp16[j] * DEST.fp16[j] - t3 )
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * DEST.fp16[j] - t3 )
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VF[,N]MSUB[132,213,231]PH—Fused Multiply-Subtract of Packed FP16 Values Vol. 2C 5-188


VF[,N]MSUB231PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a register
VL = 128, 256 or 512
KL := VL/16

IF (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *negative form:
DEST.fp16[j] := RoundFPControl(-SRC2.fp16[j]*SRC3.fp16[j] - DEST.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j]*SRC3.fp16[j] - DEST.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VF[,N]MSUB231PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *negative form*:
DEST.fp16[j] := RoundFPControl(-SRC2.fp16[j] * t3 - DEST.fp16[j] )
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * t3 - DEST.fp16[j] )
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VF[,N]MSUB[132,213,231]PH—Fused Multiply-Subtract of Packed FP16 Values Vol. 2C 5-189


Intel C/C++ Compiler Intrinsic Equivalent
VFMSUB132PH, VFMSUB213PH, and VFMSUB231PH:
__m128h _mm_fmsub_ph (__m128h a, __m128h b, __m128h c);
__m128h _mm_mask_fmsub_ph (__m128h a, __mmask8 k, __m128h b, __m128h c);
__m128h _mm_mask3_fmsub_ph (__m128h a, __m128h b, __m128h c, __mmask8 k);
__m128h _mm_maskz_fmsub_ph (__mmask8 k, __m128h a, __m128h b, __m128h c);
__m256h _mm256_fmsub_ph (__m256h a, __m256h b, __m256h c);
__m256h _mm256_mask_fmsub_ph (__m256h a, __mmask16 k, __m256h b, __m256h c);
__m256h _mm256_mask3_fmsub_ph (__m256h a, __m256h b, __m256h c, __mmask16 k);
__m256h _mm256_maskz_fmsub_ph (__mmask16 k, __m256h a, __m256h b, __m256h c);
__m512h _mm512_fmsub_ph (__m512h a, __m512h b, __m512h c);
__m512h _mm512_mask_fmsub_ph (__m512h a, __mmask32 k, __m512h b, __m512h c);
__m512h _mm512_mask3_fmsub_ph (__m512h a, __m512h b, __m512h c, __mmask32 k);
__m512h _mm512_maskz_fmsub_ph (__mmask32 k, __m512h a, __m512h b, __m512h c);
__m512h _mm512_fmsub_round_ph (__m512h a, __m512h b, __m512h c, const int rounding);
__m512h _mm512_mask_fmsub_round_ph (__m512h a, __mmask32 k, __m512h b, __m512h c, const int rounding);
__m512h _mm512_mask3_fmsub_round_ph (__m512h a, __m512h b, __m512h c, __mmask32 k, const int rounding);
__m512h _mm512_maskz_fmsub_round_ph (__mmask32 k, __m512h a, __m512h b, __m512h c, const int rounding);

VFNMSUB132PH, VFNMSUB213PH, and VFNMSUB231PH:


__m128h _mm_fnmsub_ph (__m128h a, __m128h b, __m128h c);
__m128h _mm_mask_fnmsub_ph (__m128h a, __mmask8 k, __m128h b, __m128h c);
__m128h _mm_mask3_fnmsub_ph (__m128h a, __m128h b, __m128h c, __mmask8 k);
__m128h _mm_maskz_fnmsub_ph (__mmask8 k, __m128h a, __m128h b, __m128h c);
__m256h _mm256_fnmsub_ph (__m256h a, __m256h b, __m256h c);
__m256h _mm256_mask_fnmsub_ph (__m256h a, __mmask16 k, __m256h b, __m256h c);
__m256h _mm256_mask3_fnmsub_ph (__m256h a, __m256h b, __m256h c, __mmask16 k);
__m256h _mm256_maskz_fnmsub_ph (__mmask16 k, __m256h a, __m256h b, __m256h c);
__m512h _mm512_fnmsub_ph (__m512h a, __m512h b, __m512h c);
__m512h _mm512_mask_fnmsub_ph (__m512h a, __mmask32 k, __m512h b, __m512h c);
__m512h _mm512_mask3_fnmsub_ph (__m512h a, __m512h b, __m512h c, __mmask32 k);
__m512h _mm512_maskz_fnmsub_ph (__mmask32 k, __m512h a, __m512h b, __m512h c);
__m512h _mm512_fnmsub_round_ph (__m512h a, __m512h b, __m512h c, const int rounding);
__m512h _mm512_mask_fnmsub_round_ph (__m512h a, __mmask32 k, __m512h b, __m512h c, const int rounding);
__m512h _mm512_mask3_fnmsub_round_ph (__m512h a, __m512h b, __m512h c, __mmask32 k, const int rounding);
__m512h _mm512_maskz_fnmsub_round_ph (__mmask32 k, __m512h a, __m512h b, __m512h c, const int rounding);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VF[,N]MSUB[132,213,231]PH—Fused Multiply-Subtract of Packed FP16 Values Vol. 2C 5-190


VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of
Packed Single Precision Floating-Point Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.66.0F38.W0 9E /r A V/V FMA Multiply packed single precision floating-point values
VFNMSUB132PS xmm1, xmm2, from xmm1 and xmm3/mem, negate the
xmm3/m128 multiplication result and subtract xmm2 and put
result in xmm1.
VEX.128.66.0F38.W0 AE /r A V/V FMA Multiply packed single precision floating-point values
VFNMSUB213PS xmm1, xmm2, from xmm1 and xmm2, negate the multiplication
xmm3/m128 result and subtract xmm3/mem and put result in
xmm1.
VEX.128.66.0F38.W0 BE /r A V/V FMA Multiply packed single precision floating-point values
VFNMSUB231PS xmm1, xmm2, from xmm2 and xmm3/mem, negate the
xmm3/m128 multiplication result and subtract xmm1 and put
result in xmm1.
VEX.256.66.0F38.W0 9E /r A V/V FMA Multiply packed single precision floating-point values
VFNMSUB132PS ymm1, ymm2, from ymm1 and ymm3/mem, negate the
ymm3/m256 multiplication result and subtract ymm2 and put
result in ymm1.
VEX.256.66.0F38.W0 AE /r A V/V FMA Multiply packed single precision floating-point values
VFNMSUB213PS ymm1, ymm2, from ymm1 and ymm2, negate the multiplication
ymm3/m256 result and subtract ymm3/mem and put result in
ymm1.
VEX.256.66.0F38.0 BE /r A V/V FMA Multiply packed single precision floating-point values
VFNMSUB231PS ymm1, ymm2, from ymm2 and ymm3/mem, negate the
ymm3/m256 multiplication result and subtract ymm1 and put
result in ymm1.
EVEX.128.66.0F38.W0 9E /r B V/V (AVX512VL AND Multiply packed single-precision floating-point values
VFNMSUB132PS xmm1 {k1}{z}, AVX512F) OR from xmm1 and xmm3/m128/m32bcst, negate the
xmm2, xmm3/m128/m32bcst AVX10.11 multiplication result and subtract xmm2 and put
result in xmm1.
EVEX.128.66.0F38.W0 AE /r B V/V (AVX512VL AND Multiply packed single-precision floating-point values
VFNMSUB213PS xmm1 {k1}{z}, AVX512F) OR from xmm1 and xmm2, negate the multiplication
xmm2, xmm3/m128/m32bcst AVX10.11 result and subtract xmm3/m128/m32bcst and put
result in xmm1.
EVEX.128.66.0F38.W0 BE /r B V/V (AVX512VL AND Multiply packed single-precision floating-point values
VFNMSUB231PS xmm1 {k1}{z}, AVX512F) OR from xmm2 and xmm3/m128/m32bcst, negate the
xmm2, xmm3/m128/m32bcst AVX10.11 multiplication result subtract add to xmm1 and put
result in xmm1.
EVEX.256.66.0F38.W0 9E /r B V/V (AVX512VL AND Multiply packed single-precision floating-point values
VFNMSUB132PS ymm1 {k1}{z}, AVX512F) OR from ymm1 and ymm3/m256/m32bcst, negate the
ymm2, ymm3/m256/m32bcst AVX10.11 multiplication result and subtract ymm2 and put
result in ymm1.
EVEX.256.66.0F38.W0 AE /r B V/V (AVX512VL AND Multiply packed single-precision floating-point values
VFNMSUB213PS ymm1 {k1}{z}, AVX512F) OR from ymm1 and ymm2, negate the multiplication
ymm2, ymm3/m256/m32bcst AVX10.11 result and subtract ymm3/m256/m32bcst and put
result in ymm1.
EVEX.256.66.0F38.W0 BE /r B V/V (AVX512VL AND Multiply packed single-precision floating-point values
VFNMSUB231PS ymm1 {k1}{z}, AVX512F) OR from ymm2 and ymm3/m256/m32bcst, negate the
ymm2, ymm3/m256/m32bcst AVX10.11 multiplication result subtract add to ymm1 and put
result in ymm1.

VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single Precision Floating-Point Val- Vol. 2C 5-321
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.512.66.0F38.W0 9E /r B V/V AVX512F Multiply packed single-precision floating-point values
VFNMSUB132PS zmm1 {k1}{z}, OR AVX10.11 from zmm1 and zmm3/m512/m32bcst, negate the
zmm2, zmm3/m512/m32bcst{er} multiplication result and subtract zmm2 and put
result in zmm1.
EVEX.512.66.0F38.W0 AE /r B V/V AVX512F Multiply packed single-precision floating-point values
VFNMSUB213PS zmm1 {k1}{z}, OR AVX10.11 from zmm1 and zmm2, negate the multiplication
zmm2, zmm3/m512/m32bcst{er} result and subtract zmm3/m512/m32bcst and put
result in zmm1.
EVEX.512.66.0F38.W0 BE /r B V/V AVX512F Multiply packed single-precision floating-point values
VFNMSUB231PS zmm1 {k1}{z}, OR AVX10.11 from zmm2 and zmm3/m512/m32bcst, negate the
zmm2, zmm3/m512/m32bcst{er} multiplication result subtract add to zmm1 and put
result in zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
VFNMSUB132PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the first
source operand to the four, eight or sixteen packed single precision floating-point values in the third source
operand. From negated infinite precision intermediate results, subtracts the four, eight or sixteen packed single
precision floating-point values in the second source operand, performs rounding and stores the resulting four, eight
or sixteen packed single precision floating-point values to the destination operand (first source operand).
VFNMSUB213PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the second
source operand to the four, eight or sixteen packed single precision floating-point values in the first source
operand. From negated infinite precision intermediate results, subtracts the four, eight or sixteen packed single
precision floating-point values in the third source operand, performs rounding and stores the resulting four, eight
or sixteen packed single precision floating-point values to the destination operand (first source operand).
VFNMSUB231PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the second
source to the four, eight or sixteen packed single precision floating-point values in the third source operand. From
negated infinite precision intermediate results, subtracts the four, eight or sixteen packed single precision floating-
point values in the first source operand, performs rounding and stores the resulting four, eight or sixteen packed
single precision floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) and the second source operand are
ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is condition-
ally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.

VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single Precision Floating-Point Val- Vol. 2C 5-322
Operation
In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no
rounding).

VFNMSUB132PS DEST, SRC2, SRC3 (VEX encoded version)


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 32*i;
DEST[n+31:n] := RoundFPControl_MXCSR( - (DEST[n+31:n]*SRC3[n+31:n]) - SRC2[n+31:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFNMSUB213PS DEST, SRC2, SRC3 (VEX encoded version)


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 32*i;
DEST[n+31:n] := RoundFPControl_MXCSR( - (SRC2[n+31:n]*DEST[n+31:n]) - SRC3[n+31:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFNMSUB231PS DEST, SRC2, SRC3 (VEX encoded version)


IF (VEX.128) THEN
MAXNUM := 2
ELSEIF (VEX.256)
MAXNUM := 4
FI
For i = 0 to MAXNUM-1 {
n := 32*i;
DEST[n+31:n] := RoundFPControl_MXCSR( - (SRC2[n+31:n]*SRC3[n+31:n]) - DEST[n+31:n])
}
IF (VEX.128) THEN
DEST[MAXVL-1:128] := 0
ELSEIF (VEX.256)
DEST[MAXVL-1:256] := 0
FI

VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single Precision Floating-Point Val- Vol. 2C 5-323
VFNMSUB132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
RoundFPControl(-(DEST[i+31:i]*SRC3[i+31:i]) - SRC2[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMSUB132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(DEST[i+31:i]*SRC3[31:0]) - SRC2[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(DEST[i+31:i]*SRC3[i+31:i]) - SRC2[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single Precision Floating-Point Val- Vol. 2C 5-324
VFNMSUB213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
RoundFPControl_MXCSR(-(SRC2[i+31:i]*DEST[i+31:i]) - SRC3[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMSUB213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(SRC2[i+31:i]*DEST[i+31:i]) - SRC3[31:0])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(SRC2[i+31:i]*DEST[i+31:i]) - SRC3[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single Precision Floating-Point Val- Vol. 2C 5-325
VFNMSUB231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
RoundFPControl_MXCSR(-(SRC2[i+31:i]*SRC3[i+31:i]) - DEST[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMSUB231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(SRC2[i+31:i]*SRC3[31:0]) - DEST[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(SRC2[i+31:i]*SRC3[i+31:i]) - DEST[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single Precision Floating-Point Val- Vol. 2C 5-326
Intel C/C++ Compiler Intrinsic Equivalent
VFNMSUBxxxPS __m512 _mm512_fnmsub_ps(__m512 a, __m512 b, __m512 c);
VFNMSUBxxxPS __m512 _mm512_fnmsub_round_ps(__m512 a, __m512 b, __m512 c, int r);
VFNMSUBxxxPS __m512 _mm512_mask_fnmsub_ps(__m512 a, __mmask16 k, __m512 b, __m512 c);
VFNMSUBxxxPS __m512 _mm512_maskz_fnmsub_ps(__mmask16 k, __m512 a, __m512 b, __m512 c);
VFNMSUBxxxPS __m512 _mm512_mask3_fnmsub_ps(__m512 a, __m512 b, __m512 c, __mmask16 k);
VFNMSUBxxxPS __m512 _mm512_mask_fnmsub_round_ps(__m512 a, __mmask16 k, __m512 b, __m512 c, int r);
VFNMSUBxxxPS __m512 _mm512_maskz_fnmsub_round_ps(__mmask16 k, __m512 a, __m512 b, __m512 c, int r);
VFNMSUBxxxPS __m512 _mm512_mask3_fnmsub_round_ps(__m512 a, __m512 b, __m512 c, __mmask16 k, int r);
VFNMSUBxxxPS __m256 _mm256_mask_fnmsub_ps(__m256 a, __mmask8 k, __m256 b, __m256 c);
VFNMSUBxxxPS __m256 _mm256_maskz_fnmsub_ps(__mmask8 k, __m256 a, __m256 b, __m256 c);
VFNMSUBxxxPS __m256 _mm256_mask3_fnmsub_ps(__m256 a, __m256 b, __m256 c, __mmask8 k);
VFNMSUBxxxPS __m128 _mm_mask_fnmsub_ps(__m128 a, __mmask8 k, __m128 b, __m128 c);
VFNMSUBxxxPS __m128 _mm_maskz_fnmsub_ps(__mmask8 k, __m128 a, __m128 b, __m128 c);
VFNMSUBxxxPS __m128 _mm_mask3_fnmsub_ps(__m128 a, __m128 b, __m128 c, __mmask8 k);
VFNMSUBxxxPS __m128 _mm_fnmsub_ps (__m128 a, __m128 b, __m128 c);
VFNMSUBxxxPS __m256 _mm256_fnmsub_ps (__m256 a, __m256 b, __m256 c);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single Precision Floating-Point Val- Vol. 2C 5-327
VFNMSUB132SD/VFNMSUB213SD/VFNMSUB231SD—Fused Negative Multiply-Subtract of
Scalar Double Precision Floating-Point Values
Opcode/ Op / 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
VEX.LIG.66.0F38.W1 9F /r A V/V FMA Multiply scalar double precision floating-point value from
VFNMSUB132SD xmm1, xmm2, xmm1 and xmm3/mem, negate the multiplication result
xmm3/m64 and subtract xmm2 and put result in xmm1.
VEX.LIG.66.0F38.W1 AF /r A V/V FMA Multiply scalar double precision floating-point value from
VFNMSUB213SD xmm1, xmm2, xmm1 and xmm2, negate the multiplication result and
xmm3/m64 subtract xmm3/mem and put result in xmm1.
VEX.LIG.66.0F38.W1 BF /r A V/V FMA Multiply scalar double precision floating-point value from
VFNMSUB231SD xmm1, xmm2, xmm2 and xmm3/mem, negate the multiplication result
xmm3/m64 and subtract xmm1 and put result in xmm1.
EVEX.LLIG.66.0F38.W1 9F /r B V/V AVX512F Multiply scalar double precision floating-point value from
VFNMSUB132SD xmm1 {k1}{z}, OR AVX10.11 xmm1 and xmm3/m64, negate the multiplication result
xmm2, xmm3/m64{er} and subtract xmm2 and put result in xmm1.
EVEX.LLIG.66.0F38.W1 AF /r B V/V AVX512F Multiply scalar double precision floating-point value from
VFNMSUB213SD xmm1 {k1}{z}, OR AVX10.11 xmm1 and xmm2, negate the multiplication result and
xmm2, xmm3/m64{er} subtract xmm3/m64 and put result in xmm1.
EVEX.LLIG.66.0F38.W1 BF /r B V/V AVX512F Multiply scalar double precision floating-point value from
VFNMSUB231SD xmm1 {k1}{z}, OR AVX10.11 xmm2 and xmm3/m64, negate the multiplication result
xmm2, xmm3/m64{er} and subtract xmm1 and put result in xmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Tuple1 Scalar ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
VFNMSUB132SD: Multiplies the low packed double precision floating-point value from the first source operand to
the low packed double precision floating-point value in the third source operand. From negated infinite precision
intermediate result, subtracts the low double precision floating-point value in the second source operand, performs
rounding and stores the resulting packed double precision floating-point value to the destination operand (first
source operand).
VFNMSUB213SD: Multiplies the low packed double precision floating-point value from the second source operand
to the low packed double precision floating-point value in the first source operand. From negated infinite precision
intermediate result, subtracts the low double precision floating-point value in the third source operand, performs
rounding and stores the resulting packed double precision floating-point value to the destination operand (first
source operand).
VFNMSUB231SD: Multiplies the low packed double precision floating-point value from the second source to the low
packed double precision floating-point value in the third source operand. From negated infinite precision interme-
diate result, subtracts the low double precision floating-point value in the first source operand, performs rounding
and stores the resulting packed double precision floating-point value to the destination operand (first source
operand).

VFNMSUB132SD/VFNMSUB213SD/VFNMSUB231SD—Fused Negative Multiply-Subtract of Scalar Double Precision Floating-Point Val- Vol. 2C 5-327
VEX.128 and EVEX encoded version: The destination operand (also first source operand) is encoded in reg_field.
The second source operand is encoded in VEX.vvvv/EVEX.vvvv. The third source operand is encoded in rm_field.
Bits 127:64 of the destination are unchanged. Bits MAXVL-1:128 of the destination register are zeroed.
EVEX encoded version: The low quadword element of the destination is updated according to the writemask.
Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the
opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations
involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction
column.

Operation
In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no
rounding).

VFNMSUB132SD DEST, SRC2, SRC3 (EVEX encoded version)


IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[63:0] := RoundFPControl(-(DEST[63:0]*SRC3[63:0]) - SRC2[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

VFNMSUB213SD DEST, SRC2, SRC3 (EVEX encoded version)


IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[63:0] := RoundFPControl(-(SRC2[63:0]*DEST[63:0]) - SRC3[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

VFNMSUB132SD/VFNMSUB213SD/VFNMSUB231SD—Fused Negative Multiply-Subtract of Scalar Double Precision Floating-Point Val- Vol. 2C 5-328
VFNMSUB231SD DEST, SRC2, SRC3 (EVEX encoded version)
IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[63:0] := RoundFPControl(-(SRC2[63:0]*SRC3[63:0]) - DEST[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

VFNMSUB132SD DEST, SRC2, SRC3 (VEX encoded version)


DEST[63:0] := RoundFPControl_MXCSR(- (DEST[63:0]*SRC3[63:0]) - SRC2[63:0])
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

VFNMSUB213SD DEST, SRC2, SRC3 (VEX encoded version)


DEST[63:0] := RoundFPControl_MXCSR(- (SRC2[63:0]*DEST[63:0]) - SRC3[63:0])
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

VFNMSUB231SD DEST, SRC2, SRC3 (VEX encoded version)


DEST[63:0] := RoundFPControl_MXCSR(- (SRC2[63:0]*SRC3[63:0]) - DEST[63:0])
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFNMSUBxxxSD __m128d _mm_fnmsub_round_sd(__m128d a, __m128d b, __m128d c, int r);
VFNMSUBxxxSD __m128d _mm_mask_fnmsub_sd(__m128d a, __mmask8 k, __m128d b, __m128d c);
VFNMSUBxxxSD __m128d _mm_maskz_fnmsub_sd(__mmask8 k, __m128d a, __m128d b, __m128d c);
VFNMSUBxxxSD __m128d _mm_mask3_fnmsub_sd(__m128d a, __m128d b, __m128d c, __mmask8 k);
VFNMSUBxxxSD __m128d _mm_mask_fnmsub_round_sd(__m128d a, __mmask8 k, __m128d b, __m128d c, int r);
VFNMSUBxxxSD __m128d _mm_maskz_fnmsub_round_sd(__mmask8 k, __m128d a, __m128d b, __m128d c, int r);
VFNMSUBxxxSD __m128d _mm_mask3_fnmsub_round_sd(__m128d a, __m128d b, __m128d c, __mmask8 k, int r);
VFNMSUBxxxSD __m128d _mm_fnmsub_sd (__m128d a, __m128d b, __m128d c);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VFNMSUB132SD/VFNMSUB213SD/VFNMSUB231SD—Fused Negative Multiply-Subtract of Scalar Double Precision Floating-Point Val- Vol. 2C 5-329
VF[,N]MSUB[132,213,231]SH—Fused Multiply-Subtract of Scalar FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.66.MAP6.W0 9B /r A V/V AVX512-FP16 Multiply FP16 values from xmm1 and
VFMSUB132SH xmm1{k1}{z}, xmm2, OR AVX10.11 xmm3/m16, subtract xmm2, and store the result
xmm3/m16 {er} in xmm1 subject to writemask k1.
EVEX.LLIG.66.MAP6.W0 AB /r A V/V AVX512-FP16 Multiply FP16 values from xmm1 and xmm2,
VFMSUB213SH xmm1{k1}{z}, xmm2, OR AVX10.11 subtract xmm3/m16, and store the result in
xmm3/m16 {er} xmm1 subject to writemask k1.
EVEX.LLIG.66.MAP6.W0 BB /r A V/V AVX512-FP16 Multiply FP16 values from xmm2 and
VFMSUB231SH xmm1{k1}{z}, xmm2, OR AVX10.11 xmm3/m16, subtract xmm1, and store the result
xmm3/m16 {er} in xmm1 subject to writemask k1.
EVEX.LLIG.66.MAP6.W0 9F /r A V/V AVX512-FP16 Multiply FP16 values from xmm1 and
VFNMSUB132SH xmm1{k1}{z}, OR AVX10.11 xmm3/m16, and negate the value. Subtract
xmm2, xmm3/m16 {er} xmm2 from this value, and store the result in
xmm1 subject to writemask k1.
EVEX.LLIG.66.MAP6.W0 AF /r A V/V AVX512-FP16 Multiply FP16 values from xmm1 and xmm2, and
VFNMSUB213SH xmm1{k1}{z}, OR AVX10.11 negate the value. Subtract xmm3/m16 from this
xmm2, xmm3/m16 {er} value, and store the result in xmm1 subject to
writemask k1.
EVEX.LLIG.66.MAP6.W0 BF /r A V/V AVX512-FP16 Multiply FP16 values from xmm2 and
VFNMSUB231SH xmm1{k1}{z}, OR AVX10.11 xmm3/m16, and negate the value. Subtract
xmm2, xmm3/m16 {er} xmm1 from this value, and store the result in
xmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs a scalar multiply-subtract or negated multiply-subtract computation on the low FP16
values using three source operands and writes the result in the destination operand. The destination operand is also
the first source operand. The “N” (negated) forms of this instruction subtract the remaining operand from the
negated infinite precision intermediate product. The notation’ “132”, “213” and “231” indicate the use of the oper-
ands in ±A * B − C, where each digit corresponds to the operand number, with the destination being operand 1;
see Table 5-8.
Bits 127:16 of the destination operand are preserved. Bits MAXVL-1:128 of the destination operand are zeroed. The
low FP16 element of the destination is updated according to the writemask.

Table 5-8. VF[,N]MSUB[132,213,231]SH Notation for Operands


Notation Operands
132 dest = ± dest*src3-src2
231 dest = ± src2*src3-dest
213 dest = ± src2*dest-src3

VF[,N]MSUB[132,213,231]SH—Fused Multiply-Subtract of Scalar FP16 Values Vol. 2C 5-191


Operation
VF[,N]MSUB132SH DEST, SRC2, SRC3 (EVEX encoded versions)
IF EVEX.b = 1 and SRC3 is a register:
SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

IF k1[0] OR *no writemask*:


IF *negative form*:
DEST.fp16[0] := RoundFPControl(-DEST.fp16[0]*SRC3.fp16[0] - SRC2.fp16[0])
ELSE:
DEST.fp16[0] := RoundFPControl(DEST.fp16[0]*SRC3.fp16[0] - SRC2.fp16[0])
ELSE IF *zeroing*:
DEST.fp16[0] := 0
// else DEST.fp16[0] remains unchanged

//DEST[127:16] remains unchanged


DEST[MAXVL-1:128] := 0

VF[,N]MSUB213SH DEST, SRC2, SRC3 (EVEX encoded versions)


IF EVEX.b = 1 and SRC3 is a register:
SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

IF k1[0] OR *no writemask*:


IF *negative form:
DEST.fp16[0] := RoundFPControl(-SRC2.fp16[0]*DEST.fp16[0] - SRC3.fp16[0])
ELSE:
DEST.fp16[0] := RoundFPControl(SRC2.fp16[0]*DEST.fp16[0] - SRC3.fp16[0])
ELSE IF *zeroing*:
DEST.fp16[0] := 0
// else DEST.fp16[0] remains unchanged

//DEST[127:16] remains unchanged


DEST[MAXVL-1:128] := 0

VF[,N]MSUB231SH DEST, SRC2, SRC3 (EVEX encoded versions)


IF EVEX.b = 1 and SRC3 is a register:
SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

IF k1[0] OR *no writemask*:


IF *negative form*:
DEST.fp16[0] := RoundFPControl(-SRC2.fp16[0]*SRC3.fp16[0] - DEST.fp16[0])
ELSE:
DEST.fp16[0] := RoundFPControl(SRC2.fp16[0]*SRC3.fp16[0] - DEST.fp16[0])
ELSE IF *zeroing*:
DEST.fp16[0] := 0
// else DEST.fp16[0] remains unchanged

//DEST[127:16] remains unchanged


DEST[MAXVL-1:128] := 0

VF[,N]MSUB[132,213,231]SH—Fused Multiply-Subtract of Scalar FP16 Values Vol. 2C 5-192


Intel C/C++ Compiler Intrinsic Equivalent
VFMSUB132SH, VFMSUB213SH, and VFMSUB231SH:
__m128h _mm_fmsub_round_sh (__m128h a, __m128h b, __m128h c, const int rounding);
__m128h _mm_mask_fmsub_round_sh (__m128h a, __mmask8 k, __m128h b, __m128h c, const int rounding);
__m128h _mm_mask3_fmsub_round_sh (__m128h a, __m128h b, __m128h c, __mmask8 k, const int rounding);
__m128h _mm_maskz_fmsub_round_sh (__mmask8 k, __m128h a, __m128h b, __m128h c, const int rounding);
__m128h _mm_fmsub_sh (__m128h a, __m128h b, __m128h c);
__m128h _mm_mask_fmsub_sh (__m128h a, __mmask8 k, __m128h b, __m128h c);
__m128h _mm_mask3_fmsub_sh (__m128h a, __m128h b, __m128h c, __mmask8 k);
__m128h _mm_maskz_fmsub_sh (__mmask8 k, __m128h a, __m128h b, __m128h c);

VFNMSUB132SH, VFNMSUB213SH, and VFNMSUB231SH:


__m128h _mm_fnmsub_round_sh (__m128h a, __m128h b, __m128h c, const int rounding);
__m128h _mm_mask_fnmsub_round_sh (__m128h a, __mmask8 k, __m128h b, __m128h c, const int rounding);
__m128h _mm_mask3_fnmsub_round_sh (__m128h a, __m128h b, __m128h c, __mmask8 k, const int rounding);
__m128h _mm_maskz_fnmsub_round_sh (__mmask8 k, __m128h a, __m128h b, __m128h c, const int rounding);
__m128h _mm_fnmsub_sh (__m128h a, __m128h b, __m128h c);
__m128h _mm_mask_fnmsub_sh (__m128h a, __mmask8 k, __m128h b, __m128h c);
__m128h _mm_mask3_fnmsub_sh (__m128h a, __m128h b, __m128h c, __mmask8 k);
__m128h _mm_maskz_fnmsub_sh (__mmask8 k, __m128h a, __m128h b, __m128h c);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal

Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VF[,N]MSUB[132,213,231]SH—Fused Multiply-Subtract of Scalar FP16 Values Vol. 2C 5-193


VFNMSUB132SS/VFNMSUB213SS/VFNMSUB231SS—Fused Negative Multiply-Subtract of
Scalar Single Precision Floating-Point Values
Opcode/ Op / 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
VEX.LIG.66.0F38.W0 9F /r A V/V FMA Multiply scalar single precision floating-point value from
VFNMSUB132SS xmm1, xmm2, xmm1 and xmm3/m32, negate the multiplication result
xmm3/m32 and subtract xmm2 and put result in xmm1.
VEX.LIG.66.0F38.W0 AF /r A V/V FMA Multiply scalar single precision floating-point value from
VFNMSUB213SS xmm1, xmm2, xmm1 and xmm2, negate the multiplication result and
xmm3/m32 subtract xmm3/m32 and put result in xmm1.
VEX.LIG.66.0F38.W0 BF /r A V/V FMA Multiply scalar single precision floating-point value from
VFNMSUB231SS xmm1, xmm2, xmm2 and xmm3/m32, negate the multiplication result
xmm3/m32 and subtract xmm1 and put result in xmm1.
EVEX.LLIG.66.0F38.W0 9F /r B V/V AVX512F Multiply scalar single-precision floating-point value from
VFNMSUB132SS xmm1 {k1}{z}, OR AVX10.11 xmm1 and xmm3/m32, negate the multiplication result
xmm2, xmm3/m32{er} and subtract xmm2 and put result in xmm1.
EVEX.LLIG.66.0F38.W0 AF /r B V/V AVX512F Multiply scalar single-precision floating-point value from
VFNMSUB213SS xmm1 {k1}{z}, OR AVX10.11 xmm1 and xmm2, negate the multiplication result and
xmm2, xmm3/m32{er} subtract xmm3/m32 and put result in xmm1.
EVEX.LLIG.66.0F38.W0 BF /r B V/V AVX512F Multiply scalar single-precision floating-point value from
VFNMSUB231SS xmm1 {k1}{z}, OR AVX10.11 xmm2 and xmm3/m32, negate the multiplication result
xmm2, xmm3/m32{er} and subtract xmm1 and put result in xmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Tuple1 Scalar ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
VFNMSUB132SS: Multiplies the low packed single precision floating-point value from the first source operand to
the low packed single precision floating-point value in the third source operand. From negated infinite precision
intermediate result, the low single precision floating-point value in the second source operand, performs rounding
and stores the resulting packed single precision floating-point value to the destination operand (first source
operand).
VFNMSUB213SS: Multiplies the low packed single precision floating-point value from the second source operand to
the low packed single precision floating-point value in the first source operand. From negated infinite precision
intermediate result, the low single precision floating-point value in the third source operand, performs rounding
and stores the resulting packed single precision floating-point value to the destination operand (first source
operand).
VFNMSUB231SS: Multiplies the low packed single precision floating-point value from the second source to the low
packed single precision floating-point value in the third source operand. From negated infinite precision interme-
diate result, the low single precision floating-point value in the first source operand, performs rounding and stores
the resulting packed single precision floating-point value to the destination operand (first source operand).
VEX.128 and EVEX encoded version: The destination operand (also first source operand) is encoded in reg_field.
The second source operand is encoded in VEX.vvvv/EVEX.vvvv. The third source operand is encoded in rm_field.
Bits 127:32 of the destination are unchanged. Bits MAXVL-1:128 of the destination register are zeroed.

VFNMSUB132SS/VFNMSUB213SS/VFNMSUB231SS—Fused Negative Multiply-Subtract of Scalar Single Precision Floating-Point Val- Vol. 2C 5-330
EVEX encoded version: The low doubleword element of the destination is updated according to the writemask.
Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the
opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations
involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction
column.

Operation
In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no
rounding).

VFNMSUB132SS DEST, SRC2, SRC3 (EVEX encoded version)


IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := RoundFPControl(-(DEST[31:0]*SRC3[31:0]) - SRC2[31:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

VFNMSUB213SS DEST, SRC2, SRC3 (EVEX encoded version)


IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := RoundFPControl(-(SRC2[31:0]*DEST[31:0]) - SRC3[31:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

VFNMSUB132SS/VFNMSUB213SS/VFNMSUB231SS—Fused Negative Multiply-Subtract of Scalar Single Precision Floating-Point Val- Vol. 2C 5-331
VFNMSUB231SS DEST, SRC2, SRC3 (EVEX encoded version)
IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := RoundFPControl(-(SRC2[31:0]*SRC3[63:0]) - DEST[31:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

VFNMSUB132SS DEST, SRC2, SRC3 (VEX encoded version)


DEST[31:0] := RoundFPControl_MXCSR(- (DEST[31:0]*SRC3[31:0]) - SRC2[31:0])
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

VFNMSUB213SS DEST, SRC2, SRC3 (VEX encoded version)


DEST[31:0] := RoundFPControl_MXCSR(- (SRC2[31:0]*DEST[31:0]) - SRC3[31:0])
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

VFNMSUB231SS DEST, SRC2, SRC3 (VEX encoded version)


DEST[31:0] := RoundFPControl_MXCSR(- (SRC2[31:0]*SRC3[31:0]) - DEST[31:0])
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFNMSUBxxxSS __m128 _mm_fnmsub_round_ss(__m128 a, __m128 b, __m128 c, int r);
VFNMSUBxxxSS __m128 _mm_mask_fnmsub_ss(__m128 a, __mmask8 k, __m128 b, __m128 c);
VFNMSUBxxxSS __m128 _mm_maskz_fnmsub_ss(__mmask8 k, __m128 a, __m128 b, __m128 c);
VFNMSUBxxxSS __m128 _mm_mask3_fnmsub_ss(__m128 a, __m128 b, __m128 c, __mmask8 k);
VFNMSUBxxxSS __m128 _mm_mask_fnmsub_round_ss(__m128 a, __mmask8 k, __m128 b, __m128 c, int r);
VFNMSUBxxxSS __m128 _mm_maskz_fnmsub_round_ss(__mmask8 k, __m128 a, __m128 b, __m128 c, int r);
VFNMSUBxxxSS __m128 _mm_mask3_fnmsub_round_ss(__m128 a, __m128 b, __m128 c, __mmask8 k, int r);
VFNMSUBxxxSS __m128 _mm_fnmsub_ss (__m128 a, __m128 b, __m128 c);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal.

Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”

EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VFNMSUB132SS/VFNMSUB213SS/VFNMSUB231SS—Fused Negative Multiply-Subtract of Scalar Single Precision Floating-Point Val- Vol. 2C 5-332
VFPCLASSPD—Tests Types of Packed Float64 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F3A.W1 66 /r ib A V/V (AVX512VL AND Tests the input for the following categories: NaN, +0, -
VFPCLASSPD k2 {k1}, AVX512DQ) OR 0, +Infinity, -Infinity, denormal, finite negative. The
xmm2/m128/m64bcst, imm8 AVX10.11 immediate field provides a mask bit for each of these
category tests. The masked test results are OR-ed
together to form a mask result.
EVEX.256.66.0F3A.W1 66 /r ib A V/V (AVX512VL AND Tests the input for the following categories: NaN, +0, -
VFPCLASSPD k2 {k1}, AVX512DQ) OR 0, +Infinity, -Infinity, denormal, finite negative. The
ymm2/m256/m64bcst, imm8 AVX10.11 immediate field provides a mask bit for each of these
category tests. The masked test results are OR-ed
together to form a mask result.
EVEX.512.66.0F3A.W1 66 /r ib A V/V AVX512DQ Tests the input for the following categories: NaN, +0, -
VFPCLASSPD k2 {k1}, OR AVX10.11 0, +Infinity, -Infinity, denormal, finite negative. The
zmm2/m512/m64bcst, imm8 immediate field provides a mask bit for each of these
category tests. The masked test results are OR-ed
together to form a mask result.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
The FPCLASSPD instruction checks the packed double precision floating-point values for special categories, speci-
fied by the set bits in the imm8 byte. Each set bit in imm8 specifies a category of floating-point values that the
input data element is classified against. The classified results of all specified categories of an input value are ORed
together to form the final boolean result for the input element. The result of each element is written to the corre-
sponding bit in a mask register k2 according to the writemask k1. Bits [MAX_KL-1:8/4/2] of the destination are
cleared.
The classification categories specified by imm8 are shown in Figure 5-13. The classification test for each category
is listed in Table 5-11.

7 6 5 4 3 2 1 0

SNaN Neg. Finite Denormal Neg. INF +INF Neg. 0 +0 QNaN

Figure 5-13. Imm8 Byte Specifier of Special Case Floating-Point Values for VFPCLASSPD/SD/PS/SS

VFPCLASSPD—Tests Types of Packed Float64 Values Vol. 2C 5-333


Table 5-11. Classifier Operations for VFPCLASSPD/SD/PS/SS
Bits Imm8[0] Imm8[1] Imm8[2] Imm8[3] Imm8[4] Imm8[5] Imm8[6] Imm8[7]
Category QNAN PosZero NegZero PosINF NegINF Denormal Negative SNAN
Classifier Checks for Checks for Checks for - Checks for Checks for - Checks for Checks for Checks for
QNaN +0 0 +INF INF Denormal Negative finite SNaN

The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector
broadcasted from a 64-bit memory location.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
CheckFPClassDP (tsrc[63:0], imm8[7:0]){

//* Start checking the source operand for special type *//
NegNum := tsrc[63];
IF (tsrc[62:52]=07FFh) Then ExpAllOnes := 1; FI;
IF (tsrc[62:52]=0h) Then ExpAllZeros := 1;
IF (ExpAllZeros AND MXCSR.DAZ) Then
MantAllZeros := 1;
ELSIF (tsrc[51:0]=0h) Then
MantAllZeros := 1;
FI;
ZeroNumber := ExpAllZeros AND MantAllZeros
SignalingBit := tsrc[51];

sNaN_res := ExpAllOnes AND NOT(MantAllZeros) AND NOT(SignalingBit); // sNaN


qNaN_res := ExpAllOnes AND NOT(MantAllZeros) AND SignalingBit; // qNaN
Pzero_res := NOT(NegNum) AND ExpAllZeros AND MantAllZeros; // +0
Nzero_res := NegNum AND ExpAllZeros AND MantAllZeros; // -0
PInf_res := NOT(NegNum) AND ExpAllOnes AND MantAllZeros; // +Inf
NInf_res := NegNum AND ExpAllOnes AND MantAllZeros; // -Inf
Denorm_res := ExpAllZeros AND NOT(MantAllZeros); // denorm
FinNeg_res := NegNum AND NOT(ExpAllOnes) AND NOT(ZeroNumber); // -finite

bResult = ( imm8[0] AND qNaN_res ) OR (imm8[1] AND Pzero_res ) OR


( imm8[2] AND Nzero_res ) OR ( imm8[3] AND PInf_res ) OR
( imm8[4] AND NInf_res ) OR ( imm8[5] AND Denorm_res ) OR
( imm8[6] AND FinNeg_res ) OR ( imm8[7] AND sNaN_res );
Return bResult;
} //* end of CheckFPClassDP() *//

VFPCLASSPD—Tests Types of Packed Float64 Values Vol. 2C 5-334


VFPCLASSPD (EVEX Encoded versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1) AND (SRC *is memory*)
THEN
DEST[j] := CheckFPClassDP(SRC1[63:0], imm8[7:0]);
ELSE
DEST[j] := CheckFPClassDP(SRC1[i+63:i], imm8[7:0]);
FI;
ELSE DEST[j] := 0 ; zeroing-masking only
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFPCLASSPD __mmask8 _mm512_fpclass_pd_mask( __m512d a, int c);
VFPCLASSPD __mmask8 _mm512_mask_fpclass_pd_mask( __mmask8 m, __m512d a, int c)
VFPCLASSPD __mmask8 _mm256_fpclass_pd_mask( __m256d a, int c)
VFPCLASSPD __mmask8 _mm256_mask_fpclass_pd_mask( __mmask8 m, __m256d a, int c)
VFPCLASSPD __mmask8 _mm_fpclass_pd_mask( __m128d a, int c)
VFPCLASSPD __mmask8 _mm_mask_fpclass_pd_mask( __mmask8 m, __m128d a, int c)

SIMD Floating-Point Exceptions

None.

Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VFPCLASSPD—Tests Types of Packed Float64 Values Vol. 2C 5-335


VFPCLASSPH—Test Types of Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.NP.0F3A.W0 66 /r /ib A V/V (AVX512-FP16 Test the input for the following categories: NaN,
VFPCLASSPH k1{k2}, AND AVX512VL) +0, -0, +Infinity, -Infinity, denormal, finite
xmm1/m128/m16bcst, imm8 OR AVX10.11 negative. The immediate field provides a mask
bit for each of these category tests. The masked
test results are OR-ed together to form a mask
result.
EVEX.256.NP.0F3A.W0 66 /r /ib A V/V (AVX512-FP16 Test the input for the following categories: NaN,
VFPCLASSPH k1{k2}, AND AVX512VL) +0, -0, +Infinity, -Infinity, denormal, finite
ymm1/m256/m16bcst, imm8 OR AVX10.11 negative. The immediate field provides a mask
bit for each of these category tests. The masked
test results are OR-ed together to form a mask
result.
EVEX.512.NP.0F3A.W0 66 /r /ib A V/V AVX512-FP16 Test the input for the following categories: NaN,
VFPCLASSPH k1{k2}, OR AVX10.11 +0, -0, +Infinity, -Infinity, denormal, finite
zmm1/m512/m16bcst, imm8 negative. The immediate field provides a mask
bit for each of these category tests. The masked
test results are OR-ed together to form a mask
result.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) imm8 (r) N/A

Description
This instruction checks the packed FP16 values in the source operand for special categories, specified by the set
bits in the imm8 byte. Each set bit in imm8 specifies a category of floating-point values that the input data element
is classified against; see Table 5-12 for the categories. The classified results of all specified categories of an input
value are ORed together to form the final boolean result for the input element. The result is written to the corre-
sponding bits in the destination mask register according to the writemask.

Table 5-12. Classifier Operations for VFPCLASSPH/VFPCLASSSH


Bits Category Classifier
imm8[0] QNAN Checks for QNAN
imm8[1] PosZero Checks +0
imm8[2] NegZero Checks for -0
imm8[3] PosINF Checks for +∞
imm8[4] NegINF Checks for −∞
imm8[5] Denormal Checks for Denormal
imm8[6] Negative Checks for Negative finite
imm8[7] SNAN Checks for SNAN

VFPCLASSPH—Test Types of Packed FP16 Values Vol. 2C 5-336


Operation
def check_fp_class_fp16(tsrc, imm8):
negative := tsrc[15]
exponent_all_ones := (tsrc[14:10] == 0x1F)
exponent_all_zeros := (tsrc[14:10] == 0)
mantissa_all_zeros := (tsrc[9:0] == 0)
zero := exponent_all_zeros and mantissa_all_zeros
signaling_bit := tsrc[9]

snan := exponent_all_ones and not(mantissa_all_zeros) and not(signaling_bit)


qnan := exponent_all_ones and not(mantissa_all_zeros) and signaling_bit
positive_zero := not(negative) and zero
negative_zero := negative and zero
positive_infinity := not(negative) and exponent_all_ones and mantissa_all_zeros
negative_infinity := negative and exponent_all_ones and mantissa_all_zeros
denormal := exponent_all_zeros and not(mantissa_all_zeros)
finite_negative := negative and not(exponent_all_ones) and not(zero)

return (imm8[0] and qnan) OR


(imm8[1] and positive_zero) OR
(imm8[2] and negative_zero) OR
(imm8[3] and positive_infinity) OR
(imm8[4] and negative_infinity) OR
(imm8[5] and denormal) OR
(imm8[6] and finite_negative) OR
(imm8[7] and snan)

VFPCLASSPH dest{k2}, src, imm8


VL = 128, 256 or 512
KL := VL/16

FOR i := 0 to KL-1:
IF k2[i] or *no writemask*:
IF SRC is memory and (EVEX.b = 1):
tsrc := SRC.fp16[0]
ELSE:
tsrc := SRC.fp16[i]
DEST.bit[i] := check_fp_class_fp16(tsrc, imm8)
ELSE:
DEST.bit[i] := 0

DEST[MAXKL-1:kl] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFPCLASSPH __mmask8 _mm_fpclass_ph_mask (__m128h a, int imm8);
VFPCLASSPH __mmask8 _mm_mask_fpclass_ph_mask (__mmask8 k1, __m128h a, int imm8);
VFPCLASSPH __mmask16 _mm256_fpclass_ph_mask (__m256h a, int imm8);
VFPCLASSPH __mmask16 _mm256_mask_fpclass_ph_mask (__mmask16 k1, __m256h a, int imm8);
VFPCLASSPH __mmask32 _mm512_fpclass_ph_mask (__m512h a, int imm8);
VFPCLASSPH __mmask32 _mm512_mask_fpclass_ph_mask (__mmask32 k1, __m512h a, int imm8);

SIMD Floating-Point Exceptions


None.

VFPCLASSPH—Test Types of Packed FP16 Values Vol. 2C 5-337


Other Exceptions
EVEX-encoded instructions, see Table 2-51, “Type E4 Class Exception Conditions.”

VFPCLASSPH—Test Types of Packed FP16 Values Vol. 2C 5-338


VFPCLASSPS—Tests Types of Packed Float32 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F3A.W0 66 /r ib A V/V (AVX512VL AND Tests the input for the following categories: NaN, +0, -
VFPCLASSPS k2 {k1}, AVX512DQ) OR 0, +Infinity, -Infinity, denormal, finite negative. The
xmm2/m128/m32bcst, imm8 AVX10.11 immediate field provides a mask bit for each of these
category tests. The masked test results are OR-ed
together to form a mask result.
EVEX.256.66.0F3A.W0 66 /r ib A V/V (AVX512VL AND Tests the input for the following categories: NaN, +0, -
VFPCLASSPS k2 {k1}, AVX512DQ) OR 0, +Infinity, -Infinity, denormal, finite negative. The
ymm2/m256/m32bcst, imm8 AVX10.11 immediate field provides a mask bit for each of these
category tests. The masked test results are OR-ed
together to form a mask result.
EVEX.512.66.0F3A.W0 66 /r ib A V/V AVX512DQ Tests the input for the following categories: NaN, +0, -
VFPCLASSPS k2 {k1}, OR AVX10.11 0, +Infinity, -Infinity, denormal, finite negative. The
zmm2/m512/m32bcst, imm8 immediate field provides a mask bit for each of these
category tests. The masked test results are OR-ed
together to form a mask result.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
The FPCLASSPS instruction checks the packed single precision floating-point values for special categories, specified
by the set bits in the imm8 byte. Each set bit in imm8 specifies a category of floating-point values that the input
data element is classified against. The classified results of all specified categories of an input value are ORed
together to form the final boolean result for the input element. The result of each element is written to the corre-
sponding bit in a mask register k2 according to the writemask k1. Bits [MAX_KL-1:16/8/4] of the destination are
cleared.
The classification categories specified by imm8 are shown in Figure 5-13. The classification test for each category
is listed in Table 5-11.
The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector
broadcasted from a 32-bit memory location.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
CheckFPClassSP (tsrc[31:0], imm8[7:0]){

//* Start checking the source operand for special type *//
NegNum := tsrc[31];
IF (tsrc[30:23]=0FFh) Then ExpAllOnes := 1; FI;
IF (tsrc[30:23]=0h) Then ExpAllZeros := 1;
IF (ExpAllZeros AND MXCSR.DAZ) Then
MantAllZeros := 1;
ELSIF (tsrc[22:0]=0h) Then

VFPCLASSPS—Tests Types of Packed Float32 Values Vol. 2C 5-339


MantAllZeros := 1;
FI;
ZeroNumber= ExpAllZeros AND MantAllZeros
SignalingBit= tsrc[22];

sNaN_res := ExpAllOnes AND NOT(MantAllZeros) AND NOT(SignalingBit); // sNaN


qNaN_res := ExpAllOnes AND NOT(MantAllZeros) AND SignalingBit; // qNaN
Pzero_res := NOT(NegNum) AND ExpAllZeros AND MantAllZeros; // +0
Nzero_res := NegNum AND ExpAllZeros AND MantAllZeros; // -0
PInf_res := NOT(NegNum) AND ExpAllOnes AND MantAllZeros; // +Inf
NInf_res := NegNum AND ExpAllOnes AND MantAllZeros; // -Inf
Denorm_res := ExpAllZeros AND NOT(MantAllZeros); // denorm
FinNeg_res := NegNum AND NOT(ExpAllOnes) AND NOT(ZeroNumber); // -finite

bResult = ( imm8[0] AND qNaN_res ) OR (imm8[1] AND Pzero_res ) OR


( imm8[2] AND Nzero_res ) OR ( imm8[3] AND PInf_res ) OR
( imm8[4] AND NInf_res ) OR ( imm8[5] AND Denorm_res ) OR
( imm8[6] AND FinNeg_res ) OR ( imm8[7] AND sNaN_res );
Return bResult;
} //* end of CheckSPClassSP() *//

VFPCLASSPS (EVEX encoded versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1) AND (SRC *is memory*)
THEN
DEST[j] := CheckFPClassDP(SRC1[31:0], imm8[7:0]);
ELSE
DEST[j] := CheckFPClassDP(SRC1[i+31:i], imm8[7:0]);
FI;
ELSE DEST[j] := 0 ; zeroing-masking only
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFPCLASSPS __mmask16 _mm512_fpclass_ps_mask( __m512 a, int c);
VFPCLASSPS __mmask16 _mm512_mask_fpclass_ps_mask( __mmask16 m, __m512 a, int c)
VFPCLASSPS __mmask8 _mm256_fpclass_ps_mask( __m256 a, int c)
VFPCLASSPS __mmask8 _mm256_mask_fpclass_ps_mask( __mmask8 m, __m256 a, int c)
VFPCLASSPS __mmask8 _mm_fpclass_ps_mask( __m128 a, int c)
VFPCLASSPS __mmask8 _mm_mask_fpclass_ps_mask( __mmask8 m, __m128 a, int c)

SIMD Floating-Point Exceptions

None.

Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VFPCLASSPS—Tests Types of Packed Float32 Values Vol. 2C 5-340


VFPCLASSSD—Tests Type of a Scalar Float64 Value
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.66.0F3A.W1 67 /r ib A V/V AVX512DQ Tests the input for the following categories: NaN, +0, -0,
VFPCLASSSD k2 {k1}, OR AVX10.11 +Infinity, -Infinity, denormal, finite negative. The
xmm2/m64, imm8 immediate field provides a mask bit for each of these
category tests. The masked test results are OR-ed
together to form a mask result.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
The FPCLASSSD instruction checks the low double precision floating-point value in the source operand for special
categories, specified by the set bits in the imm8 byte. Each set bit in imm8 specifies a category of floating-point
values that the input data element is classified against. The classified results of all specified categories of an input
value are ORed together to form the final boolean result for the input element. The result is written to the low bit
in a mask register k2 according to the writemask k1. Bits MAX_KL-1: 1 of the destination are cleared.
The classification categories specified by imm8 are shown in Figure 5-13. The classification test for each category
is listed in Table 5-11.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
CheckFPClassDP (tsrc[63:0], imm8[7:0]){

NegNum := tsrc[63];
IF (tsrc[62:52]=07FFh) Then ExpAllOnes := 1; FI;
IF (tsrc[62:52]=0h) Then ExpAllZeros := 1;
IF (ExpAllZeros AND MXCSR.DAZ) Then
MantAllZeros := 1;
ELSIF (tsrc[51:0]=0h) Then
MantAllZeros := 1;
FI;
ZeroNumber := ExpAllZeros AND MantAllZeros
SignalingBit := tsrc[51];

sNaN_res := ExpAllOnes AND NOT(MantAllZeros) AND NOT(SignalingBit); // sNaN


qNaN_res := ExpAllOnes AND NOT(MantAllZeros) AND SignalingBit; // qNaN
Pzero_res := NOT(NegNum) AND ExpAllZeros AND MantAllZeros; // +0
Nzero_res := NegNum AND ExpAllZeros AND MantAllZeros; // -0
PInf_res := NOT(NegNum) AND ExpAllOnes AND MantAllZeros; // +Inf
NInf_res := NegNum AND ExpAllOnes AND MantAllZeros; // -Inf
Denorm_res := ExpAllZeros AND NOT(MantAllZeros); // denorm
FinNeg_res := NegNum AND NOT(ExpAllOnes) AND NOT(ZeroNumber); // -finite

VFPCLASSSD—Tests Type of a Scalar Float64 Value Vol. 2C 5-341


bResult = ( imm8[0] AND qNaN_res ) OR (imm8[1] AND Pzero_res ) OR
( imm8[2] AND Nzero_res ) OR ( imm8[3] AND PInf_res ) OR
( imm8[4] AND NInf_res ) OR ( imm8[5] AND Denorm_res ) OR
( imm8[6] AND FinNeg_res ) OR ( imm8[7] AND sNaN_res );
Return bResult;
} //* end of CheckFPClassDP() *//

VFPCLASSSD (EVEX encoded version)


IF k1[0] OR *no writemask*
THEN DEST[0] :=
CheckFPClassDP(SRC1[63:0], imm8[7:0])
ELSE DEST[0] := 0 ; zeroing-masking only
FI;
DEST[MAX_KL-1:1] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFPCLASSSD __mmask8 _mm_fpclass_sd_mask( __m128d a, int c)
VFPCLASSSD __mmask8 _mm_mask_fpclass_sd_mask( __mmask8 m, __m128d a, int c)

SIMD Floating-Point Exceptions

None.

Other Exceptions
See Table 2-55, “Type E6 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VFPCLASSSD—Tests Type of a Scalar Float64 Value Vol. 2C 5-342


VFPCLASSSH—Test Types of Scalar FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.NP.0F3A.W0 67 /r /ib A V/V AVX512-FP16 Test the input for the following categories: NaN,
VFPCLASSSH k1{k2}, xmm1/m16, OR AVX10.11 +0, -0, +Infinity, -Infinity, denormal, finite
imm8 negative. The immediate field provides a mask
bit for each of these category tests. The masked
test results are OR-ed together to form a mask
result.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) ModRM:r/m (r) imm8 (r) N/A

Description
This instruction checks the low FP16 value in the source operand for special categories, specified by the set bits in
the imm8 byte. Each set bit in imm8 specifies a category of floating-point values that the input data element is clas-
sified against; see Table 5-12 for the categories. The classified results of all specified categories of an input value
are ORed together to form the final boolean result for the input element. The result is written to the low bit in the
destination mask register according to the writemask. The other bits in the destination mask register are zeroed.

Operation
VFPCLASSSH dest{k2}, src, imm8
IF k2[0] or *no writemask*:
DEST.bit[0] := check_fp_class_fp16(src.fp16[0], imm8) // see VFPCLASSPH
ELSE:
DEST.bit[0] := 0

DEST[MAXKL-1:1] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFPCLASSSH __mmask8 _mm_fpclass_sh_mask (__m128h a, int imm8);
VFPCLASSSH __mmask8 _mm_mask_fpclass_sh_mask (__mmask8 k1, __m128h a, int imm8);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instructions, see Table 2-60, “Type E10 Class Exception Conditions.”

VFPCLASSSH—Test Types of Scalar FP16 Values Vol. 2C 5-343


VFPCLASSSS—Tests Type of a Scalar Float32 Value
Opcode/ Op / 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
EVEX.LLIG.66.0F3A.W0 67 /r A V/V AVX512DQ Tests the input for the following categories: NaN, +0, -0,
VFPCLASSSS k2 {k1}, OR AVX10.11 +Infinity, -Infinity, denormal, finite negative. The immediate
xmm2/m32, imm8 field provides a mask bit for each of these category tests.
The masked test results are OR-ed together to form a mask
result.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
The FPCLASSSS instruction checks the low single precision floating-point value in the source operand for special
categories, specified by the set bits in the imm8 byte. Each set bit in imm8 specifies a category of floating-point
values that the input data element is classified against. The classified results of all specified categories of an input
value are ORed together to form the final boolean result for the input element. The result is written to the low bit
in a mask register k2 according to the writemask k1. Bits MAX_KL-1: 1 of the destination are cleared.
The classification categories specified by imm8 are shown in Figure 5-13. The classification test for each category
is listed in Table 5-11.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
CheckFPClassSP (tsrc[31:0], imm8[7:0]){

//* Start checking the source operand for special type *//
NegNum := tsrc[31];
IF (tsrc[30:23]=0FFh) Then ExpAllOnes := 1; FI;
IF (tsrc[30:23]=0h) Then ExpAllZeros := 1;
IF (ExpAllZeros AND MXCSR.DAZ) Then
MantAllZeros := 1;
ELSIF (tsrc[22:0]=0h) Then
MantAllZeros := 1;
FI;
ZeroNumber= ExpAllZeros AND MantAllZeros
SignalingBit= tsrc[22];

sNaN_res := ExpAllOnes AND NOT(MantAllZeros) AND NOT(SignalingBit); // sNaN


qNaN_res := ExpAllOnes AND NOT(MantAllZeros) AND SignalingBit; // qNaN
Pzero_res := NOT(NegNum) AND ExpAllZeros AND MantAllZeros; // +0
Nzero_res := NegNum AND ExpAllZeros AND MantAllZeros; // -0
PInf_res := NOT(NegNum) AND ExpAllOnes AND MantAllZeros; // +Inf
NInf_res := NegNum AND ExpAllOnes AND MantAllZeros; // -Inf
Denorm_res := ExpAllZeros AND NOT(MantAllZeros); // denorm
FinNeg_res := NegNum AND NOT(ExpAllOnes) AND NOT(ZeroNumber); // -finite

VFPCLASSSS—Tests Type of a Scalar Float32 Value Vol. 2C 5-344


bResult = ( imm8[0] AND qNaN_res ) OR (imm8[1] AND Pzero_res ) OR
( imm8[2] AND Nzero_res ) OR ( imm8[3] AND PInf_res ) OR
( imm8[4] AND NInf_res ) OR ( imm8[5] AND Denorm_res ) OR
( imm8[6] AND FinNeg_res ) OR ( imm8[7] AND sNaN_res );
Return bResult;
} //* end of CheckSPClassSP() *//

VFPCLASSSS (EVEX encoded version)


IF k1[0] OR *no writemask*
THEN DEST[0] :=
CheckFPClassSP(SRC1[31:0], imm8[7:0])
ELSE DEST[0] := 0 ; zeroing-masking only
FI;
DEST[MAX_KL-1:1] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VFPCLASSSS __mmask8 _mm_fpclass_ss_mask( __m128 a, int c)
VFPCLASSSS __mmask8 _mm_mask_fpclass_ss_mask( __mmask8 m, __m128 a, int c)

SIMD Floating-Point Exceptions

None.

Other Exceptions
See Table 2-55, “Type E6 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VFPCLASSSS—Tests Type of a Scalar Float32 Value Vol. 2C 5-345


VGATHERDPS/VGATHERDPD—Gather Packed Single, Packed Double with Signed Dword Indices
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F38.W0 92 /vsib A V/V (AVX512VL AND Using signed dword indices, gather single-precision
VGATHERDPS xmm1 {k1}, vm32x AVX512F) OR floating-point values from memory using k1 as
AVX10.11 completion mask.
EVEX.256.66.0F38.W0 92 /vsib A V/V (AVX512VL AND Using signed dword indices, gather single-precision
VGATHERDPS ymm1 {k1}, vm32y AVX512F) OR floating-point values from memory using k1 as
AVX10.11 completion mask.
EVEX.512.66.0F38.W0 92 /vsib A V/V AVX512F Using signed dword indices, gather single-precision
VGATHERDPS zmm1 {k1}, vm32z OR AVX10.11 floating-point values from memory using k1 as
completion mask.
EVEX.128.66.0F38.W1 92 /vsib A V/V (AVX512VL AND Using signed dword indices, gather float64 vector into
VGATHERDPD xmm1 {k1}, AVX512F) OR float64 vector xmm1 using k1 as completion mask.
vm32x AVX10.11
EVEX.256.66.0F38.W1 92 /vsib A V/V (AVX512VL AND Using signed dword indices, gather float64 vector into
VGATHERDPD ymm1 {k1}, AVX512F) OR float64 vector ymm1 using k1 as completion mask.
vm32x AVX10.11
EVEX.512.66.0F38.W1 92 /vsib A V/V AVX512F Using signed dword indices, gather float64 vector into
VGATHERDPD zmm1 {k1}, vm32y OR AVX10.11 float64 vector zmm1 using k1 as completion mask.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
BaseReg (R): VSIB:base,
A Tuple1 Scalar ModRM:reg (w) N/A N/A
VectorReg(R): VSIB:index

Description
A set of single precision/double precision faulting-point memory locations pointed by base address BASE_ADDR
and index vector V_INDEX with scale SCALE are gathered. The result is written into a vector register. The elements
are specified via the VSIB (i.e., the index register is a vector register, holding packed indices). Elements will only
be loaded if their corresponding mask bit is one. If an element’s mask bit is not set, the corresponding element of
the destination register is left unchanged. The entire mask register will be set to zero by this instruction unless it
triggers an exception.
This instruction can be suspended by an exception if at least one element is already gathered (i.e., if the exception
is triggered by an element other than the right most one with its mask bit set). When this happens, the destination
register and the mask register (k1) are partially updated; those elements that have been gathered are placed into
the destination register and have their mask bits set to zero. If any traps or interrupts are pending from already
gathered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruc-
tion breakpoint is not re-triggered when the instruction is continued.
If the data element size is less than the index element size, the higher part of the destination register and the mask
register do not correspond to any elements being gathered. This instruction sets those higher parts to zero. It may
update these unused elements to one or both of those registers even if the instruction triggers an exception, and
even if the instruction triggers the exception before gathering any elements.
Note that:
• The values may be read from memory in any order. Memory ordering with other instructions follows the Intel-
64 memory-ordering model.

VGATHERDPS/VGATHERDPD—Gather Packed Single, Packed Double with Signed Dword Indices Vol. 2C 5-350
• Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all
elements closer to the LSB of the destination zmm will be completed (and non-faulting). Individual elements
closer to the MSB may or may not be completed. If a given element triggers multiple faults, they are delivered
in the conventional order.
• Elements may be gathered in any order, but faults must be delivered in a right-to left order; thus, elements to
the left of a faulting one may be gathered before the fault is delivered. A given implementation of this
instruction is repeatable - given the same input values and architectural state, the same set of elements to the
left of the faulting one will be gathered.
• This instruction does not perform AC checks, and so will never deliver an AC fault.
• Not valid with 16-bit effective addresses. Will deliver a #UD fault.
Note that the presence of VSIB byte is enforced in this instruction. Hence, the instruction will #UD fault if
ModRM.rm is different than 100b.
This instruction has special disp8*N and alignment rules. N is considered to be the size of a single vector element.
The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit
mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are
ignored.
The instruction will #UD fault if the destination vector zmm1 is the same as index vector VINDEX. The instruction
will #UD fault if the k0 mask register is specified.

Operation
BASE_ADDR stands for the memory operand base address (a GPR); may not exist
VINDEX stands for the memory operand vector of indices (a vector register)
SCALE stands for the memory operand scalar (1, 2, 4 or 8)
DISP is the optional 1 or 4 byte displacement

VGATHERDPS (EVEX encoded version)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j]
THEN DEST[i+31:i] :=
MEM[BASE_ADDR +
SignExtend(VINDEX[i+31:i]) * SCALE + DISP]
k1[j] := 0
ELSE *DEST[i+31:i] := remains unchanged*
FI;
ENDFOR
k1[MAX_KL-1:KL] := 0
DEST[MAXVL-1:VL] := 0

VGATHERDPD (EVEX encoded version)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j]
THEN DEST[i+63:i] := MEM[BASE_ADDR +
SignExtend(VINDEX[k+31:k]) * SCALE + DISP]
k1[j] := 0
ELSE *DEST[i+63:i] := remains unchanged*
FI;
ENDFOR
k1[MAX_KL-1:KL] := 0

VGATHERDPS/VGATHERDPD—Gather Packed Single, Packed Double with Signed Dword Indices Vol. 2C 5-351
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VGATHERDPD __m512d _mm512_i32gather_pd( __m256i vdx, void * base, int scale);
VGATHERDPD __m512d _mm512_mask_i32gather_pd(__m512d s, __mmask8 k, __m256i vdx, void * base, int scale);
VGATHERDPD __m256d _mm256_mmask_i32gather_pd(__m256d s, __mmask8 k, __m128i vdx, void * base, int scale);
VGATHERDPD __m128d _mm_mmask_i32gather_pd(__m128d s, __mmask8 k, __m128i vdx, void * base, int scale);
VGATHERDPS __m512 _mm512_i32gather_ps( __m512i vdx, void * base, int scale);
VGATHERDPS __m512 _mm512_mask_i32gather_ps(__m512 s, __mmask16 k, __m512i vdx, void * base, int scale);
VGATHERDPS __m256 _mm256_mmask_i32gather_ps(__m256 s, __mmask8 k, __m256i vdx, void * base, int scale);
GATHERDPS __m128 _mm_mmask_i32gather_ps(__m128 s, __mmask8 k, __m128i vdx, void * base, int scale);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-63, “Type E12 Class Exception Conditions.”

VGATHERDPS/VGATHERDPD—Gather Packed Single, Packed Double with Signed Dword Indices Vol. 2C 5-352
VGATHERQPS/VGATHERQPD—Gather Packed Single, Packed Double with Signed Qword Indices
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F38.W0 93 /vsib A V/V (AVX512VL AND Using signed qword indices, gather single-precision
VGATHERQPS xmm1 {k1}, vm64x AVX512F) OR floating-point values from memory using k1 as
AVX10.11 completion mask.
EVEX.256.66.0F38.W0 93 /vsib A V/V (AVX512VL AND Using signed qword indices, gather single-precision
VGATHERQPS xmm1 {k1}, vm64y AVX512F) OR floating-point values from memory using k1 as
AVX10.11 completion mask.
EVEX.512.66.0F38.W0 93 /vsib A V/V AVX512F Using signed qword indices, gather single-precision
VGATHERQPS ymm1 {k1}, vm64z OR AVX10.11 floating-point values from memory using k1 as
completion mask.
EVEX.128.66.0F38.W1 93 /vsib A V/V (AVX512VL AND Using signed qword indices, gather float64 vector into
VGATHERQPD xmm1 {k1}, vm64x AVX512F) OR float64 vector xmm1 using k1 as completion mask.
AVX10.11
EVEX.256.66.0F38.W1 93 /vsib A V/V (AVX512VL AND Using signed qword indices, gather float64 vector into
VGATHERQPD ymm1 {k1}, vm64y AVX512F) OR float64 vector ymm1 using k1 as completion mask.
AVX10.11
EVEX.512.66.0F38.W1 93 /vsib A V/V AVX512F Using signed qword indices, gather float64 vector into
VGATHERQPD zmm1 {k1}, vm64z OR AVX10.11 float64 vector zmm1 using k1 as completion mask.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
BaseReg (R): VSIB:base,
A Tuple1 Scalar ModRM:reg (w) N/A N/A
VectorReg(R): VSIB:index

Description
A set of 8 single precision/double precision faulting-point memory locations pointed by base address BASE_ADDR
and index vector V_INDEX with scale SCALE are gathered. The result is written into vector a register. The elements
are specified via the VSIB (i.e., the index register is a vector register, holding packed indices). Elements will only
be loaded if their corresponding mask bit is one. If an element’s mask bit is not set, the corresponding element of
the destination register is left unchanged. The entire mask register will be set to zero by this instruction unless it
triggers an exception.
This instruction can be suspended by an exception if at least one element is already gathered (i.e., if the exception
is triggered by an element other than the rightmost one with its mask bit set). When this happens, the destination
register and the mask register (k1) are partially updated; those elements that have been gathered are placed into
the destination register and have their mask bits set to zero. If any traps or interrupts are pending from already
gathered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruc-
tion breakpoint is not re-triggered when the instruction is continued.
If the data element size is less than the index element size, the higher part of the destination register and the mask
register do not correspond to any elements being gathered. This instruction sets those higher parts to zero. It may
update these unused elements to one or both of those registers even if the instruction triggers an exception, and
even if the instruction triggers the exception before gathering any elements.
Note that:
• The values may be read from memory in any order. Memory ordering with other instructions follows the Intel-
64 memory-ordering model.

VGATHERQPS/VGATHERQPD—Gather Packed Single, Packed Double with Signed Qword Indices Vol. 2C 5-357
• Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all
elements closer to the LSB of the destination zmm will be completed (and non-faulting). Individual elements
closer to the MSB may or may not be completed. If a given element triggers multiple faults, they are delivered
in the conventional order.
• Elements may be gathered in any order, but faults must be delivered in a right-to left order; thus, elements to
the left of a faulting one may be gathered before the fault is delivered. A given implementation of this
instruction is repeatable - given the same input values and architectural state, the same set of elements to the
left of the faulting one will be gathered.
• This instruction does not perform AC checks, and so will never deliver an AC fault.
• Not valid with 16-bit effective addresses. Will deliver a #UD fault.
Note that the presence of VSIB byte is enforced in this instruction. Hence, the instruction will #UD fault if
ModRM.rm is different than 100b.
This instruction has special disp8*N and alignment rules. N is considered to be the size of a single vector element.
The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit
mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are
ignored.
The instruction will #UD fault if the destination vector zmm1 is the same as index vector VINDEX. The instruction
will #UD fault if the k0 mask register is specified.

Operation
BASE_ADDR stands for the memory operand base address (a GPR); may not exist
VINDEX stands for the memory operand vector of indices (a ZMM register)
SCALE stands for the memory operand scalar (1, 2, 4 or 8)
DISP is the optional 1 or 4 byte displacement

VGATHERQPS (EVEX encoded version)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1

i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
MEM[BASE_ADDR + (VINDEX[k+63:k]) * SCALE + DISP]
k1[j] := 0
ELSE *DEST[i+31:i] := remains unchanged*
FI;
ENDFOR
k1[MAX_KL-1:KL] := 0
DEST[MAXVL-1:VL/2] := 0

VGATHERQPD (EVEX encoded version)

(KL, VL) = (2, 128), (4, 256), (8, 512)


FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := MEM[BASE_ADDR + (VINDEX[i+63:i]) * SCALE + DISP]
k1[j] := 0
ELSE *DEST[i+63:i] := remains unchanged*
FI;
ENDFOR
k1[MAX_KL-1:KL] := 0

VGATHERQPS/VGATHERQPD—Gather Packed Single, Packed Double with Signed Qword Indices Vol. 2C 5-358
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VGATHERQPD __m512d _mm512_i64gather_pd( __m512i vdx, void * base, int scale);
VGATHERQPD __m512d _mm512_mask_i64gather_pd(__m512d s, __mmask8 k, __m512i vdx, void * base, int scale);
VGATHERQPD __m256d _mm256_mask_i64gather_pd(__m256d s, __mmask8 k, __m256i vdx, void * base, int scale);
VGATHERQPD __m128d _mm_mask_i64gather_pd(__m128d s, __mmask8 k, __m128i vdx, void * base, int scale);
VGATHERQPS __m256 _mm512_i64gather_ps( __m512i vdx, void * base, int scale);
VGATHERQPS __m256 _mm512_mask_i64gather_ps(__m256 s, __mmask16 k, __m512i vdx, void * base, int scale);
VGATHERQPS __m128 _mm256_mask_i64gather_ps(__m128 s, __mmask8 k, __m256i vdx, void * base, int scale);
VGATHERQPS __m128 _mm_mask_i64gather_ps(__m128 s, __mmask8 k, __m128i vdx, void * base, int scale);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-63, “Type E12 Class Exception Conditions.”

VGATHERQPS/VGATHERQPD—Gather Packed Single, Packed Double with Signed Qword Indices Vol. 2C 5-359
VGETEXPPD—Convert Exponents of Packed Double Precision Floating-Point Values to Double
Precision Floating-Point Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F38.W1 42 /r A V/V (AVX512VL Convert the exponent of packed double precision floating-
VGETEXPPD xmm1 {k1}{z}, AND AVX512F) point values in the source operand to double precision
xmm2/m128/m64bcst OR AVX10.11 floating-point results representing unbiased integer
exponents and stores the results in the destination register.
EVEX.256.66.0F38.W1 42 /r A V/V (AVX512VL Convert the exponent of packed double precision floating-
VGETEXPPD ymm1 {k1}{z}, AND AVX512F) point values in the source operand to double precision
ymm2/m256/m64bcst OR AVX10.11 floating-point results representing unbiased integer
exponents and stores the results in the destination register.
EVEX.512.66.0F38.W1 42 /r A V/V AVX512F Convert the exponent of packed double precision floating-
VGETEXPPD zmm1 {k1}{z}, OR AVX10.11 point values in the source operand to double precision
zmm2/m512/m64bcst{sae} floating-point results representing unbiased integer
exponents and stores the results in the destination under
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Extracts the biased exponents from the normalized double precision floating-point representation of each qword
data element of the source operand (the second operand) as unbiased signed integer value, or convert the
denormal representation of input data to unbiased negative integer values. Each integer value of the unbiased
exponent is converted to double precision floating-point value and written to the corresponding qword elements of
the destination operand (the first operand) as double precision floating-point numbers.
The destination operand is a ZMM/YMM/XMM register and updated under the writemask. The source operand can
be a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from
a 64-bit memory location.
EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
Each GETEXP operation converts the exponent value into a floating-point number (permitting input value in
denormal representation). Special cases of input values are listed in Table 5-13.
The formula is:
GETEXP(x) = floor(log2(|x|))
Notation floor(x) stands for the greatest integer not exceeding real number x.

VGETEXPPD—Convert Exponents of Packed Double Precision Floating-Point Values to Double Precision Floating-Point Values Vol. 2C 5-360
Table 5-13. VGETEXPPD/SD Special Cases
Input Operand Result Comments
src1 = NaN QNaN(src1)
If (SRC = SNaN) then #IE
0 < |src1| < INF floor(log2(|src1|))
If (SRC = denormal) then #DE
| src1| = +INF +INF
| src1| = 0 -INF

Operation
NormalizeExpTinyDPFP(SRC[63:0])
{
// Jbit is the hidden integral bit of a floating-point number. In case of denormal number it has the value of ZERO.
Src.Jbit := 0;
Dst.exp := 1;
Dst.fraction := SRC[51:0];
WHILE(Src.Jbit = 0)
{
Src.Jbit := Dst.fraction[51]; // Get the fraction MSB
Dst.fraction := Dst.fraction << 1 ; // One bit shift left
Dst.exp-- ; // Decrement the exponent
}
Dst.fraction := 0; // zero out fraction bits
Dst.sign := 1; // Return negative sign
TMP[63:0] := MXCSR.DAZ? 0 : (Dst.sign << 63) OR (Dst.exp << 52) OR (Dst.fraction) ;
Return (TMP[63:0]);
}

ConvertExpDPFP(SRC[63:0])
{
Src.sign := 0; // Zero out sign bit
Src.exp := SRC[62:52];
Src.fraction := SRC[51:0];
// Check for NaN
IF (SRC = NaN)
{
IF ( SRC = SNAN ) SET IE;
Return QNAN(SRC);
}
// Check for +INF
IF (Src = +INF) RETURN (Src);

// check if zero operand


IF ((Src.exp = 0) AND ((Src.fraction = 0) OR (MXCSR.DAZ = 1))) Return (-INF);
}
ELSE // check if denormal operand (notice that MXCSR.DAZ = 0)
{
IF ((Src.exp = 0) AND (Src.fraction != 0))
{
TMP[63:0] := NormalizeExpTinyDPFP(SRC[63:0]) ; // Get Normalized Exponent
Set #DE
}
ELSE // exponent value is correct

VGETEXPPD—Convert Exponents of Packed Double Precision Floating-Point Values to Double Precision Floating-Point Values Vol. 2C 5-361
{
TMP[63:0] := (Src.sign << 63) OR (Src.exp << 52) OR (Src.fraction) ;
}
TMP := SAR(TMP, 52) ; // Shift Arithmetic Right
TMP := TMP – 1023; // Subtract Bias
Return CvtI2D(TMP); // Convert INT to double precision floating-point number
}
}

VGETEXPPD (EVEX encoded versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC *is memory*)
THEN
DEST[i+63:i] :=
ConvertExpDPFP(SRC[63:0])
ELSE
DEST[i+63:i] :=
ConvertExpDPFP(SRC[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VGETEXPPD __m512d _mm512_getexp_pd(__m512d a);
VGETEXPPD __m512d _mm512_mask_getexp_pd(__m512d s, __mmask8 k, __m512d a);
VGETEXPPD __m512d _mm512_maskz_getexp_pd( __mmask8 k, __m512d a);
VGETEXPPD __m512d _mm512_getexp_round_pd(__m512d a, int sae);
VGETEXPPD __m512d _mm512_mask_getexp_round_pd(__m512d s, __mmask8 k, __m512d a, int sae);
VGETEXPPD __m512d _mm512_maskz_getexp_round_pd( __mmask8 k, __m512d a, int sae);
VGETEXPPD __m256d _mm256_getexp_pd(__m256d a);
VGETEXPPD __m256d _mm256_mask_getexp_pd(__m256d s, __mmask8 k, __m256d a);
VGETEXPPD __m256d _mm256_maskz_getexp_pd( __mmask8 k, __m256d a);
VGETEXPPD __m128d _mm_getexp_pd(__m128d a);
VGETEXPPD __m128d _mm_mask_getexp_pd(__m128d s, __mmask8 k, __m128d a);
VGETEXPPD __m128d _mm_maskz_getexp_pd( __mmask8 k, __m128d a);

SIMD Floating-Point Exceptions


Invalid, Denormal.

VGETEXPPD—Convert Exponents of Packed Double Precision Floating-Point Values to Double Precision Floating-Point Values Vol. 2C 5-362
Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VGETEXPPD—Convert Exponents of Packed Double Precision Floating-Point Values to Double Precision Floating-Point Values Vol. 2C 5-363
VGETEXPPH—Convert Exponents of Packed FP16 Values to FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.MAP6.W0 42 /r A V/V (AVX512-FP16 Convert the exponent of FP16 values in the source
VGETEXPPH xmm1{k1}{z}, AND AVX512VL) operand to FP16 results representing unbiased
xmm2/m128/m16bcst OR AVX10.11 integer exponents and stores the results in the
destination register subject to writemask k1.
EVEX.256.66.MAP6.W0 42 /r A V/V (AVX512-FP16 Convert the exponent of FP16 values in the source
VGETEXPPH ymm1{k1}{z}, AND AVX512VL) operand to FP16 results representing unbiased
ymm2/m256/m16bcst OR AVX10.11 integer exponents and stores the results in the
destination register subject to writemask k1.
EVEX.512.66.MAP6.W0 42 /r A V/V AVX512-FP16 Convert the exponent of FP16 values in the source
VGETEXPPH zmm1{k1}{z}, OR AVX10.11 operand to FP16 results representing unbiased
zmm2/m512/m16bcst {sae} integer exponents and stores the results in the
destination register subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction extracts the biased exponents from the normalized FP16 representation of each word element of
the source operand (the second operand) as unbiased signed integer value, or convert the denormal representa-
tion of input data to unbiased negative integer values. Each integer value of the unbiased exponent is converted to
an FP16 value and written to the corresponding word elements of the destination operand (the first operand) as
FP16 numbers.
The destination elements are updated according to the writemask.
Each GETEXP operation converts the exponent value into a floating-point number (permitting input value in
denormal representation). Special cases of input values are listed in Table 5-7.
The formula is:
GETEXP(x) = floor(log2(|x|))
Notation floor(x) stands for maximal integer not exceeding real number x.
Software usage of VGETEXPxx and VGETMANTxx instructions generally involve a combination of GETEXP operation
and GETMANT operation (see VGETMANTPH). Thus, the VGETEXPPH instruction does not require software to
handle SIMD floating-point exceptions.

Table 5-14. VGETEXPPH/VGETEXPSH Special Cases


Input Operand Result Comments
src1 = NaN QNaN(src1)
If (SRC = SNaN), then #IE.
0 < |src1| < INF floor(log2(|src1|))
If (SRC = denormal), then #DE.
| src1| = +INF +INF
| src1| = 0 -INF

VGETEXPPH—Convert Exponents of Packed FP16 Values to FP16 Values Vol. 2C 5-363


Operation
def normalize_exponent_tiny_fp16(src):
jbit := 0
// src & dst are FP16 numbers with sign(1b), exp(5b) and fraction (10b) fields
dst.exp := 1 // write bits 14:10
dst.fraction := src.fraction // copy bits 9:0
while jbit == 0:
jbit := dst.fraction[9] // msb of the fraction
dst.fraction := dst.fraction << 1
dst.exp := dst.exp - 1
dst.fraction := 0
return dst

def getexp_fp16(src):
src.sign := 0 // make positive
exponent_all_ones := (src[14:10] == 0x1F)
exponent_all_zeros := (src[14:10] == 0)
mantissa_all_zeros := (src[9:0] == 0)
zero := exponent_all_zeros and mantissa_all_zeros
signaling_bit := src[9]

nan := exponent_all_ones and not(mantissa_all_zeros)


snan := nan and not(signaling_bit)
qnan := nan and signaling_bit
positive_infinity := not(negative) and exponent_all_ones and mantissa_all_zeros
denormal := exponent_all_zeros and not(mantissa_all_zeros)

if nan:
if snan:
MXCSR.IE := 1
return qnan(src) // convert snan to a qnan
if positive_infinity:
return src
if zero:
return -INF
if denormal:
tmp := normalize_exponent_tiny_fp16(src)
MXCSR.DE := 1
else:
tmp := src
tmp := SAR(tmp, 10) // shift arithmetic right
tmp := tmp - 15 // subtract bias
return convert_integer_to_fp16(tmp)

VGETEXPPH—Convert Exponents of Packed FP16 Values to FP16 Values Vol. 2C 5-364


VGETEXPPH dest{k1}, src
VL = 128, 256 or 512
KL := VL/16

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF SRC is memory and (EVEX.b = 1):
tsrc := src.fp16[0]
ELSE:
tsrc := src.fp16[i]
DEST.fp16[i] := getexp_fp16(tsrc)
ELSE IF *zeroing*:
DEST.fp16[i] := 0
//else DEST.fp16[i] remains unchanged

DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VGETEXPPH __m128h _mm_getexp_ph (__m128h a);
VGETEXPPH __m128h _mm_mask_getexp_ph (__m128h src, __mmask8 k, __m128h a);
VGETEXPPH __m128h _mm_maskz_getexp_ph (__mmask8 k, __m128h a);
VGETEXPPH __m256h _mm256_getexp_ph (__m256h a);
VGETEXPPH __m256h _mm256_mask_getexp_ph (__m256h src, __mmask16 k, __m256h a);
VGETEXPPH __m256h _mm256_maskz_getexp_ph (__mmask16 k, __m256h a);
VGETEXPPH __m512h _mm512_getexp_ph (__m512h a);
VGETEXPPH __m512h _mm512_mask_getexp_ph (__m512h src, __mmask32 k, __m512h a);
VGETEXPPH __m512h _mm512_maskz_getexp_ph (__mmask32 k, __m512h a);
VGETEXPPH __m512h _mm512_getexp_round_ph (__m512h a, const int sae);
VGETEXPPH __m512h _mm512_mask_getexp_round_ph (__m512h src, __mmask32 k, __m512h a, const int sae);
VGETEXPPH __m512h _mm512_maskz_getexp_round_ph (__mmask32 k, __m512h a, const int sae);

SIMD Floating-Point Exceptions


Invalid, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VGETEXPPH—Convert Exponents of Packed FP16 Values to FP16 Values Vol. 2C 5-365


VGETEXPPS—Convert Exponents of Packed Single Precision Floating-Point Values to Single
Precision Floating-Point Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F38.W0 42 /r A V/V (AVX512VL Convert the exponent of packed single-precision floating-
VGETEXPPS xmm1 {k1}{z}, AND AVX512F) point values in the source operand to single-precision
xmm2/m128/m32bcst OR AVX10.11 floating-point results representing unbiased integer
exponents and stores the results in the destination
register.
EVEX.256.66.0F38.W0 42 /r A V/V (AVX512VL Convert the exponent of packed single-precision floating-
VGETEXPPS ymm1 {k1}{z}, AND AVX512F) point values in the source operand to single-precision
ymm2/m256/m32bcst OR AVX10.11 floating-point results representing unbiased integer
exponents and stores the results in the destination
register.
EVEX.512.66.0F38.W0 42 /r A V/V AVX512F Convert the exponent of packed single-precision floating-
VGETEXPPS zmm1 {k1}{z}, OR AVX10.11 point values in the source operand to single-precision
zmm2/m512/m32bcst{sae} floating-point results representing unbiased integer
exponents and stores the results in the destination
register.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Extracts the biased exponents from the normalized single precision floating-point representation of each dword
element of the source operand (the second operand) as unbiased signed integer value, or convert the denormal
representation of input data to unbiased negative integer values. Each integer value of the unbiased exponent is
converted to single precision floating-point value and written to the corresponding dword elements of the destina-
tion operand (the first operand) as single precision floating-point numbers.
The destination operand is a ZMM/YMM/XMM register and updated under the writemask. The source operand can
be a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from
a 32-bit memory location.
EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
Each GETEXP operation converts the exponent value into a floating-point number (permitting input value in
denormal representation). Special cases of input values are listed in Table 5-15.
The formula is:
GETEXP(x) = floor(log2(|x|))
Notation floor(x) stands for maximal integer not exceeding real number x.
Software usage of VGETEXPxx and VGETMANTxx instructions generally involve a combination of GETEXP operation
and GETMANT operation (see VGETMANTPD). Thus VGETEXPxx instruction do not require software to handle SIMD
floating-point exceptions.

VGETEXPPS—Convert Exponents of Packed Single Precision Floating-Point Values to Single Precision Floating-Point Values Vol. 2C 5-366
Table 5-15. VGETEXPPS/SS Special Cases
Input Operand Result Comments
src1 = NaN QNaN(src1)
If (SRC = SNaN) then #IE
0 < |src1| < INF floor(log2(|src1|))
If (SRC = denormal) then #DE
| src1| = +INF +INF
| src1| = 0 -INF

Figure 5-14 illustrates the VGETEXPPS functionality on input values with normalized representation.

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
s exp Fraction
Src = 2^1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

SAR Src, 23 = 080h 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

-Bias 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1

Tmp - Bias = 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Cvt_PI2PS(01h) = 2^0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Figure 5-14. VGETEXPPS Functionality On Normal Input values

Operation
NormalizeExpTinySPFP(SRC[31:0])
{
// Jbit is the hidden integral bit of a floating-point number. In case of denormal number it has the value of ZERO.
Src.Jbit := 0;
Dst.exp := 1;
Dst.fraction := SRC[22:0];
WHILE(Src.Jbit = 0)
{
Src.Jbit := Dst.fraction[22]; // Get the fraction MSB
Dst.fraction := Dst.fraction << 1 ; // One bit shift left
Dst.exp-- ; // Decrement the exponent
}
Dst.fraction := 0; // zero out fraction bits
Dst.sign := 1; // Return negative sign
TMP[31:0] := MXCSR.DAZ? 0 : (Dst.sign << 31) OR (Dst.exp << 23) OR (Dst.fraction) ;
Return (TMP[31:0]);
}
ConvertExpSPFP(SRC[31:0])
{
Src.sign := 0; // Zero out sign bit
Src.exp := SRC[30:23];
Src.fraction := SRC[22:0];
// Check for NaN
IF (SRC = NaN)
{
IF ( SRC = SNAN ) SET IE;

VGETEXPPS—Convert Exponents of Packed Single Precision Floating-Point Values to Single Precision Floating-Point Values Vol. 2C 5-367
Return QNAN(SRC);
}
// Check for +INF
IF (Src = +INF) RETURN (Src);

// check if zero operand


IF ((Src.exp = 0) AND ((Src.fraction = 0) OR (MXCSR.DAZ = 1))) Return (-INF);
}
ELSE // check if denormal operand (notice that MXCSR.DAZ = 0)
{
IF ((Src.exp = 0) AND (Src.fraction != 0))
{
TMP[31:0] := NormalizeExpTinySPFP(SRC[31:0]) ; // Get Normalized Exponent
Set #DE
}
ELSE // exponent value is correct
{
TMP[31:0] := (Src.sign << 31) OR (Src.exp << 23) OR (Src.fraction) ;
}
TMP := SAR(TMP, 23) ; // Shift Arithmetic Right
TMP := TMP – 127; // Subtract Bias
Return CvtI2S(TMP); // Convert INT to single precision floating-point number
}
}

VGETEXPPS (EVEX encoded versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC *is memory*)
THEN
DEST[i+31:i] :=
ConvertExpSPFP(SRC[31:0])
ELSE
DEST[i+31:i] :=
ConvertExpSPFP(SRC[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VGETEXPPS—Convert Exponents of Packed Single Precision Floating-Point Values to Single Precision Floating-Point Values Vol. 2C 5-368
Intel C/C++ Compiler Intrinsic Equivalent
VGETEXPPS __m512 _mm512_getexp_ps( __m512 a);
VGETEXPPS __m512 _mm512_mask_getexp_ps(__m512 s, __mmask16 k, __m512 a);
VGETEXPPS __m512 _mm512_maskz_getexp_ps( __mmask16 k, __m512 a);
VGETEXPPS __m512 _mm512_getexp_round_ps( __m512 a, int sae);
VGETEXPPS __m512 _mm512_mask_getexp_round_ps(__m512 s, __mmask16 k, __m512 a, int sae);
VGETEXPPS __m512 _mm512_maskz_getexp_round_ps( __mmask16 k, __m512 a, int sae);
VGETEXPPS __m256 _mm256_getexp_ps(__m256 a);
VGETEXPPS __m256 _mm256_mask_getexp_ps(__m256 s, __mmask8 k, __m256 a);
VGETEXPPS __m256 _mm256_maskz_getexp_ps( __mmask8 k, __m256 a);
VGETEXPPS __m128 _mm_getexp_ps(__m128 a);
VGETEXPPS __m128 _mm_mask_getexp_ps(__m128 s, __mmask8 k, __m128 a);
VGETEXPPS __m128 _mm_maskz_getexp_ps( __mmask8 k, __m128 a);

SIMD Floating-Point Exceptions


Invalid, Denormal.

Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VGETEXPPS—Convert Exponents of Packed Single Precision Floating-Point Values to Single Precision Floating-Point Values Vol. 2C 5-369
VGETEXPSD—Convert Exponents of Scalar Double Precision Floating-Point Value to Double
Precision Floating-Point Value
Opcode/ Op/ 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
EVEX.LLIG.66.0F38.W1 43 /r A V/V AVX512F Convert the biased exponent (bits 62:52) of the low
VGETEXPSD xmm1 {k1}{z}, OR AVX10.11 double precision floating-point value in xmm3/m64 to a
xmm2, xmm3/m64{sae} double precision floating-point value representing
unbiased integer exponent. Stores the result to the low
64-bit of xmm1 under the writemask k1 and merge with
the other elements of xmm2.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Extracts the biased exponent from the normalized double precision floating-point representation of the low qword
data element of the source operand (the third operand) as unbiased signed integer value, or convert the denormal
representation of input data to unbiased negative integer values. The integer value of the unbiased exponent is
converted to double precision floating-point value and written to the destination operand (the first operand) as
double precision floating-point numbers. Bits (127:64) of the XMM register destination are copied from corre-
sponding bits in the first source operand.
The destination must be a XMM register, the source operand can be a XMM register or a float64 memory location.
If writemasking is used, the low quadword element of the destination operand is conditionally updated depending
on the value of writemask register k1. If writemasking is not used, the low quadword element of the destination
operand is unconditionally updated.
Each GETEXP operation converts the exponent value into a floating-point number (permitting input value in
denormal representation). Special cases of input values are listed in Table 5-13.
The formula is:
GETEXP(x) = floor(log2(|x|))
Notation floor(x) stands for maximal integer not exceeding real number x.

Operation
// NormalizeExpTinyDPFP(SRC[63:0]) is defined in the Operation section of VGETEXPPD

// ConvertExpDPFP(SRC[63:0]) is defined in the Operation section of VGETEXPPD

VGETEXPSD—Convert Exponents of Scalar Double Precision Floating-Point Value to Double Precision Floating-Point Value Vol. 2C 5-370
VGETEXPSD (EVEX encoded version)
IF k1[0] OR *no writemask*
THEN DEST[63:0] :=
ConvertExpDPFP(SRC2[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
DEST[63:0] := 0
FI
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VGETEXPSD __m128d _mm_getexp_sd( __m128d a, __m128d b);
VGETEXPSD __m128d _mm_mask_getexp_sd(__m128d s, __mmask8 k, __m128d a, __m128d b);
VGETEXPSD __m128d _mm_maskz_getexp_sd( __mmask8 k, __m128d a, __m128d b);
VGETEXPSD __m128d _mm_getexp_round_sd( __m128d a, __m128d b, int sae);
VGETEXPSD __m128d _mm_mask_getexp_round_sd(__m128d s, __mmask8 k, __m128d a, __m128d b, int sae);
VGETEXPSD __m128d _mm_maskz_getexp_round_sd( __mmask8 k, __m128d a, __m128d b, int sae);

SIMD Floating-Point Exceptions


Invalid, Denormal

Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”

VGETEXPSD—Convert Exponents of Scalar Double Precision Floating-Point Value to Double Precision Floating-Point Value Vol. 2C 5-371
VGETEXPSH—Convert Exponents of Scalar FP16 Values to FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.66.MAP6.W0 43 /r A V/V AVX512-FP16 Convert the exponent of FP16 values in the low
VGETEXPSH xmm1{k1}{z}, xmm2, OR AVX10.11 word of the source operand to FP16 results
xmm3/m16 {sae} representing unbiased integer exponents, and stores
the results in the low word of the destination
register subject to writemask k1. Bits 127:16 of
xmm2 are copied to xmm1[127:16].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction extracts the biased exponents from the normalized FP16 representation of the low word element of
the source operand (the second operand) as unbiased signed integer value, or convert the denormal representa-
tion of input data to an unbiased negative integer value. The integer value of the unbiased exponent is converted
to an FP16 value and written to the low word element of the destination operand (the first operand) as an FP16
number.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
Each GETEXP operation converts the exponent value into a floating-point number (permitting input value in
denormal representation). Special cases of input values are listed in Table 5-14.
The formula is:
GETEXP(x) = floor(log2(|x|))
Notation floor(x) stands for maximal integer not exceeding real number x.
Software usage of VGETEXPxx and VGETMANTxx instructions generally involve a combination of GETEXP operation
and GETMANT operation (see VGETMANTSH). Thus, the VGETEXPSH instruction does not require software to
handle SIMD floating-point exceptions.

Operation
VGETEXPSH dest{k1}, src1, src2
IF k1[0] or *no writemask*:
DEST.fp16[0] := getexp_fp16(src2.fp16[0]) // see VGETEXPPH
ELSE IF *zeroing*:
DEST.fp16[0] := 0
//else DEST.fp16[0] remains unchanged

DEST[127:16] := src1[127:16]
DEST[MAXVL-1:128] := 0

VGETEXPSH—Convert Exponents of Scalar FP16 Values to FP16 Values Vol. 2C 5-372


Intel C/C++ Compiler Intrinsic Equivalent
VGETEXPSH __m128h _mm_getexp_round_sh (__m128h a, __m128h b, const int sae);
VGETEXPSH __m128h _mm_mask_getexp_round_sh (__m128h src, __mmask8 k, __m128h a, __m128h b, const int sae);
VGETEXPSH __m128h _mm_maskz_getexp_round_sh (__mmask8 k, __m128h a, __m128h b, const int sae);
VGETEXPSH __m128h _mm_getexp_sh (__m128h a, __m128h b);
VGETEXPSH __m128h _mm_mask_getexp_sh (__m128h src, __mmask8 k, __m128h a, __m128h b);
VGETEXPSH __m128h _mm_maskz_getexp_sh (__mmask8 k, __m128h a, __m128h b);

SIMD Floating-Point Exceptions


Invalid, Denormal

Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VGETEXPSH—Convert Exponents of Scalar FP16 Values to FP16 Values Vol. 2C 5-373


VGETEXPSS—Convert Exponents of Scalar Single Precision Floating-Point Value to Single
Precision Floating-Point Value
Opcode/ Op/ 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
EVEX.LLIG.66.0F38.W0 43 /r A V/V AVX512F Convert the biased exponent (bits 30:23) of the low
VGETEXPSS xmm1 {k1}{z}, xmm2, OR AVX10.11 single-precision floating-point value in xmm3/m32 to a
xmm3/m32{sae} single-precision floating-point value representing
unbiased integer exponent. Stores the result to xmm1
under the writemask k1 and merge with the other
elements of xmm2.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Extracts the biased exponent from the normalized single precision floating-point representation of the low double-
word data element of the source operand (the third operand) as unbiased signed integer value, or convert the
denormal representation of input data to unbiased negative integer values. The integer value of the unbiased expo-
nent is converted to single precision floating-point value and written to the destination operand (the first operand)
as single precision floating-point numbers. Bits (127:32) of the XMM register destination are copied from corre-
sponding bits in the first source operand.
The destination must be a XMM register, the source operand can be a XMM register or a float32 memory location.
If writemasking is used, the low doubleword element of the destination operand is conditionally updated depending
on the value of writemask register k1. If writemasking is not used, the low doubleword element of the destination
operand is unconditionally updated.
Each GETEXP operation converts the exponent value into a floating-point number (permitting input value in
denormal representation). Special cases of input values are listed in Table 5-15.
The formula is:
GETEXP(x) = floor(log2(|x|))
Notation floor(x) stands for maximal integer not exceeding real number x.
Software usage of VGETEXPxx and VGETMANTxx instructions generally involve a combination of GETEXP operation
and GETMANT operation (see VGETMANTPD). Thus VGETEXPxx instruction do not require software to handle SIMD
floating-point exceptions.

Operation
// NormalizeExpTinySPFP(SRC[31:0]) is defined in the Operation section of VGETEXPPS
// ConvertExpSPFP(SRC[31:0]) is defined in the Operation section of VGETEXPPS

VGETEXPSS—Convert Exponents of Scalar Single Precision Floating-Point Value to Single Precision Floating-Point Value Vol. 2C 5-374
VGETEXPSS (EVEX encoded version)
IF k1[0] OR *no writemask*
THEN DEST[31:0] :=
ConvertExpDPFP(SRC2[31:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
DEST[31:0]:= 0
FI
FI;
ENDFOR
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VGETEXPSS __m128 _mm_getexp_ss( __m128 a, __m128 b);
VGETEXPSS __m128 _mm_mask_getexp_ss(__m128 s, __mmask8 k, __m128 a, __m128 b);
VGETEXPSS __m128 _mm_maskz_getexp_ss( __mmask8 k, __m128 a, __m128 b);
VGETEXPSS __m128 _mm_getexp_round_ss( __m128 a, __m128 b, int sae);
VGETEXPSS __m128 _mm_mask_getexp_round_ss(__m128 s, __mmask8 k, __m128 a, __m128 b, int sae);
VGETEXPSS __m128 _mm_maskz_getexp_round_ss( __mmask8 k, __m128 a, __m128 b, int sae);

SIMD Floating-Point Exceptions


Invalid, Denormal

Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”

VGETEXPSS—Convert Exponents of Scalar Single Precision Floating-Point Value to Single Precision Floating-Point Value Vol. 2C 5-375
VGETMANTPD—Extract Float64 Vector of Normalized Mantissas From Float64 Vector
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F3A.W1 26 /r ib A V/V (AVX512VL Get Normalized Mantissa from float64 vector
VGETMANTPD xmm1 {k1}{z}, AND AVX512F) xmm2/m128/m64bcst and store the result in xmm1,
xmm2/m128/m64bcst, imm8 OR AVX10.11 using imm8 for sign control and mantissa interval
normalization, under writemask.
EVEX.256.66.0F3A.W1 26 /r ib A V/V (AVX512VL Get Normalized Mantissa from float64 vector
VGETMANTPD ymm1 {k1}{z}, AND AVX512F) ymm2/m256/m64bcst and store the result in ymm1,
ymm2/m256/m64bcst, imm8 OR AVX10.11 using imm8 for sign control and mantissa interval
normalization, under writemask.
EVEX.512.66.0F3A.W1 26 /r ib A V/V AVX512F Get Normalized Mantissa from float64 vector
VGETMANTPD zmm1 {k1}{z}, OR AVX10.11 zmm2/m512/m64bcst and store the result in zmm1,
zmm2/m512/m64bcst{sae}, using imm8 for sign control and mantissa interval
imm8 normalization, under writemask.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) imm8 N/A

Description
Convert double precision floating values in the source operand (the second operand) to double precision floating-
point values with the mantissa normalization and sign control specified by the imm8 byte, see Figure 5-15. The
converted results are written to the destination operand (the first operand) using writemask k1. The normalized
mantissa is specified by interv (imm8[1:0]) and the sign control (sc) is specified by bits 3:2 of the immediate byte.
The destination operand is a ZMM/YMM/XMM register updated under the writemask. The source operand can be a
ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 64-
bit memory location.

7 6 5 4 3 2 1 0

imm8 Must Be Zero Sign Control (SC) Normaiization Interval

Imm8[1:0] = 00b : Interval is [ 1, 2)


Imm8[3:2] = 00b : sign(SRC) Imm8[1:0] = 01b : Interval is [1/2, 2)
Imm8[3:2] = 01b : 0 Imm8[1:0] = 10b : Interval is [ 1/2, 1)
Imm8[3] = 1b : qNan_Indefinite if sign(SRC) != 0, regardless of imm8[2]. Imm8[1:0] = 11b : Interval is [3/4, 3/2)

Figure 5-15. Imm8 Controls for VGETMANTPD/SD/PS/SS

VGETMANTPD—Extract Float64 Vector of Normalized Mantissas From Float64 Vector Vol. 2C 5-376
For each input double precision floating-point value x, The conversion operation is:

GetMant(x) = ±2k|x.significand|
where:
1 <= |x.significand| < 2

Unbiased exponent k can be either 0 or -1, depending on the interval range defined by interv, the range of the
significand and whether the exponent of the source is even or odd. The sign of the final result is determined by sc
and the source sign. The encoded value of imm8[1:0] and sign control are shown in Figure 5-15.
Each converted double precision floating-point result is encoded according to the sign control, the unbiased expo-
nent k (adding bias) and a mantissa normalized to the range specified by interv.
The GetMant() function follows Table 5-16 when dealing with floating-point special numbers.
This instruction is writemasked, so only those elements with the corresponding bit set in vector mask register k1
are computed and stored into the destination. Elements in zmm1 with the corresponding bit clear in k1 retain their
previous values.
Note: EVEX.vvvv is reserved and must be 1111b; otherwise instructions will #UD.

Table 5-16. GetMant() Special Float Values Behavior


Input Result Exceptions / Comments
NaN QNaN(SRC) Ignore interv
If (SRC = SNaN) then #IE
+∞ 1.0 Ignore interv
+0 1.0 Ignore interv
-0 IF (SC[0]) THEN +1.0 Ignore interv
ELSE -1.0
-∞ IF (SC[1]) THEN {QNaN_Indefinite} Ignore interv
ELSE { If (SC[1]) then #IE
IF (SC[0]) THEN +1.0
ELSE -1.0
negative SC[1] ? QNaN_Indefinite : Getmant(SRC)1 If (SC[1]) then #IE
NOTES:
1. In case SC[1]==0, the sign of Getmant(SRC) is declared according to SC[0].

Operation
def getmant_fp64(src, sign_control, normalization_interval):
bias := 1023
dst.sign := sign_control[0] ? 0 : src.sign
signed_one := sign_control[0] ? +1.0 : -1.0
dst.exp := src.exp
dst.fraction := src.fraction
zero := (dst.exp = 0) and ((dst.fraction = 0) or (MXCSR.DAZ=1))
denormal := (dst.exp = 0) and (dst.fraction != 0) and (MXCSR.DAZ=0)
infinity := (dst.exp = 0x7FF) and (dst.fraction = 0)
nan := (dst.exp = 0x7FF) and (dst.fraction != 0)
src_signaling := src.fraction[51]
snan := nan and (src_signaling = 0)
positive := (src.sign = 0)
negative := (src.sign = 1)
if nan:

VGETMANTPD—Extract Float64 Vector of Normalized Mantissas From Float64 Vector Vol. 2C 5-377
if snan:
MXCSR.IE := 1
return qnan(src)

if positive and (zero or infinity):


return 1.0
if negative:
if zero:
return signed_one
if infinity:
if sign_control[1]:
MXCSR.IE := 1
return QNaN_Indefinite
return signed_one
if sign_control[1]:
MXCSR.IE := 1
return QNaN_Indefinite

if denormal:
jbit := 0
dst.exp := bias
while jbit = 0:
jbit := dst.fraction[51]
dst.fraction := dst.fraction << 1
dst.exp : = dst.exp - 1
MXCSR.DE := 1

unbiased_exp := dst.exp - bias


odd_exp := unbiased_exp[0]
signaling_bit := dst.fraction[51]
if normalization_interval = 0b00:
dst.exp := bias
else if normalization_interval = 0b01:
dst.exp := odd_exp ? bias-1 : bias
else if normalization_interval = 0b10:
dst.exp := bias-1
else if normalization_interval = 0b11:
dst.exp := signaling_bit ? bias-1 : bias
return dst

VGETMANTPD—Extract Float64 Vector of Normalized Mantissas From Float64 Vector Vol. 2C 5-378
VGETMANTPD (EVEX Encoded Versions)
VGETMANTPD dest{k1}, src, imm8
VL = 128, 256, or 512
KL := VL / 64
sign_control := imm8[3:2]
normalization_interval := imm8[1:0]

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF SRC is memory and (EVEX.b = 1):
tsrc := src.double[0]
ELSE:
tsrc := src.double[i]
DEST.double[i] := getmant_fp64(tsrc, sign_control, normalization_interval)
ELSE IF *zeroing*:
DEST.double[i] := 0
//else DEST.double[i] remains unchanged

DEST[MAX_VL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VGETMANTPD __m512d _mm512_getmant_pd( __m512d a, enum intv, enum sgn);
VGETMANTPD __m512d _mm512_mask_getmant_pd(__m512d s, __mmask8 k, __m512d a, enum intv, enum sgn);
VGETMANTPD __m512d _mm512_maskz_getmant_pd( __mmask8 k, __m512d a, enum intv, enum sgn);
VGETMANTPD __m512d _mm512_getmant_round_pd( __m512d a, enum intv, enum sgn, int r);
VGETMANTPD __m512d _mm512_mask_getmant_round_pd(__m512d s, __mmask8 k, __m512d a, enum intv, enum sgn, int r);
VGETMANTPD __m512d _mm512_maskz_getmant_round_pd( __mmask8 k, __m512d a, enum intv, enum sgn, int r);
VGETMANTPD __m256d _mm256_getmant_pd( __m256d a, enum intv, enum sgn);
VGETMANTPD __m256d _mm256_mask_getmant_pd(__m256d s, __mmask8 k, __m256d a, enum intv, enum sgn);
VGETMANTPD __m256d _mm256_maskz_getmant_pd( __mmask8 k, __m256d a, enum intv, enum sgn);
VGETMANTPD __m128d _mm_getmant_pd( __m128d a, enum intv, enum sgn);
VGETMANTPD __m128d _mm_mask_getmant_pd(__m128d s, __mmask8 k, __m128d a, enum intv, enum sgn);
VGETMANTPD __m128d _mm_maskz_getmant_pd( __mmask8 k, __m128d a, enum intv, enum sgn);

SIMD Floating-Point Exceptions


Denormal, Invalid.

Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VGETMANTPD—Extract Float64 Vector of Normalized Mantissas From Float64 Vector Vol. 2C 5-379
VGETMANTPH—Extract FP16 Vector of Normalized Mantissas from FP16 Vector
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.NP.0F3A.W0 26 /r /ib A V/V (AVX512-FP16 Get normalized mantissa from FP16 vector
VGETMANTPH xmm1{k1}{z}, AND AVX512VL) xmm2/m128/m16bcst and store the result in
xmm2/m128/m16bcst, imm8 OR AVX10.11 xmm1, using imm8 for sign control and mantissa
interval normalization, subject to writemask k1.
EVEX.256.NP.0F3A.W0 26 /r /ib A V/V (AVX512-FP16 Get normalized mantissa from FP16 vector
VGETMANTPH ymm1{k1}{z}, AND AVX512VL) ymm2/m256/m16bcst and store the result in
ymm2/m256/m16bcst, imm8 OR AVX10.11 ymm1, using imm8 for sign control and mantissa
interval normalization, subject to writemask k1.
EVEX.512.NP.0F3A.W0 26 /r /ib A V/V AVX512-FP16 Get normalized mantissa from FP16 vector
VGETMANTPH zmm1{k1}{z}, OR AVX10.11 zmm2/m512/m16bcst and store the result in
zmm2/m512/m16bcst {sae}, imm8 zmm1, using imm8 for sign control and mantissa
interval normalization, subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) imm8 (r) N/A

Description
This instruction converts the FP16 values in the source operand (the second operand) to FP16 values with the
mantissa normalization and sign control specified by the imm8 byte, see Table 5-17. The converted results are
written to the destination operand (the first operand) using writemask k1. The normalized mantissa is specified by
interv (imm8[1:0]) and the sign control (SC) is specified by bits 3:2 of the immediate byte.
The destination elements are updated according to the writemask.

Table 5-17. imm8 Controls for VGETMANTPH/VGETMANTSH


imm8 Bits Definition
imm8[7:4] Must be zero.
imm8[3:2] Sign Control (SC)
0b00: Sign(SRC)
0b01: 0
0b1x: QNaN_Indefinite if sign(SRC)!=0
imm8[1:0] Interv
0b00: Interval is [1, 2)
0b01: Interval is [1/2, 2)
0b10: Interval is [1/2, 1)
0b11: Interval is [3/4, 3/2)

For each input FP16 value x, The conversion operation is:

GetMant(x) = ±2k|x.significand|
where:

VGETMANTPH—Extract FP16 Vector of Normalized Mantissas from FP16 Vector Vol. 2C 5-380
1 ≤ |x.significand| < 2

Unbiased exponent k depends on the interval range defined by interv and whether the exponent of the source is
even or odd. The sign of the final result is determined by the sign control and the source sign and the leading frac-
tion bit.
The encoded value of imm8[1:0] and sign control are shown in Table 5-17.
Each converted FP16 result is encoded according to the sign control, the unbiased exponent k (adding bias) and a
mantissa normalized to the range specified by interv.
The GetMant() function follows Table 5-18 when dealing with floating-point special numbers.

Table 5-18. GetMant() Special Float Values Behavior


Input Result Exceptions / Comments
NaN QNaN(SRC) Ignore interv.
If (SRC = SNaN), then #IE.
+∞ 1.0 Ignore interv.
+0 1.0 Ignore interv.
-0 IF (SC[0]) THEN +1.0 Ignore interv.
ELSE -1.0
-∞ IF (SC[1]) THEN {QNaN_Indefinite} Ignore interv.
ELSE { If (SC[1]), then #IE.
IF (SC[0]) THEN +1.0
ELSE -1.0
negative SC[1] ? QNaN_Indefinite : Getmant(SRC)1 If (SC[1]), then #IE.
NOTES:
1. In case SC[1]==0, the sign of Getmant(SRC) is declared according to SC[0].

Operation
def getmant_fp16(src, sign_control, normalization_interval):
bias := 15
dst.sign := sign_control[0] ? 0 : src.sign
signed_one := sign_control[0] ? +1.0 : -1.0
dst.exp := src.exp
dst.fraction := src.fraction
zero := (dst.exp = 0) and (dst.fraction = 0)
denormal := (dst.exp = 0) and (dst.fraction != 0)
infinity := (dst.exp = 0x1F) and (dst.fraction = 0)
nan := (dst.exp = 0x1F) and (dst.fraction != 0)
src_signaling := src.fraction[9]
snan := nan and (src_signaling = 0)
positive := (src.sign = 0)
negative := (src.sign = 1)
if nan:
if snan:
MXCSR.IE := 1
return qnan(src)

if positive and (zero or infinity):


return 1.0
if negative:
if zero:

VGETMANTPH—Extract FP16 Vector of Normalized Mantissas from FP16 Vector Vol. 2C 5-381
return signed_one
if infinity:
if sign_control[1]:
MXCSR.IE := 1
return QNaN_Indefinite
return signed_one
if sign_control[1]:
MXCSR.IE := 1
return QNaN_Indefinite
if denormal:
jbit := 0
dst.exp := bias // set exponent to bias value
while jbit = 0:
jbit := dst.fraction[9]
dst.fraction := dst.fraction << 1
dst.exp : = dst.exp - 1
MXCSR.DE := 1

unbaiased_exp := dst.exp - bias


odd_exp := unbaiased_exp[0]
signaling_bit := dst.fraction[9]
if normalization_interval = 0b00:
dst.exp := bias
else if normalization_interval = 0b01:
dst.exp := odd_exp ? bias-1 : bias
else if normalization_interval = 0b10:
dst.exp := bias-1
else if normalization_interval = 0b11:
dst.exp := signaling_bit ? bias-1 : bias
return dst

VGETMANTPH dest{k1}, src, imm8


VL = 128, 256 or 512
KL := VL/16

sign_control := imm8[3:2]
normalization_interval := imm8[1:0]

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF SRC is memory and (EVEX.b = 1):
tsrc := src.fp16[0]
ELSE:
tsrc := src.fp16[i]
DEST.fp16[i] := getmant_fp16(tsrc, sign_control, normalization_interval)
ELSE IF *zeroing*:
DEST.fp16[i] := 0
//else DEST.fp16[i] remains unchanged

DEST[MAXVL-1:VL] := 0

VGETMANTPH—Extract FP16 Vector of Normalized Mantissas from FP16 Vector Vol. 2C 5-382
Intel C/C++ Compiler Intrinsic Equivalent
VGETMANTPH __m128h _mm_getmant_ph (__m128h a, _MM_MANTISSA_NORM_ENUM norm, _MM_MANTISSA_SIGN_ENUM sign);
VGETMANTPH __m128h _mm_mask_getmant_ph (__m128h src, __mmask8 k, __m128h a, _MM_MANTISSA_NORM_ENUM norm,
_MM_MANTISSA_SIGN_ENUM sign);
VGETMANTPH __m128h _mm_maskz_getmant_ph (__mmask8 k, __m128h a, _MM_MANTISSA_NORM_ENUM norm,
_MM_MANTISSA_SIGN_ENUM sign);
VGETMANTPH __m256h _mm256_getmant_ph (__m256h a, _MM_MANTISSA_NORM_ENUM norm, _MM_MANTISSA_SIGN_ENUM sign);
VGETMANTPH __m256h _mm256_mask_getmant_ph (__m256h src, __mmask16 k, __m256h a, _MM_MANTISSA_NORM_ENUM norm,
_MM_MANTISSA_SIGN_ENUM sign);
VGETMANTPH __m256h _mm256_maskz_getmant_ph (__mmask16 k, __m256h a, _MM_MANTISSA_NORM_ENUM norm,
_MM_MANTISSA_SIGN_ENUM sign);
VGETMANTPH __m512h _mm512_getmant_ph (__m512h a, _MM_MANTISSA_NORM_ENUM norm, _MM_MANTISSA_SIGN_ENUM sign);
VGETMANTPH __m512h _mm512_mask_getmant_ph (__m512h src, __mmask32 k, __m512h a, _MM_MANTISSA_NORM_ENUM norm,
_MM_MANTISSA_SIGN_ENUM sign);
VGETMANTPH __m512h _mm512_maskz_getmant_ph (__mmask32 k, __m512h a, _MM_MANTISSA_NORM_ENUM norm,
_MM_MANTISSA_SIGN_ENUM sign);
VGETMANTPH __m512h _mm512_getmant_round_ph (__m512h a, _MM_MANTISSA_NORM_ENUM norm,
_MM_MANTISSA_SIGN_ENUM sign, const int sae);
VGETMANTPH __m512h _mm512_mask_getmant_round_ph (__m512h src, __mmask32 k, __m512h a, _MM_MANTISSA_NORM_ENUM
norm, _MM_MANTISSA_SIGN_ENUM sign, const int sae);
VGETMANTPH __m512h _mm512_maskz_getmant_round_ph (__mmask32 k, __m512h a, _MM_MANTISSA_NORM_ENUM norm,
_MM_MANTISSA_SIGN_ENUM sign, const int sae);

SIMD Floating-Point Exceptions


Invalid, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VGETMANTPH—Extract FP16 Vector of Normalized Mantissas from FP16 Vector Vol. 2C 5-383
VGETMANTPS—Extract Float32 Vector of Normalized Mantissas From Float32 Vector
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F3A.W0 26 /r ib A V/V (AVX512VL Get normalized mantissa from float32 vector
VGETMANTPS xmm1 {k1}{z}, AND AVX512F) xmm2/m128/m32bcst and store the result in xmm1, using
xmm2/m128/m32bcst, imm8 OR AVX10.11 imm8 for sign control and mantissa interval normalization,
under writemask.
EVEX.256.66.0F3A.W0 26 /r ib A V/V (AVX512VL Get normalized mantissa from float32 vector
VGETMANTPS ymm1 {k1}{z}, AND AVX512F) ymm2/m256/m32bcst and store the result in ymm1, using
ymm2/m256/m32bcst, imm8 OR AVX10.11 imm8 for sign control and mantissa interval normalization,
under writemask.
EVEX.512.66.0F3A.W0 26 /r ib A V/V AVX512F Get normalized mantissa from float32 vector
VGETMANTPS zmm1 {k1}{z}, OR AVX10.11 zmm2/m512/m32bcst and store the result in zmm1, using
zmm2/m512/m32bcst{sae}, imm8 for sign control and mantissa interval normalization,
imm8 under writemask.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) imm8 N/A

Description
Convert single precision floating values in the source operand (the second operand) to single precision floating-
point values with the mantissa normalization and sign control specified by the imm8 byte, see Figure 5-15. The
converted results are written to the destination operand (the first operand) using writemask k1. The normalized
mantissa is specified by interv (imm8[1:0]) and the sign control (sc) is specified by bits 3:2 of the immediate byte.
The destination operand is a ZMM/YMM/XMM register updated under the writemask. The source operand can be a
ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 32-
bit memory location.
For each input single precision floating-point value x, The conversion operation is:

GetMant(x) = ±2k|x.significand|
where:

1 <= |x.significand| < 2

Unbiased exponent k can be either 0 or -1, depending on the interval range defined by interv, the range of the
significand and whether the exponent of the source is even or odd. The sign of the final result is determined by sc
and the source sign. The encoded value of imm8[1:0] and sign control are shown in Figure 5-15.
Each converted single precision floating-point result is encoded according to the sign control, the unbiased expo-
nent k (adding bias) and a mantissa normalized to the range specified by interv.
The GetMant() function follows Table 5-16 when dealing with floating-point special numbers.
This instruction is writemasked, so only those elements with the corresponding bit set in vector mask register k1
are computed and stored into the destination. Elements in zmm1 with the corresponding bit clear in k1 retain their
previous values.
Note: EVEX.vvvv is reserved and must be 1111b, VEX.L must be 0; otherwise instructions will #UD.

VGETMANTPS—Extract Float32 Vector of Normalized Mantissas From Float32 Vector Vol. 2C 5-384
Operation
def getmant_fp32(src, sign_control, normalization_interval):
bias := 127
dst.sign := sign_control[0] ? 0 : src.sign
signed_one := sign_control[0] ? +1.0 : -1.0
dst.exp := src.exp
dst.fraction := src.fraction
zero := (dst.exp = 0) and ((dst.fraction = 0) or (MXCSR.DAZ=1))
denormal := (dst.exp = 0) and (dst.fraction != 0) and (MXCSR.DAZ=0)
infinity := (dst.exp = 0xFF) and (dst.fraction = 0)
nan := (dst.exp = 0xFF) and (dst.fraction != 0)
src_signaling := src.fraction[22]
snan := nan and (src_signaling = 0)
positive := (src.sign = 0)
negative := (src.sign = 1)
if nan:
if snan:
MXCSR.IE := 1
return qnan(src)

if positive and (zero or infinity):


return 1.0
if negative:
if zero:
return signed_one
if infinity:
if sign_control[1]:
MXCSR.IE := 1
return QNaN_Indefinite
return signed_one
if sign_control[1]:
MXCSR.IE := 1
return QNaN_Indefinite

if denormal:
jbit := 0
dst.exp := bias
while jbit = 0:
jbit := dst.fraction[22]
dst.fraction := dst.fraction << 1
dst.exp : = dst.exp - 1
MXCSR.DE := 1

unbiased_exp := dst.exp - bias


odd_exp := unbiased_exp[0]
signaling_bit := dst.fraction[22]
if normalization_interval = 0b00:
dst.exp := bias
else if normalization_interval = 0b01:
dst.exp := odd_exp ? bias-1 : bias
else if normalization_interval = 0b10:
dst.exp := bias-1
else if normalization_interval = 0b11:
dst.exp := signaling_bit ? bias-1 : bias

VGETMANTPS—Extract Float32 Vector of Normalized Mantissas From Float32 Vector Vol. 2C 5-385
return dst

VGETMANTPS (EVEX encoded versions)


VGETMANTPS dest{k1}, src, imm8
VL = 128, 256, or 512
KL := VL / 32
sign_control := imm8[3:2]
normalization_interval := imm8[1:0]

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF SRC is memory and (EVEX.b = 1):
tsrc := src.float[0]
ELSE:
tsrc := src.float[i]
DEST.float[i] := getmant_fp32(tsrc, sign_control, normalization_interval)
ELSE IF *zeroing*:
DEST.float[i] := 0
//else DEST.float[i] remains unchanged

DEST[MAX_VL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VGETMANTPS __m512 _mm512_getmant_ps( __m512 a, enum intv, enum sgn);
VGETMANTPS __m512 _mm512_mask_getmant_ps(__m512 s, __mmask16 k, __m512 a, enum intv, enum sgn;
VGETMANTPS __m512 _mm512_maskz_getmant_ps(__mmask16 k, __m512 a, enum intv, enum sgn);
VGETMANTPS __m512 _mm512_getmant_round_ps( __m512 a, enum intv, enum sgn, int r);
VGETMANTPS __m512 _mm512_mask_getmant_round_ps(__m512 s, __mmask16 k, __m512 a, enum intv, enum sgn, int r);
VGETMANTPS __m512 _mm512_maskz_getmant_round_ps(__mmask16 k, __m512 a, enum intv, enum sgn, int r);
VGETMANTPS __m256 _mm256_getmant_ps( __m256 a, enum intv, enum sgn);
VGETMANTPS __m256 _mm256_mask_getmant_ps(__m256 s, __mmask8 k, __m256 a, enum intv, enum sgn);
VGETMANTPS __m256 _mm256_maskz_getmant_ps( __mmask8 k, __m256 a, enum intv, enum sgn);
VGETMANTPS __m128 _mm_getmant_ps( __m128 a, enum intv, enum sgn);
VGETMANTPS __m128 _mm_mask_getmant_ps(__m128 s, __mmask8 k, __m128 a, enum intv, enum sgn);
VGETMANTPS __m128 _mm_maskz_getmant_ps( __mmask8 k, __m128 a, enum intv, enum sgn);

SIMD Floating-Point Exceptions


Denormal, Invalid.

Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VGETMANTPS—Extract Float32 Vector of Normalized Mantissas From Float32 Vector Vol. 2C 5-386
VGETMANTSD—Extract Float64 of Normalized Mantissa From Float64 Scalar
Opcode/ Op/ 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
EVEX.LLIG.66.0F3A.W1 27 /r ib A V/V AVX512F Extract the normalized mantissa of the low float64
VGETMANTSD xmm1 {k1}{z}, xmm2, OR AVX10.11 element in xmm3/m64 using imm8 for sign control
xmm3/m64{sae}, imm8 and mantissa interval normalization. Store the
mantissa to xmm1 under the writemask k1 and
merge with the other elements of xmm2.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Convert the double precision floating values in the low quadword element of the second source operand (the third
operand) to double precision floating-point value with the mantissa normalization and sign control specified by the
imm8 byte, see Figure 5-15. The converted result is written to the low quadword element of the destination
operand (the first operand) using writemask k1. Bits (127:64) of the XMM register destination are copied from
corresponding bits in the first source operand. The normalized mantissa is specified by interv (imm8[1:0]) and the
sign control (sc) is specified by bits 3:2 of the immediate byte.
The conversion operation is:

GetMant(x) = ±2k|x.significand|
where:

1 <= |x.significand| < 2

Unbiased exponent k can be either 0 or -1, depending on the interval range defined by interv, the range of the
significand and whether the exponent of the source is even or odd. The sign of the final result is determined by sc
and the source sign. The encoded value of imm8[1:0] and sign control are shown in Figure 5-15.
The converted double precision floating-point result is encoded according to the sign control, the unbiased expo-
nent k (adding bias) and a mantissa normalized to the range specified by interv.
The GetMant() function follows Table 5-16 when dealing with floating-point special numbers.
If writemasking is used, the low quadword element of the destination operand is conditionally updated depending
on the value of writemask register k1. If writemasking is not used, the low quadword element of the destination
operand is unconditionally updated.

VGETMANTSD—Extract Float64 of Normalized Mantissa From Float64 Scalar Vol. 2C 5-387


Operation
// getmant_fp64(src, sign_control, normalization_interval) is defined in the operation section of VGETMANTPD

VGETMANTSD (EVEX encoded version)


SignCtrl[1:0] := IMM8[3:2];
Interv[1:0] := IMM8[1:0];
IF k1[0] OR *no writemask*
THEN DEST[63:0] :=
getmant_fp64(src, sign_control, normalization_interval)
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
DEST[63:0] := 0
FI
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VGETMANTSD __m128d _mm_getmant_sd( __m128d a, __m128 b, enum intv, enum sgn);
VGETMANTSD __m128d _mm_mask_getmant_sd(__m128d s, __mmask8 k, __m128d a, __m128d b, enum intv, enum sgn);
VGETMANTSD __m128d _mm_maskz_getmant_sd( __mmask8 k, __m128 a, __m128d b, enum intv, enum sgn);
VGETMANTSD __m128d _mm_getmant_round_sd( __m128d a, __m128 b, enum intv, enum sgn, int r);
VGETMANTSD __m128d _mm_mask_getmant_round_sd(__m128d s, __mmask8 k, __m128d a, __m128d b, enum intv, enum sgn, int r);
VGETMANTSD __m128d _mm_maskz_getmant_round_sd( __mmask8 k, __m128d a, __m128d b, enum intv, enum sgn, int r);

SIMD Floating-Point Exceptions


Denormal, Invalid

Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”

VGETMANTSD—Extract Float64 of Normalized Mantissa From Float64 Scalar Vol. 2C 5-388


VGETMANTSH—Extract FP16 of Normalized Mantissa from FP16 Scalar
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.NP.0F3A.W0 27 /r /ib A V/V AVX512-FP16 Extract the normalized mantissa of the low FP16
VGETMANTSH xmm1{k1}{z}, xmm2, OR AVX10.11 element in xmm3/m16 using imm8 for sign
xmm3/m16 {sae}, imm8 control and mantissa interval normalization. Store
the mantissa to xmm1 subject to writemask k1
and merge with the other elements of xmm2. Bits
127:16 of xmm2 are copied to xmm1[127:16].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8 (r)

Description
This instruction converts the FP16 value in the low element of the second source operand to FP16 values with the
mantissa normalization and sign control specified by the imm8 byte, see Table 5-17. The converted result is written
to the low element of the destination operand using writemask k1. The normalized mantissa is specified by interv
(imm8[1:0]) and the sign control (SC) is specified by bits 3:2 of the immediate byte.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
For each input FP16 value x, The conversion operation is:

GetMant(x) = ±2k|x.significand|
where:
1 ≤ |x.significand| < 2
Unbiased exponent k depends on the interval range defined by interv and whether the exponent of the source is
even or odd. The sign of the final result is determined by the sign control and the source sign and the leading frac-
tion bit.
The encoded value of imm8[1:0] and sign control are shown in Table 5-17.
Each converted FP16 result is encoded according to the sign control, the unbiased exponent k (adding bias) and a
mantissa normalized to the range specified by interv.
The GetMant() function follows Table 5-18 when dealing with floating-point special numbers.

VGETMANTSH—Extract FP16 of Normalized Mantissa from FP16 Scalar Vol. 2C 5-389


Operation
VGETMANTSH dest{k1}, src1, src2, imm8
sign_control := imm8[3:2]
normalization_interval := imm8[1:0]

IF k1[0] or *no writemask*:


dest.fp16[0] := getmant_fp16(src2.fp16[0], // see VGETMANTPH
sign_control,
normalization_interval)
ELSE IF *zeroing*:
dest.fp16[0] := 0
//else dest.fp16[0] remains unchanged

DEST[127:16] := src1[127:16]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VGETMANTSH __m128h _mm_getmant_round_sh (__m128h a, __m128h b, _MM_MANTISSA_NORM_ENUM norm,
_MM_MANTISSA_SIGN_ENUM sign, const int sae);
VGETMANTSH __m128h _mm_mask_getmant_round_sh (__m128h src, __mmask8 k, __m128h a, __m128h b,
_MM_MANTISSA_NORM_ENUM norm, _MM_MANTISSA_SIGN_ENUM sign, const int sae);
VGETMANTSH __m128h _mm_maskz_getmant_round_sh (__mmask8 k, __m128h a, __m128h b, _MM_MANTISSA_NORM_ENUM norm,
_MM_MANTISSA_SIGN_ENUM sign, const int sae);
VGETMANTSH __m128h _mm_getmant_sh (__m128h a, __m128h b, _MM_MANTISSA_NORM_ENUM norm,
_MM_MANTISSA_SIGN_ENUM sign);
VGETMANTSH __m128h _mm_mask_getmant_sh (__m128h src, __mmask8 k, __m128h a, __m128h b, _MM_MANTISSA_NORM_ENUM
norm, _MM_MANTISSA_SIGN_ENUM sign);
VGETMANTSH __m128h _mm_maskz_getmant_sh (__mmask8 k, __m128h a, __m128h b, _MM_MANTISSA_NORM_ENUM norm,
_MM_MANTISSA_SIGN_ENUM sign);

SIMD Floating-Point Exceptions


Invalid, Denormal

Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VGETMANTSH—Extract FP16 of Normalized Mantissa from FP16 Scalar Vol. 2C 5-390


VGETMANTSS—Extract Float32 Vector of Normalized Mantissa From Float32 Scalar
Opcode/ Op/ 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
EVEX.LLIG.66.0F3A.W0 27 /r ib A V/V AVX512F Extract the normalized mantissa from the low float32
VGETMANTSS xmm1 {k1}{z}, xmm2, OR AVX10.11 element of xmm3/m32 using imm8 for sign control
xmm3/m32{sae}, imm8 and mantissa interval normalization, store the
mantissa to xmm1 under the writemask k1 and merge
with the other elements of xmm2.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Convert the single precision floating values in the low doubleword element of the second source operand (the third
operand) to single precision floating-point value with the mantissa normalization and sign control specified by the
imm8 byte, see Figure 5-15. The converted result is written to the low doubleword element of the destination
operand (the first operand) using writemask k1. Bits (127:32) of the XMM register destination are copied from
corresponding bits in the first source operand. The normalized mantissa is specified by interv (imm8[1:0]) and the
sign control (sc) is specified by bits 3:2 of the immediate byte.
The conversion operation is:

GetMant(x) = ±2k|x.significand|
where:
1 <= |x.significand| < 2

Unbiased exponent k can be either 0 or -1, depending on the interval range defined by interv, the range of the
significand and whether the exponent of the source is even or odd. The sign of the final result is determined by sc
and the source sign. The encoded value of imm8[1:0] and sign control are shown in Figure 5-15.
The converted single precision floating-point result is encoded according to the sign control, the unbiased exponent
k (adding bias) and a mantissa normalized to the range specified by interv.
The GetMant() function follows Table 5-16 when dealing with floating-point special numbers.
If writemasking is used, the low doubleword element of the destination operand is conditionally updated depending
on the value of writemask register k1. If writemasking is not used, the low doubleword element of the destination
operand is unconditionally updated.

VGETMANTSS—Extract Float32 Vector of Normalized Mantissa From Float32 Scalar Vol. 2C 5-391
Operation
// getmant_fp32(src, sign_control, normalization_interval) is defined in the operation section of VGETMANTPS

VGETMANTSS (EVEX encoded version)


SignCtrl[1:0] := IMM8[3:2];
Interv[1:0] := IMM8[1:0];
IF k1[0] OR *no writemask*
THEN DEST[31:0] :=
getmant_fp32(src, sign_control, normalization_interval)
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
DEST[31:0] := 0
FI
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VGETMANTSS __m128 _mm_getmant_ss( __m128 a, __m128 b, enum intv, enum sgn);
VGETMANTSS __m128 _mm_mask_getmant_ss(__m128 s, __mmask8 k, __m128 a, __m128 b, enum intv, enum sgn);
VGETMANTSS __m128 _mm_maskz_getmant_ss( __mmask8 k, __m128 a, __m128 b, enum intv, enum sgn);
VGETMANTSS __m128 _mm_getmant_round_ss( __m128 a, __m128 b, enum intv, enum sgn, int r);
VGETMANTSS __m128 _mm_mask_getmant_round_ss(__m128 s, __mmask8 k, __m128 a, __m128 b, enum intv, enum sgn, int r);
VGETMANTSS __m128 _mm_maskz_getmant_round_ss( __mmask8 k, __m128 a, __m128 b, enum intv, enum sgn, int r);

SIMD Floating-Point Exceptions


Denormal, Invalid

Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”

VGETMANTSS—Extract Float32 Vector of Normalized Mantissa From Float32 Scalar Vol. 2C 5-392
VINSERTF128/VINSERTF32x4/VINSERTF64x2/VINSERTF32x8/VINSERTF64x4—Insert Packed
Floating-Point Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.256.66.0F3A.W0 18 /r ib A V/V AVX Insert 128 bits of packed floating-point values
VINSERTF128 ymm1, ymm2, from xmm3/m128 and the remaining values
xmm3/m128, imm8 from ymm2 into ymm1.
EVEX.256.66.0F3A.W0 18 /r ib C V/V (AVX512VL AND Insert 128 bits of packed single-precision
VINSERTF32X4 ymm1 {k1}{z}, ymm2, AVX512F) OR floating-point values from xmm3/m128 and the
xmm3/m128, imm8 AVX10.11 remaining values from ymm2 into ymm1 under
writemask k1.
EVEX.512.66.0F3A.W0 18 /r ib C V/V AVX512F Insert 128 bits of packed single-precision
VINSERTF32X4 zmm1 {k1}{z}, zmm2, OR AVX10.11 floating-point values from xmm3/m128 and the
xmm3/m128, imm8 remaining values from zmm2 into zmm1 under
writemask k1.
EVEX.256.66.0F3A.W1 18 /r ib B V/V (AVX512VL AND Insert 128 bits of packed double precision
VINSERTF64X2 ymm1 {k1}{z}, ymm2, AVX512DQ) OR floating-point values from xmm3/m128 and the
xmm3/m128, imm8 AVX10.11 remaining values from ymm2 into ymm1 under
writemask k1.
EVEX.512.66.0F3A.W1 18 /r ib B V/V AVX512DQ Insert 128 bits of packed double precision
VINSERTF64X2 zmm1 {k1}{z}, zmm2, OR AVX10.11 floating-point values from xmm3/m128 and the
xmm3/m128, imm8 remaining values from zmm2 into zmm1 under
writemask k1.
EVEX.512.66.0F3A.W0 1A /r ib D V/V AVX512DQ Insert 256 bits of packed single-precision
VINSERTF32X8 zmm1 {k1}{z}, zmm2, OR AVX10.11 floating-point values from ymm3/m256 and the
ymm3/m256, imm8 remaining values from zmm2 into zmm1 under
writemask k1.
EVEX.512.66.0F3A.W1 1A /r ib C V/V AVX512F Insert 256 bits of packed double precision
VINSERTF64X4 zmm1 {k1}{z}, zmm2, OR AVX10.11 floating-point values from ymm3/m256 and the
ymm3/m256, imm8 remaining values from zmm2 into zmm1 under
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8
B Tuple2 ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8
C Tuple4 ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8
D Tuple8 ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
VINSERTF128/VINSERTF32x4 and VINSERTF64x2 insert 128-bits of packed floating-point values from the second
source operand (the third operand) into the destination operand (the first operand) at an 128-bit granularity offset
multiplied by imm8[0] (256-bit) or imm8[1:0]. The remaining portions of the destination operand are copied from
the corresponding fields of the first source operand (the second operand). The second source operand can be either
an XMM register or a 128-bit memory location. The destination and first source operands are vector registers.

VINSERTF128/VINSERTF32x4/VINSERTF64x2/VINSERTF32x8/VINSERTF64x4—Insert Packed Floating-Point Values Vol. 2C 5-393


VINSERTF32x4: The destination operand is a ZMM/YMM register and updated at 32-bit granularity according to the
writemask. The high 6/7 bits of the immediate are ignored.
VINSERTF64x2: The destination operand is a ZMM/YMM register and updated at 64-bit granularity according to the
writemask. The high 6/7 bits of the immediate are ignored.
VINSERTF32x8 and VINSERTF64x4 inserts 256-bits of packed floating-point values from the second source operand
(the third operand) into the destination operand (the first operand) at a 256-bit granular offset multiplied by
imm8[0]. The remaining portions of the destination are copied from the corresponding fields of the first source
operand (the second operand). The second source operand can be either an YMM register or a 256-bit memory
location. The high 7 bits of the immediate are ignored. The destination operand is a ZMM register and updated at
32/64-bit granularity according to the writemask.

Operation
VINSERTF32x4 (EVEX encoded versions)
(KL, VL) = (8, 256), (16, 512)
TEMP_DEST[VL-1:0] := SRC1[VL-1:0]
IF VL = 256
CASE (imm8[0]) OF
0: TMP_DEST[127:0] := SRC2[127:0]
1: TMP_DEST[255:128] := SRC2[127:0]
ESAC.
FI;
IF VL = 512
CASE (imm8[1:0]) OF
00: TMP_DEST[127:0] := SRC2[127:0]
01: TMP_DEST[255:128] := SRC2[127:0]
10: TMP_DEST[383:256] := SRC2[127:0]
11: TMP_DEST[511:384] := SRC2[127:0]
ESAC.
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VINSERTF128/VINSERTF32x4/VINSERTF64x2/VINSERTF32x8/VINSERTF64x4—Insert Packed Floating-Point Values Vol. 2C 5-394


VINSERTF64x2 (EVEX encoded versions)
(KL, VL) = (4, 256), (8, 512)
TEMP_DEST[VL-1:0] := SRC1[VL-1:0]
IF VL = 256
CASE (imm8[0]) OF
0: TMP_DEST[127:0] := SRC2[127:0]
1: TMP_DEST[255:128] := SRC2[127:0]
ESAC.
FI;
IF VL = 512
CASE (imm8[1:0]) OF
00: TMP_DEST[127:0] := SRC2[127:0]
01: TMP_DEST[255:128] := SRC2[127:0]
10: TMP_DEST[383:256] := SRC2[127:0]
11: TMP_DEST[511:384] := SRC2[127:0]
ESAC.
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VINSERTF32x8 (EVEX.U1.512 encoded version)


TEMP_DEST[VL-1:0] := SRC1[VL-1:0]
CASE (imm8[0]) OF
0: TMP_DEST[255:0] := SRC2[255:0]
1: TMP_DEST[511:256] := SRC2[255:0]
ESAC.

FOR j := 0 TO 15
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VINSERTF128/VINSERTF32x4/VINSERTF64x2/VINSERTF32x8/VINSERTF64x4—Insert Packed Floating-Point Values Vol. 2C 5-395


VINSERTF64x4 (EVEX.512 encoded version)
VL = 512
TEMP_DEST[VL-1:0] := SRC1[VL-1:0]
CASE (imm8[0]) OF
0: TMP_DEST[255:0] := SRC2[255:0]
1: TMP_DEST[511:256] := SRC2[255:0]
ESAC.

FOR j := 0 TO 7
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VINSERTF128 (VEX encoded version)


TEMP[255:0] := SRC1[255:0]
CASE (imm8[0]) OF
0: TEMP[127:0] := SRC2[127:0]
1: TEMP[255:128] := SRC2[127:0]
ESAC
DEST := TEMP

Intel C/C++ Compiler Intrinsic Equivalent


VINSERTF32x4 __m512 _mm512_insertf32x4( __m512 a, __m128 b, int imm);
VINSERTF32x4 __m512 _mm512_mask_insertf32x4(__m512 s, __mmask16 k, __m512 a, __m128 b, int imm);
VINSERTF32x4 __m512 _mm512_maskz_insertf32x4( __mmask16 k, __m512 a, __m128 b, int imm);
VINSERTF32x4 __m256 _mm256_insertf32x4( __m256 a, __m128 b, int imm);
VINSERTF32x4 __m256 _mm256_mask_insertf32x4(__m256 s, __mmask8 k, __m256 a, __m128 b, int imm);
VINSERTF32x4 __m256 _mm256_maskz_insertf32x4( __mmask8 k, __m256 a, __m128 b, int imm);
VINSERTF32x8 __m512 _mm512_insertf32x8( __m512 a, __m256 b, int imm);
VINSERTF32x8 __m512 _mm512_mask_insertf32x8(__m512 s, __mmask16 k, __m512 a, __m256 b, int imm);
VINSERTF32x8 __m512 _mm512_maskz_insertf32x8( __mmask16 k, __m512 a, __m256 b, int imm);
VINSERTF64x2 __m512d _mm512_insertf64x2( __m512d a, __m128d b, int imm);
VINSERTF64x2 __m512d _mm512_mask_insertf64x2(__m512d s, __mmask8 k, __m512d a, __m128d b, int imm);
VINSERTF64x2 __m512d _mm512_maskz_insertf64x2( __mmask8 k, __m512d a, __m128d b, int imm);
VINSERTF64x2 __m256d _mm256_insertf64x2( __m256d a, __m128d b, int imm);
VINSERTF64x2 __m256d _mm256_mask_insertf64x2(__m256d s, __mmask8 k, __m256d a, __m128d b, int imm);
VINSERTF64x2 __m256d _mm256_maskz_insertf64x2( __mmask8 k, __m256d a, __m128d b, int imm);
VINSERTF64x4 __m512d _mm512_insertf64x4( __m512d a, __m256d b, int imm);
VINSERTF64x4 __m512d _mm512_mask_insertf64x4(__m512d s, __mmask8 k, __m512d a, __m256d b, int imm);
VINSERTF64x4 __m512d _mm512_maskz_insertf64x4( __mmask8 k, __m512d a, __m256d b, int imm);
VINSERTF128 __m256 _mm256_insertf128_ps (__m256 a, __m128 b, int offset);
VINSERTF128 __m256d _mm256_insertf128_pd (__m256d a, __m128d b, int offset);
VINSERTF128 __m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int offset);

VINSERTF128/VINSERTF32x4/VINSERTF64x2/VINSERTF32x8/VINSERTF64x4—Insert Packed Floating-Point Values Vol. 2C 5-396


SIMD Floating-Point Exceptions
None

Other Exceptions
VEX-encoded instruction, see Table 2-23, “Type 6 Class Exception Conditions.”
Additionally:
#UD If VEX.L = 0.
EVEX-encoded instruction, see Table 2-56, “Type E6NF Class Exception Conditions.”

VINSERTF128/VINSERTF32x4/VINSERTF64x2/VINSERTF32x8/VINSERTF64x4—Insert Packed Floating-Point Values Vol. 2C 5-397


VINSERTI128/VINSERTI32x4/VINSERTI64x2/VINSERTI32x8/VINSERTI64x4—Insert Packed
Integer Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.256.66.0F3A.W0 38 /r ib A V/V AVX2 Insert 128 bits of integer data from xmm3/m128
VINSERTI128 ymm1, ymm2, and the remaining values from ymm2 into ymm1.
xmm3/m128, imm8
EVEX.256.66.0F3A.W0 38 /r ib C V/V (AVX512VL AND Insert 128 bits of packed doubleword integer
VINSERTI32X4 ymm1 {k1}{z}, ymm2, AVX512F) OR values from xmm3/m128 and the remaining
xmm3/m128, imm8 AVX10.11 values from ymm2 into ymm1 under writemask
k1.
EVEX.512.66.0F3A.W0 38 /r ib C V/V AVX512F Insert 128 bits of packed doubleword integer
VINSERTI32X4 zmm1 {k1}{z}, zmm2, OR AVX10.11 values from xmm3/m128 and the remaining
xmm3/m128, imm8 values from zmm2 into zmm1 under writemask
k1.
EVEX.256.66.0F3A.W1 38 /r ib B V/V (AVX512VL AND Insert 128 bits of packed quadword integer
VINSERTI64X2 ymm1 {k1}{z}, ymm2, AVX512DQ) OR values from xmm3/m128 and the remaining
xmm3/m128, imm8 AVX10.11 values from ymm2 into ymm1 under writemask
k1.
EVEX.512.66.0F3A.W1 38 /r ib B V/V AVX512DQ OR Insert 128 bits of packed quadword integer
VINSERTI64X2 zmm1 {k1}{z}, zmm2, AVX10.11 values from xmm3/m128 and the remaining
xmm3/m128, imm8 values from zmm2 into zmm1 under writemask
k1.
EVEX.512.66.0F3A.W0 3A /r ib D V/V AVX512DQ OR Insert 256 bits of packed doubleword integer
VINSERTI32X8 zmm1 {k1}{z}, zmm2, AVX10.11 values from ymm3/m256 and the remaining
ymm3/m256, imm8 values from zmm2 into zmm1 under writemask
k1.
EVEX.512.66.0F3A.W1 3A /r ib C V/V AVX512F Insert 256 bits of packed quadword integer
VINSERTI64X4 zmm1 {k1}{z}, zmm2, OR AVX10.11 values from ymm3/m256 and the remaining
ymm3/m256, imm8 values from zmm2 into zmm1 under writemask
k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8
B Tuple2 ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8
C Tuple4 ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8
D Tuple8 ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
VINSERTI32x4 and VINSERTI64x2 inserts 128-bits of packed integer values from the second source operand (the
third operand) into the destination operand (the first operand) at an 128-bit granular offset multiplied by imm8[0]
(256-bit) or imm8[1:0]. The remaining portions of the destination are copied from the corresponding fields of the
first source operand (the second operand). The second source operand can be either an XMM register or a 128-bit
memory location. The high 6/7bits of the immediate are ignored. The destination operand is a ZMM/YMM register
and updated at 32 and 64-bit granularity according to the writemask.

VINSERTI128/VINSERTI32x4/VINSERTI64x2/VINSERTI32x8/VINSERTI64x4—Insert Packed Integer Values Vol. 2C 5-397


VINSERTI32x8 and VINSERTI64x4 inserts 256-bits of packed integer values from the second source operand (the
third operand) into the destination operand (the first operand) at a 256-bit granular offset multiplied by imm8[0].
The remaining portions of the destination are copied from the corresponding fields of the first source operand (the
second operand). The second source operand can be either an YMM register or a 256-bit memory location. The
upper bits of the immediate are ignored. The destination operand is a ZMM register and updated at 32 and 64-bit
granularity according to the writemask.
VINSERTI128 inserts 128-bits of packed integer data from the second source operand (the third operand) into the
destination operand (the first operand) at a 128-bit granular offset multiplied by imm8[0]. The remaining portions
of the destination are copied from the corresponding fields of the first source operand (the second operand). The
second source operand can be either an XMM register or a 128-bit memory location. The high 7 bits of the imme-
diate are ignored. VEX.L must be 1, otherwise attempt to execute this instruction with VEX.L=0 will cause #UD.

Operation
VINSERTI32x4 (EVEX encoded versions)
(KL, VL) = (8, 256), (16, 512)
TEMP_DEST[VL-1:0] := SRC1[VL-1:0]
IF VL = 256
CASE (imm8[0]) OF
0: TMP_DEST[127:0] := SRC2[127:0]
1: TMP_DEST[255:128] := SRC2[127:0]
ESAC.
FI;
IF VL = 512
CASE (imm8[1:0]) OF
00: TMP_DEST[127:0] := SRC2[127:0]
01: TMP_DEST[255:128] := SRC2[127:0]
10: TMP_DEST[383:256] := SRC2[127:0]
11: TMP_DEST[511:384] := SRC2[127:0]
ESAC.
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VINSERTI128/VINSERTI32x4/VINSERTI64x2/VINSERTI32x8/VINSERTI64x4—Insert Packed Integer Values Vol. 2C 5-398


VINSERTI64x2 (EVEX encoded versions)
(KL, VL) = (4, 256), (8, 512)
TEMP_DEST[VL-1:0] := SRC1[VL-1:0]
IF VL = 256
CASE (imm8[0]) OF
0: TMP_DEST[127:0] := SRC2[127:0]
1: TMP_DEST[255:128] := SRC2[127:0]
ESAC.
FI;
IF VL = 512
CASE (imm8[1:0]) OF
00: TMP_DEST[127:0] := SRC2[127:0]
01: TMP_DEST[255:128] := SRC2[127:0]
10: TMP_DEST[383:256] := SRC2[127:0]
11: TMP_DEST[511:384] := SRC2[127:0]
ESAC.
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VINSERTI32x8 (EVEX.U1.512 encoded version)


TEMP_DEST[VL-1:0] := SRC1[VL-1:0]
CASE (imm8[0]) OF
0: TMP_DEST[255:0] := SRC2[255:0]
1: TMP_DEST[511:256] := SRC2[255:0]
ESAC.

FOR j := 0 TO 15
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VINSERTI128/VINSERTI32x4/VINSERTI64x2/VINSERTI32x8/VINSERTI64x4—Insert Packed Integer Values Vol. 2C 5-399


VINSERTI64x4 (EVEX.512 encoded version)
VL = 512
TEMP_DEST[VL-1:0] := SRC1[VL-1:0]
CASE (imm8[0]) OF
0: TMP_DEST[255:0] := SRC2[255:0]
1: TMP_DEST[511:256] := SRC2[255:0]
ESAC.

FOR j := 0 TO 7
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VINSERTI128
TEMP[255:0] := SRC1[255:0]
CASE (imm8[0]) OF
0: TEMP[127:0] := SRC2[127:0]
1: TEMP[255:128] := SRC2[127:0]
ESAC
DEST := TEMP

Intel C/C++ Compiler Intrinsic Equivalent


VINSERTI32x4 _mm512i _inserti32x4( __m512i a, __m128i b, int imm);
VINSERTI32x4 _mm512i _mask_inserti32x4(__m512i s, __mmask16 k, __m512i a, __m128i b, int imm);
VINSERTI32x4 _mm512i _maskz_inserti32x4( __mmask16 k, __m512i a, __m128i b, int imm);
VINSERTI32x4 __m256i _mm256_inserti32x4( __m256i a, __m128i b, int imm);
VINSERTI32x4 __m256i _mm256_mask_inserti32x4(__m256i s, __mmask8 k, __m256i a, __m128i b, int imm);
VINSERTI32x4 __m256i _mm256_maskz_inserti32x4( __mmask8 k, __m256i a, __m128i b, int imm);
VINSERTI32x8 __m512i _mm512_inserti32x8( __m512i a, __m256i b, int imm);
VINSERTI32x8 __m512i _mm512_mask_inserti32x8(__m512i s, __mmask16 k, __m512i a, __m256i b, int imm);
VINSERTI32x8 __m512i _mm512_maskz_inserti32x8( __mmask16 k, __m512i a, __m256i b, int imm);
VINSERTI64x2 __m512i _mm512_inserti64x2( __m512i a, __m128i b, int imm);
VINSERTI64x2 __m512i _mm512_mask_inserti64x2(__m512i s, __mmask8 k, __m512i a, __m128i b, int imm);
VINSERTI64x2 __m512i _mm512_maskz_inserti64x2( __mmask8 k, __m512i a, __m128i b, int imm);
VINSERTI64x2 __m256i _mm256_inserti64x2( __m256i a, __m128i b, int imm);
VINSERTI64x2 __m256i _mm256_mask_inserti64x2(__m256i s, __mmask8 k, __m256i a, __m128i b, int imm);
VINSERTI64x2 __m256i _mm256_maskz_inserti64x2( __mmask8 k, __m256i a, __m128i b, int imm);
VINSERTI64x4 _mm512_inserti64x4( __m512i a, __m256i b, int imm);
VINSERTI64x4 _mm512_mask_inserti64x4(__m512i s, __mmask8 k, __m512i a, __m256i b, int imm);
VINSERTI64x4 _mm512_maskz_inserti64x4( __mmask m, __m512i a, __m256i b, int imm);
VINSERTI128 __m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int offset);

SIMD Floating-Point Exceptions


None.

VINSERTI128/VINSERTI32x4/VINSERTI64x2/VINSERTI32x8/VINSERTI64x4—Insert Packed Integer Values Vol. 2C 5-400


Other Exceptions
VEX-encoded instruction, see Table 2-23, “Type 6 Class Exception Conditions.”
Additionally:
#UD If VEX.L = 0.
EVEX-encoded instruction, see Table 2-56, “Type E6NF Class Exception Conditions.”

VINSERTI128/VINSERTI32x4/VINSERTI64x2/VINSERTI32x8/VINSERTI64x4—Insert Packed Integer Values Vol. 2C 5-401


VMAXPH—Return Maximum of Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.NP.MAP5.W0 5F /r A V/V (AVX512-FP16 Return the maximum packed FP16 values
VMAXPH xmm1{k1}{z}, xmm2, AND AVX512VL) between xmm2 and xmm3/m128/m16bcst and
xmm3/m128/m16bcst OR AVX10.11 store the result in xmm1 subject to writemask k1.
EVEX.256.NP.MAP5.W0 5F /r A V/V (AVX512-FP16 Return the maximum packed FP16 values
VMAXPH ymm1{k1}{z}, ymm2, AND AVX512VL) between ymm2 and ymm3/m256/m16bcst and
ymm3/m256/m16bcst OR AVX10.11 store the result in ymm1 subject to writemask k1.
EVEX.512.NP.MAP5.W0 5F /r A V/V AVX512-FP16 Return the maximum packed FP16 values
VMAXPH zmm1{k1}{z}, zmm2, OR AVX10.11 between zmm2 and zmm3/m512/m16bcst and
zmm3/m512/m16bcst {sae} store the result in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs a SIMD compare of the packed FP16 values in the first source operand and the second
source operand and returns the maximum value for each pair of values to the destination operand.
If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is
returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that
is, a QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN
or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source
operand (from either the first or second operand) be returned, the action of VMAXPH can be emulated using a
sequence of instructions, such as, a comparison followed by AND, ANDN and OR.
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcast from a 16-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.

Operation
def MAX(SRC1, SRC2):
IF (SRC1 = 0.0) and (SRC2 = 0.0):
DEST := SRC2
ELSE IF (SRC1 = NaN):
DEST := SRC2
ELSE IF (SRC2 = NaN):
DEST := SRC2
ELSE IF (SRC1 > SRC2):
DEST := SRC1
ELSE:
DEST := SRC2

VMAXPH—Return Maximum of Packed FP16 Values Vol. 2C 5-404


VMAXPH dest, src1, src2
VL = 128, 256 or 512
KL := VL/16

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
tsrc2 := SRC2.fp16[0]
ELSE:
tsrc2 := SRC2.fp16[j]
DEST.fp16[j] := MAX(SRC1.fp16[j], tsrc2)
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VMAXPH __m128h _mm_mask_max_ph (__m128h src, __mmask8 k, __m128h a, __m128h b);
VMAXPH __m128h _mm_maskz_max_ph (__mmask8 k, __m128h a, __m128h b);
VMAXPH __m128h _mm_max_ph (__m128h a, __m128h b);
VMAXPH __m256h _mm256_mask_max_ph (__m256h src, __mmask16 k, __m256h a, __m256h b);
VMAXPH __m256h _mm256_maskz_max_ph (__mmask16 k, __m256h a, __m256h b);
VMAXPH __m256h _mm256_max_ph (__m256h a, __m256h b);
VMAXPH __m512h _mm512_mask_max_ph (__m512h src, __mmask32 k, __m512h a, __m512h b);
VMAXPH __m512h _mm512_maskz_max_ph (__mmask32 k, __m512h a, __m512h b);
VMAXPH __m512h _mm512_max_ph (__m512h a, __m512h b);
VMAXPH __m512h _mm512_mask_max_round_ph (__m512h src, __mmask32 k, __m512h a, __m512h b, int sae);
VMAXPH __m512h _mm512_maskz_max_round_ph (__mmask32 k, __m512h a, __m512h b, int sae);
VMAXPH __m512h _mm512_max_round_ph (__m512h a, __m512h b, int sae);

SIMD Floating-Point Exceptions


Invalid, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VMAXPH—Return Maximum of Packed FP16 Values Vol. 2C 5-405


VMAXSH—Return Maximum of Scalar FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.LLIG.F3.MAP5.W0 5F /r A V/V AVX512-FP16 Return the maximum low FP16 value between
VMAXSH xmm1{k1}{z}, xmm2, OR AVX10.11 xmm3/m16 and xmm2 and store the result in
xmm3/m16 {sae} xmm1 subject to writemask k1. Bits 127:16 of
xmm2 are copied to xmm1[127:16].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs a compare of the low packed FP16 values in the first source operand and the second
source operand and returns the maximum value for the pair of values to the destination operand.
If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is
returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that
is, a QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN
or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source
operand (from either the first or second operand) be returned, the action of VMAXSH can be emulated using a
sequence of instructions, such as, a comparison followed by AND, ANDN, and OR.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.

Operation
def MAX(SRC1, SRC2):
IF (SRC1 = 0.0) and (SRC2 = 0.0):
DEST := SRC2
ELSE IF (SRC1 = NaN):
DEST := SRC2
ELSE IF (SRC2 = NaN):
DEST := SRC2
ELSE IF (SRC1 > SRC2):
DEST := SRC1
ELSE:
DEST := SRC2

VMAXSH—Return Maximum of Scalar FP16 Values Vol. 2C 5-406


VMAXSH dest, src1, src2
IF k1[0] OR *no writemask*:
DEST.fp16[0] := MAX(SRC1.fp16[0], SRC2.fp16[0])
ELSE IF *zeroing*:
DEST.fp16[0] := 0
// else dest.fp16[j] remains unchanged

DEST[127:16] := SRC1[127:16]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VMAXSH __m128h _mm_mask_max_round_sh (__m128h src, __mmask8 k, __m128h a, __m128h b, int sae);
VMAXSH __m128h _mm_maskz_max_round_sh (__mmask8 k, __m128h a, __m128h b, int sae);
VMAXSH __m128h _mm_max_round_sh (__m128h a, __m128h b, int sae);
VMAXSH __m128h _mm_mask_max_sh (__m128h src, __mmask8 k, __m128h a, __m128h b);
VMAXSH __m128h _mm_maskz_max_sh (__mmask8 k, __m128h a, __m128h b);
VMAXSH __m128h _mm_max_sh (__m128h a, __m128h b);

SIMD Floating-Point Exceptions


Invalid, Denormal

Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VMAXSH—Return Maximum of Scalar FP16 Values Vol. 2C 5-407


VMINPH—Return Minimum of Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.NP.MAP5.W0 5D /r A V/V (AVX512-FP16 Return the minimum packed FP16 values between
VMINPH xmm1{k1}{z}, xmm2, AND AVX512VL) xmm2 and xmm3/m128/m16bcst and store the
xmm3/m128/m16bcst OR AVX10.11 result in xmm1 subject to writemask k1.
EVEX.256.NP.MAP5.W0 5D /r A V/V (AVX512-FP16 Return the minimum packed FP16 values between
VMINPH ymm1{k1}{z}, ymm2, AND AVX512VL) ymm2 and ymm3/m256/m16bcst and store the
ymm3/m256/m16bcst OR AVX10.11 result in ymm1 subject to writemask k1.
EVEX.512.NP.MAP5.W0 5D /r A V/V AVX512-FP16 Return the minimum packed FP16 values between
VMINPH zmm1{k1}{z}, zmm2, OR AVX10.11 zmm2 and zmm3/m512/m16bcst and store the
zmm3/m512/m16bcst {sae} result in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs a SIMD compare of the packed FP16 values in the first source operand and the second
source operand and returns the minimum value for each pair of values to the destination operand.
If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is
returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that
is, a QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN
or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source
operand (from either the first or second operand) be returned, the action of VMINPH can be emulated using a
sequence of instructions, such as, a comparison followed by AND, ANDN and OR.
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcast from a 16-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.

Operation
def MIN(SRC1, SRC2):
IF (SRC1 = 0.0) and (SRC2 = 0.0):
DEST := SRC2
ELSE IF (SRC1 = NaN):
DEST := SRC2
ELSE IF (SRC2 = NaN):
DEST := SRC2
ELSE IF (SRC1 < SRC2):
DEST := SRC1
ELSE:
DEST := SRC2

VMINPH—Return Minimum of Packed FP16 Values Vol. 2C 5-408


VMINPH dest, src1, src2
VL = 128, 256 or 512
KL := VL/16

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
tsrc2 := SRC2.fp16[0]
ELSE:
tsrc2 := SRC2.fp16[j]
DEST.fp16[j] := MIN(SRC1.fp16[j], tsrc2)
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VMINPH __m128h _mm_mask_min_ph (__m128h src, __mmask8 k, __m128h a, __m128h b);
VMINPH __m128h _mm_maskz_min_ph (__mmask8 k, __m128h a, __m128h b);
VMINPH __m128h _mm_min_ph (__m128h a, __m128h b);
VMINPH __m256h _mm256_mask_min_ph (__m256h src, __mmask16 k, __m256h a, __m256h b);
VMINPH __m256h _mm256_maskz_min_ph (__mmask16 k, __m256h a, __m256h b);
VMINPH __m256h _mm256_min_ph (__m256h a, __m256h b);
VMINPH __m512h _mm512_mask_min_ph (__m512h src, __mmask32 k, __m512h a, __m512h b);
VMINPH __m512h _mm512_maskz_min_ph (__mmask32 k, __m512h a, __m512h b);
VMINPH __m512h _mm512_min_ph (__m512h a, __m512h b);
VMINPH __m512h _mm512_mask_min_round_ph (__m512h src, __mmask32 k, __m512h a, __m512h b, int sae);
VMINPH __m512h _mm512_maskz_min_round_ph (__mmask32 k, __m512h a, __m512h b, int sae);
VMINPH __m512h _mm512_min_round_ph (__m512h a, __m512h b, int sae);

SIMD Floating-Point Exceptions


Invalid, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VMINPH—Return Minimum of Packed FP16 Values Vol. 2C 5-409


VMINSH—Return Minimum Scalar FP16 Value
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.LLIG.F3.MAP5.W0 5D /r A V/V AVX512-FP16 Return the minimum low FP16 value between
VMINSH xmm1{k1}{z}, xmm2, OR AVX10.11 xmm3/m16 and xmm2. Stores the result in
xmm3/m16 {sae} xmm1 subject to writemask k1. Bits 127:16 of
xmm2 are copied to xmm1[127:16].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs a compare of the low packed FP16 values in the first source operand and the second
source operand and returns the minimum value for the pair of values to the destination operand.
If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is
returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that
is, a QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN
or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source
operand (from either the first or second operand) be returned, the action of VMINSH can be emulated using a
sequence of instructions, such as, a comparison followed by AND, ANDN, and OR.
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcast from a 16-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.

Operation
def MIN(SRC1, SRC2):
IF (SRC1 = 0.0) and (SRC2 = 0.0):
DEST := SRC2
ELSE IF (SRC1 = NaN):
DEST := SRC2
ELSE IF (SRC2 = NaN):
DEST := SRC2
ELSE IF (SRC1 < SRC2):
DEST := SRC1
ELSE:
DEST := SRC2

VMINSH—Return Minimum Scalar FP16 Value Vol. 2C 5-410


VMINSH dest, src1, src2
IF k1[0] OR *no writemask*:
DEST.fp16[0] := MIN(SRC1.fp16[0], SRC2.fp16[0])
ELSE IF *zeroing*:
DEST.fp16[0] := 0
// else dest.fp16[j] remains unchanged

DEST[127:16] := SRC1[127:16]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VMINSH __m128h _mm_mask_min_round_sh (__m128h src, __mmask8 k, __m128h a, __m128h b, int sae);
VMINSH __m128h _mm_maskz_min_round_sh (__mmask8 k, __m128h a, __m128h b, int sae);
VMINSH __m128h _mm_min_round_sh (__m128h a, __m128h b, int sae);
VMINSH __m128h _mm_mask_min_sh (__m128h src, __mmask8 k, __m128h a, __m128h b);
VMINSH __m128h _mm_maskz_min_sh (__mmask8 k, __m128h a, __m128h b);
VMINSH __m128h _mm_min_sh (__m128h a, __m128h b);

SIMD Floating-Point Exceptions


Invalid, Denormal

Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VMINSH—Return Minimum Scalar FP16 Value Vol. 2C 5-411


VMOVSH—Move Scalar FP16 Value
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.LLIG.F3.MAP5.W0 10 /r A V/V AVX512-FP16 Move FP16 value from m16 to xmm1 subject to
VMOVSH xmm1{k1}{z}, m16 OR AVX10.11 writemask k1.
EVEX.LLIG.F3.MAP5.W0 11 /r B V/V AVX512-FP16 Move low FP16 value from xmm1 to m16 subject
VMOVSH m16{k1}, xmm1 OR AVX10.11 to writemask k1.
EVEX.LLIG.F3.MAP5.W0 10 /r C V/V AVX512-FP16 Move low FP16 values from xmm3 to xmm1
VMOVSH xmm1{k1}{z}, xmm2, xmm3 OR AVX10.11 subject to writemask k1. Bits 127:16 of xmm2
are copied to xmm1[127:16].
EVEX.LLIG.F3.MAP5.W0 11 /r D V/V AVX512-FP16 Move low FP16 values from xmm3 to xmm1
VMOVSH xmm1{k1}{z}, xmm2, xmm3 OR AVX10.11 subject to writemask k1. Bits 127:16 of xmm2
are copied to xmm1[127:16].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Scalar ModRM:r/m (w) ModRM:reg (r) N/A N/A
C N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
D N/A ModRM:r/m (w) VEX.vvvv (r) ModRM:reg (r) N/A

Description
This instruction moves a FP16 value to a register or memory location.
The two register-only forms are aliases and differ only in where their operands are encoded; this is a side effect of
the encodings selected.

Operation
VMOVSH dest, src (two operand load)
IF k1[0] or no writemask:
DEST.fp16[0] := SRC.fp16[0]
ELSE IF *zeroing*:
DEST.fp16[0] := 0
// ELSE DEST.fp16[0] remains unchanged

DEST[MAXVL:16] := 0

VMOVSH dest, src (two operand store)


IF k1[0] or no writemask:
DEST.fp16[0] := SRC.fp16[0]
// ELSE DEST.fp16[0] remains unchanged

VMOVSH—Move Scalar FP16 Value Vol. 2C 5-412


VMOVSH dest, src1, src2 (three operand copy)
IF k1[0] or no writemask:
DEST.fp16[0] := SRC2.fp16[0]
ELSE IF *zeroing*:
DEST.fp16[0] := 0
// ELSE DEST.fp16[0] remains unchanged

DEST[127:16] := SRC1[127:16]
DEST[MAXVL:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VMOVSH __m128h _mm_load_sh (void const* mem_addr);
VMOVSH __m128h _mm_mask_load_sh (__m128h src, __mmask8 k, void const* mem_addr);
VMOVSH __m128h _mm_maskz_load_sh (__mmask8 k, void const* mem_addr);
VMOVSH __m128h _mm_mask_move_sh (__m128h src, __mmask8 k, __m128h a, __m128h b);
VMOVSH __m128h _mm_maskz_move_sh (__mmask8 k, __m128h a, __m128h b);
VMOVSH __m128h _mm_move_sh (__m128h a, __m128h b);
VMOVSH void _mm_mask_store_sh (void * mem_addr, __mmask8 k, __m128h a);
VMOVSH void _mm_store_sh (void * mem_addr, __m128h a);

SIMD Floating-Point Exceptions


None

Other Exceptions
EVEX-encoded instruction, see Table 2-53, “Type E5 Class Exception Conditions.”

VMOVSH—Move Scalar FP16 Value Vol. 2C 5-413


VMOVW—Move Word
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.MAP5.WIG 6E /r A V/V AVX512-FP16 Copy word from reg/m16 to xmm1.
VMOVW xmm1, reg/m16 OR AVX10.11
EVEX.128.66.MAP5.WIG 7E /r B V/V AVX512-FP16 Copy word from xmm1 to reg/m16.
VMOVW reg/m16, xmm1 OR AVX10.11

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Scalar ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
This instruction either (a) copies one word element from an XMM register to a general-purpose register or memory
location or (b) copies one word element from a general-purpose register or memory location to an XMM register.
When writing a general-purpose register, the lower 16-bits of the register will contain the word value. The upper bits
of the general-purpose register are written with zeros.

Operation
VMOVW dest, src (two operand load)
DEST.word[0] := SRC.word[0]
DEST[MAXVL:16] := 0

VMOVW dest, src (two operand store)


DEST.word[0] := SRC.word[0]
// upper bits of GPR DEST are zeroed

Intel C/C++ Compiler Intrinsic Equivalent


VMOVW short _mm_cvtsi128_si16 (__m128i a);
VMOVW __m128i _mm_cvtsi16_si128 (short a);

SIMD Floating-Point Exceptions


None

Other Exceptions
EVEX-encoded instructions, see Table 2-59, “Type E9NF Class Exception Conditions.”

VMOVW—Move Word Vol. 2C 5-414


VMULPH—Multiply Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.NP.MAP5.W0 59 /r A V/V (AVX512-FP16 Multiply packed FP16 values from
VMULPH xmm1{k1}{z}, xmm2, AND AVX512VL) xmm3/m128/m16bcst to xmm2 and store the
xmm3/m128/m16bcst OR AVX10.11 result in xmm1 subject to writemask k1.
EVEX.256.NP.MAP5.W0 59 /r A V/V (AVX512-FP16 Multiply packed FP16 values from
VMULPH ymm1{k1}{z}, ymm2, AND AVX512VL) ymm3/m256/m16bcst to ymm2 and store the
ymm3/m256/m16bcst OR AVX10.11 result in ymm1 subject to writemask k1.
EVEX.512.NP.MAP5.W0 59 /r A V/V AVX512-FP16 Multiply packed FP16 values in
VMULPH zmm1{k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m16bcst with zmm2 and store the
zmm3/m512/m16bcst {er} result in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction multiplies packed FP16 values from source operands and stores the packed FP16 result in the desti-
nation operand. The destination elements are updated according to the writemask.

Operation
VMULPH (EVEX encoded versions) when src2 operand is a register
VL = 128, 256 or 512
KL := VL/16

IF (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.fp16[j] := SRC1.fp16[j] * SRC2.fp16[j]
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VMULPH—Multiply Packed FP16 Values Vol. 2C 5-415


VMULPH (EVEX encoded versions) when src2 operand is a memory source
VL = 128, 256 or 512
KL := VL/16

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
DEST.fp16[j] := SRC1.fp16[j] * SRC2.fp16[0]
ELSE:
DEST.fp16[j] := SRC1.fp16[j] * SRC2.fp16[j]
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VMULPH __m128h _mm_mask_mul_ph (__m128h src, __mmask8 k, __m128h a, __m128h b);
VMULPH __m128h _mm_maskz_mul_ph (__mmask8 k, __m128h a, __m128h b);
VMULPH __m128h _mm_mul_ph (__m128h a, __m128h b);
VMULPH __m256h _mm256_mask_mul_ph (__m256h src, __mmask16 k, __m256h a, __m256h b);
VMULPH __m256h _mm256_maskz_mul_ph (__mmask16 k, __m256h a, __m256h b);
VMULPH __m256h _mm256_mul_ph (__m256h a, __m256h b);
VMULPH __m512h _mm512_mask_mul_ph (__m512h src, __mmask32 k, __m512h a, __m512h b);
VMULPH __m512h _mm512_maskz_mul_ph (__mmask32 k, __m512h a, __m512h b);
VMULPH __m512h _mm512_mul_ph (__m512h a, __m512h b);
VMULPH __m512h _mm512_mask_mul_round_ph (__m512h src, __mmask32 k, __m512h a, __m512h b, int rounding);
VMULPH __m512h _mm512_maskz_mul_round_ph (__mmask32 k, __m512h a, __m512h b, int rounding);
VMULPH __m512h _mm512_mul_round_ph (__m512h a, __m512h b, int rounding);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal

Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”

VMULPH—Multiply Packed FP16 Values Vol. 2C 5-416


VMULSH—Multiply Scalar FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.LLIG.F3.MAP5.W0 59 /r A V/V AVX512-FP16 Multiply the low FP16 value in xmm3/m16 by low
VMULSH xmm1{k1}{z}, xmm2, OR AVX10.11 FP16 value in xmm2, and store the result in
xmm3/m16 {er} xmm1 subject to writemask k1. Bits 127:16 of
xmm2 are copied to xmm1[127:16].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction multiplies the low FP16 value from the source operands and stores the FP16 result in the destina-
tion operand. Bits 127:16 of the destination operand are copied from the corresponding bits of the first source
operand. Bits MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is
updated according to the writemask.

Operation
VMULSH (EVEX encoded versions)
IF EVEX.b = 1 and SRC2 is a register:
SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

IF k1[0] OR *no writemask*:


DEST.fp16[0] := SRC1.fp16[0] * SRC2.fp16[0]
ELSE IF *zeroing*:
DEST.fp16[0] := 0
// else dest.fp16[0] remains unchanged

DEST[127:16] := SRC1[127:16]
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VMULSH __m128h _mm_mask_mul_round_sh (__m128h src, __mmask8 k, __m128h a, __m128h b, int rounding);
VMULSH __m128h _mm_maskz_mul_round_sh (__mmask8 k, __m128h a, __m128h b, int rounding);
VMULSH __m128h _mm_mul_round_sh (__m128h a, __m128h b, int rounding);
VMULSH __m128h _mm_mask_mul_sh (__m128h src, __mmask8 k, __m128h a, __m128h b);
VMULSH __m128h _mm_maskz_mul_sh (__mmask8 k, __m128h a, __m128h b);
VMULSH __m128h _mm_mul_sh (__m128h a, __m128h b);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal.

VMULSH—Multiply Scalar FP16 Values Vol. 2C 5-417


Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VMULSH—Multiply Scalar FP16 Values Vol. 2C 5-418


VPBLENDMB/VPBLENDMW—Blend Byte/Word Vectors Using an Opmask Control
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 66 /r A V/V (AVX512VL AND Blend byte integer vector xmm2 and byte vector
VPBLENDMB xmm1 {k1}{z}, AVX512BW) OR xmm3/m128 and store the result in xmm1, under
xmm2, xmm3/m128 AVX10.11 control mask.
EVEX.256.66.0F38.W0 66 /r A V/V (AVX512VL AND Blend byte integer vector ymm2 and byte vector
VPBLENDMB ymm1 {k1}{z}, AVX512BW) OR ymm3/m256 and store the result in ymm1, under
ymm2, ymm3/m256 AVX10.11 control mask.
EVEX.512.66.0F38.W0 66 /r A V/V AVX512BW Blend byte integer vector zmm2 and byte vector
VPBLENDMB zmm1 {k1}{z}, OR AVX10.11 zmm3/m512 and store the result in zmm1, under
zmm2, zmm3/m512 control mask.
EVEX.128.66.0F38.W1 66 /r A V/V (AVX512VL AND Blend word integer vector xmm2 and word vector
VPBLENDMW xmm1 {k1}{z}, AVX512BW) OR xmm3/m128 and store the result in xmm1, under
xmm2, xmm3/m128 AVX10.11 control mask.
EVEX.256.66.0F38.W1 66 /r A V/V (AVX512VL AND Blend word integer vector ymm2 and word vector
VPBLENDMW ymm1 {k1}{z}, AVX512BW) OR ymm3/m256 and store the result in ymm1, under
ymm2, ymm3/m256 AVX10.11 control mask.
EVEX.512.66.0F38.W1 66 /r A V/V AVX512BW Blend word integer vector zmm2 and word vector
VPBLENDMW zmm1 {k1}{z}, OR AVX10.11 zmm3/m512 and store the result in zmm1, under
zmm2, zmm3/m512 control mask.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs an element-by-element blending of byte/word elements between the first source operand byte vector
register and the second source operand byte vector from memory or register, using the instruction mask as
selector. The result is written into the destination byte vector register.
The destination and first source operands are ZMM/YMM/XMM registers. The second source operand can be a
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit memory location.
The mask is not used as a writemask for this instruction. Instead, the mask is used as an element selector: every
element of the destination is conditionally selected between first source or second source using the value of the
related mask bit (0 for first source, 1 for second source).

VPBLENDMB/VPBLENDMW—Blend Byte/Word Vectors Using an Opmask Control Vol. 2C 5-422


Operation
VPBLENDMB (EVEX encoded versions)
(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SRC2[i+7:i]
ELSE
IF *merging-masking* ; merging-masking
THEN DEST[i+7:i] := SRC1[i+7:i]
ELSE ; zeroing-masking
DEST[i+7:i] := 0
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0;

VPBLENDMW (EVEX encoded versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SRC2[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN DEST[i+15:i] := SRC1[i+15:i]
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPBLENDMB __m512i _mm512_mask_blend_epi8(__mmask64 m, __m512i a, __m512i b);
VPBLENDMB __m256i _mm256_mask_blend_epi8(__mmask32 m, __m256i a, __m256i b);
VPBLENDMB __m128i _mm_mask_blend_epi8(__mmask16 m, __m128i a, __m128i b);
VPBLENDMW __m512i _mm512_mask_blend_epi16(__mmask32 m, __m512i a, __m512i b);
VPBLENDMW __m256i _mm256_mask_blend_epi16(__mmask16 m, __m256i a, __m256i b);
VPBLENDMW __m128i _mm_mask_blend_epi16(__mmask8 m, __m128i a, __m128i b);

SIMD Floating-Point Exceptions


None

Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”

VPBLENDMB/VPBLENDMW—Blend Byte/Word Vectors Using an Opmask Control Vol. 2C 5-423


VPBLENDMD/VPBLENDMQ—Blend Int32/Int64 Vectors Using an OpMask Control
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 64 /r A V/V (AVX512VL AND Blend doubleword integer vector xmm2 and
VPBLENDMD xmm1 {k1}{z}, AVX512F) OR doubleword vector xmm3/m128/m32bcst and
xmm2, xmm3/m128/m32bcst AVX10.11 store the result in xmm1, under control mask.
EVEX.256.66.0F38.W0 64 /r A V/V (AVX512VL AND Blend doubleword integer vector ymm2 and
VPBLENDMD ymm1 {k1}{z}, ymm2, AVX512F) OR doubleword vector ymm3/m256/m32bcst and
ymm3/m256/m32bcst AVX10.11 store the result in ymm1, under control mask.
EVEX.512.66.0F38.W0 64 /r A V/V AVX512F Blend doubleword integer vector zmm2 and
VPBLENDMD zmm1 {k1}{z}, zmm2, OR AVX10.11 doubleword vector zmm3/m512/m32bcst and
zmm3/m512/m32bcst store the result in zmm1, under control mask.
EVEX.128.66.0F38.W1 64 /r A V/V (AVX512VL AND Blend quadword integer vector xmm2 and
VPBLENDMQ xmm1 {k1}{z}, AVX512F) OR quadword vector xmm3/m128/m64bcst and store
xmm2, xmm3/m128/m64bcst AVX10.11 the result in xmm1, under control mask.
EVEX.256.66.0F38.W1 64 /r A V/V (AVX512VL AND Blend quadword integer vector ymm2 and
VPBLENDMQ ymm1 {k1}{z}, AVX512F) OR quadword vector ymm3/m256/m64bcst and store
ymm2, ymm3/m256/m64bcst AVX10.11 the result in ymm1, under control mask.
EVEX.512.66.0F38.W1 64 /r A V/V AVX512F Blend quadword integer vector zmm2 and
VPBLENDMQ zmm1 {k1}{z}, zmm2, OR AVX10.11 quadword vector zmm3/m512/m64bcst and store
zmm3/m512/m64bcst the result in zmm1, under control mask.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs an element-by-element blending of dword/qword elements between the first source operand (the second
operand) and the elements of the second source operand (the third operand) using an opmask register as select
control. The blended result is written into the destination.
The destination and first source operands are ZMM registers. The second source operand can be a ZMM register, a
512-bit memory location or a 512-bit vector broadcasted from a 32-bit memory location.
The opmask register is not used as a writemask for this instruction. Instead, the mask is used as an element
selector: every element of the destination is conditionally selected between first source or second source using the
value of the related mask bit (0 for the first source operand, 1 for the second source operand).
If EVEX.z is set, the elements with corresponding mask bit value of 0 in the destination operand are zeroed.

VPBLENDMD/VPBLENDMQ—Blend Int32/Int64 Vectors Using an OpMask Control Vol. 2C 5-424


Operation
VPBLENDMD (EVEX encoded versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no controlmask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+31:i] := SRC2[31:0]
ELSE
DEST[i+31:i] := SRC2[i+31:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN DEST[i+31:i] := SRC1[i+31:i]
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0;

VPBLENDMD (EVEX encoded versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no controlmask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+31:i] := SRC2[31:0]
ELSE
DEST[i+31:i] := SRC2[i+31:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN DEST[i+31:i] := SRC1[i+31:i]
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPBLENDMD/VPBLENDMQ—Blend Int32/Int64 Vectors Using an OpMask Control Vol. 2C 5-425


Intel C/C++ Compiler Intrinsic Equivalent
VPBLENDMD __m512i _mm512_mask_blend_epi32(__mmask16 k, __m512i a, __m512i b);
VPBLENDMD __m256i _mm256_mask_blend_epi32(__mmask8 m, __m256i a, __m256i b);
VPBLENDMD __m128i _mm_mask_blend_epi32(__mmask8 m, __m128i a, __m128i b);
VPBLENDMQ __m512i _mm512_mask_blend_epi64(__mmask8 k, __m512i a, __m512i b);
VPBLENDMQ __m256i _mm256_mask_blend_epi64(__mmask8 m, __m256i a, __m256i b);
VPBLENDMQ __m128i _mm_mask_blend_epi64(__mmask8 m, __m128i a, __m128i b);

SIMD Floating-Point Exceptions


None

Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”

VPBLENDMD/VPBLENDMQ—Blend Int32/Int64 Vectors Using an OpMask Control Vol. 2C 5-426


VPBROADCASTB/W/D/Q—Load With Broadcast Integer Data From General Purpose Register
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Flag
Mode
Support
EVEX.128.66.0F38.W0 7A /r A V/V (AVX512VL AND Broadcast an 8-bit value from a GPR to all bytes in
VPBROADCASTB xmm1 {k1}{z}, reg AVX512BW) OR the 128-bit destination subject to writemask k1.
AVX10.11
EVEX.256.66.0F38.W0 7A /r A V/V (AVX512VL AND Broadcast an 8-bit value from a GPR to all bytes in
VPBROADCASTB ymm1 {k1}{z}, reg AVX512BW) OR the 256-bit destination subject to writemask k1.
AVX10.11
EVEX.512.66.0F38.W0 7A /r A V/V AVX512BW Broadcast an 8-bit value from a GPR to all bytes in
VPBROADCASTB zmm1 {k1}{z}, reg OR AVX10.11 the 512-bit destination subject to writemask k1.
EVEX.128.66.0F38.W0 7B /r A V/V (AVX512VL AND Broadcast a 16-bit value from a GPR to all words in
VPBROADCASTW xmm1 {k1}{z}, reg AVX512BW) OR the 128-bit destination subject to writemask k1.
AVX10.11
EVEX.256.66.0F38.W0 7B /r A V/V (AVX512VL AND Broadcast a 16-bit value from a GPR to all words in
VPBROADCASTW ymm1 {k1}{z}, reg AVX512BW) OR the 256-bit destination subject to writemask k1.
AVX10.11
EVEX.512.66.0F38.W0 7B /r A V/V AVX512BW Broadcast a 16-bit value from a GPR to all words in
VPBROADCASTW zmm1 {k1}{z}, reg OR AVX10.11 the 512-bit destination subject to writemask k1.
EVEX.128.66.0F38.W0 7C /r A V/V (AVX512VL AND Broadcast a 32-bit value from a GPR to all
VPBROADCASTD xmm1 {k1}{z}, r32 AVX512F) OR doublewords in the 128-bit destination subject to
AVX10.11 writemask k1.
EVEX.256.66.0F38.W0 7C /r A V/V (AVX512VL AND Broadcast a 32-bit value from a GPR to all
VPBROADCASTD ymm1 {k1}{z}, r32 AVX512F) OR doublewords in the 256-bit destination subject to
AVX10.11 writemask k1.
EVEX.512.66.0F38.W0 7C /r A V/V AVX512F Broadcast a 32-bit value from a GPR to all
VPBROADCASTD zmm1 {k1}{z}, r32 OR AVX10.11 doublewords in the 512-bit destination subject to
writemask k1.
EVEX.128.66.0F38.W1 7C /r A V/N.E.1 (AVX512VL AND Broadcast a 64-bit value from a GPR to all
VPBROADCASTQ xmm1 {k1}{z}, r64 AVX512F) OR quadwords in the 128-bit destination subject to
AVX10.11 writemask k1.
EVEX.256.66.0F38.W1 7C /r A V/N.E.1 (AVX512VL AND Broadcast a 64-bit value from a GPR to all
VPBROADCASTQ ymm1 {k1}{z}, r64 AVX512F) OR quadwords in the 256-bit destination subject to
AVX10.11 writemask k1.
EVEX.512.66.0F38.W1 7C /r A V/N.E.2 AVX512F Broadcast a 64-bit value from a GPR to all
VPBROADCASTQ zmm1 {k1}{z}, r64 OR AVX10.11 quadwords in the 512-bit destination subject to
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
2. EVEX.W in non-64 bit is ignored; the instruction behaves as if the W0 version is used.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A

VPBROADCASTB/W/D/Q—Load With Broadcast Integer Data From General Purpose Register Vol. 2C 5-427
Description
Broadcasts a 8-bit, 16-bit, 32-bit or 64-bit value from a general-purpose register (the second operand) to all the
locations in the destination vector register (the first operand) using the writemask k1.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
VPBROADCASTB (EVEX encoded versions)
(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SRC[7:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPBROADCASTW (EVEX encoded versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SRC[15:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPBROADCASTD (EVEX encoded versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[31:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPBROADCASTB/W/D/Q—Load With Broadcast Integer Data From General Purpose Register Vol. 2C 5-428
VPBROADCASTQ (EVEX encoded versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[63:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPBROADCASTB __m512i _mm512_mask_set1_epi8(__m512i s, __mmask64 k, int a);
VPBROADCASTB __m512i _mm512_maskz_set1_epi8( __mmask64 k, int a);
VPBROADCASTB __m256i _mm256_mask_set1_epi8(__m256i s, __mmask32 k, int a);
VPBROADCASTB __m256i _mm256_maskz_set1_epi8( __mmask32 k, int a);
VPBROADCASTB __m128i _mm_mask_set1_epi8(__m128i s, __mmask16 k, int a);
VPBROADCASTB __m128i _mm_maskz_set1_epi8( __mmask16 k, int a);
VPBROADCASTD __m512i _mm512_mask_set1_epi32(__m512i s, __mmask16 k, int a);
VPBROADCASTD __m512i _mm512_maskz_set1_epi32( __mmask16 k, int a);
VPBROADCASTD __m256i _mm256_mask_set1_epi32(__m256i s, __mmask8 k, int a);
VPBROADCASTD __m256i _mm256_maskz_set1_epi32( __mmask8 k, int a);
VPBROADCASTD __m128i _mm_mask_set1_epi32(__m128i s, __mmask8 k, int a);
VPBROADCASTD __m128i _mm_maskz_set1_epi32( __mmask8 k, int a);
VPBROADCASTQ __m512i _mm512_mask_set1_epi64(__m512i s, __mmask8 k, __int64 a);
VPBROADCASTQ __m512i _mm512_maskz_set1_epi64( __mmask8 k, __int64 a);
VPBROADCASTQ __m256i _mm256_mask_set1_epi64(__m256i s, __mmask8 k, __int64 a);
VPBROADCASTQ __m256i _mm256_maskz_set1_epi64( __mmask8 k, __int64 a);
VPBROADCASTQ __m128i _mm_mask_set1_epi64(__m128i s, __mmask8 k, __int64 a);
VPBROADCASTQ __m128i _mm_maskz_set1_epi64( __mmask8 k, __int64 a);
VPBROADCASTW __m512i _mm512_mask_set1_epi16(__m512i s, __mmask32 k, int a);
VPBROADCASTW __m512i _mm512_maskz_set1_epi16( __mmask32 k, int a);
VPBROADCASTW __m256i _mm256_mask_set1_epi16(__m256i s, __mmask16 k, int a);
VPBROADCASTW __m256i _mm256_maskz_set1_epi16( __mmask16 k, int a);
VPBROADCASTW __m128i _mm_mask_set1_epi16(__m128i s, __mmask8 k, int a);
VPBROADCASTW __m128i _mm_maskz_set1_epi16( __mmask8 k, int a);

Exceptions
EVEX-encoded instructions, see Table 2-57, “Type E7NM Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VPBROADCASTB/W/D/Q—Load With Broadcast Integer Data From General Purpose Register Vol. 2C 5-429
VPBROADCAST—Load Integer and Broadcast
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.66.0F38.W0 78 /r A V/V AVX2 Broadcast a byte integer in the source
VPBROADCASTB xmm1, xmm2/m8 operand to sixteen locations in xmm1.
VEX.256.66.0F38.W0 78 /r A V/V AVX2 Broadcast a byte integer in the source
VPBROADCASTB ymm1, xmm2/m8 operand to thirty-two locations in ymm1.
EVEX.128.66.0F38.W0 78 /r B V/V (AVX512VL AND Broadcast a byte integer in the source
VPBROADCASTB xmm1{k1}{z}, xmm2/m8 AVX512BW) OR operand to locations in xmm1 subject to
AVX10.11 writemask k1.
EVEX.256.66.0F38.W0 78 /r B V/V (AVX512VL AND Broadcast a byte integer in the source
VPBROADCASTB ymm1{k1}{z}, xmm2/m8 AVX512BW) OR operand to locations in ymm1 subject to
AVX10.11 writemask k1.
EVEX.512.66.0F38.W0 78 /r B V/V AVX512BW Broadcast a byte integer in the source
VPBROADCASTB zmm1{k1}{z}, xmm2/m8 OR AVX10.11 operand to 64 locations in zmm1 subject to
writemask k1.
VEX.128.66.0F38.W0 79 /r A V/V AVX2 Broadcast a word integer in the source
VPBROADCASTW xmm1, xmm2/m16 operand to eight locations in xmm1.
VEX.256.66.0F38.W0 79 /r A V/V AVX2 Broadcast a word integer in the source
VPBROADCASTW ymm1, xmm2/m16 operand to sixteen locations in ymm1.
EVEX.128.66.0F38.W0 79 /r B V/V (AVX512VL AND Broadcast a word integer in the source
VPBROADCASTW xmm1{k1}{z}, xmm2/m16 AVX512BW) OR operand to locations in xmm1 subject to
AVX10.11 writemask k1.
EVEX.256.66.0F38.W0 79 /r B V/V (AVX512VL AND Broadcast a word integer in the source
VPBROADCASTW ymm1{k1}{z}, xmm2/m16 AVX512BW) OR operand to locations in ymm1 subject to
AVX10.11 writemask k1.
EVEX.512.66.0F38.W0 79 /r B V/V AVX512BW Broadcast a word integer in the source
VPBROADCASTW zmm1{k1}{z}, xmm2/m16 OR AVX10.11 operand to 32 locations in zmm1 subject to
writemask k1.
VEX.128.66.0F38.W0 58 /r A V/V AVX2 Broadcast a dword integer in the source
VPBROADCASTD xmm1, xmm2/m32 operand to four locations in xmm1.
VEX.256.66.0F38.W0 58 /r A V/V AVX2 Broadcast a dword integer in the source
VPBROADCASTD ymm1, xmm2/m32 operand to eight locations in ymm1.
EVEX.128.66.0F38.W0 58 /r B V/V (AVX512VL AND Broadcast a dword integer in the source
VPBROADCASTD xmm1 {k1}{z}, xmm2/m32 AVX512F) OR operand to locations in xmm1 subject to
AVX10.11 writemask k1.
EVEX.256.66.0F38.W0 58 /r B V/V (AVX512VL AND Broadcast a dword integer in the source
VPBROADCASTD ymm1 {k1}{z}, xmm2/m32 AVX512F) OR operand to locations in ymm1 subject to
AVX10.11 writemask k1.
EVEX.512.66.0F38.W0 58 /r B V/V AVX512F Broadcast a dword integer in the source
VPBROADCASTD zmm1 {k1}{z}, xmm2/m32 OR AVX10.11 operand to locations in zmm1 subject to
writemask k1.
VEX.128.66.0F38.W0 59 /r A V/V AVX2 Broadcast a qword element in source
VPBROADCASTQ xmm1, xmm2/m64 operand to two locations in xmm1.
VEX.256.66.0F38.W0 59 /r A V/V AVX2 Broadcast a qword element in source
VPBROADCASTQ ymm1, xmm2/m64 operand to four locations in ymm1.
EVEX.128.66.0F38.W1 59 /r B V/V (AVX512VL AND Broadcast a qword element in source
VPBROADCASTQ xmm1 {k1}{z}, xmm2/m64 AVX512F) OR operand to locations in xmm1 subject to
AVX10.11 writemask k1.

VPBROADCAST—Load Integer and Broadcast Vol. 2C 5-430


Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.256.66.0F38.W1 59 /r B V/V (AVX512VL AND Broadcast a qword element in source
VPBROADCASTQ ymm1 {k1}{z}, xmm2/m64 AVX512F) OR operand to locations in ymm1 subject to
AVX10.11 writemask k1.
EVEX.512.66.0F38.W1 59 /r B V/V AVX512F Broadcast a qword element in source
VPBROADCASTQ zmm1 {k1}{z}, xmm2/m64 OR AVX10.11 operand to locations in zmm1 subject to
writemask k1.
EVEX.128.66.0F38.W0 59 /r C V/V (AVX512VL AND Broadcast two dword elements in source
VBROADCASTI32x2 xmm1 {k1}{z}, AVX512DQ) OR operand to locations in xmm1 subject to
xmm2/m64 AVX10.11 writemask k1.
EVEX.256.66.0F38.W0 59 /r C V/V (AVX512VL AND Broadcast two dword elements in source
VBROADCASTI32x2 ymm1 {k1}{z}, AVX512DQ) OR operand to locations in ymm1 subject to
xmm2/m64 AVX10.11 writemask k1.
EVEX.512.66.0F38.W0 59 /r C V/V AVX512DQ Broadcast two dword elements in source
VBROADCASTI32x2 zmm1 {k1}{z}, OR AVX10.11 operand to locations in zmm1 subject to
xmm2/m64 writemask k1.
VEX.256.66.0F38.W0 5A /r A V/V AVX2 Broadcast 128 bits of integer data in mem
VBROADCASTI128 ymm1, m128 to low and high 128-bits in ymm1.
EVEX.256.66.0F38.W0 5A /r D V/V (AVX512VL AND Broadcast 128 bits of 4 doubleword integer
VBROADCASTI32X4 ymm1 {k1}{z}, m128 AVX512F) OR data in mem to locations in ymm1 using
AVX10.11 writemask k1.
EVEX.512.66.0F38.W0 5A /r D V/V AVX512F Broadcast 128 bits of 4 doubleword integer
VBROADCASTI32X4 zmm1 {k1}{z}, m128 OR AVX10.11 data in mem to locations in zmm1 using
writemask k1.
EVEX.256.66.0F38.W1 5A /r C V/V (AVX512VL AND Broadcast 128 bits of 2 quadword integer
VBROADCASTI64X2 ymm1 {k1}{z}, m128 AVX512DQ) OR data in mem to locations in ymm1 using
AVX10.11 writemask k1.
EVEX.512.66.0F38.W1 5A /r C V/V AVX512DQ Broadcast 128 bits of 2 quadword integer
VBROADCASTI64X2 zmm1 {k1}{z}, m128 OR AVX10.11 data in mem to locations in zmm1 using
writemask k1.
EVEX.512.66.0F38.W0 5B /r E V/V AVX512DQ Broadcast 256 bits of 8 doubleword integer
VBROADCASTI32X8 zmm1 {k1}{z}, m256 OR AVX10.11 data in mem to locations in zmm1 using
writemask k1.
EVEX.512.66.0F38.W1 5B /r D V/V AVX512F Broadcast 256 bits of 4 quadword integer
VBROADCASTI64X4 zmm1 {k1}{z}, m256 OR AVX10.11 data in mem to locations in zmm1 using
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

VPBROADCAST—Load Integer and Broadcast Vol. 2C 5-431


Instruction Operand Encoding
Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Tuple1 Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A
C Tuple2 ModRM:reg (w) ModRM:r/m (r) N/A N/A
D Tuple4 ModRM:reg (w) ModRM:r/m (r) N/A N/A
E Tuple8 ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Load integer data from the source operand (the second operand) and broadcast to all elements of the destination
operand (the first operand).
VEX256-encoded VPBROADCASTB/W/D/Q: The source operand is 8-bit, 16-bit, 32-bit, 64-bit memory location or
the low 8-bit, 16-bit 32-bit, 64-bit data in an XMM register. The destination operand is a YMM register. VPBROAD-
CASTI128 support the source operand of 128-bit memory location. Register source encodings for VPBROADCAS-
TI128 is reserved and will #UD. Bits (MAXVL-1:256) of the destination register are zeroed.
EVEX-encoded VPBROADCASTD/Q: The source operand is a 32-bit, 64-bit memory location or the low 32-bit, 64-
bit data in an XMM register. The destination operand is a ZMM/YMM/XMM register and updated according to the
writemask k1.
VPBROADCASTI32X4 and VPBROADCASTI64X4: The destination operand is a ZMM register and updated according
to the writemask k1. The source operand is 128-bit or 256-bit memory location. Register source encodings for
VBROADCASTI32X4 and VBROADCASTI64X4 are reserved and will #UD.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
If VPBROADCASTI128 is encoded with VEX.L= 0, an attempt to execute the instruction encoded with VEX.L= 0 will
cause an #UD exception.

m32 X0

DEST X0 X0 X0 X0 X0 X0 X0 X0

Figure 5-16. VPBROADCASTD Operation (VEX.256 encoded version)

m32 X0

DEST 0 0 0 0 X0 X0 X0 X0

Figure 5-17. VPBROADCASTD Operation (128-bit version)

VPBROADCAST—Load Integer and Broadcast Vol. 2C 5-432


m64 X0

DEST X0 X0 X0 X0

Figure 5-18. VPBROADCASTQ Operation (256-bit version)

m128 X0

DEST X0 X0

Figure 5-19. VBROADCASTI128 Operation (256-bit version)

m256 X0

DEST X0 X0

Figure 5-20. VBROADCASTI256 Operation (512-bit version)

VPBROADCAST—Load Integer and Broadcast Vol. 2C 5-433


Operation
VPBROADCASTB (EVEX encoded versions)
(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SRC[7:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPBROADCASTW (EVEX encoded versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SRC[15:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPBROADCASTD (128 bit version)


temp := SRC[31:0]
DEST[31:0] := temp
DEST[63:32] := temp
DEST[95:64] := temp
DEST[127:96] := temp
DEST[MAXVL-1:128] := 0

VPBROADCASTD (VEX.256 encoded version)


temp := SRC[31:0]
DEST[31:0] := temp
DEST[63:32] := temp
DEST[95:64] := temp
DEST[127:96] := temp
DEST[159:128] := temp
DEST[191:160] := temp
DEST[223:192] := temp
DEST[255:224] := temp
DEST[MAXVL-1:256] := 0

VPBROADCAST—Load Integer and Broadcast Vol. 2C 5-434


VPBROADCASTD (EVEX encoded versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[31:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPBROADCASTQ (VEX.256 encoded version)


temp := SRC[63:0]
DEST[63:0] := temp
DEST[127:64] := temp
DEST[191:128] := temp
DEST[255:192] := temp
DEST[MAXVL-1:256] := 0

VPBROADCASTQ (EVEX encoded versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[63:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VBROADCASTI32x2 (EVEX encoded versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
n := (j mod 2) * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[n+31:n]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR

VPBROADCAST—Load Integer and Broadcast Vol. 2C 5-435


DEST[MAXVL-1:VL] := 0

VBROADCASTI128 (VEX.256 encoded version)


temp := SRC[127:0]
DEST[127:0] := temp
DEST[255:128] := temp
DEST[MAXVL-1:256] := 0

VBROADCASTI32X4 (EVEX encoded versions)


(KL, VL) = (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j* 32
n := (j modulo 4) * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[n+31:n]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VBROADCASTI64X2 (EVEX encoded versions)


(KL, VL) = (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 64
n := (j modulo 2) * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[n+63:n]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] = 0
FI
FI;
ENDFOR;

VBROADCASTI32X8 (EVEX.U1.512 encoded version)


FOR j := 0 TO 15
i := j * 32
n := (j modulo 8) * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[n+31:n]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;

VPBROADCAST—Load Integer and Broadcast Vol. 2C 5-436


ENDFOR
DEST[MAXVL-1:VL] := 0

VBROADCASTI64X4 (EVEX.512 encoded version)


FOR j := 0 TO 7
i := j * 64
n := (j modulo 4) * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[n+63:n]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPBROADCASTB __m512i _mm512_broadcastb_epi8( __m128i a);
VPBROADCASTB __m512i _mm512_mask_broadcastb_epi8(__m512i s, __mmask64 k, __m128i a);
VPBROADCASTB __m512i _mm512_maskz_broadcastb_epi8( __mmask64 k, __m128i a);
VPBROADCASTB __m256i _mm256_broadcastb_epi8(__m128i a);
VPBROADCASTB __m256i _mm256_mask_broadcastb_epi8(__m256i s, __mmask32 k, __m128i a);
VPBROADCASTB __m256i _mm256_maskz_broadcastb_epi8( __mmask32 k, __m128i a);
VPBROADCASTB __m128i _mm_mask_broadcastb_epi8(__m128i s, __mmask16 k, __m128i a);
VPBROADCASTB __m128i _mm_maskz_broadcastb_epi8( __mmask16 k, __m128i a);
VPBROADCASTB __m128i _mm_broadcastb_epi8(__m128i a);
VPBROADCASTD __m512i _mm512_broadcastd_epi32( __m128i a);
VPBROADCASTD __m512i _mm512_mask_broadcastd_epi32(__m512i s, __mmask16 k, __m128i a);
VPBROADCASTD __m512i _mm512_maskz_broadcastd_epi32( __mmask16 k, __m128i a);
VPBROADCASTD __m256i _mm256_broadcastd_epi32( __m128i a);
VPBROADCASTD __m256i _mm256_mask_broadcastd_epi32(__m256i s, __mmask8 k, __m128i a);
VPBROADCASTD __m256i _mm256_maskz_broadcastd_epi32( __mmask8 k, __m128i a);
VPBROADCASTD __m128i _mm_broadcastd_epi32(__m128i a);
VPBROADCASTD __m128i _mm_mask_broadcastd_epi32(__m128i s, __mmask8 k, __m128i a);
VPBROADCASTD __m128i _mm_maskz_broadcastd_epi32( __mmask8 k, __m128i a);
VPBROADCASTQ __m512i _mm512_broadcastq_epi64( __m128i a);
VPBROADCASTQ __m512i _mm512_mask_broadcastq_epi64(__m512i s, __mmask8 k, __m128i a);
VPBROADCASTQ __m512i _mm512_maskz_broadcastq_epi64( __mmask8 k, __m128i a);
VPBROADCASTQ __m256i _mm256_broadcastq_epi64(__m128i a);
VPBROADCASTQ __m256i _mm256_mask_broadcastq_epi64(__m256i s, __mmask8 k, __m128i a);
VPBROADCASTQ __m256i _mm256_maskz_broadcastq_epi64( __mmask8 k, __m128i a);
VPBROADCASTQ __m128i _mm_broadcastq_epi64(__m128i a);
VPBROADCASTQ __m128i _mm_mask_broadcastq_epi64(__m128i s, __mmask8 k, __m128i a);
VPBROADCASTQ __m128i _mm_maskz_broadcastq_epi64( __mmask8 k, __m128i a);
VPBROADCASTW __m512i _mm512_broadcastw_epi16(__m128i a);
VPBROADCASTW __m512i _mm512_mask_broadcastw_epi16(__m512i s, __mmask32 k, __m128i a);
VPBROADCASTW __m512i _mm512_maskz_broadcastw_epi16( __mmask32 k, __m128i a);
VPBROADCASTW __m256i _mm256_broadcastw_epi16(__m128i a);
VPBROADCASTW __m256i _mm256_mask_broadcastw_epi16(__m256i s, __mmask16 k, __m128i a);
VPBROADCASTW __m256i _mm256_maskz_broadcastw_epi16( __mmask16 k, __m128i a);
VPBROADCASTW __m128i _mm_broadcastw_epi16(__m128i a);

VPBROADCAST—Load Integer and Broadcast Vol. 2C 5-437


VPBROADCASTW __m128i _mm_mask_broadcastw_epi16(__m128i s, __mmask8 k, __m128i a);
VPBROADCASTW __m128i _mm_maskz_broadcastw_epi16( __mmask8 k, __m128i a);
VBROADCASTI32x2 __m512i _mm512_broadcast_i32x2( __m128i a);
VBROADCASTI32x2 __m512i _mm512_mask_broadcast_i32x2(__m512i s, __mmask16 k, __m128i a);
VBROADCASTI32x2 __m512i _mm512_maskz_broadcast_i32x2( __mmask16 k, __m128i a);
VBROADCASTI32x2 __m256i _mm256_broadcast_i32x2( __m128i a);
VBROADCASTI32x2 __m256i _mm256_mask_broadcast_i32x2(__m256i s, __mmask8 k, __m128i a);
VBROADCASTI32x2 __m256i _mm256_maskz_broadcast_i32x2( __mmask8 k, __m128i a);
VBROADCASTI32x2 __m128i _mm_broadcast_i32x2(__m128i a);
VBROADCASTI32x2 __m128i _mm_mask_broadcast_i32x2(__m128i s, __mmask8 k, __m128i a);
VBROADCASTI32x2 __m128i _mm_maskz_broadcast_i32x2( __mmask8 k, __m128i a);
VBROADCASTI32x4 __m512i _mm512_broadcast_i32x4( __m128i a);
VBROADCASTI32x4 __m512i _mm512_mask_broadcast_i32x4(__m512i s, __mmask16 k, __m128i a);
VBROADCASTI32x4 __m512i _mm512_maskz_broadcast_i32x4( __mmask16 k, __m128i a);
VBROADCASTI32x4 __m256i _mm256_broadcast_i32x4( __m128i a);
VBROADCASTI32x4 __m256i _mm256_mask_broadcast_i32x4(__m256i s, __mmask8 k, __m128i a);
VBROADCASTI32x4 __m256i _mm256_maskz_broadcast_i32x4( __mmask8 k, __m128i a);
VBROADCASTI32x8 __m512i _mm512_broadcast_i32x8( __m256i a);
VBROADCASTI32x8 __m512i _mm512_mask_broadcast_i32x8(__m512i s, __mmask16 k, __m256i a);
VBROADCASTI32x8 __m512i _mm512_maskz_broadcast_i32x8( __mmask16 k, __m256i a);
VBROADCASTI64x2 __m512i _mm512_broadcast_i64x2( __m128i a);
VBROADCASTI64x2 __m512i _mm512_mask_broadcast_i64x2(__m512i s, __mmask8 k, __m128i a);
VBROADCASTI64x2 __m512i _mm512_maskz_broadcast_i64x2( __mmask8 k, __m128i a);
VBROADCASTI64x2 __m256i _mm256_broadcast_i64x2( __m128i a);
VBROADCASTI64x2 __m256i _mm256_mask_broadcast_i64x2(__m256i s, __mmask8 k, __m128i a);
VBROADCASTI64x2 __m256i _mm256_maskz_broadcast_i64x2( __mmask8 k, __m128i a);
VBROADCASTI64x4 __m512i _mm512_broadcast_i64x4( __m256i a);
VBROADCASTI64x4 __m512i _mm512_mask_broadcast_i64x4(__m512i s, __mmask8 k, __m256i a);
VBROADCASTI64x4 __m512i _mm512_maskz_broadcast_i64x4( __mmask8 k, __m256i a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instructions, see Table 2-23, “Type 6 Class Exception Conditions.”
EVEX-encoded instructions, syntax with reg/mem operand, see Table 2-55, “Type E6 Class Exception Conditions.”
Additionally:
#UD If VEX.L = 0 for VPBROADCASTQ, VPBROADCASTI128.
If EVEX.L’L = 0 for VBROADCASTI32X4/VBROADCASTI64X2.
If EVEX.L’L < 10b for VBROADCASTI32X8/VBROADCASTI64X4.

VPBROADCAST—Load Integer and Broadcast Vol. 2C 5-438


VPBROADCASTM—Broadcast Mask to Vector Register
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.F3.0F38.W1 2A /r RM V/V (AVX512VL AND Broadcast low byte value in k1 to two locations in
VPBROADCASTMB2Q xmm1, k1 AVX512CD) OR xmm1.
AVX10.11
EVEX.256.F3.0F38.W1 2A /r RM V/V (AVX512VL AND Broadcast low byte value in k1 to four locations in
VPBROADCASTMB2Q ymm1, k1 AVX512CD) OR ymm1.
AVX10.11
EVEX.512.F3.0F38.W1 2A /r RM V/V AVX512CD Broadcast low byte value in k1 to eight locations in
VPBROADCASTMB2Q zmm1, k1 OR AVX10.11 zmm1.
EVEX.128.F3.0F38.W0 3A /r RM V/V (AVX512VL AND Broadcast low word value in k1 to four locations in
VPBROADCASTMW2D xmm1, k1 AVX512CD) OR xmm1.
AVX10.11
EVEX.256.F3.0F38.W0 3A /r RM V/V (AVX512VL AND Broadcast low word value in k1 to eight locations
VPBROADCASTMW2D ymm1, k1 AVX512CD) OR in ymm1.
AVX10.11
EVEX.512.F3.0F38.W0 3A /r RM V/V AVX512CD Broadcast low word value in k1 to sixteen
VPBROADCASTMW2D zmm1, k1 OR AVX10.11 locations in zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Operand 1 Operand 2 Operand 3 Operand 4
RM ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Broadcasts the zero-extended 64/32 bit value of the low byte/word of the source operand (the second operand) to
each 64/32 bit element of the destination operand (the first operand). The source operand is an opmask register.
The destination operand is a ZMM register (EVEX.512), YMM register (EVEX.256), or XMM register (EVEX.128).
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
VPBROADCASTMB2Q
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j*64
DEST[i+63:i] := ZeroExtend(SRC[7:0])
ENDFOR
DEST[MAXVL-1:VL] := 0

VPBROADCASTM—Broadcast Mask to Vector Register Vol. 2C 5-439


VPBROADCASTMW2D
(KL, VL) = (4, 128), (8, 256),(16, 512)
FOR j := 0 TO KL-1
i := j*32
DEST[i+31:i] := ZeroExtend(SRC[15:0])
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent

VPBROADCASTMB2Q __m512i _mm512_broadcastmb_epi64( __mmask8);


VPBROADCASTMW2D __m512i _mm512_broadcastmw_epi32( __mmask16);
VPBROADCASTMB2Q __m256i _mm256_broadcastmb_epi64( __mmask8);
VPBROADCASTMW2D __m256i _mm256_broadcastmw_epi32( __mmask8);
VPBROADCASTMB2Q __m128i _mm_broadcastmb_epi64( __mmask8);
VPBROADCASTMW2D __m128i _mm_broadcastmw_epi32( __mmask8);

SIMD Floating-Point Exceptions


None

Other Exceptions
EVEX-encoded instruction, see Table 2-56, “Type E6NF Class Exception Conditions.”

VPBROADCASTM—Broadcast Mask to Vector Register Vol. 2C 5-440


VPCMPB/VPCMPUB—Compare Packed Byte Values Into Mask
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F3A.W0 3F /r ib A V/V (AVX512VL AND Compare packed signed byte values in xmm3/m128
VPCMPB k1 {k2}, xmm2, AVX512BW) OR and xmm2 using bits 2:0 of imm8 as a comparison
xmm3/m128, imm8 AVX10.11 predicate with writemask k2 and leave the result in
mask register k1.
EVEX.256.66.0F3A.W0 3F /r ib A V/V (AVX512VL AND Compare packed signed byte values in ymm3/m256
VPCMPB k1 {k2}, ymm2, AVX512BW) OR and ymm2 using bits 2:0 of imm8 as a comparison
ymm3/m256, imm8 AVX10.11 predicate with writemask k2 and leave the result in
mask register k1.
EVEX.512.66.0F3A.W0 3F /r ib A V/V AVX512BW Compare packed signed byte values in zmm3/m512
VPCMPB k1 {k2}, zmm2, OR AVX10.11 and zmm2 using bits 2:0 of imm8 as a comparison
zmm3/m512, imm8 predicate with writemask k2 and leave the result in
mask register k1.
EVEX.128.66.0F3A.W0 3E /r ib A V/V (AVX512VL AND Compare packed unsigned byte values in
VPCMPUB k1 {k2}, xmm2, AVX512BW) OR xmm3/m128 and xmm2 using bits 2:0 of imm8 as a
xmm3/m128, imm8 AVX10.11 comparison predicate with writemask k2 and leave
the result in mask register k1.
EVEX.256.66.0F3A.W0 3E /r ib A V/V (AVX512VL AND Compare packed unsigned byte values in
VPCMPUB k1 {k2}, ymm2, AVX512BW) OR ymm3/m256 and ymm2 using bits 2:0 of imm8 as a
ymm3/m256, imm8 AVX10.11 comparison predicate with writemask k2 and leave
the result in mask register k1.
EVEX.512.66.0F3A.W0 3E /r ib A V/V AVX512BW Compare packed unsigned byte values in
VPCMPUB k1 {k2}, zmm2, OR AVX10.11 zmm3/m512 and zmm2 using bits 2:0 of imm8 as a
zmm3/m512, imm8 comparison predicate with writemask k2 and leave
the result in mask register k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD compare of the packed byte values in the second source operand and the first source operand and
returns the results of the comparison to the mask destination operand. The comparison predicate operand (imme-
diate byte) specifies the type of comparison performed on each pair of packed values in the two source operands.
The result of each comparison is a single mask bit result of 1 (comparison true) or 0 (comparison false).
VPCMPB performs a comparison between pairs of signed byte values.
VPCMPUB performs a comparison between pairs of unsigned byte values.
The first source operand (second operand) is a ZMM/YMM/XMM register. The second source operand can be a
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand (first operand) is a mask
register k1. Up to 64/32/16 comparisons are performed with results written to the destination operand under the
writemask k2.

VPCMPB/VPCMPUB—Compare Packed Byte Values Into Mask Vol. 2C 5-441


The comparison predicate operand is an 8-bit immediate: bits 2:0 define the type of comparison to be performed.
Bits 3 through 7 of the immediate are reserved. Compiler can implement the pseudo-op mnemonic listed in Table
5-19.

Table 5-19. Pseudo-Op and VPCMP* Implementation


:

Pseudo-Op PCMPM Implementation


VPCMPEQ* reg1, reg2, reg3 VPCMP* reg1, reg2, reg3, 0
VPCMPLT* reg1, reg2, reg3 VPCMP*reg1, reg2, reg3, 1
VPCMPLE* reg1, reg2, reg3 VPCMP* reg1, reg2, reg3, 2
VPCMPNEQ* reg1, reg2, reg3 VPCMP* reg1, reg2, reg3, 4
VPPCMPNLT* reg1, reg2, reg3 VPCMP* reg1, reg2, reg3, 5
VPCMPNLE* reg1, reg2, reg3 VPCMP* reg1, reg2, reg3, 6

Operation
CASE (COMPARISON PREDICATE) OF
0: OP := EQ;
1: OP := LT;
2: OP := LE;
3: OP := FALSE;
4: OP := NEQ;
5: OP := NLT;
6: OP := NLE;
7: OP := TRUE;
ESAC;

VPCMPB (EVEX encoded versions)


(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF k2[j] OR *no writemask*
THEN
CMP := SRC1[i+7:i] OP SRC2[i+7:i];
IF CMP = TRUE
THEN DEST[j] := 1;
ELSE DEST[j] := 0; FI;
ELSE DEST[j] = 0 ; zeroing-masking onlyFI;
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

VPCMPB/VPCMPUB—Compare Packed Byte Values Into Mask Vol. 2C 5-442


VPCMPUB (EVEX encoded versions)
(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF k2[j] OR *no writemask*
THEN
CMP := SRC1[i+7:i] OP SRC2[i+7:i];
IF CMP = TRUE
THEN DEST[j] := 1;
ELSE DEST[j] := 0; FI;
ELSE DEST[j] = 0 ; zeroing-masking onlyFI;
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPCMPB __mmask64 _mm512_cmp_epi8_mask( __m512i a, __m512i b, int cmp);
VPCMPB __mmask64 _mm512_mask_cmp_epi8_mask( __mmask64 m, __m512i a, __m512i b, int cmp);
VPCMPB __mmask32 _mm256_cmp_epi8_mask( __m256i a, __m256i b, int cmp);
VPCMPB __mmask32 _mm256_mask_cmp_epi8_mask( __mmask32 m, __m256i a, __m256i b, int cmp);
VPCMPB __mmask16 _mm_cmp_epi8_mask( __m128i a, __m128i b, int cmp);
VPCMPB __mmask16 _mm_mask_cmp_epi8_mask( __mmask16 m, __m128i a, __m128i b, int cmp);
VPCMPB __mmask64 _mm512_cmp[eq|ge|gt|le|lt|neq]_epi8_mask( __m512i a, __m512i b);
VPCMPB __mmask64 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epi8_mask( __mmask64 m, __m512i a, __m512i b);
VPCMPB __mmask32 _mm256_cmp[eq|ge|gt|le|lt|neq]_epi8_mask( __m256i a, __m256i b);
VPCMPB __mmask32 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epi8_mask( __mmask32 m, __m256i a, __m256i b);
VPCMPB __mmask16 _mm_cmp[eq|ge|gt|le|lt|neq]_epi8_mask( __m128i a, __m128i b);
VPCMPB __mmask16 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epi8_mask( __mmask16 m, __m128i a, __m128i b);
VPCMPUB __mmask64 _mm512_cmp_epu8_mask( __m512i a, __m512i b, int cmp);
VPCMPUB __mmask64 _mm512_mask_cmp_epu8_mask( __mmask64 m, __m512i a, __m512i b, int cmp);
VPCMPUB __mmask32 _mm256_cmp_epu8_mask( __m256i a, __m256i b, int cmp);
VPCMPUB __mmask32 _mm256_mask_cmp_epu8_mask( __mmask32 m, __m256i a, __m256i b, int cmp);
VPCMPUB __mmask16 _mm_cmp_epu8_mask( __m128i a, __m128i b, int cmp);
VPCMPUB __mmask16 _mm_mask_cmp_epu8_mask( __mmask16 m, __m128i a, __m128i b, int cmp);
VPCMPUB __mmask64 _mm512_cmp[eq|ge|gt|le|lt|neq]_epu8_mask( __m512i a, __m512i b, int cmp);
VPCMPUB __mmask64 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epu8_mask( __mmask64 m, __m512i a, __m512i b, int cmp);
VPCMPUB __mmask32 _mm256_cmp[eq|ge|gt|le|lt|neq]_epu8_mask( __m256i a, __m256i b, int cmp);
VPCMPUB __mmask32 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epu8_mask( __mmask32 m, __m256i a, __m256i b, int cmp);
VPCMPUB __mmask16 _mm_cmp[eq|ge|gt|le|lt|neq]_epu8_mask( __m128i a, __m128i b, int cmp);
VPCMPUB __mmask16 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epu8_mask( __mmask16 m, __m128i a, __m128i b, int cmp);

SIMD Floating-Point Exceptions


None

Other Exceptions
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

VPCMPB/VPCMPUB—Compare Packed Byte Values Into Mask Vol. 2C 5-443


VPCMPD/VPCMPUD—Compare Packed Integer Values Into Mask
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F3A.W0 1F /r ib A V/V (AVX512VL AND Compare packed signed doubleword integer values in
VPCMPD k1 {k2}, xmm2, AVX512F) OR xmm3/m128/m32bcst and xmm2 using bits 2:0 of
xmm3/m128/m32bcst, imm8 AVX10.11 imm8 as a comparison predicate with writemask k2
and leave the result in mask register k1.
EVEX.256.66.0F3A.W0 1F /r ib A V/V (AVX512VL AND Compare packed signed doubleword integer values in
VPCMPD k1 {k2}, ymm2, AVX512F) OR ymm3/m256/m32bcst and ymm2 using bits 2:0 of
ymm3/m256/m32bcst, imm8 AVX10.11 imm8 as a comparison predicate with writemask k2
and leave the result in mask register k1.
EVEX.512.66.0F3A.W0 1F /r ib A V/V AVX512F Compare packed signed doubleword integer values in
VPCMPD k1 {k2}, zmm2, OR AVX10.11 zmm2 and zmm3/m512/m32bcst using bits 2:0 of
zmm3/m512/m32bcst, imm8 imm8 as a comparison predicate. The comparison
results are written to the destination k1 under
writemask k2.
EVEX.128.66.0F3A.W0 1E /r ib A V/V (AVX512VL AND Compare packed unsigned doubleword integer values
VPCMPUD k1 {k2}, xmm2, AVX512F) OR in xmm3/m128/m32bcst and xmm2 using bits 2:0 of
xmm3/m128/m32bcst, imm8 AVX10.11 imm8 as a comparison predicate with writemask k2
and leave the result in mask register k1.
EVEX.256.66.0F3A.W0 1E /r ib A V/V (AVX512VL AND Compare packed unsigned doubleword integer values
VPCMPUD k1 {k2}, ymm2, AVX512F) OR in ymm3/m256/m32bcst and ymm2 using bits 2:0 of
ymm3/m256/m32bcst, imm8 AVX10.11 imm8 as a comparison predicate with writemask k2
and leave the result in mask register k1.
EVEX.512.66.0F3A.W0 1E /r ib A V/V AVX512F Compare packed unsigned doubleword integer values
VPCMPUD k1 {k2}, zmm2, OR AVX10.11 in zmm2 and zmm3/m512/m32bcst using bits 2:0 of
zmm3/m512/m32bcst, imm8 imm8 as a comparison predicate. The comparison
results are written to the destination k1 under
writemask k2.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
Performs a SIMD compare of the packed integer values in the second source operand and the first source operand
and returns the results of the comparison to the mask destination operand. The comparison predicate operand
(immediate byte) specifies the type of comparison performed on each pair of packed values in the two source oper-
ands. The result of each comparison is a single mask bit result of 1 (comparison true) or 0 (comparison false).
VPCMPD/VPCMPUD performs a comparison between pairs of signed/unsigned doubleword integer values.
The first source operand (second operand) is a ZMM/YMM/XMM register. The second source operand can be a
ZMM/YMM/XMM register or a 512/256/128-bit memory location or a 512-bit vector broadcasted from a 32-bit
memory location. The destination operand (first operand) is a mask register k1. Up to 16/8/4 comparisons are
performed with results written to the destination operand under the writemask k2.

VPCMPD/VPCMPUD—Compare Packed Integer Values Into Mask Vol. 2C 5-444


The comparison predicate operand is an 8-bit immediate: bits 2:0 define the type of comparison to be performed.
Bits 3 through 7 of the immediate are reserved. Compiler can implement the pseudo-op mnemonic listed in Table
5-19.

Operation
CASE (COMPARISON PREDICATE) OF
0: OP := EQ;
1: OP := LT;
2: OP := LE;
3: OP := FALSE;
4: OP := NEQ;
5: OP := NLT;
6: OP := NLE;
7: OP := TRUE;
ESAC;

VPCMPD (EVEX encoded versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k2[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN CMP := SRC1[i+31:i] OP SRC2[31:0];
ELSE CMP := SRC1[i+31:i] OP SRC2[i+31:i];
FI;
IF CMP = TRUE
THEN DEST[j] := 1;
ELSE DEST[j] := 0; FI;
ELSE DEST[j] := 0 ; zeroing-masking onlyFI;
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

VPCMPUD (EVEX encoded versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k2[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN CMP := SRC1[i+31:i] OP SRC2[31:0];
ELSE CMP := SRC1[i+31:i] OP SRC2[i+31:i];
FI;
IF CMP = TRUE
THEN DEST[j] := 1;
ELSE DEST[j] := 0; FI;
ELSE DEST[j] := 0 ; zeroing-masking onlyFI;
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

VPCMPD/VPCMPUD—Compare Packed Integer Values Into Mask Vol. 2C 5-445


Intel C/C++ Compiler Intrinsic Equivalent
VPCMPD __mmask16 _mm512_cmp_epi32_mask( __m512i a, __m512i b, int imm);
VPCMPD __mmask16 _mm512_mask_cmp_epi32_mask(__mmask16 k, __m512i a, __m512i b, int imm);
VPCMPD __mmask16 _mm512_cmp[eq|ge|gt|le|lt|neq]_epi32_mask( __m512i a, __m512i b);
VPCMPD __mmask16 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epi32_mask(__mmask16 k, __m512i a, __m512i b);
VPCMPUD __mmask16 _mm512_cmp_epu32_mask( __m512i a, __m512i b, int imm);
VPCMPUD __mmask16 _mm512_mask_cmp_epu32_mask(__mmask16 k, __m512i a, __m512i b, int imm);
VPCMPUD __mmask16 _mm512_cmp[eq|ge|gt|le|lt|neq]_epu32_mask( __m512i a, __m512i b);
VPCMPUD __mmask16 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epu32_mask(__mmask16 k, __m512i a, __m512i b);
VPCMPD __mmask8 _mm256_cmp_epi32_mask( __m256i a, __m256i b, int imm);
VPCMPD __mmask8 _mm256_mask_cmp_epi32_mask(__mmask8 k, __m256i a, __m256i b, int imm);
VPCMPD __mmask8 _mm256_cmp[eq|ge|gt|le|lt|neq]_epi32_mask( __m256i a, __m256i b);
VPCMPD __mmask8 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epi32_mask(__mmask8 k, __m256i a, __m256i b);
VPCMPUD __mmask8 _mm256_cmp_epu32_mask( __m256i a, __m256i b, int imm);
VPCMPUD __mmask8 _mm256_mask_cmp_epu32_mask(__mmask8 k, __m256i a, __m256i b, int imm);
VPCMPUD __mmask8 _mm256_cmp[eq|ge|gt|le|lt|neq]_epu32_mask( __m256i a, __m256i b);
VPCMPUD __mmask8 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epu32_mask(__mmask8 k, __m256i a, __m256i b);
VPCMPD __mmask8 _mm_cmp_epi32_mask( __m128i a, __m128i b, int imm);
VPCMPD __mmask8 _mm_mask_cmp_epi32_mask(__mmask8 k, __m128i a, __m128i b, int imm);
VPCMPD __mmask8 _mm_cmp[eq|ge|gt|le|lt|neq]_epi32_mask( __m128i a, __m128i b);
VPCMPD __mmask8 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epi32_mask(__mmask8 k, __m128i a, __m128i b);
VPCMPUD __mmask8 _mm_cmp_epu32_mask( __m128i a, __m128i b, int imm);
VPCMPUD __mmask8 _mm_mask_cmp_epu32_mask(__mmask8 k, __m128i a, __m128i b, int imm);
VPCMPUD __mmask8 _mm_cmp[eq|ge|gt|le|lt|neq]_epu32_mask( __m128i a, __m128i b);
VPCMPUD __mmask8 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epu32_mask(__mmask8 k, __m128i a, __m128i b);

SIMD Floating-Point Exceptions


None

Other Exceptions
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

VPCMPD/VPCMPUD—Compare Packed Integer Values Into Mask Vol. 2C 5-446


VPCMPQ/VPCMPUQ—Compare Packed Integer Values Into Mask
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F3A.W1 1F /r ib A V/V (AVX512VL AND Compare packed signed quadword integer values in
VPCMPQ k1 {k2}, xmm2, AVX512F) OR xmm3/m128/m64bcst and xmm2 using bits 2:0 of
xmm3/m128/m64bcst, imm8 AVX10.11 imm8 as a comparison predicate with writemask k2
and leave the result in mask register k1.
EVEX.256.66.0F3A.W1 1F /r ib A V/V (AVX512VL AND Compare packed signed quadword integer values in
VPCMPQ k1 {k2}, ymm2, AVX512F) OR ymm3/m256/m64bcst and ymm2 using bits 2:0 of
ymm3/m256/m64bcst, imm8 AVX10.11 imm8 as a comparison predicate with writemask k2
and leave the result in mask register k1.
EVEX.512.66.0F3A.W1 1F /r ib A V/V AVX512F Compare packed signed quadword integer values in
VPCMPQ k1 {k2}, zmm2, OR AVX10.11 zmm3/m512/m64bcst and zmm2 using bits 2:0 of
zmm3/m512/m64bcst, imm8 imm8 as a comparison predicate with writemask k2
and leave the result in mask register k1.
EVEX.128.66.0F3A.W1 1E /r ib A V/V (AVX512VL AND Compare packed unsigned quadword integer values
VPCMPUQ k1 {k2}, xmm2, AVX512F) OR in xmm3/m128/m64bcst and xmm2 using bits 2:0 of
xmm3/m128/m64bcst, imm8 AVX10.11 imm8 as a comparison predicate with writemask k2
and leave the result in mask register k1.
EVEX.256.66.0F3A.W1 1E /r ib A V/V (AVX512VL AND Compare packed unsigned quadword integer values
VPCMPUQ k1 {k2}, ymm2, AVX512F) OR in ymm3/m256/m64bcst and ymm2 using bits 2:0 of
ymm3/m256/m64bcst, imm8 AVX10.11 imm8 as a comparison predicate with writemask k2
and leave the result in mask register k1.
EVEX.512.66.0F3A.W1 1E /r ib A V/V AVX512F Compare packed unsigned quadword integer values
VPCMPUQ k1 {k2}, zmm2, OR AVX10.11 in zmm3/m512/m64bcst and zmm2 using bits 2:0 of
zmm3/m512/m64bcst, imm8 imm8 as a comparison predicate with writemask k2
and leave the result in mask register k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
Performs a SIMD compare of the packed integer values in the second source operand and the first source operand
and returns the results of the comparison to the mask destination operand. The comparison predicate operand
(immediate byte) specifies the type of comparison performed on each pair of packed values in the two source oper-
ands. The result of each comparison is a single mask bit result of 1 (comparison true) or 0 (comparison false).
VPCMPQ/VPCMPUQ performs a comparison between pairs of signed/unsigned quadword integer values.
The first source operand (second operand) is a ZMM/YMM/XMM register. The second source operand can be a
ZMM/YMM/XMM register or a 512/256/128-bit memory location or a 512-bit vector broadcasted from a 64-bit
memory location. The destination operand (first operand) is a mask register k1. Up to 8/4/2 comparisons are
performed with results written to the destination operand under the writemask k2.
The comparison predicate operand is an 8-bit immediate: bits 2:0 define the type of comparison to be performed.
Bits 3 through 7 of the immediate are reserved. Compiler can implement the pseudo-op mnemonic listed in Table
5-19.

VPCMPQ/VPCMPUQ—Compare Packed Integer Values Into Mask Vol. 2C 5-447


Operation
CASE (COMPARISON PREDICATE) OF
0: OP := EQ;
1: OP := LT;
2: OP := LE;
3: OP := FALSE;
4: OP := NEQ;
5: OP := NLT;
6: OP := NLE;
7: OP := TRUE;
ESAC;

VPCMPQ (EVEX encoded versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k2[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN CMP := SRC1[i+63:i] OP SRC2[63:0];
ELSE CMP := SRC1[i+63:i] OP SRC2[i+63:i];
FI;
IF CMP = TRUE
THEN DEST[j] := 1;
ELSE DEST[j] := 0; FI;
ELSE DEST[j] := 0 ; zeroing-masking only
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

VPCMPUQ (EVEX encoded versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k2[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN CMP := SRC1[i+63:i] OP SRC2[63:0];
ELSE CMP := SRC1[i+63:i] OP SRC2[i+63:i];
FI;
IF CMP = TRUE
THEN DEST[j] := 1;
ELSE DEST[j] := 0; FI;
ELSE DEST[j] := 0 ; zeroing-masking only
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

VPCMPQ/VPCMPUQ—Compare Packed Integer Values Into Mask Vol. 2C 5-448


Intel C/C++ Compiler Intrinsic Equivalent
VPCMPQ __mmask8 _mm512_cmp_epi64_mask( __m512i a, __m512i b, int imm);
VPCMPQ __mmask8 _mm512_mask_cmp_epi64_mask(__mmask8 k, __m512i a, __m512i b, int imm);
VPCMPQ __mmask8 _mm512_cmp[eq|ge|gt|le|lt|neq]_epi64_mask( __m512i a, __m512i b);
VPCMPQ __mmask8 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epi64_mask(__mmask8 k, __m512i a, __m512i b);
VPCMPUQ __mmask8 _mm512_cmp_epu64_mask( __m512i a, __m512i b, int imm);
VPCMPUQ __mmask8 _mm512_mask_cmp_epu64_mask(__mmask8 k, __m512i a, __m512i b, int imm);
VPCMPUQ __mmask8 _mm512_cmp[eq|ge|gt|le|lt|neq]_epu64_mask( __m512i a, __m512i b);
VPCMPUQ __mmask8 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epu64_mask(__mmask8 k, __m512i a, __m512i b);
VPCMPQ __mmask8 _mm256_cmp_epi64_mask( __m256i a, __m256i b, int imm);
VPCMPQ __mmask8 _mm256_mask_cmp_epi64_mask(__mmask8 k, __m256i a, __m256i b, int imm);
VPCMPQ __mmask8 _mm256_cmp[eq|ge|gt|le|lt|neq]_epi64_mask( __m256i a, __m256i b);
VPCMPQ __mmask8 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epi64_mask(__mmask8 k, __m256i a, __m256i b);
VPCMPUQ __mmask8 _mm256_cmp_epu64_mask( __m256i a, __m256i b, int imm);
VPCMPUQ __mmask8 _mm256_mask_cmp_epu64_mask(__mmask8 k, __m256i a, __m256i b, int imm);
VPCMPUQ __mmask8 _mm256_cmp[eq|ge|gt|le|lt|neq]_epu64_mask( __m256i a, __m256i b);
VPCMPUQ __mmask8 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epu64_mask(__mmask8 k, __m256i a, __m256i b);
VPCMPQ __mmask8 _mm_cmp_epi64_mask( __m128i a, __m128i b, int imm);
VPCMPQ __mmask8 _mm_mask_cmp_epi64_mask(__mmask8 k, __m128i a, __m128i b, int imm);
VPCMPQ __mmask8 _mm_cmp[eq|ge|gt|le|lt|neq]_epi64_mask( __m128i a, __m128i b);
VPCMPQ __mmask8 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epi64_mask(__mmask8 k, __m128i a, __m128i b);
VPCMPUQ __mmask8 _mm_cmp_epu64_mask( __m128i a, __m128i b, int imm);
VPCMPUQ __mmask8 _mm_mask_cmp_epu64_mask(__mmask8 k, __m128i a, __m128i b, int imm);
VPCMPUQ __mmask8 _mm_cmp[eq|ge|gt|le|lt|neq]_epu64_mask( __m128i a, __m128i b);
VPCMPUQ __mmask8 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epu64_mask(__mmask8 k, __m128i a, __m128i b);

SIMD Floating-Point Exceptions


None

Other Exceptions
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

VPCMPQ/VPCMPUQ—Compare Packed Integer Values Into Mask Vol. 2C 5-449


VPCMPW/VPCMPUW—Compare Packed Word Values Into Mask
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F3A.W1 3F /r ib A V/V (AVX512VL AND Compare packed signed word integers in
VPCMPW k1 {k2}, xmm2, AVX512BW) OR xmm3/m128 and xmm2 using bits 2:0 of imm8 as a
xmm3/m128, imm8 AVX10.11 comparison predicate with writemask k2 and leave
the result in mask register k1.
EVEX.256.66.0F3A.W1 3F /r ib A V/V (AVX512VL AND Compare packed signed word integers in
VPCMPW k1 {k2}, ymm2, AVX512BW) OR ymm3/m256 and ymm2 using bits 2:0 of imm8 as a
ymm3/m256, imm8 AVX10.11 comparison predicate with writemask k2 and leave
the result in mask register k1.
EVEX.512.66.0F3A.W1 3F /r ib A V/V AVX512BW Compare packed signed word integers in
VPCMPW k1 {k2}, zmm2, OR AVX10.11 zmm3/m512 and zmm2 using bits 2:0 of imm8 as a
zmm3/m512, imm8 comparison predicate with writemask k2 and leave
the result in mask register k1.
EVEX.128.66.0F3A.W1 3E /r ib A V/V (AVX512VL AND Compare packed unsigned word integers in
VPCMPUW k1 {k2}, xmm2, AVX512BW) OR xmm3/m128 and xmm2 using bits 2:0 of imm8 as a
xmm3/m128, imm8 AVX10.11 comparison predicate with writemask k2 and leave
the result in mask register k1.
EVEX.256.66.0F3A.W1 3E /r ib A V/V (AVX512VL AND Compare packed unsigned word integers in
VPCMPUW k1 {k2}, ymm2, AVX512BW) OR ymm3/m256 and ymm2 using bits 2:0 of imm8 as a
ymm3/m256, imm8 AVX10.11 comparison predicate with writemask k2 and leave
the result in mask register k1.
EVEX.512.66.0F3A.W1 3E /r ib A V/V AVX512BW Compare packed unsigned word integers in
VPCMPUW k1 {k2}, zmm2, OR AVX10.11 zmm3/m512 and zmm2 using bits 2:0 of imm8 as a
zmm3/m512, imm8 comparison predicate with writemask k2 and leave
the result in mask register k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a SIMD compare of the packed integer word in the second source operand and the first source operand
and returns the results of the comparison to the mask destination operand. The comparison predicate operand
(immediate byte) specifies the type of comparison performed on each pair of packed values in the two source oper-
ands. The result of each comparison is a single mask bit result of 1 (comparison true) or 0 (comparison false).
VPCMPW performs a comparison between pairs of signed word values.
VPCMPUW performs a comparison between pairs of unsigned word values.
The first source operand (second operand) is a ZMM/YMM/XMM register. The second source operand can be a
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand (first operand) is a mask
register k1. Up to 32/16/8 comparisons are performed with results written to the destination operand under the
writemask k2.
The comparison predicate operand is an 8-bit immediate: bits 2:0 define the type of comparison to be performed.
Bits 3 through 7 of the immediate are reserved. Compiler can implement the pseudo-op mnemonic listed in Table
5-19.

VPCMPW/VPCMPUW—Compare Packed Word Values Into Mask Vol. 2C 5-450


Operation
CASE (COMPARISON PREDICATE) OF
0: OP := EQ;
1: OP := LT;
2: OP := LE;
3: OP := FALSE;
4: OP := NEQ;
5: OP := NLT;
6: OP := NLE;
7: OP := TRUE;
ESAC;

VPCMPW (EVEX encoded versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k2[j] OR *no writemask*
THEN
ICMP := SRC1[i+15:i] OP SRC2[i+15:i];
IF CMP = TRUE
THEN DEST[j] := 1;
ELSE DEST[j] := 0; FI;
ELSE DEST[j] = 0 ; zeroing-masking only
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

VPCMPUW (EVEX encoded versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k2[j] OR *no writemask*
THEN
CMP := SRC1[i+15:i] OP SRC2[i+15:i];
IF CMP = TRUE
THEN DEST[j] := 1;
ELSE DEST[j] := 0; FI;
ELSE DEST[j] = 0 ; zeroing-masking only
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

VPCMPW/VPCMPUW—Compare Packed Word Values Into Mask Vol. 2C 5-451


Intel C/C++ Compiler Intrinsic Equivalent
VPCMPW __mmask32 _mm512_cmp_epi16_mask( __m512i a, __m512i b, int cmp);
VPCMPW __mmask32 _mm512_mask_cmp_epi16_mask( __mmask32 m, __m512i a, __m512i b, int cmp);
VPCMPW __mmask16 _mm256_cmp_epi16_mask( __m256i a, __m256i b, int cmp);
VPCMPW __mmask16 _mm256_mask_cmp_epi16_mask( __mmask16 m, __m256i a, __m256i b, int cmp);
VPCMPW __mmask8 _mm_cmp_epi16_mask( __m128i a, __m128i b, int cmp);
VPCMPW __mmask8 _mm_mask_cmp_epi16_mask( __mmask8 m, __m128i a, __m128i b, int cmp);
VPCMPW __mmask32 _mm512_cmp[eq|ge|gt|le|lt|neq]_epi16_mask( __m512i a, __m512i b);
VPCMPW __mmask32 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epi16_mask( __mmask32 m, __m512i a, __m512i b);
VPCMPW __mmask16 _mm256_cmp[eq|ge|gt|le|lt|neq]_epi16_mask( __m256i a, __m256i b);
VPCMPW __mmask16 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epi16_mask( __mmask16 m, __m256i a, __m256i b);
VPCMPW __mmask8 _mm_cmp[eq|ge|gt|le|lt|neq]_epi16_mask( __m128i a, __m128i b);
VPCMPW __mmask8 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epi16_mask( __mmask8 m, __m128i a, __m128i b);
VPCMPUW __mmask32 _mm512_cmp_epu16_mask( __m512i a, __m512i b, int cmp);
VPCMPUW __mmask32 _mm512_mask_cmp_epu16_mask( __mmask32 m, __m512i a, __m512i b, int cmp);
VPCMPUW __mmask16 _mm256_cmp_epu16_mask( __m256i a, __m256i b, int cmp);
VPCMPUW __mmask16 _mm256_mask_cmp_epu16_mask( __mmask16 m, __m256i a, __m256i b, int cmp);
VPCMPUW __mmask8 _mm_cmp_epu16_mask( __m128i a, __m128i b, int cmp);
VPCMPUW __mmask8 _mm_mask_cmp_epu16_mask( __mmask8 m, __m128i a, __m128i b, int cmp);
VPCMPUW __mmask32 _mm512_cmp[eq|ge|gt|le|lt|neq]_epu16_mask( __m512i a, __m512i b, int cmp);
VPCMPUW __mmask32 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epu16_mask( __mmask32 m, __m512i a, __m512i b, int cmp);
VPCMPUW __mmask16 _mm256_cmp[eq|ge|gt|le|lt|neq]_epu16_mask( __m256i a, __m256i b, int cmp);
VPCMPUW __mmask16 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epu16_mask( __mmask16 m, __m256i a, __m256i b, int cmp);
VPCMPUW __mmask8 _mm_cmp[eq|ge|gt|le|lt|neq]_epu16_mask( __m128i a, __m128i b, int cmp);
VPCMPUW __mmask8 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epu16_mask( __mmask8 m, __m128i a, __m128i b, int cmp);

SIMD Floating-Point Exceptions


None

Other Exceptions
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

VPCMPW/VPCMPUW—Compare Packed Word Values Into Mask Vol. 2C 5-452


VPCOMPRESSB/VCOMPRESSW—Store Sparse Packed Byte/Word Integer Values Into Dense
Memory/Register
Opcode/ Op/ 64/32 CPUID Feature Flag Description
Instruction En bit Mode
Support
EVEX.128.66.0F38.W0 63 /r A V/V (AVX512_VBMI2 AND Compress up to 128 bits of packed byte
VPCOMPRESSB m128{k1}, xmm1 AVX512VL) OR values from xmm1 to m128 with
AVX10.11 writemask k1.
EVEX.128.66.0F38.W0 63 /r B V/V (AVX512_VBMI2 AND Compress up to 128 bits of packed byte
VPCOMPRESSB xmm1{k1}{z}, xmm2 AVX512VL) OR values from xmm2 to xmm1 with
AVX10.11 writemask k1.
EVEX.256.66.0F38.W0 63 /r A V/V (AVX512_VBMI2 AND Compress up to 256 bits of packed byte
VPCOMPRESSB m256{k1}, ymm1 AVX512VL) OR values from ymm1 to m256 with
AVX10.11 writemask k1.
EVEX.256.66.0F38.W0 63 /r B V/V (AVX512_VBMI2 AND Compress up to 256 bits of packed byte
VPCOMPRESSB ymm1{k1}{z}, ymm2 AVX512VL) OR values from ymm2 to ymm1 with
AVX10.11 writemask k1.
EVEX.512.66.0F38.W0 63 /r A V/V AVX512_VBMI2 Compress up to 512 bits of packed byte
VPCOMPRESSB m512{k1}, zmm1 OR AVX10.11 values from zmm1 to m512 with writemask
k1.
EVEX.512.66.0F38.W0 63 /r B V/V AVX512_VBMI2 Compress up to 512 bits of packed byte
VPCOMPRESSB zmm1{k1}{z}, zmm2 OR AVX10.11 values from zmm2 to zmm1 with
writemask k1.
EVEX.128.66.0F38.W1 63 /r A V/V (AVX512_VBMI2 AND Compress up to 128 bits of packed word
VPCOMPRESSW m128{k1}, xmm1 AVX512VL) OR values from xmm1 to m128 with
AVX10.11 writemask k1.
EVEX.128.66.0F38.W1 63 /r B V/V (AVX512_VBMI2 AND Compress up to 128 bits of packed word
VPCOMPRESSW xmm1{k1}{z}, xmm2 AVX512VL) OR values from xmm2 to xmm1 with
AVX10.11 writemask k1.
EVEX.256.66.0F38.W1 63 /r A V/V (AVX512_VBMI2 AND Compress up to 256 bits of packed word
VPCOMPRESSW m256{k1}, ymm1 AVX512VL) OR values from ymm1 to m256 with
AVX10.11 writemask k1.
EVEX.256.66.0F38.W1 63 /r B V/V (AVX512_VBMI2 AND Compress up to 256 bits of packed word
VPCOMPRESSW ymm1{k1}{z}, ymm2 AVX512VL) OR values from ymm2 to ymm1 with
AVX10.11 writemask k1.
EVEX.512.66.0F38.W1 63 /r A V/V AVX512_VBMI2 Compress up to 512 bits of packed word
VPCOMPRESSW m512{k1}, zmm1 OR AVX10.11 values from zmm1 to m512 with writemask
k1.
EVEX.512.66.0F38.W1 63 /r B V/V AVX512_VBMI2 Compress up to 512 bits of packed word
VPCOMPRESSW zmm1{k1}{z}, zmm2 OR AVX10.11 values from zmm2 to zmm1 with
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

VPCOMPRESSB/VCOMPRESSW—Store Sparse Packed Byte/Word Integer Values Into Dense Memory/Register Vol. 2C 5-453
Instruction Operand Encoding
Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) N/A N/A
B N/A ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
Compress (stores) up to 64 byte values or 32 word values from the source operand (second operand) to the desti-
nation operand (first operand), based on the active elements determined by the writemask operand. Note:
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Moves up to 512 bits of packed byte values from the source operand (second operand) to the destination operand
(first operand). This instruction is used to store partial contents of a vector register into a byte vector or single
memory location using the active elements in operand writemask.
Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z
must be zero.
Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the
source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper
bits are zeroed.
This instruction supports memory fault suppression.
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.

Operation
VPCOMPRESSB store form
(KL, VL) = (16, 128), (32, 256), (64, 512)
k := 0
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.byte[k] := SRC.byte[j]
k := k +1

VPCOMPRESSB reg-reg form


(KL, VL) = (16, 128), (32, 256), (64, 512)
k := 0
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.byte[k] := SRC.byte[j]
k := k + 1
IF *merging-masking*:
*DEST[VL-1:k*8] remains unchanged*
ELSE DEST[VL-1:k*8] := 0
DEST[MAX_VL-1:VL] := 0

VPCOMPRESSW store form


(KL, VL) = (8, 128), (16, 256), (32, 512)
k := 0
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.word[k] := SRC.word[j]
k := k + 1

VPCOMPRESSB/VCOMPRESSW—Store Sparse Packed Byte/Word Integer Values Into Dense Memory/Register Vol. 2C 5-454
VPCOMPRESSW reg-reg form
(KL, VL) = (8, 128), (16, 256), (32, 512)
k := 0
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.word[k] := SRC.word[j]
k := k + 1
IF *merging-masking*:
*DEST[VL-1:k*16] remains unchanged*
ELSE DEST[VL-1:k*16] := 0
DEST[MAX_VL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPCOMPRESSB __m128i _mm_mask_compress_epi8(__m128i, __mmask16, __m128i);
VPCOMPRESSB __m128i _mm_maskz_compress_epi8(__mmask16, __m128i);
VPCOMPRESSB __m256i _mm256_mask_compress_epi8(__m256i, __mmask32, __m256i);
VPCOMPRESSB __m256i _mm256_maskz_compress_epi8(__mmask32, __m256i);
VPCOMPRESSB __m512i _mm512_mask_compress_epi8(__m512i, __mmask64, __m512i);
VPCOMPRESSB __m512i _mm512_maskz_compress_epi8(__mmask64, __m512i);
VPCOMPRESSB void _mm_mask_compressstoreu_epi8(void*, __mmask16, __m128i);
VPCOMPRESSB void _mm256_mask_compressstoreu_epi8(void*, __mmask32, __m256i);
VPCOMPRESSB void _mm512_mask_compressstoreu_epi8(void*, __mmask64, __m512i);
VPCOMPRESSW __m128i _mm_mask_compress_epi16(__m128i, __mmask8, __m128i);
VPCOMPRESSW __m128i _mm_maskz_compress_epi16(__mmask8, __m128i);
VPCOMPRESSW __m256i _mm256_mask_compress_epi16(__m256i, __mmask16, __m256i);
VPCOMPRESSW __m256i _mm256_maskz_compress_epi16(__mmask16, __m256i);
VPCOMPRESSW __m512i _mm512_mask_compress_epi16(__m512i, __mmask32, __m512i);
VPCOMPRESSW __m512i _mm512_maskz_compress_epi16(__mmask32, __m512i);
VPCOMPRESSW void _mm_mask_compressstoreu_epi16(void*, __mmask8, __m128i);
VPCOMPRESSW void _mm256_mask_compressstoreu_epi16(void*, __mmask16, __m256i);
VPCOMPRESSW void _mm512_mask_compressstoreu_epi16(void*, __mmask32, __m512i);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”

VPCOMPRESSB/VCOMPRESSW—Store Sparse Packed Byte/Word Integer Values Into Dense Memory/Register Vol. 2C 5-455
VPCOMPRESSD—Store Sparse Packed Doubleword Integer Values Into Dense Memory/Register
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 8B /r A V/V (AVX512VL AND Compress packed doubleword integer
VPCOMPRESSD xmm1/m128 {k1}{z}, xmm2 AVX512F) OR values from xmm2 to xmm1/m128 using
AVX10.11 control mask k1.
EVEX.256.66.0F38.W0 8B /r A V/V (AVX512VL AND Compress packed doubleword integer
VPCOMPRESSD ymm1/m256 {k1}{z}, ymm2 AVX512F) OR values from ymm2 to ymm1/m256 using
AVX10.11 control mask k1.
EVEX.512.66.0F38.W0 8B /r A V/V AVX512F Compress packed doubleword integer
VPCOMPRESSD zmm1/m512 {k1}{z}, zmm2 OR AVX10.11 values from zmm2 to zmm1/m512 using
control mask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
Compress (store) up to 16/8/4 doubleword integer values from the source operand (second operand) to the desti-
nation operand (first operand). The source operand is a ZMM/YMM/XMM register, the destination operand can be a
ZMM/YMM/XMM register or a 512/256/128-bit memory location.
The opmask register k1 selects the active elements (partial vector or possibly non-contiguous if less than 16 active
elements) from the source operand to compress into a contiguous vector. The contiguous vector is written to the
destination starting from the low element of the destination operand.
Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z
must be zero.
Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the
source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper
bits are zeroed.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.

VPCOMPRESSD—Store Sparse Packed Doubleword Integer Values Into Dense Memory/Register Vol. 2C 5-456
Operation
VPCOMPRESSD (EVEX encoded versions) store form
(KL, VL) = (4, 128), (8, 256), (16, 512)
SIZE := 32
k := 0
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no controlmask*
THEN
DEST[k+SIZE-1:k] := SRC[i+31:i]
k := k + SIZE
FI;
ENDFOR;

VPCOMPRESSD (EVEX encoded versions) reg-reg form


(KL, VL) = (4, 128), (8, 256), (16, 512)
SIZE := 32
k := 0
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no controlmask*
THEN
DEST[k+SIZE-1:k] := SRC[i+31:i]
k := k + SIZE
FI;
ENDFOR
IF *merging-masking*
THEN *DEST[VL-1:k] remains unchanged*
ELSE DEST[VL-1:k] := 0
FI
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPCOMPRESSD __m512i _mm512_mask_compress_epi32(__m512i s, __mmask16 c, __m512i a);
VPCOMPRESSD __m512i _mm512_maskz_compress_epi32( __mmask16 c, __m512i a);
VPCOMPRESSD void _mm512_mask_compressstoreu_epi32(void * a, __mmask16 c, __m512i s);
VPCOMPRESSD __m256i _mm256_mask_compress_epi32(__m256i s, __mmask8 c, __m256i a);
VPCOMPRESSD __m256i _mm256_maskz_compress_epi32( __mmask8 c, __m256i a);
VPCOMPRESSD void _mm256_mask_compressstoreu_epi32(void * a, __mmask8 c, __m256i s);
VPCOMPRESSD __m128i _mm_mask_compress_epi32(__m128i s, __mmask8 c, __m128i a);
VPCOMPRESSD __m128i _mm_maskz_compress_epi32( __mmask8 c, __m128i a);
VPCOMPRESSD void _mm_mask_compressstoreu_epi32(void * a, __mmask8 c, __m128i s);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

VPCOMPRESSD—Store Sparse Packed Doubleword Integer Values Into Dense Memory/Register Vol. 2C 5-457
VPCOMPRESSQ—Store Sparse Packed Quadword Integer Values Into Dense Memory/Register
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 8B /r A V/V (AVX512VL AND Compress packed quadword integer values
VPCOMPRESSQ xmm1/m128 {k1}{z}, xmm2 AVX512F) OR from xmm2 to xmm1/m128 using control
AVX10.11 mask k1.
EVEX.256.66.0F38.W1 8B /r A V/V (AVX512VL AND Compress packed quadword integer values
VPCOMPRESSQ ymm1/m256 {k1}{z}, ymm2 AVX512F) OR from ymm2 to ymm1/m256 using control
AVX10.11 mask k1.
EVEX.512.66.0F38.W1 8B /r A V/V AVX512F Compress packed quadword integer values
VPCOMPRESSQ zmm1/m512 {k1}{z}, zmm2 OR AVX10.11 from zmm2 to zmm1/m512 using control
mask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
Compress (stores) up to 8/4/2 quadword integer values from the source operand (second operand) to the destina-
tion operand (first operand). The source operand is a ZMM/YMM/XMM register, the destination operand can be a
ZMM/YMM/XMM register or a 512/256/128-bit memory location.
The opmask register k1 selects the active elements (partial vector or possibly non-contiguous if less than 8 active
elements) from the source operand to compress into a contiguous vector. The contiguous vector is written to the
destination starting from the low element of the destination operand.
Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z
must be zero.
Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the
source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper
bits are zeroed.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.

VPCOMPRESSQ—Store Sparse Packed Quadword Integer Values Into Dense Memory/Register Vol. 2C 5-458
Operation
VPCOMPRESSQ (EVEX encoded versions) store form
(KL, VL) = (2, 128), (4, 256), (8, 512)
SIZE := 64
k := 0
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no controlmask*
THEN
DEST[k+SIZE-1:k] := SRC[i+63:i]
k := k + SIZE
FI;
ENFOR

VPCOMPRESSQ (EVEX encoded versions) reg-reg form


(KL, VL) = (2, 128), (4, 256), (8, 512)
SIZE := 64
k := 0
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no controlmask*
THEN
DEST[k+SIZE-1:k] := SRC[i+63:i]
k := k + SIZE
FI;
ENDFOR
IF *merging-masking*
THEN *DEST[VL-1:k] remains unchanged*
ELSE DEST[VL-1:k] := 0
FI
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPCOMPRESSQ __m512i _mm512_mask_compress_epi64(__m512i s, __mmask8 c, __m512i a);
VPCOMPRESSQ __m512i _mm512_maskz_compress_epi64( __mmask8 c, __m512i a);
VPCOMPRESSQ void _mm512_mask_compressstoreu_epi64(void * a, __mmask8 c, __m512i s);
VPCOMPRESSQ __m256i _mm256_mask_compress_epi64(__m256i s, __mmask8 c, __m256i a);
VPCOMPRESSQ __m256i _mm256_maskz_compress_epi64( __mmask8 c, __m256i a);
VPCOMPRESSQ void _mm256_mask_compressstoreu_epi64(void * a, __mmask8 c, __m256i s);
VPCOMPRESSQ __m128i _mm_mask_compress_epi64(__m128i s, __mmask8 c, __m128i a);
VPCOMPRESSQ __m128i _mm_maskz_compress_epi64( __mmask8 c, __m128i a);
VPCOMPRESSQ void _mm_mask_compressstoreu_epi64(void * a, __mmask8 c, __m128i s);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

VPCOMPRESSQ—Store Sparse Packed Quadword Integer Values Into Dense Memory/Register Vol. 2C 5-459
VPCONFLICTD/Q—Detect Conflicts Within a Vector of Packed Dword/Qword Values Into Dense
Memory/ Register
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 C4 /r A V/V (AVX512VL AND Detect duplicate double-word values in
VPCONFLICTD xmm1 {k1}{z}, AVX512CD) OR xmm2/m128/m32bcst using writemask k1.
xmm2/m128/m32bcst AVX10.11
EVEX.256.66.0F38.W0 C4 /r A V/V (AVX512VL AND Detect duplicate double-word values in
VPCONFLICTD ymm1 {k1}{z}, AVX512CD) OR ymm2/m256/m32bcst using writemask k1.
ymm2/m256/m32bcst AVX10.11
EVEX.512.66.0F38.W0 C4 /r A V/V AVX512CD Detect duplicate double-word values in
VPCONFLICTD zmm1 {k1}{z}, OR AVX10.11 zmm2/m512/m32bcst using writemask k1.
zmm2/m512/m32bcst
EVEX.128.66.0F38.W1 C4 /r A V/V (AVX512VL AND Detect duplicate quad-word values in
VPCONFLICTQ xmm1 {k1}{z}, AVX512CD) OR xmm2/m128/m64bcst using writemask k1.
xmm2/m128/m64bcst AVX10.11
EVEX.256.66.0F38.W1 C4 /r A V/V (AVX512VL AND Detect duplicate quad-word values in
VPCONFLICTQ ymm1 {k1}{z}, AVX512CD) OR ymm2/m256/m64bcst using writemask k1.
ymm2/m256/m64bcst AVX10.11
EVEX.512.66.0F38.W1 C4 /r A V/V AVX512CD Detect duplicate quad-word values in
VPCONFLICTQ zmm1 {k1}{z}, OR AVX10.11 zmm2/m512/m64bcst using writemask k1.
zmm2/m512/m64bcst

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Test each dword/qword element of the source operand (the second operand) for equality with all other elements in
the source operand closer to the least significant element. Each element’s comparison results form a bit vector,
which is then zero extended and written to the destination according to the writemask.
EVEX.512 encoded version: The source operand is a ZMM register, a 512-bit memory location, or a 512-bit vector
broadcasted from a 32/64-bit memory location. The destination operand is a ZMM register, conditionally updated
using writemask k1.
EVEX.256 encoded version: The source operand is a YMM register, a 256-bit memory location, or a 256-bit vector
broadcasted from a 32/64-bit memory location. The destination operand is a YMM register, conditionally updated
using writemask k1.
EVEX.128 encoded version: The source operand is a XMM register, a 128-bit memory location, or a 128-bit vector
broadcasted from a 32/64-bit memory location. The destination operand is a XMM register, conditionally updated
using writemask k1.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

VPCONFLICTD/Q—Detect Conflicts Within a Vector of Packed Dword/Qword Values Into Dense Memory/ Register Vol. 2C 5-460
Operation
VPCONFLICTD
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j*32
IF MaskBit(j) OR *no writemask* THEN
FOR k := 0 TO j-1
m := k*32
IF ((SRC[i+31:i] = SRC[m+31:m])) THEN
DEST[i+k] := 1
ELSE
DEST[i+k] := 0
FI
ENDFOR
DEST[i+31:i+j] := 0
ELSE
IF *merging-masking* THEN
*DEST[i+31:i] remains unchanged*
ELSE
DEST[i+31:i] := 0
FI
FI
ENDFOR
DEST[MAXVL-1:VL] := 0

VPCONFLICTQ
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j*64
IF MaskBit(j) OR *no writemask* THEN
FOR k := 0 TO j-1
m := k*64
IF ((SRC[i+63:i] = SRC[m+63:m])) THEN
DEST[i+k] := 1
ELSE
DEST[i+k] := 0
FI
ENDFOR
DEST[i+63:i+j] := 0
ELSE
IF *merging-masking* THEN
*DEST[i+63:i] remains unchanged*
ELSE
DEST[i+63:i] := 0
FI
FI
ENDFOR
DEST[MAXVL-1:VL] := 0

VPCONFLICTD/Q—Detect Conflicts Within a Vector of Packed Dword/Qword Values Into Dense Memory/ Register Vol. 2C 5-461
Intel C/C++ Compiler Intrinsic Equivalent

VPCONFLICTD __m512i _mm512_conflict_epi32( __m512i a);


VPCONFLICTD __m512i _mm512_mask_conflict_epi32(__m512i s, __mmask16 m, __m512i a);
VPCONFLICTD __m512i _mm512_maskz_conflict_epi32(__mmask16 m, __m512i a);
VPCONFLICTQ __m512i _mm512_conflict_epi64( __m512i a);
VPCONFLICTQ __m512i _mm512_mask_conflict_epi64(__m512i s, __mmask8 m, __m512i a);
VPCONFLICTQ __m512i _mm512_maskz_conflict_epi64(__mmask8 m, __m512i a);
VPCONFLICTD __m256i _mm256_conflict_epi32( __m256i a);
VPCONFLICTD __m256i _mm256_mask_conflict_epi32(__m256i s, __mmask8 m, __m256i a);
VPCONFLICTD __m256i _mm256_maskz_conflict_epi32(__mmask8 m, __m256i a);
VPCONFLICTQ __m256i _mm256_conflict_epi64( __m256i a);
VPCONFLICTQ __m256i _mm256_mask_conflict_epi64(__m256i s, __mmask8 m, __m256i a);
VPCONFLICTQ __m256i _mm256_maskz_conflict_epi64(__mmask8 m, __m256i a);
VPCONFLICTD __m128i _mm_conflict_epi32( __m128i a);
VPCONFLICTD __m128i _mm_mask_conflict_epi32(__m128i s, __mmask8 m, __m128i a);
VPCONFLICTD __m128i _mm_maskz_conflict_epi32(__mmask8 m, __m128i a);
VPCONFLICTQ __m128i _mm_conflict_epi64( __m128i a);
VPCONFLICTQ __m128i _mm_mask_conflict_epi64(__m128i s, __mmask8 m, __m128i a);
VPCONFLICTQ __m128i _mm_maskz_conflict_epi64(__mmask8 m, __m128i a);

SIMD Floating-Point Exceptions


None

Other Exceptions
EVEX-encoded instruction, see Table 2-52, “Type E4NF Class Exception Conditions.”

VPCONFLICTD/Q—Detect Conflicts Within a Vector of Packed Dword/Qword Values Into Dense Memory/ Register Vol. 2C 5-462
VPDPBUSD—Multiply and Add Unsigned and Signed Bytes
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.66.0F38.W0 50 /r A V/V AVX-VNNI Multiply groups of 4 pairs of signed bytes in
VPDPBUSD xmm1, xmm2, xmm3/m128 with corresponding unsigned bytes of
xmm3/m128 xmm2, summing those products and adding them
to doubleword result in xmm1.
VEX.256.66.0F38.W0 50 /r A V/V AVX-VNNI Multiply groups of 4 pairs of signed bytes in
VPDPBUSD ymm1, ymm2, ymm3/m256 with corresponding unsigned bytes of
ymm3/m256 ymm2, summing those products and adding them
to doubleword result in ymm1.
EVEX.128.66.0F38.W0 50 /r B V/V (AVX512_VNNI Multiply groups of 4 pairs of signed bytes in
VPDPBUSD xmm1{k1}{z}, xmm2, AND AVX512VL) xmm3/m128/m32bcst with corresponding
xmm3/m128/m32bcst OR AVX10.11 unsigned bytes of xmm2, summing those products
and adding them to doubleword result in xmm1
under writemask k1.
EVEX.256.66.0F38.W0 50 /r B V/V (AVX512_VNNI Multiply groups of 4 pairs of signed bytes in
VPDPBUSD ymm1{k1}{z}, ymm2, AND AVX512VL) ymm3/m256/m32bcst with corresponding
ymm3/m256/m32bcst OR AVX10.11 unsigned bytes of ymm2, summing those products
and adding them to doubleword result in ymm1
under writemask k1.
EVEX.512.66.0F38.W0 50 /r B V/V AVX512_VNNI Multiply groups of 4 pairs of signed bytes in
VPDPBUSD zmm1{k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m32bcst with corresponding
zmm3/m512/m32bcst unsigned bytes of zmm2, summing those products
and adding them to doubleword result in zmm1
under writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Multiplies the individual unsigned bytes of the first source operand by the corresponding signed bytes of the second
source operand, producing intermediate signed word results. The word results are then summed and accumulated
in the destination dword element size operand.
This instruction supports memory fault suppression.

VPDPBUSD—Multiply and Add Unsigned and Signed Bytes Vol. 2C 5-466


Operation
VPDPBUSD dest, src1, src2 (VEX encoded versions)
VL=(128, 256)
KL=VL/32

ORIGDEST := DEST
FOR i := 0 TO KL-1:

// Extending to 16b
// src1extend := ZERO_EXTEND
// src2extend := SIGN_EXTEND

p1word := src1extend(SRC1.byte[4*i+0]) * src2extend(SRC2.byte[4*i+0])


p2word := src1extend(SRC1.byte[4*i+1]) * src2extend(SRC2.byte[4*i+1])
p3word := src1extend(SRC1.byte[4*i+2]) * src2extend(SRC2.byte[4*i+2])
p4word := src1extend(SRC1.byte[4*i+3]) * src2extend(SRC2.byte[4*i+3])
DEST.dword[i] := ORIGDEST.dword[i] + p1word + p2word + p3word + p4word

DEST[MAX_VL-1:VL] := 0

VPDPBUSD dest, src1, src2 (EVEX encoded versions)


(KL,VL)=(4,128), (8,256), (16,512)
ORIGDEST := DEST
FOR i := 0 TO KL-1:
IF k1[i] or *no writemask*:
// Byte elements of SRC1 are zero-extended to 16b and
// byte elements of SRC2 are sign extended to 16b before multiplication.
IF SRC2 is memory and EVEX.b == 1:
t := SRC2.dword[0]
ELSE:
t := SRC2.dword[i]
p1word := ZERO_EXTEND(SRC1.byte[4*i]) * SIGN_EXTEND(t.byte[0])
p2word := ZERO_EXTEND(SRC1.byte[4*i+1]) * SIGN_EXTEND(t.byte[1])
p3word := ZERO_EXTEND(SRC1.byte[4*i+2]) * SIGN_EXTEND(t.byte[2])
p4word := ZERO_EXTEND(SRC1.byte[4*i+3]) * SIGN_EXTEND(t.byte[3])
DEST.dword[i] := ORIGDEST.dword[i] + p1word + p2word + p3word + p4word
ELSE IF *zeroing*:
DEST.dword[i] := 0
ELSE: // Merge masking, dest element unchanged
DEST.dword[i] := ORIGDEST.dword[i]
DEST[MAX_VL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPDPBUSD __m128i _mm_dpbusd_avx_epi32(__m128i, __m128i, __m128i);
VPDPBUSD __m128i _mm_dpbusd_epi32(__m128i, __m128i, __m128i);
VPDPBUSD __m128i _mm_mask_dpbusd_epi32(__m128i, __mmask8, __m128i, __m128i);
VPDPBUSD __m128i _mm_maskz_dpbusd_epi32(__mmask8, __m128i, __m128i, __m128i);
VPDPBUSD __m256i _mm256_dpbusd_avx_epi32(__m256i, __m256i, __m256i);
VPDPBUSD __m256i _mm256_dpbusd_epi32(__m256i, __m256i, __m256i);
VPDPBUSD __m256i _mm256_mask_dpbusd_epi32(__m256i, __mmask8, __m256i, __m256i);
VPDPBUSD __m256i _mm256_maskz_dpbusd_epi32(__mmask8, __m256i, __m256i, __m256i);
VPDPBUSD __m512i _mm512_dpbusd_epi32(__m512i, __m512i, __m512i);
VPDPBUSD __m512i _mm512_mask_dpbusd_epi32(__m512i, __mmask16, __m512i, __m512i);
VPDPBUSD __m512i _mm512_maskz_dpbusd_epi32(__mmask16, __m512i, __m512i, __m512i);

VPDPBUSD—Multiply and Add Unsigned and Signed Bytes Vol. 2C 5-467


SIMD Floating-Point Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

VPDPBUSD—Multiply and Add Unsigned and Signed Bytes Vol. 2C 5-468


VPDPBUSDS—Multiply and Add Unsigned and Signed Bytes With Saturation
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.66.0F38.W0 51 /r A V/V AVX-VNNI Multiply groups of 4 pairs signed bytes in
VPDPBUSDS xmm1, xmm2, xmm3/m128 with corresponding unsigned
xmm3/m128 bytes of xmm2, summing those products and
adding them to doubleword result, with signed
saturation in xmm1.
VEX.256.66.0F38.W0 51 /r A V/V AVX-VNNI Multiply groups of 4 pairs signed bytes in
VPDPBUSDS ymm1, ymm2, ymm3/m256 with corresponding unsigned
ymm3/m256 bytes of ymm2, summing those products and
adding them to doubleword result, with signed
saturation in ymm1.
EVEX.128.66.0F38.W0 51 /r B V/V (AVX512_VNNI Multiply groups of 4 pairs signed bytes in
VPDPBUSDS xmm1{k1}{z}, xmm2, AND AVX512VL) xmm3/m128/m32bcst with corresponding
xmm3/m128/m32bcst OR AVX10.11 unsigned bytes of xmm2, summing those
products and adding them to doubleword
result, with signed saturation in xmm1, under
writemask k1.
EVEX.256.66.0F38.W0 51 /r B V/V (AVX512_VNNI Multiply groups of 4 pairs signed bytes in
VPDPBUSDS ymm1{k1}{z}, ymm2, AND AVX512VL) ymm3/m256/m32bcst with corresponding
ymm3/m256/m32bcst OR AVX10.11 unsigned bytes of ymm2, summing those
products and adding them to doubleword
result, with signed saturation in ymm1, under
writemask k1.
EVEX.512.66.0F38.W0 51 /r B V/V AVX512_VNNI Multiply groups of 4 pairs signed bytes in
VPDPBUSDS zmm1{k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m32bcst with corresponding
zmm3/m512/m32bcst unsigned bytes of zmm2, summing those
products and adding them to doubleword
result, with signed saturation in zmm1, under
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Multiplies the individual unsigned bytes of the first source operand by the corresponding signed bytes of the second
source operand, producing intermediate signed word results. The word results are then summed and accumulated
in the destination dword element size operand. If the intermediate sum overflows a 32b signed number the result
is saturated to either 0x7FFF_FFFF for positive numbers of 0x8000_0000 for negative numbers.
This instruction supports memory fault suppression.

VPDPBUSDS—Multiply and Add Unsigned and Signed Bytes With Saturation Vol. 2C 5-468
Operation
VPDPBUSDS dest, src1, src2 (VEX encoded versions)
VL=(128, 256)
KL=VL/32

ORIGDEST := DEST
FOR i := 0 TO KL-1:
// Extending to 16b
// src1extend := ZERO_EXTEND
// src2extend := SIGN_EXTEND

p1word := src1extend(SRC1.byte[4*i+0]) * src2extend(SRC2.byte[4*i+0])


p2word := src1extend(SRC1.byte[4*i+1]) * src2extend(SRC2.byte[4*i+1])
p3word := src1extend(SRC1.byte[4*i+2]) * src2extend(SRC2.byte[4*i+2])
p4word := src1extend(SRC1.byte[4*i+3]) * src2extend(SRC2.byte[4*i+3])
DEST.dword[i] := SIGNED_DWORD_SATURATE(ORIGDEST.dword[i] + p1word + p2word + p3word + p4word)

DEST[MAX_VL-1:VL] := 0

VPDPBUSDS dest, src1, src2 (EVEX encoded versions)


(KL,VL)=(4,128), (8,256), (16,512)
ORIGDEST := DEST
FOR i := 0 TO KL-1:
IF k1[i] or *no writemask*:
// Byte elements of SRC1 are zero-extended to 16b and
// byte elements of SRC2 are sign extended to 16b before multiplication.
IF SRC2 is memory and EVEX.b == 1:
t := SRC2.dword[0]
ELSE:
t := SRC2.dword[i]
p1word := ZERO_EXTEND(SRC1.byte[4*i]) * SIGN_EXTEND(t.byte[0])
p2word := ZERO_EXTEND(SRC1.byte[4*i+1]) * SIGN_EXTEND(t.byte[1])
p3word := ZERO_EXTEND(SRC1.byte[4*i+2]) * SIGN_EXTEND(t.byte[2])
p4word := ZERO_EXTEND(SRC1.byte[4*i+3]) *SIGN_EXTEND(t.byte[3])
DEST.dword[i] := SIGNED_DWORD_SATURATE(ORIGDEST.dword[i] + p1word + p2word + p3word + p4word)
ELSE IF *zeroing*:
DEST.dword[i] := 0
ELSE: // Merge masking, dest element unchanged
DEST.dword[i] := ORIGDEST.dword[i]
DEST[MAX_VL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPDPBUSDS __m128i _mm_dpbusds_avx_epi32(__m128i, __m128i, __m128i);
VPDPBUSDS __m128i _mm_dpbusds_epi32(__m128i, __m128i, __m128i);
VPDPBUSDS __m128i _mm_mask_dpbusds_epi32(__m128i, __mmask8, __m128i, __m128i);
VPDPBUSDS __m128i _mm_maskz_dpbusds_epi32(__mmask8, __m128i, __m128i, __m128i);
VPDPBUSDS __m256i _mm256_dpbusds_avx_epi32(__m256i, __m256i, __m256i);
VPDPBUSDS __m256i _mm256_dpbusds_epi32(__m256i, __m256i, __m256i);
VPDPBUSDS __m256i _mm256_mask_dpbusds_epi32(__m256i, __mmask8, __m256i, __m256i);
VPDPBUSDS __m256i _mm256_maskz_dpbusds_epi32(__mmask8, __m256i, __m256i, __m256i);
VPDPBUSDS __m512i _mm512_dpbusds_epi32(__m512i, __m512i, __m512i);
VPDPBUSDS __m512i _mm512_mask_dpbusds_epi32(__m512i, __mmask16, __m512i, __m512i);
VPDPBUSDS __m512i _mm512_maskz_dpbusds_epi32(__mmask16, __m512i, __m512i, __m512i);

VPDPBUSDS—Multiply and Add Unsigned and Signed Bytes With Saturation Vol. 2C 5-469
SIMD Floating-Point Exceptions
None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

VPDPBUSDS—Multiply and Add Unsigned and Signed Bytes With Saturation Vol. 2C 5-470
VPDPWSSD—Multiply and Add Signed Word Integers
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.66.0F38.W0 52 /r A V/V AVX-VNNI Multiply groups of 2 pairs signed words in
VPDPWSSD xmm1, xmm2, xmm3/m128 with corresponding signed words
xmm3/m128 of xmm2, summing those products and adding
them to doubleword result in xmm1.
VEX.256.66.0F38.W0 52 /r A V/V AVX-VNNI Multiply groups of 2 pairs signed words in
VPDPWSSD ymm1, ymm2, ymm3/m256 with corresponding signed words
ymm3/m256 of ymm2, summing those products and adding
them to doubleword result in ymm1.
EVEX.128.66.0F38.W0 52 /r B V/V (AVX512_VNNI Multiply groups of 2 pairs signed words in
VPDPWSSD xmm1{k1}{z}, xmm2, AND AVX512VL) xmm3/m128/m32bcst with corresponding
xmm3/m128/m32bcst OR AVX10.11 signed words of xmm2, summing those
products and adding them to doubleword result
in xmm1, under writemask k1.
EVEX.256.66.0F38.W0 52 /r B V/V (AVX512_VNNI Multiply groups of 2 pairs signed words in
VPDPWSSD ymm1{k1}{z}, ymm2, AND AVX512VL) ymm3/m256/m32bcst with corresponding
ymm3/m256/m32bcst OR AVX10.11 signed words of ymm2, summing those
products and adding them to doubleword result
in ymm1, under writemask k1.
EVEX.512.66.0F38.W0 52 /r B V/V AVX512_VNNI Multiply groups of 2 pairs signed words in
VPDPWSSD zmm1{k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m32bcst with corresponding
zmm3/m512/m32bcst signed words of zmm2, summing those
products and adding them to doubleword result
in zmm1, under writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Multiplies the individual signed words of the first source operand by the corresponding signed words of the second
source operand, producing intermediate signed, doubleword results. The adjacent doubleword results are then
summed and accumulated in the destination operand.
This instruction supports memory fault suppression.

VPDPWSSD—Multiply and Add Signed Word Integers Vol. 2C 5-470


Operation
VPDPWSSD dest, src1, src2 (VEX encoded versions)
VL=(128, 256)
KL=VL/32
ORIGDEST := DEST
FOR i := 0 TO KL-1:
p1dword := SIGN_EXTEND(SRC1.word[2*i+0]) * SIGN_EXTEND(SRC2.word[2*i+0] )
p2dword := SIGN_EXTEND(SRC1.word[2*i+1]) * SIGN_EXTEND(SRC2.word[2*i+1] )
DEST.dword[i] := ORIGDEST.dword[i] + p1dword + p2dword
DEST[MAX_VL-1:VL] := 0

VPDPWSSD dest, src1, src2 (EVEX encoded versions)


(KL,VL)=(4,128), (8,256), (16,512)
ORIGDEST := DEST
FOR i := 0 TO KL-1:
IF k1[i] or *no writemask*:
IF SRC2 is memory and EVEX.b == 1:
t := SRC2.dword[0]
ELSE:
t := SRC2.dword[i]
p1dword := SIGN_EXTEND(SRC1.word[2*i]) * SIGN_EXTEND(t.word[0])
p2dword := SIGN_EXTEND(SRC1.word[2*i+1]) * SIGN_EXTEND(t.word[1])
DEST.dword[i] := ORIGDEST.dword[i] + p1dword + p2dword
ELSE IF *zeroing*:
DEST.dword[i] := 0
ELSE: // Merge masking, dest element unchanged
DEST.dword[i] := ORIGDEST.dword[i]
DEST[MAX_VL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPDPWSSD __m128i _mm_dpwssd_avx_epi32(__m128i, __m128i, __m128i);
VPDPWSSD __m128i _mm_dpwssd_epi32(__m128i, __m128i, __m128i);
VPDPWSSD __m128i _mm_mask_dpwssd_epi32(__m128i, __mmask8, __m128i, __m128i);
VPDPWSSD __m128i _mm_maskz_dpwssd_epi32(__mmask8, __m128i, __m128i, __m128i);
VPDPWSSD __m256i _mm256_dpwssd_avx_epi32(__m256i, __m256i, __m256i);
VPDPWSSD __m256i _mm256_dpwssd_epi32(__m256i, __m256i, __m256i);
VPDPWSSD __m256i _mm256_mask_dpwssd_epi32(__m256i, __mmask8, __m256i, __m256i);
VPDPWSSD __m256i _mm256_maskz_dpwssd_epi32(__mmask8, __m256i, __m256i, __m256i);
VPDPWSSD __m512i _mm512_dpwssd_epi32(__m512i, __m512i, __m512i);
VPDPWSSD __m512i _mm512_mask_dpwssd_epi32(__m512i, __mmask16, __m512i, __m512i);
VPDPWSSD __m512i _mm512_maskz_dpwssd_epi32(__mmask16, __m512i, __m512i, __m512i);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

VPDPWSSD—Multiply and Add Signed Word Integers Vol. 2C 5-471


VPDPWSSDS—Multiply and Add Signed Word Integers With Saturation
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.66.0F38.W0 53 /r A V/V AVX-VNNI Multiply groups of 2 pairs of signed words in
VPDPWSSDS xmm1, xmm2, xmm3/m128 with corresponding signed words
xmm3/m128 of xmm2, summing those products and adding
them to doubleword result in xmm1, with
signed saturation.
VEX.256.66.0F38.W0 53 /r A V/V AVX-VNNI Multiply groups of 2 pairs of signed words in
VPDPWSSDS ymm1, ymm2, ymm3/m256 with corresponding signed words
ymm3/m256 of ymm2, summing those products and adding
them to doubleword result in ymm1, with
signed saturation.
EVEX.128.66.0F38.W0 53 /r B V/V (AVX512_VNNI Multiply groups of 2 pairs of signed words in
VPDPWSSDS xmm1{k1}{z}, xmm2, AND AVX512VL) xmm3/m128/m32bcst with corresponding
xmm3/m128/m32bcst OR AVX10.11 signed words of xmm2, summing those
products and adding them to doubleword result
in xmm1, with signed saturation, under
writemask k1.
EVEX.256.66.0F38.W0 53 /r B V/V (AVX512_VNNI Multiply groups of 2 pairs of signed words in
VPDPWSSDS ymm1{k1}{z}, ymm2, AND AVX512VL) ymm3/m256/m32bcst with corresponding
ymm3/m256/m32bcst OR AVX10.11 signed words of ymm2, summing those
products and adding them to doubleword result
in ymm1, with signed saturation, under
writemask k1.
EVEX.512.66.0F38.W0 53 /r B V/V AVX512_VNNI Multiply groups of 2 pairs of signed words in
VPDPWSSDS zmm1{k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m32bcst with corresponding
zmm3/m512/m32bcst signed words of zmm2, summing those
products and adding them to doubleword result
in zmm1, with signed saturation, under
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Multiplies the individual signed words of the first source operand by the corresponding signed words of the second
source operand, producing intermediate signed, doubleword results. The adjacent doubleword results are then
summed and accumulated in the destination operand. If the intermediate sum overflows a 32b signed number, the
result is saturated to either 0x7FFF_FFFF for positive numbers of 0x8000_0000 for negative numbers.
This instruction supports memory fault suppression.

VPDPWSSDS—Multiply and Add Signed Word Integers With Saturation Vol. 2C 5-472
Operation
VPDPWSSDS dest, src1, src2 (VEX encoded versions)
VL=(128, 256)
KL=VL/32
ORIGDEST := DEST
FOR i := 0 TO KL-1:
p1dword := SIGN_EXTEND(SRC1.word[2*i+0]) * SIGN_EXTEND(SRC2.word[2*i+0])
p2dword := SIGN_EXTEND(SRC1.word[2*i+1]) * SIGN_EXTEND(SRC2.word[2*i+1])
DEST.dword[i] := SIGNED_DWORD_SATURATE(ORIGDEST.dword[i] + p1dword + p2dword)
DEST[MAX_VL-1:VL] := 0

VPDPWSSDS dest, src1, src2 (EVEX encoded versions)


(KL,VL)=(4,128), (8,256), (16,512)
ORIGDEST := DEST
FOR i := 0 TO KL-1:
IF k1[i] or *no writemask*:
IF SRC2 is memory and EVEX.b == 1:
t := SRC2.dword[0]
ELSE:
t := SRC2.dword[i]
p1dword := SIGN_EXTEND(SRC1.word[2*i]) * SIGN_EXTEND(t.word[0])
p2dword := SIGN_EXTEND(SRC1.word[2*i+1]) * SIGN_EXTEND(t.word[1])
DEST.dword[i] := SIGNED_DWORD_SATURATE(ORIGDEST.dword[i] + p1dword + p2dword)
ELSE IF *zeroing*:
DEST.dword[i] := 0
ELSE: // Merge masking, dest element unchanged
DEST.dword[i] := ORIGDEST.dword[i]
DEST[MAX_VL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPDPWSSDS __m128i _mm_dpwssds_avx_epi32(__m128i, __m128i, __m128i);
VPDPWSSDS __m128i _mm_dpwssds_epi32(__m128i, __m128i, __m128i);
VPDPWSSDS __m128i _mm_mask_dpwssd_epi32(__m128i, __mmask8, __m128i, __m128i);
VPDPWSSDS __m128i _mm_maskz_dpwssd_epi32(__mmask8, __m128i, __m128i, __m128i);
VPDPWSSDS __m256i _mm256_dpwssds_avx_epi32(__m256i, __m256i, __m256i);
VPDPWSSDS __m256i _mm256_dpwssd_epi32(__m256i, __m256i, __m256i);
VPDPWSSDS __m256i _mm256_mask_dpwssd_epi32(__m256i, __mmask8, __m256i, __m256i);
VPDPWSSDS __m256i _mm256_maskz_dpwssd_epi32(__mmask8, __m256i, __m256i, __m256i);
VPDPWSSDS __m512i _mm512_dpwssd_epi32(__m512i, __m512i, __m512i);
VPDPWSSDS __m512i _mm512_mask_dpwssd_epi32(__m512i, __mmask16, __m512i, __m512i);
VPDPWSSDS __m512i _mm512_maskz_dpwssd_epi32(__mmask16, __m512i, __m512i, __m512i);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

VPDPWSSDS—Multiply and Add Signed Word Integers With Saturation Vol. 2C 5-473
VPERMB—Permute Packed Bytes Elements
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 8D /r A V/V (AVX512VL AND Permute bytes in xmm3/m128 using byte indexes
VPERMB xmm1 {k1}{z}, xmm2, AVX512_VBMI) in xmm2 and store the result in xmm1 using
xmm3/m128 OR AVX10.11 writemask k1.
EVEX.256.66.0F38.W0 8D /r A V/V AVX512VL Permute bytes in ymm3/m256 using byte indexes
VPERMB ymm1 {k1}{z}, ymm2, AVX512_VBMI) in ymm2 and store the result in ymm1 using
ymm3/m256 OR AVX10.11 writemask k1.
EVEX.512.66.0F38.W0 8D /r A V/V AVX512_VBMI Permute bytes in zmm3/m512 using byte indexes
VPERMB zmm1 {k1}{z}, zmm2, OR AVX10.11 in zmm2 and store the result in zmm1 using
zmm3/m512 writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Copies bytes from the second source operand (the third operand) to the destination operand (the first operand)
according to the byte indices in the first source operand (the second operand). Note that this instruction permits a
byte in the source operand to be copied to more than one location in the destination operand.
Only the low 6(EVEX.512)/5(EVEX.256)/4(EVEX.128) bits of each byte index is used to select the location of the
source byte from the second source operand.
The first source operand is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM reg-
ister, a 512/256/128-bit memory location. The destination operand is a ZMM/YMM/XMM register updated at byte
granularity by the writemask k1.

Operation
VPERMB (EVEX encoded versions)
(KL, VL) = (16, 128), (32, 256), (64, 512)
IF VL = 128:
n := 3;
ELSE IF VL = 256:
n := 4;
ELSE IF VL = 512:
n := 5;
FI;
FOR j := 0 TO KL-1:
id := SRC1[j*8 + n : j*8] ; // location of the source byte
IF k1[j] OR *no writemask* THEN
DEST[j*8 + 7: j*8] := SRC2[id*8 +7: id*8];
ELSE IF zeroing-masking THEN
DEST[j*8 + 7: j*8] := 0;
*ELSE
DEST[j*8 + 7: j*8] remains unchanged*
FI

VPERMB—Permute Packed Bytes Elements Vol. 2C 5-481


ENDFOR
DEST[MAX_VL-1:VL] := 0;

Intel C/C++ Compiler Intrinsic Equivalent


VPERMB __m512i _mm512_permutexvar_epi8( __m512i idx, __m512i a);
VPERMB __m512i _mm512_mask_permutexvar_epi8(__m512i s, __mmask64 k, __m512i idx, __m512i a);
VPERMB __m512i _mm512_maskz_permutexvar_epi8( __mmask64 k, __m512i idx, __m512i a);
VPERMB __m256i _mm256_permutexvar_epi8( __m256i idx, __m256i a);
VPERMB __m256i _mm256_mask_permutexvar_epi8(__m256i s, __mmask32 k, __m256i idx, __m256i a);
VPERMB __m256i _mm256_maskz_permutexvar_epi8( __mmask32 k, __m256i idx, __m256i a);
VPERMB __m128i _mm_permutexvar_epi8( __m128i idx, __m128i a);
VPERMB __m128i _mm_mask_permutexvar_epi8(__m128i s, __mmask16 k, __m128i idx, __m128i a);
VPERMB __m128i _mm_maskz_permutexvar_epi8( __mmask16 k, __m128i idx, __m128i a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”

VPERMB—Permute Packed Bytes Elements Vol. 2C 5-482


VPERMD/VPERMW—Permute Packed Doubleword/Word Elements
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.256.66.0F38.W0 36 /r A V/V AVX2 Permute doublewords in ymm3/m256 using
VPERMD ymm1, ymm2, ymm3/m256 indices in ymm2 and store the result in ymm1.
EVEX.256.66.0F38.W0 36 /r B V/V (AVX512VL AND Permute doublewords in ymm3/m256/m32bcst
VPERMD ymm1 {k1}{z}, ymm2, AVX512F) OR using indexes in ymm2 and store the result in
ymm3/m256/m32bcst AVX10.11 ymm1 using writemask k1.
EVEX.512.66.0F38.W0 36 /r B V/V AVX512F Permute doublewords in zmm3/m512/m32bcst
VPERMD zmm1 {k1}{z}, zmm2, OR AVX10.11 using indices in zmm2 and store the result in
zmm3/m512/m32bcst zmm1 using writemask k1.
EVEX.128.66.0F38.W1 8D /r C V/V (AVX512VL AND Permute word integers in xmm3/m128 using
VPERMW xmm1 {k1}{z}, xmm2, AVX512BW) OR indexes in xmm2 and store the result in xmm1
xmm3/m128 AVX10.11 using writemask k1.
EVEX.256.66.0F38.W1 8D /r C V/V (AVX512VL AND Permute word integers in ymm3/m256 using
VPERMW ymm1 {k1}{z}, ymm2, AVX512BW) OR indexes in ymm2 and store the result in ymm1
ymm3/m256 AVX10.11 using writemask k1.
EVEX.512.66.0F38.W1 8D /r C V/V AVX512BW Permute word integers in zmm3/m512 using
VPERMW zmm1 {k1}{z}, zmm2, OR AVX10.11 indexes in zmm2 and store the result in zmm1
zmm3/m512 using writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A
C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Copies doublewords (or words) from the second source operand (the third operand) to the destination operand
(the first operand) according to the indices in the first source operand (the second operand). Note that this instruc-
tion permits a doubleword (word) in the source operand to be copied to more than one location in the destination
operand.
VEX.256 encoded VPERMD: The first and second operands are YMM registers, the third operand can be a YMM
register or memory location. Bits (MAXVL-1:256) of the corresponding destination register are zeroed.
EVEX encoded VPERMD: The first and second operands are ZMM/YMM registers, the third operand can be a
ZMM/YMM register, a 512/256-bit memory location or a 512/256-bit vector broadcasted from a 32-bit memory
location. The elements in the destination are updated using the writemask k1.
VPERMW: first and second operands are ZMM/YMM/XMM registers, the third operand can be a ZMM/YMM/XMM
register, or a 512/256/128-bit memory location. The destination is updated using the writemask k1.
EVEX.128 encoded versions: Bits (MAXVL-1:128) of the corresponding ZMM register are zeroed.

VPERMD/VPERMW—Permute Packed Doubleword/Word Elements Vol. 2C 5-483


Operation
VPERMD (EVEX encoded versions)
(KL, VL) = (8, 256), (16, 512)
IF VL = 256 THEN n := 2; FI;
IF VL = 512 THEN n := 3; FI;
FOR j := 0 TO KL-1
i := j * 32
id := 32*SRC1[i+n:i]
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+31:i] := SRC2[31:0];
ELSE DEST[i+31:i] := SRC2[id+31:id];
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPERMD (VEX.256 encoded version)


DEST[31:0] := (SRC2[255:0] >> (SRC1[2:0] * 32))[31:0];
DEST[63:32] := (SRC2[255:0] >> (SRC1[34:32] * 32))[31:0];
DEST[95:64] := (SRC2[255:0] >> (SRC1[66:64] * 32))[31:0];
DEST[127:96] := (SRC2[255:0] >> (SRC1[98:96] * 32))[31:0];
DEST[159:128] := (SRC2[255:0] >> (SRC1[130:128] * 32))[31:0];
DEST[191:160] := (SRC2[255:0] >> (SRC1[162:160] * 32))[31:0];
DEST[223:192] := (SRC2[255:0] >> (SRC1[194:192] * 32))[31:0];
DEST[255:224] := (SRC2[255:0] >> (SRC1[226:224] * 32))[31:0];
DEST[MAXVL-1:256] := 0

VPERMW (EVEX encoded versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
IF VL = 128 THEN n := 2; FI;
IF VL = 256 THEN n := 3; FI;
IF VL = 512 THEN n := 4; FI;
FOR j := 0 TO KL-1
i := j * 16
id := 16*SRC1[i+n:i]
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SRC2[id+15:id]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPERMD/VPERMW—Permute Packed Doubleword/Word Elements Vol. 2C 5-484


Intel C/C++ Compiler Intrinsic Equivalent
VPERMD __m512i _mm512_permutexvar_epi32( __m512i idx, __m512i a);
VPERMD __m512i _mm512_mask_permutexvar_epi32(__m512i s, __mmask16 k, __m512i idx, __m512i a);
VPERMD __m512i _mm512_maskz_permutexvar_epi32( __mmask16 k, __m512i idx, __m512i a);
VPERMD __m256i _mm256_permutexvar_epi32( __m256i idx, __m256i a);
VPERMD __m256i _mm256_mask_permutexvar_epi32(__m256i s, __mmask8 k, __m256i idx, __m256i a);
VPERMD __m256i _mm256_maskz_permutexvar_epi32( __mmask8 k, __m256i idx, __m256i a);
VPERMW __m512i _mm512_permutexvar_epi16( __m512i idx, __m512i a);
VPERMW __m512i _mm512_mask_permutexvar_epi16(__m512i s, __mmask32 k, __m512i idx, __m512i a);
VPERMW __m512i _mm512_maskz_permutexvar_epi16( __mmask32 k, __m512i idx, __m512i a);
VPERMW __m256i _mm256_permutexvar_epi16( __m256i idx, __m256i a);
VPERMW __m256i _mm256_mask_permutexvar_epi16(__m256i s, __mmask16 k, __m256i idx, __m256i a);
VPERMW __m256i _mm256_maskz_permutexvar_epi16( __mmask16 k, __m256i idx, __m256i a);
VPERMW __m128i _mm_permutexvar_epi16( __m128i idx, __m128i a);
VPERMW __m128i _mm_mask_permutexvar_epi16(__m128i s, __mmask8 k, __m128i idx, __m128i a);
VPERMW __m128i _mm_maskz_permutexvar_epi16( __mmask8 k, __m128i idx, __m128i a);

SIMD Floating-Point Exceptions


None

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPERMD, see Table 2-52, “Type E4NF Class Exception Conditions.”
EVEX-encoded VPERMW, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
Additionally:
#UD If VEX.L = 0.
If EVEX.L’L = 0 for VPERMD.

VPERMD/VPERMW—Permute Packed Doubleword/Word Elements Vol. 2C 5-485


VPERMI2B—Full Permute of Bytes From Two Tables Overwriting the Index
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 75 /r A V/V (AVX512VL AND Permute bytes in xmm3/m128 and xmm2 using
VPERMI2B xmm1 {k1}{z}, xmm2, AVX512_VBMI) byte indexes in xmm1 and store the byte results
xmm3/m128 OR AVX10.11 in xmm1 using writemask k1.
EVEX.256.66.0F38.W0 75 /r A V/V (AVX512VL Permute bytes in ymm3/m256 and ymm2 using
VPERMI2B ymm1 {k1}{z}, ymm2, AVX512_VBMI) byte indexes in ymm1 and store the byte results
ymm3/m256 OR AVX10.11 in ymm1 using writemask k1.
EVEX.512.66.0F38.W0 75 /r A V/V AVX512_VBMI Permute bytes in zmm3/m512 and zmm2 using
VPERMI2B zmm1 {k1}{z}, zmm2, OR AVX10.11 byte indexes in zmm1 and store the byte results
zmm3/m512 in zmm1 using writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full Mem ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Permutes byte values in the second operand (the first source operand) and the third operand (the second source
operand) using the byte indices in the first operand (the destination operand) to select byte elements from the
second or third source operands. The selected byte elements are written to the destination at byte granularity
under the writemask k1.
The first and second operands are ZMM/YMM/XMM registers. The first operand contains input indices to select
elements from the two input tables in the 2nd and 3rd operands. The first operand is also the destination of the
result. The third operand can be a ZMM/YMM/XMM register, or a 512/256/128-bit memory location. In each index
byte, the id bit for table selection is bit 6/5/4, and bits [5:0]/[4:0]/[3:0] selects element within each input table.
Note that these instructions permit a byte value in the source operands to be copied to more than one location in
the destination operand. Also, the same tables can be reused in subsequent iterations, but the index elements are
overwritten.
Bits (MAX_VL-1:256/128) of the destination are zeroed for VL=256,128.

VPERMI2B—Full Permute of Bytes From Two Tables Overwriting the Index Vol. 2C 5-486
Operation
VPERMI2B (EVEX encoded versions)
(KL, VL) = (16, 128), (32, 256), (64, 512)
IF VL = 128:
id := 3;
ELSE IF VL = 256:
id := 4;
ELSE IF VL = 512:
id := 5;
FI;
TMP_DEST[VL-1:0] := DEST[VL-1:0];
FOR j := 0 TO KL-1
off := 8*SRC1[j*8 + id: j*8] ;
IF k1[j] OR *no writemask*:
DEST[j*8 + 7: j*8] := TMP_DEST[j*8+id+1]? SRC2[off+7:off] : SRC1[off+7:off];
ELSE IF *zeroing-masking*
DEST[j*8 + 7: j*8] := 0;
*ELSE
DEST[j*8 + 7: j*8] remains unchanged*
FI;
ENDFOR
DEST[MAX_VL-1:VL] := 0;

Intel C/C++ Compiler Intrinsic Equivalent


VPERMI2B __m512i _mm512_permutex2var_epi8(__m512i a, __m512i idx, __m512i b);
VPERMI2B __m512i _mm512_mask2_permutex2var_epi8(__m512i a, __m512i idx, __mmask64 k, __m512i b);
VPERMI2B __m512i _mm512_maskz_permutex2var_epi8(__mmask64 k, __m512i a, __m512i idx, __m512i b);
VPERMI2B __m256i _mm256_permutex2var_epi8(__m256i a, __m256i idx, __m256i b);
VPERMI2B __m256i _mm256_mask2_permutex2var_epi8(__m256i a, __m256i idx, __mmask32 k, __m256i b);
VPERMI2B __m256i _mm256_maskz_permutex2var_epi8(__mmask32 k, __m256i a, __m256i idx, __m256i b);
VPERMI2B __m128i _mm_permutex2var_epi8(__m128i a, __m128i idx, __m128i b);
VPERMI2B __m128i _mm_mask2_permutex2var_epi8(__m128i a, __m128i idx, __mmask16 k, __m128i b);
VPERMI2B __m128i _mm_maskz_permutex2var_epi8(__mmask16 k, __m128i a, __m128i idx, __m128i b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”

VPERMI2B—Full Permute of Bytes From Two Tables Overwriting the Index Vol. 2C 5-487
VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 75 /r A V/V (AVX512VL AND Permute word integers from two tables in
VPERMI2W xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm3/m128 and xmm2 using indexes in xmm1
xmm3/m128 AVX10.11 and store the result in xmm1 using writemask k1.
EVEX.256.66.0F38.W1 75 /r A V/V (AVX512VL AND Permute word integers from two tables in
VPERMI2W ymm1 {k1}{z}, ymm2, AVX512BW) OR ymm3/m256 and ymm2 using indexes in ymm1
ymm3/m256 AVX10.11 and store the result in ymm1 using writemask k1.
EVEX.512.66.0F38.W1 75 /r A V/V AVX512BW Permute word integers from two tables in
VPERMI2W zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512 and zmm2 using indexes in zmm1
zmm3/m512 and store the result in zmm1 using writemask k1.
EVEX.128.66.0F38.W0 76 /r B V/V (AVX512VL AND Permute double-words from two tables in
VPERMI2D xmm1 {k1}{z}, xmm2, AVX512F) OR xmm3/m128/m32bcst and xmm2 using indexes in
xmm3/m128/m32bcst AVX10.11 xmm1 and store the result in xmm1 using
writemask k1.
EVEX.256.66.0F38.W0 76 /r B V/V (AVX512VL AND Permute double-words from two tables in
VPERMI2D ymm1 {k1}{z}, ymm2, AVX512F) OR ymm3/m256/m32bcst and ymm2 using indexes in
ymm3/m256/m32bcst AVX10.11 ymm1 and store the result in ymm1 using
writemask k1.
EVEX.512.66.0F38.W0 76 /r B V/V AVX512F Permute double-words from two tables in
VPERMI2D zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m32bcst and zmm2 using indices in
zmm3/m512/m32bcst zmm1 and store the result in zmm1 using
writemask k1.
EVEX.128.66.0F38.W1 76 /r B V/V (AVX512VL AND Permute quad-words from two tables in
VPERMI2Q xmm1 {k1}{z}, xmm2, AVX512F) OR xmm3/m128/m64bcst and xmm2 using indexes in
xmm3/m128/m64bcst AVX10.11 xmm1 and store the result in xmm1 using
writemask k1.
EVEX.256.66.0F38.W1 76 /r B V/V (AVX512VL AND Permute quad-words from two tables in
VPERMI2Q ymm1 {k1}{z}, ymm2, AVX512F) OR ymm3/m256/m64bcst and ymm2 using indexes in
ymm3/m256/m64bcst AVX10.11 ymm1 and store the result in ymm1 using
writemask k1.
EVEX.512.66.0F38.W1 76 /r B V/V AVX512F Permute quad-words from two tables in
VPERMI2Q zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m64bcst and zmm2 using indices in
zmm3/m512/m64bcst zmm1 and store the result in zmm1 using
writemask k1.
EVEX.128.66.0F38.W0 77 /r B V/V (AVX512VL AND Permute single-precision floating-point values
VPERMI2PS xmm1 {k1}{z}, xmm2, AVX512F) OR from two tables in xmm3/m128/m32bcst and
xmm3/m128/m32bcst AVX10.11 xmm2 using indexes in xmm1 and store the result
in xmm1 using writemask k1.
EVEX.256.66.0F38.W0 77 /r B V/V (AVX512VL AND Permute single-precision floating-point values
VPERMI2PS ymm1 {k1}{z}, ymm2, AVX512F) OR from two tables in ymm3/m256/m32bcst and
ymm3/m256/m32bcst AVX10.11 ymm2 using indexes in ymm1 and store the result
in ymm1 using writemask k1.
EVEX.512.66.0F38.W0 77 /r B V/V AVX512F Permute single-precision floating-point values
VPERMI2PS zmm1 {k1}{z}, zmm2, OR AVX10.11 from two tables in zmm3/m512/m32bcst and
zmm3/m512/m32bcst zmm2 using indices in zmm1 and store the result
in zmm1 using writemask k1.

VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index Vol. 2C 5-488
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 77 /r B V/V (AVX512VL AND Permute double precision floating-point values
VPERMI2PD xmm1 {k1}{z}, xmm2, AVX512F) OR from two tables in xmm3/m128/m64bcst and
xmm3/m128/m64bcst AVX10.11 xmm2 using indexes in xmm1 and store the result
in xmm1 using writemask k1.
EVEX.256.66.0F38.W1 77 /r B V/V (AVX512VL AND Permute double precision floating-point values
VPERMI2PD ymm1 {k1}{z}, ymm2, AVX512F) OR from two tables in ymm3/m256/m64bcst and
ymm3/m256/m64bcst AVX10.11 ymm2 using indexes in ymm1 and store the result
in ymm1 using writemask k1.
EVEX.512.66.0F38.W1 77 /r B V/V AVX512F Permute double precision floating-point values
VPERMI2PD zmm1 {k1}{z}, zmm2, OR AVX10.11 from two tables in zmm3/m512/m64bcst and
zmm3/m512/m64bcst zmm2 using indices in zmm1 and store the result
in zmm1 using writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full Mem ModRM:reg (r,w) EVEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Permutes 16-bit/32-bit/64-bit values in the second operand (the first source operand) and the third operand (the
second source operand) using indices in the first operand to select elements from the second and third operands.
The selected elements are written to the destination operand (the first operand) according to the writemask k1.
The first and second operands are ZMM/YMM/XMM registers. The first operand contains input indices to select
elements from the two input tables in the 2nd and 3rd operands. The first operand is also the destination of the
result.
D/Q/PS/PD element versions: The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit
memory location or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. Broadcast from the
low 32/64-bit memory location is performed if EVEX.b and the id bit for table selection are set (selecting table_2).
Dword/PS versions: The id bit for table selection is bit 4/3/2, depending on VL=512, 256, 128. Bits
[3:0]/[2:0]/[1:0] of each element in the input index vector select an element within the two source operands, If
the id bit is 0, table_1 (the first source) is selected; otherwise the second source operand is selected.
Qword/PD versions: The id bit for table selection is bit 3/2/1, and bits [2:0]/[1:0] /bit 0 selects element within each
input table.
Word element versions: The second source operand can be a ZMM/YMM/XMM register, or a 512/256/128-bit
memory location. The id bit for table selection is bit 5/4/3, and bits [4:0]/[3:0]/[2:0] selects element within each
input table.
Note that these instructions permit a 16-bit/32-bit/64-bit value in the source operands to be copied to more than
one location in the destination operand. Note also that in this case, the same table can be reused for example for a
second iteration, while the index elements are overwritten.
Bits (MAXVL-1:256/128) of the destination are zeroed for VL=256,128.

VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index Vol. 2C 5-489
Operation
VPERMI2W (EVEX encoded versions)
(KL, VL) = (8, 128), (16, 256), (32, 512)
IF VL = 128
id := 2
FI;
IF VL = 256
id := 3
FI;
IF VL = 512
id := 4
FI;
TMP_DEST := DEST
FOR j := 0 TO KL-1
i := j * 16
off := 16*TMP_DEST[i+id:i]
IF k1[j] OR *no writemask*
THEN
DEST[i+15:i]=TMP_DEST[i+id+1] ? SRC2[off+15:off]
: SRC1[off+15:off]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPERMI2D/VPERMI2PS (EVEX encoded versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
IF VL = 128
id := 1
FI;
IF VL = 256
id := 2
FI;
IF VL = 512
id := 3
FI;
TMP_DEST := DEST
FOR j := 0 TO KL-1
i := j * 32
off := 32*TMP_DEST[i+id:i]
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+31:i] := TMP_DEST[i+id+1] ? SRC2[31:0]
: SRC1[off+31:off]
ELSE
DEST[i+31:i] := TMP_DEST[i+id+1] ? SRC2[off+31:off]
: SRC1[off+31:off]

VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index Vol. 2C 5-490
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPERMI2Q/VPERMI2PD (EVEX encoded versions)


(KL, VL) = (2, 128), (4, 256), (8 512)
IF VL = 128
id := 0
FI;
IF VL = 256
id := 1
FI;
IF VL = 512
id := 2
FI;
TMP_DEST:= DEST
FOR j := 0 TO KL-1
i := j * 64
off := 64*TMP_DEST[i+id:i]
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+63:i] := TMP_DEST[i+id+1] ? SRC2[63:0]
: SRC1[off+63:off]
ELSE
DEST[i+63:i] := TMP_DEST[i+id+1] ? SRC2[off+63:off]
: SRC1[off+63:off]
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index Vol. 2C 5-491
Intel C/C++ Compiler Intrinsic Equivalent

VPERMI2D __m512i _mm512_permutex2var_epi32(__m512i a, __m512i idx, __m512i b);


VPERMI2D __m512i _mm512_mask_permutex2var_epi32(__m512i a, __mmask16 k, __m512i idx, __m512i b);
VPERMI2D __m512i _mm512_mask2_permutex2var_epi32(__m512i a, __m512i idx, __mmask16 k, __m512i b);
VPERMI2D __m512i _mm512_maskz_permutex2var_epi32(__mmask16 k, __m512i a, __m512i idx, __m512i b);
VPERMI __m256i _mm256_permutex2var_epi32(__m256i a, __m256i idx, __m256i b);
VPERMI2D __m256i _mm256_mask_permutex2var_epi32(__m256i a, __mmask8 k, __m256i idx, __m256i b);
VPERMI2D __m256i _mm256_mask2_permutex2var_epi32(__m256i a, __m256i idx, __mmask8 k, __m256i b);
VPERMI2D __m256i _mm256_maskz_permutex2var_epi32(__mmask8 k, __m256i a, __m256i idx, __m256i b);
VPERMI2D __m128i _mm_permutex2var_epi32(__m128i a, __m128i idx, __m128i b);
VPERMI2D __m128i _mm_mask_permutex2var_epi32(__m128i a, __mmask8 k, __m128i idx, __m128i b);
VPERMI2D __m128i _mm_mask2_permutex2var_epi32(__m128i a, __m128i idx, __mmask8 k, __m128i b);
VPERMI2D __m128i _mm_maskz_permutex2var_epi32(__mmask8 k, __m128i a, __m128i idx, __m128i b);
VPERMI2PD __m512d _mm512_permutex2var_pd(__m512d a, __m512i idx, __m512d b);
VPERMI2PD __m512d _mm512_mask_permutex2var_pd(__m512d a, __mmask8 k, __m512i idx, __m512d b);
VPERMI2PD __m512d _mm512_mask2_permutex2var_pd(__m512d a, __m512i idx, __mmask8 k, __m512d b);
VPERMI2PD __m512d _mm512_maskz_permutex2var_pd(__mmask8 k, __m512d a, __m512i idx, __m512d b);
VPERMI2PD __m256d _mm256_permutex2var_pd(__m256d a, __m256i idx, __m256d b);
VPERMI2PD __m256d _mm256_mask_permutex2var_pd(__m256d a, __mmask8 k, __m256i idx, __m256d b);
VPERMI2PD __m256d _mm256_mask2_permutex2var_pd(__m256d a, __m256i idx, __mmask8 k, __m256d b);
VPERMI2PD __m256d _mm256_maskz_permutex2var_pd(__mmask8 k, __m256d a, __m256i idx, __m256d b);
VPERMI2PD __m128d _mm_permutex2var_pd(__m128d a, __m128i idx, __m128d b);
VPERMI2PD __m128d _mm_mask_permutex2var_pd(__m128d a, __mmask8 k, __m128i idx, __m128d b);
VPERMI2PD __m128d _mm_mask2_permutex2var_pd(__m128d a, __m128i idx, __mmask8 k, __m128d b);
VPERMI2PD __m128d _mm_maskz_permutex2var_pd(__mmask8 k, __m128d a, __m128i idx, __m128d b);
VPERMI2PS __m512 _mm512_permutex2var_ps(__m512 a, __m512i idx, __m512 b);
VPERMI2PS __m512 _mm512_mask_permutex2var_ps(__m512 a, __mmask16 k, __m512i idx, __m512 b);
VPERMI2PS __m512 _mm512_mask2_permutex2var_ps(__m512 a, __m512i idx, __mmask16 k, __m512 b);
VPERMI2PS __m512 _mm512_maskz_permutex2var_ps(__mmask16 k, __m512 a, __m512i idx, __m512 b);
VPERMI2PS __m256 _mm256_permutex2var_ps(__m256 a, __m256i idx, __m256 b);
VPERMI2PS __m256 _mm256_mask_permutex2var_ps(__m256 a, __mmask8 k, __m256i idx, __m256 b);
VPERMI2PS __m256 _mm256_mask2_permutex2var_ps(__m256 a, __m256i idx, __mmask8 k, __m256 b);
VPERMI2PS __m256 _mm256_maskz_permutex2var_ps(__mmask8 k, __m256 a, __m256i idx, __m256 b);
VPERMI2PS __m128 _mm_permutex2var_ps(__m128 a, __m128i idx, __m128 b);
VPERMI2PS __m128 _mm_mask_permutex2var_ps(__m128 a, __mmask8 k, __m128i idx, __m128 b);
VPERMI2PS __m128 _mm_mask2_permutex2var_ps(__m128 a, __m128i idx, __mmask8 k, __m128 b);
VPERMI2PS __m128 _mm_maskz_permutex2var_ps(__mmask8 k, __m128 a, __m128i idx, __m128 b);
VPERMI2Q __m512i _mm512_permutex2var_epi64(__m512i a, __m512i idx, __m512i b);
VPERMI2Q __m512i _mm512_mask_permutex2var_epi64(__m512i a, __mmask8 k, __m512i idx, __m512i b);
VPERMI2Q __m512i _mm512_mask2_permutex2var_epi64(__m512i a, __m512i idx, __mmask8 k, __m512i b);
VPERMI2Q __m512i _mm512_maskz_permutex2var_epi64(__mmask8 k, __m512i a, __m512i idx, __m512i b);
VPERMI2Q __m256i _mm256_permutex2var_epi64(__m256i a, __m256i idx, __m256i b);
VPERMI2Q __m256i _mm256_mask_permutex2var_epi64(__m256i a, __mmask8 k, __m256i idx, __m256i b);
VPERMI2Q __m256i _mm256_mask2_permutex2var_epi64(__m256i a, __m256i idx, __mmask8 k, __m256i b);
VPERMI2Q __m256i _mm256_maskz_permutex2var_epi64(__mmask8 k, __m256i a, __m256i idx, __m256i b);
VPERMI2Q __m128i _mm_permutex2var_epi64(__m128i a, __m128i idx, __m128i b);
VPERMI2Q __m128i _mm_mask_permutex2var_epi64(__m128i a, __mmask8 k, __m128i idx, __m128i b);
VPERMI2Q __m128i _mm_mask2_permutex2var_epi64(__m128i a, __m128i idx, __mmask8 k, __m128i b);
VPERMI2Q __m128i _mm_maskz_permutex2var_epi64(__mmask8 k, __m128i a, __m128i idx, __m128i b);

VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index Vol. 2C 5-492
VPERMI2W __m512i _mm512_permutex2var_epi16(__m512i a, __m512i idx, __m512i b);
VPERMI2W __m512i _mm512_mask_permutex2var_epi16(__m512i a, __mmask32 k, __m512i idx, __m512i b);
VPERMI2W __m512i _mm512_mask2_permutex2var_epi16(__m512i a, __m512i idx, __mmask32 k, __m512i b);
VPERMI2W __m512i _mm512_maskz_permutex2var_epi16(__mmask32 k, __m512i a, __m512i idx, __m512i b);
VPERMI2W __m256i _mm256_permutex2var_epi16(__m256i a, __m256i idx, __m256i b);
VPERMI2W __m256i _mm256_mask_permutex2var_epi16(__m256i a, __mmask16 k, __m256i idx, __m256i b);
VPERMI2W __m256i _mm256_mask2_permutex2var_epi16(__m256i a, __m256i idx, __mmask16 k, __m256i b);
VPERMI2W __m256i _mm256_maskz_permutex2var_epi16(__mmask16 k, __m256i a, __m256i idx, __m256i b);
VPERMI2W __m128i _mm_permutex2var_epi16(__m128i a, __m128i idx, __m128i b);
VPERMI2W __m128i _mm_mask_permutex2var_epi16(__m128i a, __mmask8 k, __m128i idx, __m128i b);
VPERMI2W __m128i _mm_mask2_permutex2var_epi16(__m128i a, __m128i idx, __mmask8 k, __m128i b);
VPERMI2W __m128i _mm_maskz_permutex2var_epi16(__mmask8 k, __m128i a, __m128i idx, __m128i b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
VPERMI2D/Q/PS/PD: See Table 2-52, “Type E4NF Class Exception Conditions.”
VPERMI2W: See Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”

VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index Vol. 2C 5-493
VPERMILPD—Permute In-Lane of Pairs of Double Precision Floating-Point Values
Opcode/ Op / En 64/32 CPUID Feature Description
Instruction bit Mode Flag
Support
VEX.128.66.0F38.W0 0D /r A V/V AVX Permute double precision floating-point values
VPERMILPD xmm1, xmm2, in xmm2 using controls from xmm3/m128 and
xmm3/m128 store result in xmm1.
VEX.256.66.0F38.W0 0D /r A V/V AVX Permute double precision floating-point values
VPERMILPD ymm1, ymm2, in ymm2 using controls from ymm3/m256 and
ymm3/m256 store result in ymm1.
EVEX.128.66.0F38.W1 0D /r C V/V (AVX512VL AND Permute double precision floating-point values
VPERMILPD xmm1 {k1}{z}, xmm2, AVX512F) OR in xmm2 using control from
xmm3/m128/m64bcst AVX10.11 xmm3/m128/m64bcst and store the result in
xmm1 using writemask k1.
EVEX.256.66.0F38.W1 0D /r C V/V (AVX512VL AND Permute double precision floating-point values
VPERMILPD ymm1 {k1}{z}, ymm2, AVX512F) OR in ymm2 using control from
ymm3/m256/m64bcst AVX10.11 ymm3/m256/m64bcst and store the result in
ymm1 using writemask k1.
EVEX.512.66.0F38.W1 0D /r C V/V AVX512F Permute double precision floating-point values
VPERMILPD zmm1 {k1}{z}, zmm2, OR AVX10.11 in zmm2 using control from
zmm3/m512/m64bcst zmm3/m512/m64bcst and store the result in
zmm1 using writemask k1.
VEX.128.66.0F3A.W0 05 /r ib B V/V AVX Permute double precision floating-point values
VPERMILPD xmm1, xmm2/m128, in xmm2/m128 using controls from imm8.
imm8
VEX.256.66.0F3A.W0 05 /r ib B V/V AVX Permute double precision floating-point values
VPERMILPD ymm1, ymm2/m256, in ymm2/m256 using controls from imm8.
imm8
EVEX.128.66.0F3A.W1 05 /r ib D V/V (AVX512VL AND Permute double precision floating-point values
VPERMILPD xmm1 {k1}{z}, AVX512F) OR in xmm2/m128/m64bcst using controls from
xmm2/m128/m64bcst, imm8 AVX10.11 imm8 and store the result in xmm1 using
writemask k1.
EVEX.256.66.0F3A.W1 05 /r ib D V/V (AVX512VL AND Permute double precision floating-point values
VPERMILPD ymm1 {k1}{z}, AVX512F) OR in ymm2/m256/m64bcst using controls from
ymm2/m256/m64bcst, imm8 AVX10.11 imm8 and store the result in ymm1 using
writemask k1.
EVEX.512.66.0F3A.W1 05 /r ib D V/V AVX512F Permute double precision floating-point values
VPERMILPD zmm1 {k1}{z}, OR AVX10.11 in zmm2/m512/m64bcst using controls from
zmm2/m512/m64bcst, imm8 imm8 and store the result in zmm1 using
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

VPERMILPD—Permute In-Lane of Pairs of Double Precision Floating-Point Values Vol. 2C 5-494


Instruction Operand Encoding
Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
B N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A
D Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
(Variable control version)
Permute pairs of double precision floating-point values in the first source operand (second operand), each using a
1-bit control field residing in the corresponding quadword element of the second source operand (third operand).
Permuted results are stored in the destination operand (first operand).
The control bits are located at bit 0 of each quadword element (see Figure 5-24). Each control determines which of
the source element in an input pair is selected for the destination element. Each pair of source elements must lie in
the same 128-bit region as the destination.
EVEX version: The second source operand (third operand) is a ZMM/YMM/XMM register, a 512/256/128-bit
memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory location. Permuted results are
written to the destination under the writemask.

SRC1 X3 X2 X1 X0

DEST X2..X3 X2..X3 X0..X1 X0..X1

Figure 5-23. VPERMILPD Operation

VEX.256 encoded version: Bits (MAXVL-1:256) of the corresponding ZMM register are zeroed.

Bit
255 194 193 127 66 65 63 2 1
ignored

ignored

ignored

ignored sel ... ignored sel ignored sel

Control Field 4 Control Field 2 Control Field1

Figure 5-24. VPERMILPD Shuffle Control

Immediate control version: Permute pairs of double precision floating-point values in the first source operand
(second operand), each pair using a 1-bit control field in the imm8 byte. Each element in the destination operand
(first operand) use a separate control bit of the imm8 byte.

VPERMILPD—Permute In-Lane of Pairs of Double Precision Floating-Point Values Vol. 2C 5-495


VEX version: The source operand is a YMM/XMM register or a 256/128-bit memory location and the destination
operand is a YMM/XMM register. Imm8 byte provides the lower 4/2 bit as permute control fields.
EVEX version: The source operand (second operand) is a ZMM/YMM/XMM register, a 512/256/128-bit memory
location or a 512/256/128-bit vector broadcasted from a 64-bit memory location. Permuted results are written to
the destination under the writemask. Imm8 byte provides the lower 8/4/2 bit as permute control fields.
Note: For the imm8 versions, VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will
#UD.

Operation
VPERMILPD (EVEX immediate versions)
(KL, VL) = (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1) AND (SRC1 *is memory*)
THEN TMP_SRC1[i+63:i] := SRC1[63:0];
ELSE TMP_SRC1[i+63:i] := SRC1[i+63:i];
FI;
ENDFOR;
IF (imm8[0] = 0) THEN TMP_DEST[63:0] := SRC1[63:0]; FI;
IF (imm8[0] = 1) THEN TMP_DEST[63:0] := TMP_SRC1[127:64]; FI;
IF (imm8[1] = 0) THEN TMP_DEST[127:64] := TMP_SRC1[63:0]; FI;
IF (imm8[1] = 1) THEN TMP_DEST[127:64] := TMP_SRC1[127:64]; FI;
IF VL >= 256
IF (imm8[2] = 0) THEN TMP_DEST[191:128] := TMP_SRC1[191:128]; FI;
IF (imm8[2] = 1) THEN TMP_DEST[191:128] := TMP_SRC1[255:192]; FI;
IF (imm8[3] = 0) THEN TMP_DEST[255:192] := TMP_SRC1[191:128]; FI;
IF (imm8[3] = 1) THEN TMP_DEST[255:192] := TMP_SRC1[255:192]; FI;
FI;
IF VL >= 512
IF (imm8[4] = 0) THEN TMP_DEST[319:256] := TMP_SRC1[319:256]; FI;
IF (imm8[4] = 1) THEN TMP_DEST[319:256] := TMP_SRC1[383:320]; FI;
IF (imm8[5] = 0) THEN TMP_DEST[383:320] := TMP_SRC1[319:256]; FI;
IF (imm8[5] = 1) THEN TMP_DEST[383:320] := TMP_SRC1[383:320]; FI;
IF (imm8[6] = 0) THEN TMP_DEST[447:384] := TMP_SRC1[447:384]; FI;
IF (imm8[6] = 1) THEN TMP_DEST[447:384] := TMP_SRC1[511:448]; FI;
IF (imm8[7] = 0) THEN TMP_DEST[511:448] := TMP_SRC1[447:384]; FI;
IF (imm8[7] = 1) THEN TMP_DEST[511:448] := TMP_SRC1[511:448]; FI;
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPERMILPD—Permute In-Lane of Pairs of Double Precision Floating-Point Values Vol. 2C 5-496


VPERMILPD (256-bit immediate version)
IF (imm8[0] = 0) THEN DEST[63:0] := SRC1[63:0]
IF (imm8[0] = 1) THEN DEST[63:0] := SRC1[127:64]
IF (imm8[1] = 0) THEN DEST[127:64] := SRC1[63:0]
IF (imm8[1] = 1) THEN DEST[127:64] := SRC1[127:64]
IF (imm8[2] = 0) THEN DEST[191:128] := SRC1[191:128]
IF (imm8[2] = 1) THEN DEST[191:128] := SRC1[255:192]
IF (imm8[3] = 0) THEN DEST[255:192] := SRC1[191:128]
IF (imm8[3] = 1) THEN DEST[255:192] := SRC1[255:192]
DEST[MAXVL-1:256] := 0

VPERMILPD (128-bit immediate version)


IF (imm8[0] = 0) THEN DEST[63:0] := SRC1[63:0]
IF (imm8[0] = 1) THEN DEST[63:0] := SRC1[127:64]
IF (imm8[1] = 0) THEN DEST[127:64] := SRC1[63:0]
IF (imm8[1] = 1) THEN DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

VPERMILPD (EVEX variable versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN TMP_SRC2[i+63:i] := SRC2[63:0];
ELSE TMP_SRC2[i+63:i] := SRC2[i+63:i];
FI;
ENDFOR;

IF (TMP_SRC2[1] = 0) THEN TMP_DEST[63:0] := SRC1[63:0]; FI;


IF (TMP_SRC2[1] = 1) THEN TMP_DEST[63:0] := SRC1[127:64]; FI;
IF (TMP_SRC2[65] = 0) THEN TMP_DEST[127:64] := SRC1[63:0]; FI;
IF (TMP_SRC2[65] = 1) THEN TMP_DEST[127:64] := SRC1[127:64]; FI;
IF VL >= 256
IF (TMP_SRC2[129] = 0) THEN TMP_DEST[191:128] := SRC1[191:128]; FI;
IF (TMP_SRC2[129] = 1) THEN TMP_DEST[191:128] := SRC1[255:192]; FI;
IF (TMP_SRC2[193] = 0) THEN TMP_DEST[255:192] := SRC1[191:128]; FI;
IF (TMP_SRC2[193] = 1) THEN TMP_DEST[255:192] := SRC1[255:192]; FI;
FI;
IF VL >= 512
IF (TMP_SRC2[257] = 0) THEN TMP_DEST[319:256] := SRC1[319:256]; FI;
IF (TMP_SRC2[257] = 1) THEN TMP_DEST[319:256] := SRC1[383:320]; FI;
IF (TMP_SRC2[321] = 0) THEN TMP_DEST[383:320] := SRC1[319:256]; FI;
IF (TMP_SRC2[321] = 1) THEN TMP_DEST[383:320] := SRC1[383:320]; FI;
IF (TMP_SRC2[385] = 0) THEN TMP_DEST[447:384] := SRC1[447:384]; FI;
IF (TMP_SRC2[385] = 1) THEN TMP_DEST[447:384] := SRC1[511:448]; FI;
IF (TMP_SRC2[449] = 0) THEN TMP_DEST[511:448] := SRC1[447:384]; FI;
IF (TMP_SRC2[449] = 1) THEN TMP_DEST[511:448] := SRC1[511:448]; FI;
FI;

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE

VPERMILPD—Permute In-Lane of Pairs of Double Precision Floating-Point Values Vol. 2C 5-497


IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPERMILPD (256-bit variable version)


IF (SRC2[1] = 0) THEN DEST[63:0] := SRC1[63:0]
IF (SRC2[1] = 1) THEN DEST[63:0] := SRC1[127:64]
IF (SRC2[65] = 0) THEN DEST[127:64] := SRC1[63:0]
IF (SRC2[65] = 1) THEN DEST[127:64] := SRC1[127:64]
IF (SRC2[129] = 0) THEN DEST[191:128] := SRC1[191:128]
IF (SRC2[129] = 1) THEN DEST[191:128] := SRC1[255:192]
IF (SRC2[193] = 0) THEN DEST[255:192] := SRC1[191:128]
IF (SRC2[193] = 1) THEN DEST[255:192] := SRC1[255:192]
DEST[MAXVL-1:256] := 0

VPERMILPD (128-bit variable version)


IF (SRC2[1] = 0) THEN DEST[63:0] := SRC1[63:0]
IF (SRC2[1] = 1) THEN DEST[63:0] := SRC1[127:64]
IF (SRC2[65] = 0) THEN DEST[127:64] := SRC1[63:0]
IF (SRC2[65] = 1) THEN DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPERMILPD __m512d _mm512_permute_pd( __m512d a, int imm);
VPERMILPD __m512d _mm512_mask_permute_pd(__m512d s, __mmask8 k, __m512d a, int imm);
VPERMILPD __m512d _mm512_maskz_permute_pd( __mmask8 k, __m512d a, int imm);
VPERMILPD __m256d _mm256_mask_permute_pd(__m256d s, __mmask8 k, __m256d a, int imm);
VPERMILPD __m256d _mm256_maskz_permute_pd( __mmask8 k, __m256d a, int imm);
VPERMILPD __m128d _mm_mask_permute_pd(__m128d s, __mmask8 k, __m128d a, int imm);
VPERMILPD __m128d _mm_maskz_permute_pd( __mmask8 k, __m128d a, int imm);
VPERMILPD __m512d _mm512_permutevar_pd( __m512i i, __m512d a);
VPERMILPD __m512d _mm512_mask_permutevar_pd(__m512d s, __mmask8 k, __m512i i, __m512d a);
VPERMILPD __m512d _mm512_maskz_permutevar_pd( __mmask8 k, __m512i i, __m512d a);
VPERMILPD __m256d _mm256_mask_permutevar_pd(__m256d s, __mmask8 k, __m256d i, __m256d a);
VPERMILPD __m256d _mm256_maskz_permutevar_pd( __mmask8 k, __m256d i, __m256d a);
VPERMILPD __m128d _mm_mask_permutevar_pd(__m128d s, __mmask8 k, __m128d i, __m128d a);
VPERMILPD __m128d _mm_maskz_permutevar_pd( __mmask8 k, __m128d i, __m128d a);
VPERMILPD __m128d _mm_permute_pd (__m128d a, int control)
VPERMILPD __m256d _mm256_permute_pd (__m256d a, int control)
VPERMILPD __m128d _mm_permutevar_pd (__m128d a, __m128i control);
VPERMILPD __m256d _mm256_permutevar_pd (__m256d a, __m256i control);

SIMD Floating-Point Exceptions


None.

VPERMILPD—Permute In-Lane of Pairs of Double Precision Floating-Point Values Vol. 2C 5-498


Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
Additionally:
#UD If VEX.W = 1.
EVEX-encoded instruction, see Table 2-52, “Type E4NF Class Exception Conditions.”
Additionally:
#UD If either (E)VEX.vvvv != 1111B and with imm8.

VPERMILPD—Permute In-Lane of Pairs of Double Precision Floating-Point Values Vol. 2C 5-499


VPERMILPS—Permute In-Lane of Quadruples of Single Precision Floating-Point Values
Opcode/ Op / En 64/32 CPUID Feature Description
Instruction bit Mode Flag
Support
VEX.128.66.0F38.W0 0C /r A V/V AVX Permute single precision floating-point values in
VPERMILPS xmm1, xmm2, xmm2 using controls from xmm3/m128 and
xmm3/m128 store result in xmm1.
VEX.128.66.0F3A.W0 04 /r ib B V/V AVX Permute single precision floating-point values in
VPERMILPS xmm1, xmm2/m128, xmm2/m128 using controls from imm8 and store
imm8 result in xmm1.
VEX.256.66.0F38.W0 0C /r A V/V AVX Permute single precision floating-point values in
VPERMILPS ymm1, ymm2, ymm2 using controls from ymm3/m256 and
ymm3/m256 store result in ymm1.
VEX.256.66.0F3A.W0 04 /r ib B V/V AVX Permute single precision floating-point values in
VPERMILPS ymm1, ymm2/m256, ymm2/m256 using controls from imm8 and store
imm8 result in ymm1.
EVEX.128.66.0F38.W0 0C /r C V/V (AVX512VL AND Permute single-precision floating-point values
VPERMILPS xmm1 {k1}{z}, xmm2, AVX512F) OR xmm2 using control from xmm3/m128/m32bcst
xmm3/m128/m32bcst AVX10.11 and store the result in xmm1 using writemask k1.
EVEX.256.66.0F38.W0 0C /r C V/V (AVX512VL AND Permute single-precision floating-point values
VPERMILPS ymm1 {k1}{z}, ymm2, AVX512F) OR ymm2 using control from ymm3/m256/m32bcst
ymm3/m256/m32bcst AVX10.11 and store the result in ymm1 using writemask k1.
EVEX.512.66.0F38.W0 0C /r C V/V AVX512F Permute single-precision floating-point values
VPERMILPS zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm2 using control from zmm3/m512/m32bcst
zmm3/m512/m32bcst and store the result in zmm1 using writemask k1.
EVEX.128.66.0F3A.W0 04 /r ib D V/V (AVX512VL AND Permute single-precision floating-point values
VPERMILPS xmm1 {k1}{z}, AVX512F) OR xmm2/m128/m32bcst using controls from imm8
xmm2/m128/m32bcst, imm8 AVX10.11 and store the result in xmm1 using writemask k1.
EVEX.256.66.0F3A.W0 04 /r ib D V/V (AVX512VL AND Permute single-precision floating-point values
VPERMILPS ymm1 {k1}{z}, AVX512F) OR ymm2/m256/m32bcst using controls from imm8
ymm2/m256/m32bcst, imm8 AVX10.11 and store the result in ymm1 using writemask k1.
EVEX.512.66.0F3A.W0 04 /r D V/V AVX512F Permute single-precision floating-point values
ibVPERMILPS zmm1 {k1}{z}, OR AVX10.11 zmm2/m512/m32bcst using controls from imm8
zmm2/m512/m32bcst, imm8 and store the result in zmm1 using writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
B N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A
D Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

VPERMILPS—Permute In-Lane of Quadruples of Single Precision Floating-Point Values Vol. 2C 5-499


Description
Variable control version:
Permute quadruples of single precision floating-point values in the first source operand (second operand), each
quadruplet using a 2-bit control field in the corresponding dword element of the second source operand. Permuted
results are stored in the destination operand (first operand).
The 2-bit control fields are located at the low two bits of each dword element (see Figure 5-26). Each control deter-
mines which of the source element in an input quadruple is selected for the destination element. Each quadruple of
source elements must lie in the same 128-bit region as the destination.
EVEX version: The second source operand (third operand) is a ZMM/YMM/XMM register, a 512/256/128-bit
memory location or a 512/256/128-bit vector broadcasted from a 32-bit memory location. Permuted results are
written to the destination under the writemask.

SRC1 X7 X6 X5 X4 X3 X2 X1 X0

DEST X7 .. X4 X7 .. X4 X7 .. X4 X7 .. X4 X3 ..X0 X3 ..X0 X3 .. X0 X3 .. X0

Figure 5-25. VPERMILPS Operation

Bit
255 226 225 224 63 34 33 32 31 1 0

ignored sel ... ignored sel ignored sel

Control Field 7 Control Field 2 Control Field 1

Figure 5-26. VPERMILPS Shuffle Control

(Immediate control version)


Permute quadruples of single precision floating-point values in the first source operand (second operand), each
quadruplet using a 2-bit control field in the imm8 byte. Each 128-bit lane in the destination operand (first operand)
use the four control fields of the same imm8 byte.
VEX version: The source operand is a YMM/XMM register or a 256/128-bit memory location and the destination
operand is a YMM/XMM register.
EVEX version: The source operand (second operand) is a ZMM/YMM/XMM register, a 512/256/128-bit memory
location or a 512/256/128-bit vector broadcasted from a 32-bit memory location. Permuted results are written to
the destination under the writemask.
Note: For the imm8 version, VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will
#UD.

VPERMILPS—Permute In-Lane of Quadruples of Single Precision Floating-Point Values Vol. 2C 5-500


Operation
Select4(SRC, control) {
CASE (control[1:0]) OF
0: TMP := SRC[31:0];
1: TMP := SRC[63:32];
2: TMP := SRC[95:64];
3: TMP := SRC[127:96];
ESAC;
RETURN TMP
}

VPERMILPS (EVEX immediate versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF (EVEX.b = 1) AND (SRC1 *is memory*)
THEN TMP_SRC1[i+31:i] := SRC1[31:0];
ELSE TMP_SRC1[i+31:i] := SRC1[i+31:i];
FI;
ENDFOR;

TMP_DEST[31:0] := Select4(TMP_SRC1[127:0], imm8[1:0]);


TMP_DEST[63:32] := Select4(TMP_SRC1[127:0], imm8[3:2]);
TMP_DEST[95:64] := Select4(TMP_SRC1[127:0], imm8[5:4]);
TMP_DEST[127:96] := Select4(TMP_SRC1[127:0], imm8[7:6]); FI;
IF VL >= 256
TMP_DEST[159:128] := Select4(TMP_SRC1[255:128], imm8[1:0]); FI;
TMP_DEST[191:160] := Select4(TMP_SRC1[255:128], imm8[3:2]); FI;
TMP_DEST[223:192] := Select4(TMP_SRC1[255:128], imm8[5:4]); FI;
TMP_DEST[255:224] := Select4(TMP_SRC1[255:128], imm8[7:6]); FI;
FI;
IF VL >= 512
TMP_DEST[287:256] := Select4(TMP_SRC1[383:256], imm8[1:0]); FI;
TMP_DEST[319:288] := Select4(TMP_SRC1[383:256], imm8[3:2]); FI;
TMP_DEST[351:320] := Select4(TMP_SRC1[383:256], imm8[5:4]); FI;
TMP_DEST[383:352] := Select4(TMP_SRC1[383:256], imm8[7:6]); FI;
TMP_DEST[415:384] := Select4(TMP_SRC1[511:384], imm8[1:0]); FI;
TMP_DEST[447:416] := Select4(TMP_SRC1[511:384], imm8[3:2]); FI;
TMP_DEST[479:448] := Select4(TMP_SRC1[511:384], imm8[5:4]); FI;
TMP_DEST[511:480] := Select4(TMP_SRC1[511:384], imm8[7:6]); FI;
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking*
THEN *DEST[i+31:i] remains unchanged*
ELSE DEST[i+31:i] := 0 ;zeroing-masking
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPERMILPS—Permute In-Lane of Quadruples of Single Precision Floating-Point Values Vol. 2C 5-501


VPERMILPS (256-bit immediate version)
DEST[31:0] := Select4(SRC1[127:0], imm8[1:0]);
DEST[63:32] := Select4(SRC1[127:0], imm8[3:2]);
DEST[95:64] := Select4(SRC1[127:0], imm8[5:4]);
DEST[127:96] := Select4(SRC1[127:0], imm8[7:6]);
DEST[159:128] := Select4(SRC1[255:128], imm8[1:0]);
DEST[191:160] := Select4(SRC1[255:128], imm8[3:2]);
DEST[223:192] := Select4(SRC1[255:128], imm8[5:4]);
DEST[255:224] := Select4(SRC1[255:128], imm8[7:6]);

VPERMILPS (128-bit immediate version)


DEST[31:0] := Select4(SRC1[127:0], imm8[1:0]);
DEST[63:32] := Select4(SRC1[127:0], imm8[3:2]);
DEST[95:64] := Select4(SRC1[127:0], imm8[5:4]);
DEST[127:96] := Select4(SRC1[127:0], imm8[7:6]);
DEST[MAXVL-1:128] := 0

VPERMILPS (EVEX variable versions)


(KL, VL) = (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN TMP_SRC2[i+31:i] := SRC2[31:0];
ELSE TMP_SRC2[i+31:i] := SRC2[i+31:i];
FI;
ENDFOR;
TMP_DEST[31:0] := Select4(SRC1[127:0], TMP_SRC2[1:0]);
TMP_DEST[63:32] := Select4(SRC1[127:0], TMP_SRC2[33:32]);
TMP_DEST[95:64] := Select4(SRC1[127:0], TMP_SRC2[65:64]);
TMP_DEST[127:96] := Select4(SRC1[127:0], TMP_SRC2[97:96]);
IF VL >= 256
TMP_DEST[159:128] := Select4(SRC1[255:128], TMP_SRC2[129:128]);
TMP_DEST[191:160] := Select4(SRC1[255:128], TMP_SRC2[161:160]);
TMP_DEST[223:192] := Select4(SRC1[255:128], TMP_SRC2[193:192]);
TMP_DEST[255:224] := Select4(SRC1[255:128], TMP_SRC2[225:224]);
FI;
IF VL >= 512
TMP_DEST[287:256] := Select4(SRC1[383:256], TMP_SRC2[257:256]);
TMP_DEST[319:288] := Select4(SRC1[383:256], TMP_SRC2[289:288]);
TMP_DEST[351:320] := Select4(SRC1[383:256], TMP_SRC2[321:320]);
TMP_DEST[383:352] := Select4(SRC1[383:256], TMP_SRC2[353:352]);
TMP_DEST[415:384] := Select4(SRC1[511:384], TMP_SRC2[385:384]);
TMP_DEST[447:416] := Select4(SRC1[511:384], TMP_SRC2[417:416]);
TMP_DEST[479:448] := Select4(SRC1[511:384], TMP_SRC2[449:448]);
TMP_DEST[511:480] := Select4(SRC1[511:384], TMP_SRC2[481:480]);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking*
THEN *DEST[i+31:i] remains unchanged*
ELSE DEST[i+31:i] := 0 ;zeroing-masking

VPERMILPS—Permute In-Lane of Quadruples of Single Precision Floating-Point Values Vol. 2C 5-502


FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPERMILPS (256-bit variable version)


DEST[31:0] := Select4(SRC1[127:0], SRC2[1:0]);
DEST[63:32] := Select4(SRC1[127:0], SRC2[33:32]);
DEST[95:64] := Select4(SRC1[127:0], SRC2[65:64]);
DEST[127:96] := Select4(SRC1[127:0], SRC2[97:96]);
DEST[159:128] := Select4(SRC1[255:128], SRC2[129:128]);
DEST[191:160] := Select4(SRC1[255:128], SRC2[161:160]);
DEST[223:192] := Select4(SRC1[255:128], SRC2[193:192]);
DEST[255:224] := Select4(SRC1[255:128], SRC2[225:224]);
DEST[MAXVL-1:256] := 0

VPERMILPS (128-bit variable version)


DEST[31:0] := Select4(SRC1[127:0], SRC2[1:0]);
DEST[63:32] := Select4(SRC1[127:0], SRC2[33:32]);
DEST[95:64] :=Select4(SRC1[127:0], SRC2[65:64]);
DEST[127:96] := Select4(SRC1[127:0], SRC2[97:96]);
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPERMILPS __m512 _mm512_permute_ps( __m512 a, int imm);
VPERMILPS __m512 _mm512_mask_permute_ps(__m512 s, __mmask16 k, __m512 a, int imm);
VPERMILPS __m512 _mm512_maskz_permute_ps( __mmask16 k, __m512 a, int imm);
VPERMILPS __m256 _mm256_mask_permute_ps(__m256 s, __mmask8 k, __m256 a, int imm);
VPERMILPS __m256 _mm256_maskz_permute_ps( __mmask8 k, __m256 a, int imm);
VPERMILPS __m128 _mm_mask_permute_ps(__m128 s, __mmask8 k, __m128 a, int imm);
VPERMILPS __m128 _mm_maskz_permute_ps( __mmask8 k, __m128 a, int imm);
VPERMILPS __m512 _mm512_permutevar_ps( __m512i i, __m512 a);
VPERMILPS __m512 _mm512_mask_permutevar_ps(__m512 s, __mmask16 k, __m512i i, __m512 a);
VPERMILPS __m512 _mm512_maskz_permutevar_ps( __mmask16 k, __m512i i, __m512 a);
VPERMILPS __m256 _mm256_mask_permutevar_ps(__m256 s, __mmask8 k, __m256 i, __m256 a);
VPERMILPS __m256 _mm256_maskz_permutevar_ps( __mmask8 k, __m256 i, __m256 a);
VPERMILPS __m128 _mm_mask_permutevar_ps(__m128 s, __mmask8 k, __m128 i, __m128 a);
VPERMILPS __m128 _mm_maskz_permutevar_ps( __mmask8 k, __m128 i, __m128 a);
VPERMILPS __m128 _mm_permute_ps (__m128 a, int control);
VPERMILPS __m256 _mm256_permute_ps (__m256 a, int control);
VPERMILPS __m128 _mm_permutevar_ps (__m128 a, __m128i control);
VPERMILPS __m256 _mm256_permutevar_ps (__m256 a, __m256i control);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
Additionally:
#UD If VEX.W = 1.
EVEX-encoded instruction, see Table 2-52, “Type E4NF Class Exception Conditions.”
Additionally:
#UD If either (E)VEX.vvvv != 1111B and with imm8.

VPERMILPS—Permute In-Lane of Quadruples of Single Precision Floating-Point Values Vol. 2C 5-503


VPERMPD—Permute Double Precision Floating-Point Elements
Opcode/ Op / En 64/32 CPUID Feature Description
Instruction bit Mode Flag
Support
VEX.256.66.0F3A.W1 01 /r ib A V/V AVX2 Permute double precision floating-point elements
VPERMPD ymm1, ymm2/m256, in ymm2/m256 using indices in imm8 and store the
imm8 result in ymm1.
EVEX.256.66.0F3A.W1 01 /r ib B V/V (AVX512VL AND Permute double precision floating-point elements
VPERMPD ymm1 {k1}{z}, AVX512F) OR in ymm2/m256/m64bcst using indexes in imm8
ymm2/m256/m64bcst, imm8 AVX10.11 and store the result in ymm1 subject to writemask
k1.
EVEX.512.66.0F3A.W1 01 /r ib B V/V AVX512F Permute double precision floating-point elements
VPERMPD zmm1 {k1}{z}, OR AVX10.11 in zmm2/m512/m64bcst using indices in imm8 and
zmm2/m512/m64bcst, imm8 store the result in zmm1 subject to writemask k1.
EVEX.256.66.0F38.W1 16 /r C V/V (AVX512VL AND Permute double precision floating-point elements
VPERMPD ymm1 {k1}{z}, ymm2, AVX512F) OR in ymm3/m256/m64bcst using indexes in ymm2
ymm3/m256/m64bcst AVX10.11 and store the result in ymm1 subject to writemask
k1.
EVEX.512.66.0F38.W1 16 /r C V/V AVX512F Permute double precision floating-point elements
VPERMPD zmm1 {k1}{z}, zmm2, OR AVX10.11 in zmm3/m512/m64bcst using indices in zmm2
zmm3/m512/m64bcst and store the result in zmm1 subject to writemask
k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) imm8 N/A
B Full ModRM:reg (w) ModRM:r/m (r) imm8 N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
The imm8 version: Copies quadword elements of double precision floating-point values from the source operand
(the second operand) to the destination operand (the first operand) according to the indices specified by the imme-
diate operand (the third operand). Each two-bit value in the immediate byte selects a qword element in the source
operand.
VEX version: The source operand can be a YMM register or a memory location. Bits (MAXVL-1:256) of the corre-
sponding destination register are zeroed.
In EVEX.512 encoded version, The elements in the destination are updated using the writemask k1 and the imm8
bits are reused as control bits for the upper 256-bit half when the control bits are coming from immediate. The
source operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 64-bit
memory location.
The imm8 versions: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
The vector control version: Copies quadword elements of double precision floating-point values from the second
source operand (the third operand) to the destination operand (the first operand) according to the indices in the
first source operand (the second operand). The first 3 bits of each 64 bit element in the index operand selects which
quadword in the second source operand to copy. The first and second operands are ZMM registers, the third
operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 64-bit memory
location. The elements in the destination are updated using the writemask k1.

VPERMPD—Permute Double Precision Floating-Point Elements Vol. 2C 5-504


Note that this instruction permits a qword in the source operand to be copied to multiple locations in the destination
operand.
If VPERMPD is encoded with VEX.L= 0, an attempt to execute the instruction encoded with VEX.L= 0 will cause an
#UD exception.

Operation
VPERMPD (EVEX - imm8 control forms)
(KL, VL) = (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1) AND (SRC *is memory*)
THEN TMP_SRC[i+63:i] := SRC[63:0];
ELSE TMP_SRC[i+63:i] := SRC[i+63:i];
FI;
ENDFOR;

TMP_DEST[63:0] := (TMP_SRC[256:0] >> (IMM8[1:0] * 64))[63:0];


TMP_DEST[127:64] := (TMP_SRC[256:0] >> (IMM8[3:2] * 64))[63:0];
TMP_DEST[191:128] := (TMP_SRC[256:0] >> (IMM8[5:4] * 64))[63:0];
TMP_DEST[255:192] := (TMP_SRC[256:0] >> (IMM8[7:6] * 64))[63:0];
IF VL >= 512
TMP_DEST[319:256] := (TMP_SRC[511:256] >> (IMM8[1:0] * 64))[63:0];
TMP_DEST[383:320] := (TMP_SRC[511:256] >> (IMM8[3:2] * 64))[63:0];
TMP_DEST[447:384] := (TMP_SRC[511:256] >> (IMM8[5:4] * 64))[63:0];
TMP_DEST[511:448] := (TMP_SRC[511:256] >> (IMM8[7:6] * 64))[63:0];
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0 ;zeroing-masking
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPERMPD—Permute Double Precision Floating-Point Elements Vol. 2C 5-505


VPERMPD (EVEX - vector control forms)
(KL, VL) = (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN TMP_SRC2[i+63:i] := SRC2[63:0];
ELSE TMP_SRC2[i+63:i] := SRC2[i+63:i];
FI;
ENDFOR;

IF VL = 256
TMP_DEST[63:0] := (TMP_SRC2[255:0] >> (SRC1[1:0] * 64))[63:0];
TMP_DEST[127:64] := (TMP_SRC2[255:0] >> (SRC1[65:64] * 64))[63:0];
TMP_DEST[191:128] := (TMP_SRC2[255:0] >> (SRC1[129:128] * 64))[63:0];
TMP_DEST[255:192] := (TMP_SRC2[255:0] >> (SRC1[193:192] * 64))[63:0];
FI;
IF VL = 512
TMP_DEST[63:0] := (TMP_SRC2[511:0] >> (SRC1[2:0] * 64))[63:0];
TMP_DEST[127:64] := (TMP_SRC2[511:0] >> (SRC1[66:64] * 64))[63:0];
TMP_DEST[191:128] := (TMP_SRC2[511:0] >> (SRC1[130:128] * 64))[63:0];
TMP_DEST[255:192] := (TMP_SRC2[511:0] >> (SRC1[194:192] * 64))[63:0];
TMP_DEST[319:256] := (TMP_SRC2[511:0] >> (SRC1[258:256] * 64))[63:0];
TMP_DEST[383:320] := (TMP_SRC2[511:0] >> (SRC1[322:320] * 64))[63:0];
TMP_DEST[447:384] := (TMP_SRC2[511:0] >> (SRC1[386:384] * 64))[63:0];
TMP_DEST[511:448] := (TMP_SRC2[511:0] >> (SRC1[450:448] * 64))[63:0];
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0 ;zeroing-masking
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPERMPD (VEX.256 encoded version)


DEST[63:0] := (SRC[255:0] >> (IMM8[1:0] * 64))[63:0];
DEST[127:64] := (SRC[255:0] >> (IMM8[3:2] * 64))[63:0];
DEST[191:128] := (SRC[255:0] >> (IMM8[5:4] * 64))[63:0];
DEST[255:192] := (SRC[255:0] >> (IMM8[7:6] * 64))[63:0];
DEST[MAXVL-1:256] := 0

VPERMPD—Permute Double Precision Floating-Point Elements Vol. 2C 5-506


Intel C/C++ Compiler Intrinsic Equivalent
VPERMPD __m512d _mm512_permutex_pd( __m512d a, int imm);
VPERMPD __m512d _mm512_mask_permutex_pd(__m512d s, __mmask16 k, __m512d a, int imm);
VPERMPD __m512d _mm512_maskz_permutex_pd( __mmask16 k, __m512d a, int imm);
VPERMPD __m512d _mm512_permutexvar_pd( __m512i i, __m512d a);
VPERMPD __m512d _mm512_mask_permutexvar_pd(__m512d s, __mmask16 k, __m512i i, __m512d a);
VPERMPD __m512d _mm512_maskz_permutexvar_pd( __mmask16 k, __m512i i, __m512d a);
VPERMPD __m256d _mm256_permutex_epi64( __m256d a, int imm);
VPERMPD __m256d _mm256_mask_permutex_epi64(__m256i s, __mmask8 k, __m256d a, int imm);
VPERMPD __m256d _mm256_maskz_permutex_epi64( __mmask8 k, __m256d a, int imm);
VPERMPD __m256d _mm256_permutexvar_epi64( __m256i i, __m256d a);
VPERMPD __m256d _mm256_mask_permutexvar_epi64(__m256i s, __mmask8 k, __m256i i, __m256d a);
VPERMPD __m256d _mm256_maskz_permutexvar_epi64( __mmask8 k, __m256i i, __m256d a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions”; additionally:
#UD If VEX.L = 0.
If VEX.vvvv != 1111B.
EVEX-encoded instruction, see Table 2-52, “Type E4NF Class Exception Conditions”; additionally:
#UD If encoded with EVEX.128.
If EVEX.vvvv != 1111B and with imm8.

VPERMPD—Permute Double Precision Floating-Point Elements Vol. 2C 5-507


VPERMPS—Permute Single Precision Floating-Point Elements
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.256.66.0F38.W0 16 /r A V/V AVX2 Permute single precision floating-point elements in
VPERMPS ymm1, ymm2, ymm3/m256 using indices in ymm2 and store the
ymm3/m256 result in ymm1.
EVEX.256.66.0F38.W0 16 /r B V/V (AVX512VL AND Permute single-precision floating-point elements in
VPERMPS ymm1 {k1}{z}, ymm2, AVX512F) OR ymm3/m256/m32bcst using indexes in ymm2 and
ymm3/m256/m32bcst AVX10.11 store the result in ymm1 subject to write mask k1.
EVEX.512.66.0F38.W0 16 /r B V/V AVX512F Permute single-precision floating-point values in
VPERMPS zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m32bcst using indices in zmm2 and
zmm3/m512/m32bcst store the result in zmm1 subject to write mask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Copies doubleword elements of single precision floating-point values from the second source operand (the third
operand) to the destination operand (the first operand) according to the indices in the first source operand (the
second operand). Note that this instruction permits a doubleword in the source operand to be copied to more than
one location in the destination operand.
VEX.256 versions: The first and second operands are YMM registers, the third operand can be a YMM register or
memory location. Bits (MAXVL-1:256) of the corresponding destination register are zeroed.
EVEX encoded version: The first and second operands are ZMM registers, the third operand can be a ZMM register,
a 512-bit memory location or a 512-bit vector broadcasted from a 32-bit memory location. The elements in the
destination are updated using the writemask k1.
If VPERMPS is encoded with VEX.L= 0, an attempt to execute the instruction encoded with VEX.L= 0 will cause an
#UD exception.

Operation
VPERMPS (EVEX forms)
(KL, VL) (8, 256),= (16, 512)
FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN TMP_SRC2[i+31:i] := SRC2[31:0];
ELSE TMP_SRC2[i+31:i] := SRC2[i+31:i];
FI;
ENDFOR;

IF VL = 256
TMP_DEST[31:0] := (TMP_SRC2[255:0] >> (SRC1[2:0] * 32))[31:0];
TMP_DEST[63:32] := (TMP_SRC2[255:0] >> (SRC1[34:32] * 32))[31:0];

VPERMPS—Permute Single Precision Floating-Point Elements Vol. 2C 5-507


TMP_DEST[95:64] := (TMP_SRC2[255:0] >> (SRC1[66:64] * 32))[31:0];
TMP_DEST[127:96] := (TMP_SRC2[255:0] >> (SRC1[98:96] * 32))[31:0];
TMP_DEST[159:128] := (TMP_SRC2[255:0] >> (SRC1[130:128] * 32))[31:0];
TMP_DEST[191:160] := (TMP_SRC2[255:0] >> (SRC1[162:160] * 32))[31:0];
TMP_DEST[223:192] := (TMP_SRC2[255:0] >> (SRC1[193:192] * 32))[31:0];
TMP_DEST[255:224] := (TMP_SRC2[255:0] >> (SRC1[226:224] * 32))[31:0];
FI;
IF VL = 512
TMP_DEST[31:0] := (TMP_SRC2[511:0] >> (SRC1[3:0] * 32))[31:0];
TMP_DEST[63:32] := (TMP_SRC2[511:0] >> (SRC1[35:32] * 32))[31:0];
TMP_DEST[95:64] := (TMP_SRC2[511:0] >> (SRC1[67:64] * 32))[31:0];
TMP_DEST[127:96] := (TMP_SRC2[511:0] >> (SRC1[99:96] * 32))[31:0];
TMP_DEST[159:128] := (TMP_SRC2[511:0] >> (SRC1[131:128] * 32))[31:0];
TMP_DEST[191:160] := (TMP_SRC2[511:0] >> (SRC1[163:160] * 32))[31:0];
TMP_DEST[223:192] := (TMP_SRC2[511:0] >> (SRC1[195:192] * 32))[31:0];
TMP_DEST[255:224] := (TMP_SRC2[511:0] >> (SRC1[227:224] * 32))[31:0];
TMP_DEST[287:256] := (TMP_SRC2[511:0] >> (SRC1[259:256] * 32))[31:0];
TMP_DEST[319:288] := (TMP_SRC2[511:0] >> (SRC1[291:288] * 32))[31:0];
TMP_DEST[351:320] := (TMP_SRC2[511:0] >> (SRC1[323:320] * 32))[31:0];
TMP_DEST[383:352] := (TMP_SRC2[511:0] >> (SRC1[355:352] * 32))[31:0];
TMP_DEST[415:384] := (TMP_SRC2[511:0] >> (SRC1[387:384] * 32))[31:0];
TMP_DEST[447:416] := (TMP_SRC2[511:0] >> (SRC1[419:416] * 32))[31:0];
TMP_DEST[479:448] :=(TMP_SRC2[511:0] >> (SRC1[451:448] * 32))[31:0];
TMP_DEST[511:480] := (TMP_SRC2[511:0] >> (SRC1[483:480] * 32))[31:0];
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0 ;zeroing-masking
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPERMPS (VEX.256 encoded version)


DEST[31:0] := (SRC2[255:0] >> (SRC1[2:0] * 32))[31:0];
DEST[63:32] := (SRC2[255:0] >> (SRC1[34:32] * 32))[31:0];
DEST[95:64] := (SRC2[255:0] >> (SRC1[66:64] * 32))[31:0];
DEST[127:96] := (SRC2[255:0] >> (SRC1[98:96] * 32))[31:0];
DEST[159:128] := (SRC2[255:0] >> (SRC1[130:128] * 32))[31:0];
DEST[191:160] := (SRC2[255:0] >> (SRC1[162:160] * 32))[31:0];
DEST[223:192] := (SRC2[255:0] >> (SRC1[194:192] * 32))[31:0];
DEST[255:224] := (SRC2[255:0] >> (SRC1[226:224] * 32))[31:0];
DEST[MAXVL-1:256] := 0

VPERMPS—Permute Single Precision Floating-Point Elements Vol. 2C 5-508


Intel C/C++ Compiler Intrinsic Equivalent
VPERMPS __m512 _mm512_permutexvar_ps(__m512i i, __m512 a);
VPERMPS __m512 _mm512_mask_permutexvar_ps(__m512 s, __mmask16 k, __m512i i, __m512 a);
VPERMPS __m512 _mm512_maskz_permutexvar_ps( __mmask16 k, __m512i i, __m512 a);
VPERMPS __m256 _mm256_permutexvar_ps(__m256 i, __m256 a);
VPERMPS __m256 _mm256_mask_permutexvar_ps(__m256 s, __mmask8 k, __m256 i, __m256 a);
VPERMPS __m256 _mm256_maskz_permutexvar_ps( __mmask8 k, __m256 i, __m256 a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
Additionally:
#UD If VEX.L = 0.
EVEX-encoded instruction, see Table 2-52, “Type E4NF Class Exception Conditions.”

VPERMPS—Permute Single Precision Floating-Point Elements Vol. 2C 5-509


VPERMQ—Qwords Element Permutation
Opcode/ Op / En 64/32 CPUID Feature Description
Instruction bit Mode Flag
Support
VEX.256.66.0F3A.W1 00 /r ib A V/V AVX2 Permute qwords in ymm2/m256 using
VPERMQ ymm1, ymm2/m256, imm8 indices in imm8 and store the result in ymm1.
EVEX.256.66.0F3A.W1 00 /r ib B V/V (AVX512VL AND Permute qwords in ymm2/m256/m64bcst
VPERMQ ymm1 {k1}{z}, AVX512F) OR using indexes in imm8 and store the result in
ymm2/m256/m64bcst, imm8 AVX10.11 ymm1.
EVEX.512.66.0F3A.W1 00 /r ib B V/V AVX512F Permute qwords in zmm2/m512/m64bcst
VPERMQ zmm1 {k1}{z}, OR AVX10.11 using indices in imm8 and store the result in
zmm2/m512/m64bcst, imm8 zmm1.
EVEX.256.66.0F38.W1 36 /r C V/V (AVX512VL AND Permute qwords in ymm3/m256/m64bcst
VPERMQ ymm1 {k1}{z}, ymm2, AVX512F) OR using indexes in ymm2 and store the result in
ymm3/m256/m64bcst AVX10.11 ymm1.
EVEX.512.66.0F38.W1 36 /r C V/V AVX512F Permute qwords in zmm3/m512/m64bcst
VPERMQ zmm1 {k1}{z}, zmm2, OR AVX10.11 using indices in zmm2 and store the result in
zmm3/m512/m64bcst zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) ModRM:r/m (r) imm8 N/A
B Full ModRM:reg (w) ModRM:r/m (r) imm8 N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
The imm8 version: Copies quadwords from the source operand (the second operand) to the destination operand
(the first operand) according to the indices specified by the immediate operand (the third operand). Each two-bit
value in the immediate byte selects a qword element in the source operand.
VEX version: The source operand can be a YMM register or a memory location. Bits (MAXVL-1:256) of the corre-
sponding destination register are zeroed.
In EVEX.512 encoded version, The elements in the destination are updated using the writemask k1 and the imm8
bits are reused as control bits for the upper 256-bit half when the control bits are coming from immediate. The
source operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 64-bit
memory location.
Immediate control versions: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will
#UD.
The vector control version: Copies quadwords from the second source operand (the third operand) to the destina-
tion operand (the first operand) according to the indices in the first source operand (the second operand). The first
3 bits of each 64 bit element in the index operand selects which quadword in the second source operand to copy.
The first and second operands are ZMM registers, the third operand can be a ZMM register, a 512-bit memory loca-
tion or a 512-bit vector broadcasted from a 64-bit memory location. The elements in the destination are updated
using the writemask k1.
Note that this instruction permits a qword in the source operand to be copied to multiple locations in the destination
operand.

VPERMQ—Qwords Element Permutation Vol. 2C 5-510


If VPERMPQ is encoded with VEX.L= 0 or EVEX.128, an attempt to execute the instruction will cause an #UD excep-
tion.

Operation
VPERMQ (EVEX - imm8 control forms)
(KL, VL) = (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1) AND (SRC *is memory*)
THEN TMP_SRC[i+63:i] := SRC[63:0];
ELSE TMP_SRC[i+63:i] := SRC[i+63:i];
FI;
ENDFOR;
TMP_DEST[63:0] := (TMP_SRC[255:0] >> (IMM8[1:0] * 64))[63:0];
TMP_DEST[127:64] := (TMP_SRC[255:0] >> (IMM8[3:2] * 64))[63:0];
TMP_DEST[191:128] := (TMP_SRC[255:0] >> (IMM8[5:4] * 64))[63:0];
TMP_DEST[255:192] := (TMP_SRC[255:0] >> (IMM8[7:6] * 64))[63:0];
IF VL >= 512
TMP_DEST[319:256] := (TMP_SRC[511:256] >> (IMM8[1:0] * 64))[63:0];
TMP_DEST[383:320] := (TMP_SRC[511:256] >> (IMM8[3:2] * 64))[63:0];
TMP_DEST[447:384] := (TMP_SRC[511:256] >> (IMM8[5:4] * 64))[63:0];
TMP_DEST[511:448] := (TMP_SRC[511:256] >> (IMM8[7:6] * 64))[63:0];
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0 ;zeroing-masking
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPERMQ (EVEX - vector control forms)


(KL, VL) = (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN TMP_SRC2[i+63:i] := SRC2[63:0];
ELSE TMP_SRC2[i+63:i] := SRC2[i+63:i];
FI;
ENDFOR;
IF VL = 256
TMP_DEST[63:0] := (TMP_SRC2[255:0] >> (SRC1[1:0] * 64))[63:0];
TMP_DEST[127:64] := (TMP_SRC2[255:0] >> (SRC1[65:64] * 64))[63:0];
TMP_DEST[191:128] := (TMP_SRC2[255:0] >> (SRC1[129:128] * 64))[63:0];
TMP_DEST[255:192] := (TMP_SRC2[255:0] >> (SRC1[193:192] * 64))[63:0];
FI;
IF VL = 512
TMP_DEST[63:0] := (TMP_SRC2[511:0] >> (SRC1[2:0] * 64))[63:0];

VPERMQ—Qwords Element Permutation Vol. 2C 5-511


TMP_DEST[127:64] := (TMP_SRC2[511:0] >> (SRC1[66:64] * 64))[63:0];
TMP_DEST[191:128] := (TMP_SRC2[511:0] >> (SRC1[130:128] * 64))[63:0];
TMP_DEST[255:192] := (TMP_SRC2[511:0] >> (SRC1[194:192] * 64))[63:0];
TMP_DEST[319:256] := (TMP_SRC2[511:0] >> (SRC1[258:256] * 64))[63:0];
TMP_DEST[383:320] := (TMP_SRC2[511:0] >> (SRC1[322:320] * 64))[63:0];
TMP_DEST[447:384] := (TMP_SRC2[511:0] >> (SRC1[386:384] * 64))[63:0];
TMP_DEST[511:448] := (TMP_SRC2[511:0] >> (SRC1[450:448] * 64))[63:0];
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0 ;zeroing-masking
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPERMQ (VEX.256 encoded version)


DEST[63:0] := (SRC[255:0] >> (IMM8[1:0] * 64))[63:0];
DEST[127:64] := (SRC[255:0] >> (IMM8[3:2] * 64))[63:0];
DEST[191:128] := (SRC[255:0] >> (IMM8[5:4] * 64))[63:0];
DEST[255:192] := (SRC[255:0] >> (IMM8[7:6] * 64))[63:0];
DEST[MAXVL-1:256] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPERMQ __m512i _mm512_permutex_epi64( __m512i a, int imm);
VPERMQ __m512i _mm512_mask_permutex_epi64(__m512i s, __mmask8 k, __m512i a, int imm);
VPERMQ __m512i _mm512_maskz_permutex_epi64( __mmask8 k, __m512i a, int imm);
VPERMQ __m512i _mm512_permutexvar_epi64( __m512i a, __m512i b);
VPERMQ __m512i _mm512_mask_permutexvar_epi64(__m512i s, __mmask8 k, __m512i a, __m512i b);
VPERMQ __m512i _mm512_maskz_permutexvar_epi64( __mmask8 k, __m512i a, __m512i b);
VPERMQ __m256i _mm256_permutex_epi64( __m256i a, int imm);
VPERMQ __m256i _mm256_mask_permutex_epi64(__m256i s, __mmask8 k, __m256i a, int imm);
VPERMQ __m256i _mm256_maskz_permutex_epi64( __mmask8 k, __m256i a, int imm);
VPERMQ __m256i _mm256_permutexvar_epi64( __m256i a, __m256i b);
VPERMQ __m256i _mm256_mask_permutexvar_epi64(__m256i s, __mmask8 k, __m256i a, __m256i b);
VPERMQ __m256i _mm256_maskz_permutexvar_epi64( __mmask8 k, __m256i a, __m256i b);

SIMD Floating-Point Exceptions


None.

VPERMQ—Qwords Element Permutation Vol. 2C 5-512


Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
Additionally:
#UD If VEX.L = 0.
If VEX.vvvv != 1111B.
EVEX-encoded instruction, see Table 2-52, “Type E4NF Class Exception Conditions.”
Additionally:
#UD If encoded with EVEX.128.
If EVEX.vvvv != 1111B and with imm8.

VPERMQ—Qwords Element Permutation Vol. 2C 5-513


VPERMT2B—Full Permute of Bytes From Two Tables Overwriting a Table
Opcode/ Op 64/32 CPUID Feature Description
Instruction / bit Mode Flag
En Support
EVEX.128.66.0F38.W0 7D /r A V/V (AVX512VL AND Permute bytes in xmm3/m128 and xmm1 using
VPERMT2B xmm1 {k1}{z}, xmm2, AVX512_VBMI) byte indexes in xmm2 and store the byte results in
xmm3/m128 OR AVX10.11 xmm1 using writemask k1.
EVEX.256.66.0F38.W0 7D /r A V/V (AVX512VL Permute bytes in ymm3/m256 and ymm1 using
VPERMT2B ymm1 {k1}{z}, ymm2, AVX512_VBMI) byte indexes in ymm2 and store the byte results in
ymm3/m256 OR AVX10.11 ymm1 using writemask k1.
EVEX.512.66.0F38.W0 7D /r A V/V AVX512_VBMI Permute bytes in zmm3/m512 and zmm1 using
VPERMT2B zmm1 {k1}{z}, zmm2, OR AVX10.11 byte indexes in zmm2 and store the byte results in
zmm3/m512 zmm1 using writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full Mem ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Permutes byte values from two tables, comprising of the first operand (also the destination operand) and the third
operand (the second source operand). The second operand (the first source operand) provides byte indices to
select byte results from the two tables. The selected byte elements are written to the destination at byte granu-
larity under the writemask k1.
The first and second operands are ZMM/YMM/XMM registers. The second operand contains input indices to select
elements from the two input tables in the 1st and 3rd operands. The first operand is also the destination of the
result. The second source operand can be a ZMM/YMM/XMM register, or a 512/256/128-bit memory location. In
each index byte, the id bit for table selection is bit 6/5/4, and bits [5:0]/[4:0]/[3:0] selects element within each
input table.
Note that these instructions permit a byte value in the source operands to be copied to more than one location in
the destination operand. Also, the second table and the indices can be reused in subsequent iterations, but the first
table is overwritten.
Bits (MAX_VL-1:256/128) of the destination are zeroed for VL=256,128.

VPERMT2B—Full Permute of Bytes From Two Tables Overwriting a Table Vol. 2C 5-513
Operation
VPERMT2B (EVEX encoded versions)
(KL, VL) = (16, 128), (32, 256), (64, 512)
IF VL = 128:
id := 3;
ELSE IF VL = 256:
id := 4;
ELSE IF VL = 512:
id := 5;
FI;
TMP_DEST[VL-1:0] := DEST[VL-1:0];
FOR j := 0 TO KL-1
off := 8*SRC1[j*8 + id: j*8] ;
IF k1[j] OR *no writemask*:
DEST[j*8 + 7: j*8] := SRC1[j*8+id+1]? SRC2[off+7:off] : TMP_DEST[off+7:off];
ELSE IF *zeroing-masking*
DEST[j*8 + 7: j*8] := 0;
*ELSE
DEST[j*8 + 7: j*8] remains unchanged*
FI;
ENDFOR
DEST[MAX_VL-1:VL] := 0;

Intel C/C++ Compiler Intrinsic Equivalent


VPERMT2B __m512i _mm512_permutex2var_epi8(__m512i a, __m512i idx, __m512i b);
VPERMT2B __m512i _mm512_mask_permutex2var_epi8(__m512i a, __mmask64 k, __m512i idx, __m512i b);
VPERMT2B __m512i _mm512_maskz_permutex2var_epi8(__mmask64 k, __m512i a, __m512i idx, __m512i b);
VPERMT2B __m256i _mm256_permutex2var_epi8(__m256i a, __m256i idx, __m256i b);
VPERMT2B __m256i _mm256_mask_permutex2var_epi8(__m256i a, __mmask32 k, __m256i idx, __m256i b);
VPERMT2B __m256i _mm256_maskz_permutex2var_epi8(__mmask32 k, __m256i a, __m256i idx, __m256i b);
VPERMT2B __m128i _mm_permutex2var_epi8(__m128i a, __m128i idx, __m128i b);
VPERMT2B __m128i _mm_mask_permutex2var_epi8(__m128i a, __mmask16 k, __m128i idx, __m128i b);
VPERMT2B __m128i _mm_maskz_permutex2var_epi8(__mmask16 k, __m128i a, __m128i idx, __m128i b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”

VPERMT2B—Full Permute of Bytes From Two Tables Overwriting a Table Vol. 2C 5-514
VPERMT2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting One Table
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 7D /r A V/V (AVX512VL AND Permute word integers from two tables in
VPERMT2W xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm3/m128 and xmm1 using indexes in xmm2 and
xmm3/m128 AVX10.11 store the result in xmm1 using writemask k1.
EVEX.256.66.0F38.W1 7D /r A V/V (AVX512VL AND Permute word integers from two tables in
VPERMT2W ymm1 {k1}{z}, ymm2, AVX512BW) OR ymm3/m256 and ymm1 using indexes in ymm2 and
ymm3/m256 AVX10.11 store the result in ymm1 using writemask k1.
EVEX.512.66.0F38.W1 7D /r A V/V AVX512BW Permute word integers from two tables in
VPERMT2W zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512 and zmm1 using indexes in zmm2 and
zmm3/m512 store the result in zmm1 using writemask k1.
EVEX.128.66.0F38.W0 7E /r B V/V (AVX512VL AND Permute double-words from two tables in
VPERMT2D xmm1 {k1}{z}, xmm2, AVX512F) OR xmm3/m128/m32bcst and xmm1 using indexes in
xmm3/m128/m32bcst AVX10.11 xmm2 and store the result in xmm1 using
writemask k1.
EVEX.256.66.0F38.W0 7E /r B V/V (AVX512VL AND Permute double-words from two tables in
VPERMT2D ymm1 {k1}{z}, ymm2, AVX512F) OR ymm3/m256/m32bcst and ymm1 using indexes in
ymm3/m256/m32bcst AVX10.11 ymm2 and store the result in ymm1 using
writemask k1.
EVEX.512.66.0F38.W0 7E /r B V/V AVX512F Permute double-words from two tables in
VPERMT2D zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m32bcst and zmm1 using indices in
zmm3/m512/m32bcst zmm2 and store the result in zmm1 using
writemask k1.
EVEX.128.66.0F38.W1 7E /r B V/V (AVX512VL AND Permute quad-words from two tables in
VPERMT2Q xmm1 {k1}{z}, xmm2, AVX512F) OR xmm3/m128/m64bcst and xmm1 using indexes in
xmm3/m128/m64bcst AVX10.11 xmm2 and store the result in xmm1 using
writemask k1.
EVEX.256.66.0F38.W1 7E /r B V/V (AVX512VL AND Permute quad-words from two tables in
VPERMT2Q ymm1 {k1}{z}, ymm2, AVX512F) OR ymm3/m256/m64bcst and ymm1 using indexes in
ymm3/m256/m64bcst AVX10.11 ymm2 and store the result in ymm1 using
writemask k1.
EVEX.512.66.0F38.W1 7E /r B V/V AVX512F Permute quad-words from two tables in
VPERMT2Q zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m64bcst and zmm1 using indices in
zmm3/m512/m64bcst zmm2 and store the result in zmm1 using
writemask k1.
EVEX.128.66.0F38.W0 7F /r B V/V (AVX512VL AND Permute single-precision floating-point values from
VPERMT2PS xmm1 {k1}{z}, AVX512F) OR two tables in xmm3/m128/m32bcst and xmm1
xmm2, xmm3/m128/m32bcst AVX10.11 using indexes in xmm2 and store the result in xmm1
using writemask k1.
EVEX.256.66.0F38.W0 7F /r B V/V (AVX512VL AND Permute single-precision floating-point values from
VPERMT2PS ymm1 {k1}{z}, AVX512F) OR two tables in ymm3/m256/m32bcst and ymm1
ymm2, ymm3/m256/m32bcst AVX10.11 using indexes in ymm2 and store the result in ymm1
using writemask k1.
EVEX.512.66.0F38.W0 7F /r B V/V AVX512F Permute single-precision floating-point values from
VPERMT2PS zmm1 {k1}{z}, OR AVX10.11 two tables in zmm3/m512/m32bcst and zmm1
zmm2, zmm3/m512/m32bcst using indices in zmm2 and store the result in zmm1
using writemask k1.

VPERMT2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting One Table Vol. 2C 5-515
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 7F /r B V/V (AVX512VL AND Permute double precision floating-point values from
VPERMT2PD xmm1 {k1}{z}, AVX512F) OR two tables in xmm3/m128/m64bcst and xmm1
xmm2, xmm3/m128/m64bcst AVX10.11 using indexes in xmm2 and store the result in xmm1
using writemask k1.
EVEX.256.66.0F38.W1 7F /r B V/V (AVX512VL AND Permute double precision floating-point values from
VPERMT2PD ymm1 {k1}{z}, AVX512F) OR two tables in ymm3/m256/m64bcst and ymm1
ymm2, ymm3/m256/m64bcst AVX10.11 using indexes in ymm2 and store the result in ymm1
using writemask k1.
EVEX.512.66.0F38.W1 7F /r B V/V AVX512F Permute double precision floating-point values from
VPERMT2PD zmm1 {k1}{z}, OR AVX10.11 two tables in zmm3/m512/m64bcst and zmm1
zmm2, zmm3/m512/m64bcst using indices in zmm2 and store the result in zmm1
using writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full Mem ModRM:reg (r,w) EVEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Permutes 16-bit/32-bit/64-bit values in the first operand and the third operand (the second source operand) using
indices in the second operand (the first source operand) to select elements from the first and third operands. The
selected elements are written to the destination operand (the first operand) according to the writemask k1.
The first and second operands are ZMM/YMM/XMM registers. The second operand contains input indices to select
elements from the two input tables in the 1st and 3rd operands. The first operand is also the destination of the
result.
D/Q/PS/PD element versions: The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit
memory location or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. Broadcast from the
low 32/64-bit memory location is performed if EVEX.b and the id bit for table selection are set (selecting table_2).
Dword/PS versions: The id bit for table selection is bit 4/3/2, depending on VL=512, 256, 128. Bits
[3:0]/[2:0]/[1:0] of each element in the input index vector select an element within the two source operands, If
the id bit is 0, table_1 (the first source) is selected; otherwise the second source operand is selected.
Qword/PD versions: The id bit for table selection is bit 3/2/1, and bits [2:0]/[1:0] /bit 0 selects element within each
input table.
Word element versions: The second source operand can be a ZMM/YMM/XMM register, or a 512/256/128-bit
memory location. The id bit for table selection is bit 5/4/3, and bits [4:0]/[3:0]/[2:0] selects element within each
input table.
Note that these instructions permit a 16-bit/32-bit/64-bit value in the source operands to be copied to more than
one location in the destination operand. Note also that in this case, the same index can be reused for example for
a second iteration, while the table elements being permuted are overwritten.
Bits (MAXVL-1:256/128) of the destination are zeroed for VL=256,128.

VPERMT2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting One Table Vol. 2C 5-516
Operation
VPERMT2W (EVEX encoded versions)
(KL, VL) = (8, 128), (16, 256), (32, 512)
IF VL = 128
id := 2
FI;
IF VL = 256
id := 3
FI;
IF VL = 512
id := 4
FI;
TMP_DEST := DEST
FOR j := 0 TO KL-1
i := j * 16
off := 16*SRC1[i+id:i]
IF k1[j] OR *no writemask*
THEN
DEST[i+15:i]=SRC1[i+id+1] ? SRC2[off+15:off]
: TMP_DEST[off+15:off]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPERMT2D/VPERMT2PS (EVEX encoded versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
IF VL = 128
id := 1
FI;
IF VL = 256
id := 2
FI;
IF VL = 512
id := 3
FI;
TMP_DEST := DEST
FOR j := 0 TO KL-1
i := j * 32
off := 32*SRC1[i+id:i]
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+31:i] := SRC1[i+id+1] ? SRC2[31:0]
: TMP_DEST[off+31:off]
ELSE
DEST[i+31:i] := SRC1[i+id+1] ? SRC2[off+31:off]
: TMP_DEST[off+31:off]

VPERMT2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting One Table Vol. 2C 5-517
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPERMT2Q/VPERMT2PD (EVEX encoded versions)


(KL, VL) = (2, 128), (4, 256), (8 512)
IF VL = 128
id := 0
FI;
IF VL = 256
id := 1
FI;
IF VL = 512
id := 2
FI;
TMP_DEST:= DEST
FOR j := 0 TO KL-1
i := j * 64
off := 64*SRC1[i+id:i]
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+63:i] := SRC1[i+id+1] ? SRC2[63:0]
: TMP_DEST[off+63:off]
ELSE
DEST[i+63:i] := SRC1[i+id+1] ? SRC2[off+63:off]
: TMP_DEST[off+63:off]
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent

VPERMT2D __m512i _mm512_permutex2var_epi32(__m512i a, __m512i idx, __m512i b);


VPERMT2D __m512i _mm512_mask_permutex2var_epi32(__m512i a, __mmask16 k, __m512i idx, __m512i b);
VPERMT2D __m512i _mm512_mask2_permutex2var_epi32(__m512i a, __m512i idx, __mmask16 k, __m512i b);
VPERMT2D __m512i _mm512_maskz_permutex2var_epi32(__mmask16 k, __m512i a, __m512i idx, __m512i b);
VPERMT2D __m256i _mm256_permutex2var_epi32(__m256i a, __m256i idx, __m256i b);
VPERMT2D __m256i _mm256_mask_permutex2var_epi32(__m256i a, __mmask8 k, __m256i idx, __m256i b);

VPERMT2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting One Table Vol. 2C 5-518
VPERMT2D __m256i _mm256_mask2_permutex2var_epi32(__m256i a, __m256i idx, __mmask8 k, __m256i b);
VPERMT2D __m256i _mm256_maskz_permutex2var_epi32(__mmask8 k, __m256i a, __m256i idx, __m256i b);
VPERMT2D __m128i _mm_permutex2var_epi32(__m128i a, __m128i idx, __m128i b);
VPERMT2D __m128i _mm_mask_permutex2var_epi32(__m128i a, __mmask8 k, __m128i idx, __m128i b);
VPERMT2D __m128i _mm_mask2_permutex2var_epi32(__m128i a, __m128i idx, __mmask8 k, __m128i b);
VPERMT2D __m128i _mm_maskz_permutex2var_epi32(__mmask8 k, __m128i a, __m128i idx, __m128i b);
VPERMT2PD __m512d _mm512_permutex2var_pd(__m512d a, __m512i idx, __m512d b);
VPERMT2PD __m512d _mm512_mask_permutex2var_pd(__m512d a, __mmask8 k, __m512i idx, __m512d b);
VPERMT2PD __m512d _mm512_mask2_permutex2var_pd(__m512d a, __m512i idx, __mmask8 k, __m512d b);
VPERMT2PD __m512d _mm512_maskz_permutex2var_pd(__mmask8 k, __m512d a, __m512i idx, __m512d b);
VPERMT2PD __m256d _mm256_permutex2var_pd(__m256d a, __m256i idx, __m256d b);
VPERMT2PD __m256d _mm256_mask_permutex2var_pd(__m256d a, __mmask8 k, __m256i idx, __m256d b);
VPERMT2PD __m256d _mm256_mask2_permutex2var_pd(__m256d a, __m256i idx, __mmask8 k, __m256d b);
VPERMT2PD __m256d _mm256_maskz_permutex2var_pd(__mmask8 k, __m256d a, __m256i idx, __m256d b);
VPERMT2PD __m128d _mm_permutex2var_pd(__m128d a, __m128i idx, __m128d b);
VPERMT2PD __m128d _mm_mask_permutex2var_pd(__m128d a, __mmask8 k, __m128i idx, __m128d b);
VPERMT2PD __m128d _mm_mask2_permutex2var_pd(__m128d a, __m128i idx, __mmask8 k, __m128d b);
VPERMT2PD __m128d _mm_maskz_permutex2var_pd(__mmask8 k, __m128d a, __m128i idx, __m128d b);
VPERMT2PS __m512 _mm512_permutex2var_ps(__m512 a, __m512i idx, __m512 b);
VPERMT2PS __m512 _mm512_mask_permutex2var_ps(__m512 a, __mmask16 k, __m512i idx, __m512 b);
VPERMT2PS __m512 _mm512_mask2_permutex2var_ps(__m512 a, __m512i idx, __mmask16 k, __m512 b);
VPERMT2PS __m512 _mm512_maskz_permutex2var_ps(__mmask16 k, __m512 a, __m512i idx, __m512 b);
VPERMT2PS __m256 _mm256_permutex2var_ps(__m256 a, __m256i idx, __m256 b);
VPERMT2PS __m256 _mm256_mask_permutex2var_ps(__m256 a, __mmask8 k, __m256i idx, __m256 b);
VPERMT2PS __m256 _mm256_mask2_permutex2var_ps(__m256 a, __m256i idx, __mmask8 k, __m256 b);
VPERMT2PS __m256 _mm256_maskz_permutex2var_ps(__mmask8 k, __m256 a, __m256i idx, __m256 b);
VPERMT2PS __m128 _mm_permutex2var_ps(__m128 a, __m128i idx, __m128 b);
VPERMT2PS __m128 _mm_mask_permutex2var_ps(__m128 a, __mmask8 k, __m128i idx, __m128 b);
VPERMT2PS __m128 _mm_mask2_permutex2var_ps(__m128 a, __m128i idx, __mmask8 k, __m128 b);
VPERMT2PS __m128 _mm_maskz_permutex2var_ps(__mmask8 k, __m128 a, __m128i idx, __m128 b);
VPERMT2Q __m512i _mm512_permutex2var_epi64(__m512i a, __m512i idx, __m512i b);
VPERMT2Q __m512i _mm512_mask_permutex2var_epi64(__m512i a, __mmask8 k, __m512i idx, __m512i b);
VPERMT2Q __m512i _mm512_mask2_permutex2var_epi64(__m512i a, __m512i idx, __mmask8 k, __m512i b);
VPERMT2Q __m512i _mm512_maskz_permutex2var_epi64(__mmask8 k, __m512i a, __m512i idx, __m512i b);
VPERMT2Q __m256i _mm256_permutex2var_epi64(__m256i a, __m256i idx, __m256i b);
VPERMT2Q __m256i _mm256_mask_permutex2var_epi64(__m256i a, __mmask8 k, __m256i idx, __m256i b);
VPERMT2Q __m256i _mm256_mask2_permutex2var_epi64(__m256i a, __m256i idx, __mmask8 k, __m256i b);
VPERMT2Q __m256i _mm256_maskz_permutex2var_epi64(__mmask8 k, __m256i a, __m256i idx, __m256i b);
VPERMT2Q __m128i _mm_permutex2var_epi64(__m128i a, __m128i idx, __m128i b);
VPERMT2Q __m128i _mm_mask_permutex2var_epi64(__m128i a, __mmask8 k, __m128i idx, __m128i b);
VPERMT2Q __m128i _mm_mask2_permutex2var_epi64(__m128i a, __m128i idx, __mmask8 k, __m128i b);
VPERMT2Q __m128i _mm_maskz_permutex2var_epi64(__mmask8 k, __m128i a, __m128i idx, __m128i b);
VPERMT2W __m512i _mm512_permutex2var_epi16(__m512i a, __m512i idx, __m512i b);
VPERMT2W __m512i _mm512_mask_permutex2var_epi16(__m512i a, __mmask32 k, __m512i idx, __m512i b);
VPERMT2W __m512i _mm512_mask2_permutex2var_epi16(__m512i a, __m512i idx, __mmask32 k, __m512i b);
VPERMT2W __m512i _mm512_maskz_permutex2var_epi16(__mmask32 k, __m512i a, __m512i idx, __m512i b);
VPERMT2W __m256i _mm256_permutex2var_epi16(__m256i a, __m256i idx, __m256i b);
VPERMT2W __m256i _mm256_mask_permutex2var_epi16(__m256i a, __mmask16 k, __m256i idx, __m256i b);
VPERMT2W __m256i _mm256_mask2_permutex2var_epi16(__m256i a, __m256i idx, __mmask16 k, __m256i b);

VPERMT2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting One Table Vol. 2C 5-519
VPERMT2W __m256i _mm256_maskz_permutex2var_epi16(__mmask16 k, __m256i a, __m256i idx, __m256i b);
VPERMT2W __m128i _mm_permutex2var_epi16(__m128i a, __m128i idx, __m128i b);
VPERMT2W __m128i _mm_mask_permutex2var_epi16(__m128i a, __mmask8 k, __m128i idx, __m128i b);
VPERMT2W __m128i _mm_mask2_permutex2var_epi16(__m128i a, __m128i idx, __mmask8 k, __m128i b);
VPERMT2W __m128i _mm_maskz_permutex2var_epi16(__mmask8 k, __m128i a, __m128i idx, __m128i b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
VPERMT2D/Q/PS/PD: See Table 2-52, “Type E4NF Class Exception Conditions.”
VPERMT2W: See Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”

VPERMT2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting One Table Vol. 2C 5-520
VPEXPANDB/VPEXPANDW—Expand Byte/Word Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 62 /r A V/V (AVX512_VBMI2 Expands up to 128 bits of packed byte values
VPEXPANDB xmm1{k1}{z}, m128 AND AVX512VL) from m128 to xmm1 with writemask k1.
OR AVX10.11
EVEX.128.66.0F38.W0 62 /r B V/V (AVX512_VBMI2 Expands up to 128 bits of packed byte values
VPEXPANDB xmm1{k1}{z}, xmm2 AND AVX512VL) from xmm2 to xmm1 with writemask k1.
OR AVX10.11
EVEX.256.66.0F38.W0 62 /r A V/V (AVX512_VBMI2 Expands up to 256 bits of packed byte values
VPEXPANDB ymm1{k1}{z}, m256 AND AVX512VL) from m256 to ymm1 with writemask k1.
OR AVX10.11
EVEX.256.66.0F38.W0 62 /r B V/V (AVX512_VBMI2 Expands up to 256 bits of packed byte values
VPEXPANDB ymm1{k1}{z}, ymm2 AND AVX512VL) from ymm2 to ymm1 with writemask k1.
OR AVX10.11
EVEX.512.66.0F38.W0 62 /r A V/V AVX512_VBMI2 Expands up to 512 bits of packed byte values
VPEXPANDB zmm1{k1}{z}, m512 OR AVX10.11 from m512 to zmm1 with writemask k1.
EVEX.512.66.0F38.W0 62 /r B V/V AVX512_VBMI2 Expands up to 512 bits of packed byte values
VPEXPANDB zmm1{k1}{z}, zmm2 OR AVX10.11 from zmm2 to zmm1 with writemask k1.
EVEX.128.66.0F38.W1 62 /r A V/V (AVX512_VBMI2 Expands up to 128 bits of packed word values
VPEXPANDW xmm1{k1}{z}, m128 AND AVX512VL) from m128 to xmm1 with writemask k1.
OR AVX10.11
EVEX.128.66.0F38.W1 62 /r B V/V (AVX512_VBMI2 Expands up to 128 bits of packed word values
VPEXPANDW xmm1{k1}{z}, xmm2 AND AVX512VL) from xmm2 to xmm1 with writemask k1.
OR AVX10.11
EVEX.256.66.0F38.W1 62 /r A V/V (AVX512_VBMI2 Expands up to 256 bits of packed word values
VPEXPANDW ymm1{k1}{z}, m256 AND AVX512VL) from m256 to ymm1 with writemask k1.
OR AVX10.11
EVEX.256.66.0F38.W1 62 /r B V/V (AVX512_VBMI2 Expands up to 256 bits of packed word values
VPEXPANDW ymm1{k1}{z}, ymm2 AND AVX512VL) from ymm2 to ymm1 with writemask k1.
OR AVX10.11
EVEX.512.66.0F38.W1 62 /r A V/V AVX512_VBMI2 Expands up to 512 bits of packed word values
VPEXPANDW zmm1{k1}{z}, m512 OR AVX10.11 from m512 to zmm1 with writemask k1.
EVEX.512.66.0F38.W1 62 /r B V/V AVX512_VBMI2 Expands up to 512 bits of packed byte integer
VPEXPANDW zmm1{k1}{z}, zmm2 OR AVX10.11 values from zmm2 to zmm1 with writemask
k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) ModRM:r/m (r) N/A N/A

VPEXPANDB/VPEXPANDW—Expand Byte/Word Values Vol. 2C 5-520


Description
Expands (loads) up to 64 byte integer values or 32 word integer values from the source operand (memory
operand) to the destination operand (register operand), based on the active elements determined by the write-
mask operand.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Moves 128, 256 or 512 bits of packed byte integer values from the source operand (memory operand) to the desti-
nation operand (register operand). This instruction is used to load from an int8 vector register or memory location
while inserting the data into sparse elements of destination vector register using the active elements pointed out
by the operand writemask.
This instruction supports memory fault suppression.
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.

Operation
VPEXPANDB
(KL, VL) = (16, 128), (32, 256), (64, 512)
k := 0
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.byte[j] := SRC.byte[k];
k := k + 1
ELSE:
IF *merging-masking*:
*DEST.byte[j] remains unchanged*
ELSE: ; zeroing-masking
DEST.byte[j] := 0
DEST[MAX_VL-1:VL] := 0

VPEXPANDW
(KL, VL) = (8,128), (16,256), (32, 512)
k := 0
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.word[j] := SRC.word[k];
k := k + 1
ELSE:
IF *merging-masking*:
*DEST.word[j] remains unchanged*
ELSE: ; zeroing-masking
DEST.word[j] := 0
DEST[MAX_VL-1:VL] := 0

VPEXPANDB/VPEXPANDW—Expand Byte/Word Values Vol. 2C 5-521


Intel C/C++ Compiler Intrinsic Equivalent
VPEXPAND __m128i _mm_mask_expand_epi8(__m128i, __mmask16, __m128i);
VPEXPAND __m128i _mm_maskz_expand_epi8(__mmask16, __m128i);
VPEXPAND __m128i _mm_mask_expandloadu_epi8(__m128i, __mmask16, const void*);
VPEXPAND __m128i _mm_maskz_expandloadu_epi8(__mmask16, const void*);
VPEXPAND __m256i _mm256_mask_expand_epi8(__m256i, __mmask32, __m256i);
VPEXPAND __m256i _mm256_maskz_expand_epi8(__mmask32, __m256i);
VPEXPAND __m256i _mm256_mask_expandloadu_epi8(__m256i, __mmask32, const void*);
VPEXPAND __m256i _mm256_maskz_expandloadu_epi8(__mmask32, const void*);
VPEXPAND __m512i _mm512_mask_expand_epi8(__m512i, __mmask64, __m512i);
VPEXPAND __m512i _mm512_maskz_expand_epi8(__mmask64, __m512i);
VPEXPAND __m512i _mm512_mask_expandloadu_epi8(__m512i, __mmask64, const void*);
VPEXPAND __m512i _mm512_maskz_expandloadu_epi8(__mmask64, const void*);
VPEXPANDW __m128i _mm_mask_expand_epi16(__m128i, __mmask8, __m128i);
VPEXPANDW __m128i _mm_maskz_expand_epi16(__mmask8, __m128i);
VPEXPANDW __m128i _mm_mask_expandloadu_epi16(__m128i, __mmask8, const void*);
VPEXPANDW __m128i _mm_maskz_expandloadu_epi16(__mmask8, const void *);
VPEXPANDW __m256i _mm256_mask_expand_epi16(__m256i, __mmask16, __m256i);
VPEXPANDW __m256i _mm256_maskz_expand_epi16(__mmask16, __m256i);
VPEXPANDW __m256i _mm256_mask_expandloadu_epi16(__m256i, __mmask16, const void*);
VPEXPANDW __m256i _mm256_maskz_expandloadu_epi16(__mmask16, const void*);
VPEXPANDW __m512i _mm512_mask_expand_epi16(__m512i, __mmask32, __m512i);
VPEXPANDW __m512i _mm512_maskz_expand_epi16(__mmask32, __m512i);
VPEXPANDW __m512i _mm512_mask_expandloadu_epi16(__m512i, __mmask32, const void*);
VPEXPANDW __m512i _mm512_maskz_expandloadu_epi16(__mmask32, const void*);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”

VPEXPANDB/VPEXPANDW—Expand Byte/Word Values Vol. 2C 5-522


VPEXPANDD—Load Sparse Packed Doubleword Integer Values From Dense Memory/Register
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 89 /r A V/V (AVX512VL AND Expand packed double-word integer values from
VPEXPANDD xmm1 {k1}{z}, AVX512F) OR xmm2/m128 to xmm1 using writemask k1.
xmm2/m128 AVX10.11
EVEX.256.66.0F38.W0 89 /r A V/V (AVX512VL AND Expand packed double-word integer values from
VPEXPANDD ymm1 {k1}{z}, AVX512F) OR ymm2/m256 to ymm1 using writemask k1.
ymm2/m256 AVX10.11
EVEX.512.66.0F38.W0 89 /r A V/V AVX512F Expand packed double-word integer values from
VPEXPANDD zmm1 {k1}{z}, OR AVX10.11 zmm2/m512 to zmm1 using writemask k1.
zmm2/m512

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Expand (load) up to 16 contiguous doubleword integer values of the input vector in the source operand (the second
operand) to sparse elements in the destination operand (the first operand), selected by the writemask k1. The
destination operand is a ZMM register, the source operand can be a ZMM register or memory location.
The input vector starts from the lowest element in the source operand. The opmask register k1 selects the desti-
nation elements (a partial vector or sparse elements if less than 8 elements) to be replaced by the ascending
elements in the input vector. Destination elements not selected by the writemask k1 are either unmodified or
zeroed, depending on EVEX.z.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.

Operation
VPEXPANDD (EVEX encoded versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
k := 0
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
DEST[i+31:i] := SRC[k+31:k];
k := k + 32
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;

VPEXPANDD—Load Sparse Packed Doubleword Integer Values From Dense Memory/Register Vol. 2C 5-523
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPEXPANDD __m512i _mm512_mask_expandloadu_epi32(__m512i s, __mmask16 k, void * a);
VPEXPANDD __m512i _mm512_maskz_expandloadu_epi32( __mmask16 k, void * a);
VPEXPANDD __m512i _mm512_mask_expand_epi32(__m512i s, __mmask16 k, __m512i a);
VPEXPANDD __m512i _mm512_maskz_expand_epi32( __mmask16 k, __m512i a);
VPEXPANDD __m256i _mm256_mask_expandloadu_epi32(__m256i s, __mmask8 k, void * a);
VPEXPANDD __m256i _mm256_maskz_expandloadu_epi32( __mmask8 k, void * a);
VPEXPANDD __m256i _mm256_mask_expand_epi32(__m256i s, __mmask8 k, __m256i a);
VPEXPANDD __m256i _mm256_maskz_expand_epi32( __mmask8 k, __m256i a);
VPEXPANDD __m128i _mm_mask_expandloadu_epi32(__m128i s, __mmask8 k, void * a);
VPEXPANDD __m128i _mm_maskz_expandloadu_epi32( __mmask8 k, void * a);
VPEXPANDD __m128i _mm_mask_expand_epi32(__m128i s, __mmask8 k, __m128i a);
VPEXPANDD __m128i _mm_maskz_expand_epi32( __mmask8 k, __m128i a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VPEXPANDD—Load Sparse Packed Doubleword Integer Values From Dense Memory/Register Vol. 2C 5-524
VPEXPANDQ—Load Sparse Packed Quadword Integer Values From Dense Memory/Register
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 89 /r A V/V (AVX512VL AND Expand packed quad-word integer values
VPEXPANDQ xmm1 {k1}{z}, xmm2/m128 AVX512F) OR from xmm2/m128 to xmm1 using
AVX10.11 writemask k1.
EVEX.256.66.0F38.W1 89 /r A V/V (AVX512VL AND Expand packed quad-word integer values
VPEXPANDQ ymm1 {k1}{z}, ymm2/m256 AVX512F) OR from ymm2/m256 to ymm1 using
AVX10.11 writemask k1.
EVEX.512.66.0F38.W1 89 /r A V/V AVX512F Expand packed quad-word integer values
VPEXPANDQ zmm1 {k1}{z}, zmm2/m512 OR AVX10.11 from zmm2/m512 to zmm1 using writemask
k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Expand (load) up to 8 quadword integer values from the source operand (the second operand) to sparse elements
in the destination operand (the first operand), selected by the writemask k1. The destination operand is a ZMM
register, the source operand can be a ZMM register or memory location.
The input vector starts from the lowest element in the source operand. The opmask register k1 selects the desti-
nation elements (a partial vector or sparse elements if less than 8 elements) to be replaced by the ascending
elements in the input vector. Destination elements not selected by the writemask k1 are either unmodified or
zeroed, depending on EVEX.z.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.

Operation
VPEXPANDQ (EVEX encoded versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
k := 0
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
DEST[i+63:i] := SRC[k+63:k];
k := k + 64
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[i+63:i] := 0
FI
FI;

VPEXPANDQ—Load Sparse Packed Quadword Integer Values From Dense Memory/Register Vol. 2C 5-525
ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPEXPANDQ __m512i _mm512_mask_expandloadu_epi64(__m512i s, __mmask8 k, void * a);
VPEXPANDQ __m512i _mm512_maskz_expandloadu_epi64( __mmask8 k, void * a);
VPEXPANDQ __m512i _mm512_mask_expand_epi64(__m512i s, __mmask8 k, __m512i a);
VPEXPANDQ __m512i _mm512_maskz_expand_epi64( __mmask8 k, __m512i a);
VPEXPANDQ __m256i _mm256_mask_expandloadu_epi64(__m256i s, __mmask8 k, void * a);
VPEXPANDQ __m256i _mm256_maskz_expandloadu_epi64( __mmask8 k, void * a);
VPEXPANDQ __m256i _mm256_mask_expand_epi64(__m256i s, __mmask8 k, __m256i a);
VPEXPANDQ __m256i _mm256_maskz_expand_epi64( __mmask8 k, __m256i a);
VPEXPANDQ __m128i _mm_mask_expandloadu_epi64(__m128i s, __mmask8 k, void * a);
VPEXPANDQ __m128i _mm_maskz_expandloadu_epi64( __mmask8 k, void * a);
VPEXPANDQ __m128i _mm_mask_expand_epi64(__m128i s, __mmask8 k, __m128i a);
VPEXPANDQ __m128i _mm_maskz_expand_epi64( __mmask8 k, __m128i a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VPEXPANDQ—Load Sparse Packed Quadword Integer Values From Dense Memory/Register Vol. 2C 5-526
VPGATHERDD/VPGATHERDQ—Gather Packed Dword, Packed Qword With Signed Dword Indices
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 90 /vsib A V/V (AVX512VL AND Using signed dword indices, gather dword values
VPGATHERDD xmm1 {k1}, vm32x AVX512F) OR from memory using writemask k1 for merging-
AVX10.11 masking.
EVEX.256.66.0F38.W0 90 /vsib A V/V (AVX512VL AND Using signed dword indices, gather dword values
VPGATHERDD ymm1 {k1}, vm32y AVX512F) OR from memory using writemask k1 for merging-
AVX10.11 masking.
EVEX.512.66.0F38.W0 90 /vsib A V/V AVX512F Using signed dword indices, gather dword values
VPGATHERDD zmm1 {k1}, vm32z OR AVX10.11 from memory using writemask k1 for merging-
masking.
EVEX.128.66.0F38.W1 90 /vsib A V/V (AVX512VL AND Using signed dword indices, gather quadword values
VPGATHERDQ xmm1 {k1}, vm32x AVX512F) OR from memory using writemask k1 for merging-
AVX10.11 masking.
EVEX.256.66.0F38.W1 90 /vsib A V/V (AVX512VL AND Using signed dword indices, gather quadword values
VPGATHERDQ ymm1 {k1}, vm32x AVX512F) OR from memory using writemask k1 for merging-
AVX10.11 masking.
EVEX.512.66.0F38.W1 90 /vsib A V/V AVX512F Using signed dword indices, gather quadword values
VPGATHERDQ zmm1 {k1}, vm32y OR AVX10.11 from memory using writemask k1 for merging-
masking.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
BaseReg (R): VSIB:base,
A Tuple1 Scalar ModRM:reg (w) N/A N/A
VectorReg(R): VSIB:index

Description
A set of 16 or 8 doubleword/quadword memory locations pointed to by base address BASE_ADDR and index vector
VINDEX with scale SCALE are gathered. The result is written into vector zmm1. The elements are specified via the
VSIB (i.e., the index register is a zmm, holding packed indices). Elements will only be loaded if their corresponding
mask bit is one. If an element’s mask bit is not set, the corresponding element of the destination register (zmm1)
is left unchanged. The entire mask register will be set to zero by this instruction unless it triggers an exception.
This instruction can be suspended by an exception if at least one element is already gathered (i.e., if the exception
is triggered by an element other than the rightmost one with its mask bit set). When this happens, the destination
register and the mask register (k1) are partially updated; those elements that have been gathered are placed into
the destination register and have their mask bits set to zero. If any traps or interrupts are pending from already
gathered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruc-
tion breakpoint is not re-triggered when the instruction is continued.
If the data element size is less than the index element size, the higher part of the destination register and the mask
register do not correspond to any elements being gathered. This instruction sets those higher parts to zero. It may
update these unused elements to one or both of those registers even if the instruction triggers an exception, and
even if the instruction triggers the exception before gathering any elements.
Note that:
• The values may be read from memory in any order. Memory ordering with other instructions follows the Intel-
64 memory-ordering model.

VPGATHERDD/VPGATHERDQ—Gather Packed Dword, Packed Qword With Signed Dword Indices Vol. 2C 5-527
• Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all
elements closer to the LSB of the destination zmm will be completed (and non-faulting). Individual elements
closer to the MSB may or may not be completed. If a given element triggers multiple faults, they are delivered
in the conventional order.
• Elements may be gathered in any order, but faults must be delivered in a right-to-left order; thus, elements to
the left of a faulting one may be gathered before the fault is delivered. A given implementation of this
instruction is repeatable - given the same input values and architectural state, the same set of elements to the
left of the faulting one will be gathered.
• This instruction does not perform AC checks, and so will never deliver an AC fault.
• Not valid with 16-bit effective addresses. Will deliver a #UD fault.
• These instructions do not accept zeroing-masking since the 0 values in k1 are used to determine completion.
Note that the presence of VSIB byte is enforced in this instruction. Hence, the instruction will #UD fault if
ModRM.rm is different than 100b.
This instruction has the same disp8*N and alignment rules as for scalar instructions (Tuple 1).
The instruction will #UD fault if the destination vector zmm1 is the same as index vector VINDEX. The instruction
will #UD fault if the k0 mask register is specified.
The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit
mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are
ignored.

Operation
BASE_ADDR stands for the memory operand base address (a GPR); may not exist
VINDEX stands for the memory operand vector of indices (a ZMM register)
SCALE stands for the memory operand scalar (1, 2, 4 or 8)
DISP is the optional 1 or 4 byte displacement

VPGATHERDD (EVEX encoded version)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j]
THEN DEST[i+31:i] := MEM[BASE_ADDR +
SignExtend(VINDEX[i+31:i]) * SCALE + DISP]
k1[j] := 0
ELSE *DEST[i+31:i] := remains unchanged* ; Only merging masking is allowed
FI;
ENDFOR
k1[MAX_KL-1:KL] := 0
DEST[MAXVL-1:VL] := 0

VPGATHERDQ (EVEX encoded version)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j]
THEN DEST[i+63:i] :=
MEM[BASE_ADDR + SignExtend(VINDEX[k+31:k]) * SCALE + DISP]
k1[j] := 0
ELSE *DEST[i+63:i] := remains unchanged* ; Only merging masking is allowed
FI;
ENDFOR
k1[MAX_KL-1:KL] := 0

VPGATHERDD/VPGATHERDQ—Gather Packed Dword, Packed Qword With Signed Dword Indices Vol. 2C 5-528
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPGATHERDD __m512i _mm512_i32gather_epi32( __m512i vdx, void * base, int scale);
VPGATHERDD __m512i _mm512_mask_i32gather_epi32(__m512i s, __mmask16 k, __m512i vdx, void * base, int scale);
VPGATHERDD __m256i _mm256_mmask_i32gather_epi32(__m256i s, __mmask8 k, __m256i vdx, void * base, int scale);
VPGATHERDD __m128i _mm_mmask_i32gather_epi32(__m128i s, __mmask8 k, __m128i vdx, void * base, int scale);
VPGATHERDQ __m512i _mm512_i32logather_epi64( __m256i vdx, void * base, int scale);
VPGATHERDQ __m512i _mm512_mask_i32logather_epi64(__m512i s, __mmask8 k, __m256i vdx, void * base, int scale);
VPGATHERDQ __m256i _mm256_mmask_i32logather_epi64(__m256i s, __mmask8 k, __m128i vdx, void * base, int scale);
VPGATHERDQ __m128i _mm_mmask_i32gather_epi64(__m128i s, __mmask8 k, __m128i vdx, void * base, int scale);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-63, “Type E12 Class Exception Conditions.”

VPGATHERDD/VPGATHERDQ—Gather Packed Dword, Packed Qword With Signed Dword Indices Vol. 2C 5-529
VPGATHERQD/VPGATHERQQ—Gather Packed Dword, Packed Qword with Signed Qword Indices
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 91 /vsib A V/V (AVX512VL AND Using signed qword indices, gather dword values
VPGATHERQD xmm1 {k1}, vm64x AVX512F) OR from memory using writemask k1 for merging-
AVX10.11 masking.
EVEX.256.66.0F38.W0 91 /vsib A V/V (AVX512VL AND Using signed qword indices, gather dword values
VPGATHERQD xmm1 {k1}, vm64y AVX512F) OR from memory using writemask k1 for merging-
AVX10.11 masking.
EVEX.512.66.0F38.W0 91 /vsib A V/V AVX512F Using signed qword indices, gather dword values
VPGATHERQD ymm1 {k1}, vm64z OR AVX10.11 from memory using writemask k1 for merging-
masking.
EVEX.128.66.0F38.W1 91 /vsib A V/V (AVX512VL AND Using signed qword indices, gather quadword
VPGATHERQQ xmm1 {k1}, vm64x AVX512F) OR values from memory using writemask k1 for
AVX10.11 merging-masking.
EVEX.256.66.0F38.W1 91 /vsib A V/V (AVX512VL AND Using signed qword indices, gather quadword
VPGATHERQQ ymm1 {k1}, vm64y AVX512F) OR values from memory using writemask k1 for
AVX10.11 merging-masking.
EVEX.512.66.0F38.W1 91 /vsib A V/V AVX512F Using signed qword indices, gather quadword
VPGATHERQQ zmm1 {k1}, vm64z OR AVX10.11 values from memory using writemask k1 for
merging-masking.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
BaseReg (R): VSIB:base,
A Tuple1 Scalar ModRM:reg (w) N/A N/A
VectorReg(R): VSIB:index

Description
A set of 8 doubleword/quadword memory locations pointed to by base address BASE_ADDR and index vector
VINDEX with scale SCALE are gathered. The result is written into a vector register. The elements are specified via
the VSIB (i.e., the index register is a vector register, holding packed indices). Elements will only be loaded if their
corresponding mask bit is one. If an element’s mask bit is not set, the corresponding element of the destination
register is left unchanged. The entire mask register will be set to zero by this instruction unless it triggers an excep-
tion.
This instruction can be suspended by an exception if at least one element is already gathered (i.e., if the exception
is triggered by an element other than the rightmost one with its mask bit set). When this happens, the destination
register and the mask register (k1) are partially updated; those elements that have been gathered are placed into
the destination register and have their mask bits set to zero. If any traps or interrupts are pending from already
gathered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruc-
tion breakpoint is not re-triggered when the instruction is continued.
If the data element size is less than the index element size, the higher part of the destination register and the mask
register do not correspond to any elements being gathered. This instruction sets those higher parts to zero. It may
update these unused elements to one or both of those registers even if the instruction triggers an exception, and
even if the instruction triggers the exception before gathering any elements.
Note that:

VPGATHERQD/VPGATHERQQ—Gather Packed Dword, Packed Qword with Signed Qword Indices Vol. 2C 5-538
• The values may be read from memory in any order. Memory ordering with other instructions follows the Intel-
64 memory-ordering model.
• Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all
elements closer to the LSB of the destination zmm will be completed (and non-faulting). Individual elements
closer to the MSB may or may not be completed. If a given element triggers multiple faults, they are delivered
in the conventional order.
• Elements may be gathered in any order, but faults must be delivered in a right-to-left order; thus, elements to
the left of a faulting one may be gathered before the fault is delivered. A given implementation of this
instruction is repeatable - given the same input values and architectural state, the same set of elements to the
left of the faulting one will be gathered.
• This instruction does not perform AC checks, and so will never deliver an AC fault.
• Not valid with 16-bit effective addresses. Will deliver a #UD fault.
• These instructions do not accept zeroing-masking since the 0 values in k1 are used to determine completion.
Note that the presence of VSIB byte is enforced in this instruction. Hence, the instruction will #UD fault if
ModRM.rm is different than 100b.
This instruction has the same disp8*N and alignment rules as for scalar instructions (Tuple 1).
The instruction will #UD fault if the destination vector zmm1 is the same as index vector VINDEX. The instruction
will #UD fault if the k0 mask register is specified.
The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit
mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are
ignored.

Operation
BASE_ADDR stands for the memory operand base address (a GPR); may not exist
VINDEX stands for the memory operand vector of indices (a ZMM register)
SCALE stands for the memory operand scalar (1, 2, 4 or 8)
DISP is the optional 1 or 4 byte displacement

VPGATHERQD (EVEX encoded version)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j]
THEN DEST[i+31:i] := MEM[BASE_ADDR + (VINDEX[k+63:k]) * SCALE + DISP]
k1[j] := 0
ELSE *DEST[i+31:i] := remains unchanged* ; Only merging masking is allowed
FI;
ENDFOR
k1[MAX_KL-1:KL] := 0
DEST[MAXVL-1:VL/2] := 0

VPGATHERQQ (EVEX encoded version)


(KL, VL) = (2, 64), (4, 128), (8, 256)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j]
THEN DEST[i+63:i] :=
MEM[BASE_ADDR + (VINDEX[i+63:i]) * SCALE + DISP]
k1[j] := 0
ELSE *DEST[i+63:i] := remains unchanged* ; Only merging masking is allowed
FI;

VPGATHERQD/VPGATHERQQ—Gather Packed Dword, Packed Qword with Signed Qword Indices Vol. 2C 5-539
ENDFOR
k1[MAX_KL-1:KL] := 0
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPGATHERQD __m256i _mm512_i64gather_epi32(__m512i vdx, void * base, int scale);
VPGATHERQD __m256i _mm512_mask_i64gather_epi32lo(__m256i s, __mmask8 k, __m512i vdx, void * base, int scale);
VPGATHERQD __m128i _mm256_mask_i64gather_epi32lo(__m128i s, __mmask8 k, __m256i vdx, void * base, int scale);
VPGATHERQD __m128i _mm_mask_i64gather_epi32(__m128i s, __mmask8 k, __m128i vdx, void * base, int scale);
VPGATHERQQ __m512i _mm512_i64gather_epi64( __m512i vdx, void * base, int scale);
VPGATHERQQ __m512i _mm512_mask_i64gather_epi64(__m512i s, __mmask8 k, __m512i vdx, void * base, int scale);
VPGATHERQQ __m256i _mm256_mask_i64gather_epi64(__m256i s, __mmask8 k, __m256i vdx, void * base, int scale);
VPGATHERQQ __m128i _mm_mask_i64gather_epi64(__m128i s, __mmask8 k, __m128i vdx, void * base, int scale);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-63, “Type E12 Class Exception Conditions.”

VPGATHERQD/VPGATHERQQ—Gather Packed Dword, Packed Qword with Signed Qword Indices Vol. 2C 5-540
VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 44 /r A V/V (AVX512VL AND Count the number of leading zero bits in each dword
VPLZCNTD xmm1 {k1}{z}, AVX512CD) OR element of xmm2/m128/m32bcst using writemask k1.
xmm2/m128/m32bcst AVX10.11
EVEX.256.66.0F38.W0 44 /r A V/V (AVX512VL AND Count the number of leading zero bits in each dword
VPLZCNTD ymm1 {k1}{z}, AVX512CD) OR element of ymm2/m256/m32bcst using writemask k1.
ymm2/m256/m32bcst AVX10.11
EVEX.512.66.0F38.W0 44 /r A V/V AVX512CD Count the number of leading zero bits in each dword
VPLZCNTD zmm1 {k1}{z}, OR AVX10.11 element of zmm2/m512/m32bcst using writemask k1.
zmm2/m512/m32bcst
EVEX.128.66.0F38.W1 44 /r A V/V (AVX512VL AND Count the number of leading zero bits in each qword
VPLZCNTQ xmm1 {k1}{z}, AVX512CD) OR element of xmm2/m128/m64bcst using writemask k1.
xmm2/m128/m64bcst AVX10.11
EVEX.256.66.0F38.W1 44 /r A V/V (AVX512VL AND Count the number of leading zero bits in each qword
VPLZCNTQ ymm1 {k1}{z}, AVX512CD) OR element of ymm2/m256/m64bcst using writemask k1.
ymm2/m256/m64bcst AVX10.11
EVEX.512.66.0F38.W1 44 /r A V/V AVX512CD Count the number of leading zero bits in each qword
VPLZCNTQ zmm1 {k1}{z}, OR AVX10.11 element of zmm2/m512/m64bcst using writemask k1.
zmm2/m512/m64bcst

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Counts the number of leading most significant zero bits in each dword or qword element of the source operand (the
second operand) and stores the results in the destination register (the first operand) according to the writemask.
If an element is zero, the result for that element is the operand size of the element.
EVEX.512 encoded version: The source operand is a ZMM register, a 512-bit memory location, or a 512-bit vector
broadcasted from a 32/64-bit memory location. The destination operand is a ZMM register, conditionally updated
using writemask k1.
EVEX.256 encoded version: The source operand is a YMM register, a 256-bit memory location, or a 256-bit vector
broadcasted from a 32/64-bit memory location. The destination operand is a YMM register, conditionally updated
using writemask k1.
EVEX.128 encoded version: The source operand is a XMM register, a 128-bit memory location, or a 128-bit vector
broadcasted from a 32/64-bit memory location. The destination operand is a XMM register, conditionally updated
using writemask k1.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values Vol. 2C 5-541
Operation
VPLZCNTD
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j*32
IF MaskBit(j) OR *no writemask*
THEN
temp := 32
DEST[i+31:i] := 0
WHILE (temp > 0) AND (SRC[i+temp-1] = 0)
DO
temp := temp – 1
DEST[i+31:i] := DEST[i+31:i] + 1
OD
ELSE
IF *merging-masking*
THEN *DEST[i+31:i] remains unchanged*
ELSE DEST[i+31:i] := 0
FI
FI
ENDFOR
DEST[MAXVL-1:VL] := 0

VPLZCNTQ
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j*64
IF MaskBit(j) OR *no writemask*
THEN
temp := 64
DEST[i+63:i] := 0
WHILE (temp > 0) AND (SRC[i+temp-1] = 0)
DO
temp := temp – 1
DEST[i+63:i] := DEST[i+63:i] + 1
OD
ELSE
IF *merging-masking*
THEN *DEST[i+63:i] remains unchanged*
ELSE DEST[i+63:i] := 0
FI
FI
ENDFOR
DEST[MAXVL-1:VL] := 0

VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values Vol. 2C 5-542
Intel C/C++ Compiler Intrinsic Equivalent

VPLZCNTD __m512i _mm512_lzcnt_epi32(__m512i a);


VPLZCNTD __m512i _mm512_mask_lzcnt_epi32(__m512i s, __mmask16 m, __m512i a);
VPLZCNTD __m512i _mm512_maskz_lzcnt_epi32( __mmask16 m, __m512i a);
VPLZCNTQ __m512i _mm512_lzcnt_epi64(__m512i a);
VPLZCNTQ __m512i _mm512_mask_lzcnt_epi64(__m512i s, __mmask8 m, __m512i a);
VPLZCNTQ __m512i _mm512_maskz_lzcnt_epi64(__mmask8 m, __m512i a);
VPLZCNTD __m256i _mm256_lzcnt_epi32(__m256i a);
VPLZCNTD __m256i _mm256_mask_lzcnt_epi32(__m256i s, __mmask8 m, __m256i a);
VPLZCNTD __m256i _mm256_maskz_lzcnt_epi32( __mmask8 m, __m256i a);
VPLZCNTQ __m256i _mm256_lzcnt_epi64(__m256i a);
VPLZCNTQ __m256i _mm256_mask_lzcnt_epi64(__m256i s, __mmask8 m, __m256i a);
VPLZCNTQ __m256i _mm256_maskz_lzcnt_epi64(__mmask8 m, __m256i a);
VPLZCNTD __m128i _mm_lzcnt_epi32(__m128i a);
VPLZCNTD __m128i _mm_mask_lzcnt_epi32(__m128i s, __mmask8 m, __m128i a);
VPLZCNTD __m128i _mm_maskz_lzcnt_epi32( __mmask8 m, __m128i a);
VPLZCNTQ __m128i _mm_lzcnt_epi64(__m128i a);
VPLZCNTQ __m128i _mm_mask_lzcnt_epi64(__m128i s, __mmask8 m, __m128i a);
VPLZCNTQ __m128i _mm_maskz_lzcnt_epi64(__mmask8 m, __m128i a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values Vol. 2C 5-543
VPMADD52HUQ—Packed Multiply of Unsigned 52-Bit Unsigned Integers and Add High 52-Bit
Products to 64-Bit Accumulators
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.66.0F38.W1 B5 /r A V/V AVX512_IFMA Multiply unsigned 52-bit integers in xmm2 and
VPMADD52HUQ xmm1, xmm2, xmm3/m128 and add the high 52 bits of the
xmm3/m128 104-bit product to the qword unsigned integers
in xmm1.
VEX.256.66.0F38.W1 B5 /r A V/V AVX512_IFMA Multiply unsigned 52-bit integers in ymm2 and
VPMADD52HUQ ymm1, ymm2, ymm3/m256 and add the high 52 bits of the
ymm3/m256 104-bit product to the qword unsigned integers
in ymm1.
EVEX.128.66.0F38.W1 B5 /r B V/V (AVX512_IFMA Multiply unsigned 52-bit integers in xmm2 and
VPMADD52HUQ xmm1 {k1}{z}, xmm2, AND AVX512VL) xmm3/m128 and add the high 52 bits of the
xmm3/m128/m64bcst OR AVX10.11 104-bit product to the qword unsigned integers
in xmm1 using writemask k1.
EVEX.256.66.0F38.W1 B5 /r B V/V (AVX512_IFMA Multiply unsigned 52-bit integers in ymm2 and
VPMADD52HUQ ymm1 {k1}{z}, ymm2, AND AVX512VL) ymm3/m256 and add the high 52 bits of the
ymm3/m256/m64bcst OR AVX10.11 104-bit product to the qword unsigned integers
in ymm1 using writemask k1.
EVEX.512.66.0F38.W1 B5 /r B V/V AVX512_IFMA Multiply unsigned 52-bit integers in zmm2 and
VPMADD52HUQ zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512 and add the high 52 bits of the
zmm3/m512/m64bcst 104-bit product to the qword unsigned integers
in zmm1 using writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Multiplies packed unsigned 52-bit integers in each qword element of the first source operand (the second oper-
and) with the packed unsigned 52-bit integers in the corresponding elements of the second source operand (the
third operand) to form packed 104-bit intermediate results. The high 52-bit, unsigned integer of each 104-bit
product is added to the corresponding qword unsigned integer of the destination operand (the first operand)
under the writemask k1.
The first source operand is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM reg-
ister, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory loca-
tion. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1 at 64-bit
granularity.

VPMADD52HUQ—Packed Multiply of Unsigned 52-Bit Unsigned Integers and Add High 52-Bit Products to 64-Bit Accumulators Vol. 2C 5-544
Operation
VPMADDHUQ srcdest, src1, src2 (VEX version)
VL = (128,256)
KL = VL/64

FOR i in 0 .. KL-1:
temp128 := zeroextend64(src1.qword[i][51:0]) *zeroextend64(src2.qword[i][51:0])
srcdest.qword[i] := srcdest.qword[i] +zeroextend64(temp128[103:52])
srcdest[MAXVL:VL] := 0

VPMADD52HUQ (EVEX encoded)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64;
IF k1[j] OR *no writemask* THEN
IF src2 is Memory AND EVEX.b=1 THEN
tsrc2[63:0] := ZeroExtend64(src2[51:0]);
ELSE
tsrc2[63:0] := ZeroExtend64(src2[i+51:i];
FI;
Temp128[127:0] := ZeroExtend64(src1[i+51:i]) * tsrc2[63:0];
Temp2[63:0] := DEST[i+63:i] + ZeroExtend64(temp128[103:52]) ;
DEST[i+63:i] := Temp2[63:0];
ELSE
IF *zeroing-masking* THEN
DEST[i+63:i] := 0;
ELSE *merge-masking*
DEST[i+63:i] is unchanged;
FI;
FI;
ENDFOR
DEST[MAX_VL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPMADD52HUQ __m128i _mm_madd52hi_avx_epu64 (__m128i __X, __m128i __Y, __m128i __Z);
VPMADD52HUQ __m128i _mm_maskz_madd52hi_epu64( __mmask8 k, __m128i a, __m128i b, __m128i c);
VPMADD52HUQ __m128i _mm_madd52hi_epu64 (__m128i __X, __m128i __Y, __m128i __Z);
VPMADD52HUQ __m128i _mm_madd52hi_epu64( __m128i a, __m128i b, __m128i c);
VPMADD52HUQ __m128i _mm_mask_madd52hi_epu64(__m128i s, __mmask8 k, __m128i a, __m128i b, __m128i c);
VPMADD52HUQ __m256i _mm256_madd52hi_avx_epu64 (__m256i __X, __m256i __Y, __m256i __Z);
VPMADD52HUQ __m256i _mm256_madd52hi_epu64( __m256i a, __m256i b, __m256i c);
VPMADD52HUQ __m256i _mm256_madd52hi_epu64 (__m256i __X, __m256i __Y, __m256i __Z);
VPMADD52HUQ __m256i _mm256_mask_madd52hi_epu64(__m256i s, __mmask8 k, __m256i a, __m256i b, __m256i c);
VPMADD52HUQ __m256i _mm256_maskz_madd52hi_epu64( __mmask8 k, __m256i a, __m256i b, __m256i c);
VPMADD52HUQ __m512i _mm512_madd52hi_epu64( __m512i a, __m512i b, __m512i c);
VPMADD52HUQ __m512i _mm512_mask_madd52hi_epu64(__m512i s, __mmask8 k, __m512i a, __m512i b, __m512i c);
VPMADD52HUQ __m512i _mm512_maskz_madd52hi_epu64( __mmask8 k, __m512i a, __m512i b, __m512i c);

Flags Affected

None.

SIMD Floating-Point Exceptions

None.

VPMADD52HUQ—Packed Multiply of Unsigned 52-Bit Unsigned Integers and Add High 52-Bit Products to 64-Bit Accumulators Vol. 2C 5-545
Other Exceptions
VEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-51, “Type E4 Class Exception Conditions.”

VPMADD52HUQ—Packed Multiply of Unsigned 52-Bit Unsigned Integers and Add High 52-Bit Products to 64-Bit Accumulators Vol. 2C 5-546
VPMADD52LUQ—Packed Multiply of Unsigned 52-Bit Integers and Add the Low 52-Bit Products
to Qword Accumulators
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.66.0F38.W1 B4 /r A V/V AVX512_IFMA Multiply unsigned 52-bit integers in xmm2 and
VPMADD52LUQ xmm1, xmm2, xmm3/m128 and add the low 52 bits of the 104-bit
xmm3/m128 product to the qword unsigned integers in xmm1.

VEX.256.66.0F38.W1 B4 /r A V/V AVX512_IFMA Multiply unsigned 52-bit integers in ymm2 and


VPMADD52LUQ ymm1, ymm2, ymm3/m256 and add the low 52 bits of the 104-bit
ymm3/m256 product to the qword unsigned integers in ymm1.

EVEX.128.66.0F38.W1 B4 /r B V/V (AVX512_IFMA Multiply unsigned 52-bit integers in xmm2 and


VPMADD52LUQ xmm1 {k1}{z}, AND AVX512VL) xmm3/m128 and add the low 52 bits of the 104-bit
xmm2,xmm3/m128/m64bcst OR AVX10.11 product to the qword unsigned integers in xmm1
using writemask k1.
EVEX.256.66.0F38.W1 B4 /r B V/V (AVX512_IFMA Multiply unsigned 52-bit integers in ymm2 and
VPMADD52LUQ ymm1 {k1}{z}, AND AVX512VL) ymm3/m256 and add the low 52 bits of the 104-bit
ymm2, ymm3/m256/m64bcst OR AVX10.11 product to the qword unsigned integers in ymm1
using writemask k1.
EVEX.512.66.0F38.W1 B4 /r B V/V AVX512_IFMA Multiply unsigned 52-bit integers in zmm2 and
VPMADD52LUQ zmm1 {k1}{z}, OR AVX10.11 zmm3/m512 and add the low 52 bits of the 104-bit
zmm2,zmm3/m512/m64bcst product to the qword unsigned integers in zmm1
using writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m(r) N/A

Description
Multiplies packed unsigned 52-bit integers in each qword element of the first source operand (the second oper-
and) with the packed unsigned 52-bit integers in the corresponding elements of the second source operand (the
third operand) to form packed 104-bit intermediate results. The low 52-bit, unsigned integer of each 104-bit
product is added to the corresponding qword unsigned integer of the destination operand (the first operand)
under the writemask k1.
The first source operand is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM reg-
ister, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory loca-
tion. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1 at 64-bit
granularity.

VPMADD52LUQ—Packed Multiply of Unsigned 52-Bit Integers and Add the Low 52-Bit Products to Qword Accumulators Vol. 2C 5-547
Operation
VPMADDLUQ srcdest, src1, src2 (VEX version)
VL = (128,256)
KL = VL/64

FOR i in 0 .. KL-1:
temp128 := zeroextend64(src1.qword[i][51:0]) *zeroextend64(src2.qword[i][51:0])
srcdest.qword[i] := srcdest.qword[i] +zeroextend64(temp128[51:0])
srcdest[MAXVL:VL] := 0

VPMADD52LUQ (EVEX encoded)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64;
IF k1[j] OR *no writemask* THEN
IF src2 is Memory AND EVEX.b=1 THEN
tsrc2[63:0] := ZeroExtend64(src2[51:0]);
ELSE
tsrc2[63:0] := ZeroExtend64(src2[i+51:i];
FI;
Temp128[127:0] := ZeroExtend64(src1[i+51:i]) * tsrc2[63:0];
Temp2[63:0] := DEST[i+63:i] + ZeroExtend64(temp128[51:0]) ;
DEST[i+63:i] := Temp2[63:0];
ELSE
IF *zeroing-masking* THEN
DEST[i+63:i] := 0;
ELSE *merge-masking*
DEST[i+63:i] is unchanged;
FI;
FI;
ENDFOR
DEST[MAX_VL-1:VL] := 0;

Intel C/C++ Compiler Intrinsic Equivalent


VPMADD52LUQ __m128i _mm_madd52lo_avx_epu64 (__m128i __X, __m128i __Y, __m128i __Z);
VPMADD52LUQ __m128i _mm_madd52lo_epu64( __m128i a, __m128i b, __m128i c);
VPMADD52LUQ __m128i _mm_madd52lo_epu64 (__m128i __X, __m128i __Y, __m128i __Z);
VPMADD52LUQ __m128i _mm_mask_madd52lo_epu64(__m128i s, __mmask8 k, __m128i a, __m128i b, __m128i c);
VPMADD52LUQ __m128i _mm_maskz_madd52lo_epu64( __mmask8 k, __m128i a, __m128i b, __m128i c);
VPMADD52LUQ __m256i _mm256_madd52lo_avx_epu64 (__m256i __X, __m256i __Y, __m256i __Z);
VPMADD52LUQ __m256i _mm256_madd52lo_epu64( __m256i a, __m256i b, __m256i c);
VPMADD52LUQ __m256i _mm256_madd52lo_epu64 (__m256i __X, __m256i __Y, __m256i __Z);
VPMADD52LUQ __m256i _mm256_mask_madd52lo_epu64(__m256i s, __mmask8 k, __m256i a, __m256i b, __m256i c);
VPMADD52LUQ __m256i _mm256_maskz_madd52lo_epu64( __mmask8 k, __m256i a, __m256i b, __m256i c);
VPMADD52LUQ __m512i _mm512_madd52lo_epu64( __m512i a, __m512i b, __m512i c);
VPMADD52LUQ __m512i _mm512_mask_madd52lo_epu64(__m512i s, __mmask8 k, __m512i a, __m512i b, __m512i c);
VPMADD52LUQ __m512i _mm512_maskz_madd52lo_epu64( __mmask8 k, __m512i a, __m512i b, __m512i c);

Flags Affected

None.

VPMADD52LUQ—Packed Multiply of Unsigned 52-Bit Integers and Add the Low 52-Bit Products to Qword Accumulators Vol. 2C 5-548
SIMD Floating-Point Exceptions

None.

Other Exceptions
VEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-51, “Type E4 Class Exception Conditions.”

VPMADD52LUQ—Packed Multiply of Unsigned 52-Bit Integers and Add the Low 52-Bit Products to Qword Accumulators Vol. 2C 5-549
VPMOVB2M/VPMOVW2M/VPMOVD2M/VPMOVQ2M—Convert a Vector Register to a Mask
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.F3.0F38.W0 29 /r RM V/V (AVX512VL AND Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVB2M k1, xmm1 AVX512BW) OR most significant bit of the corresponding byte in XMM1.
AVX10.11
EVEX.256.F3.0F38.W0 29 /r RM V/V (AVX512VL AND Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVB2M k1, ymm1 AVX512BW) OR most significant bit of the corresponding byte in YMM1.
AVX10.11
EVEX.512.F3.0F38.W0 29 /r RM V/V AVX512BW Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVB2M k1, zmm1 OR AVX10.11 most significant bit of the corresponding byte in ZMM1.
EVEX.128.F3.0F38.W1 29 /r RM V/V (AVX512VL AND Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVW2M k1, xmm1 AVX512BW) OR most significant bit of the corresponding word in XMM1.
AVX10.11
EVEX.256.F3.0F38.W1 29 /r RM V/V (AVX512VL AND Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVW2M k1, ymm1 AVX512BW) OR most significant bit of the corresponding word in YMM1.
AVX10.11
EVEX.512.F3.0F38.W1 29 /r RM V/V AVX512BW Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVW2M k1, zmm1 OR AVX10.11 most significant bit of the corresponding word in ZMM1.
EVEX.128.F3.0F38.W0 39 /r RM V/V (AVX512VL AND Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVD2M k1, xmm1 AVX512DQ) OR most significant bit of the corresponding doubleword in
AVX10.11 XMM1.
EVEX.256.F3.0F38.W0 39 /r RM V/V (AVX512VL AND Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVD2M k1, ymm1 AVX512DQ) OR most significant bit of the corresponding doubleword in
AVX10.11 YMM1.
EVEX.512.F3.0F38.W0 39 /r RM V/V AVX512DQ Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVD2M k1, zmm1 OR AVX10.11 most significant bit of the corresponding doubleword in
ZMM1.
EVEX.128.F3.0F38.W1 39 /r RM V/V (AVX512VL AND Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVQ2M k1, xmm1 AVX512DQ) OR most significant bit of the corresponding quadword in
AVX10.11 XMM1.
EVEX.256.F3.0F38.W1 39 /r RM V/V (AVX512VL AND Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVQ2M k1, ymm1 AVX512DQ) OR most significant bit of the corresponding quadword in
AVX10.11 YMM1.
EVEX.512.F3.0F38.W1 39 /r RM V/V AVX512DQ Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVQ2M k1, zmm1 OR AVX10.11 most significant bit of the corresponding quadword in
ZMM1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Operand 1 Operand 2 Operand 3 Operand 4
RM ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts a vector register to a mask register. Each element in the destination register is set to 1 or 0 depending on
the value of most significant bit of the corresponding element in the source register.

VPMOVB2M/VPMOVW2M/VPMOVD2M/VPMOVQ2M—Convert a Vector Register to a Mask Vol. 2C 5-553


The source operand is a ZMM/YMM/XMM register. The destination operand is a mask register.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
VPMOVB2M (EVEX encoded versions)
(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF SRC[i+7]
THEN DEST[j] := 1
ELSE DEST[j] := 0
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

VPMOVW2M (EVEX encoded versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF SRC[i+15]
THEN DEST[j] := 1
ELSE DEST[j] := 0
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

VPMOVD2M (EVEX encoded versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF SRC[i+31]
THEN DEST[j] := 1
ELSE DEST[j] := 0
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

VPMOVQ2M (EVEX encoded versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF SRC[i+63]
THEN DEST[j] := 1
ELSE DEST[j] := 0
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

VPMOVB2M/VPMOVW2M/VPMOVD2M/VPMOVQ2M—Convert a Vector Register to a Mask Vol. 2C 5-554


Intel C/C++ Compiler Intrinsic Equivalents
VPMPOVB2M __mmask64 _mm512_movepi8_mask( __m512i );
VPMPOVD2M __mmask16 _mm512_movepi32_mask( __m512i );
VPMPOVQ2M __mmask8 _mm512_movepi64_mask( __m512i );
VPMPOVW2M __mmask32 _mm512_movepi16_mask( __m512i );
VPMPOVB2M __mmask32 _mm256_movepi8_mask( __m256i );
VPMPOVD2M __mmask8 _mm256_movepi32_mask( __m256i );
VPMPOVQ2M __mmask8 _mm256_movepi64_mask( __m256i );
VPMPOVW2M __mmask16 _mm256_movepi16_mask( __m256i );
VPMPOVB2M __mmask16 _mm_movepi8_mask( __m128i );
VPMPOVD2M __mmask8 _mm_movepi32_mask( __m128i );
VPMPOVQ2M __mmask8 _mm_movepi64_mask( __m128i );
VPMPOVW2M __mmask8 _mm_movepi16_mask( __m128i );

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instruction, see Table 2-57, “Type E7NM Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VPMOVB2M/VPMOVW2M/VPMOVD2M/VPMOVQ2M—Convert a Vector Register to a Mask Vol. 2C 5-555


VPMOVDB/VPMOVSDB/VPMOVUSDB—Down Convert DWord to Byte
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.F3.0F38.W0 31 /r A V/V (AVX512VL Converts 4 packed double-word integers from
VPMOVDB xmm1/m32 {k1}{z}, xmm2 AND AVX512F) xmm2 into 4 packed byte integers in
OR AVX10.11 xmm1/m32 with truncation under writemask
k1.
EVEX.128.F3.0F38.W0 21 /r A V/V (AVX512VL Converts 4 packed signed double-word integers
VPMOVSDB xmm1/m32 {k1}{z}, xmm2 AND AVX512F) from xmm2 into 4 packed signed byte integers
OR AVX10.11 in xmm1/m32 using signed saturation under
writemask k1.
EVEX.128.F3.0F38.W0 11 /r A V/V (AVX512VL Converts 4 packed unsigned double-word
VPMOVUSDB xmm1/m32 {k1}{z}, xmm2 AND AVX512F) integers from xmm2 into 4 packed unsigned
OR AVX10.11 byte integers in xmm1/m32 using unsigned
saturation under writemask k1.
EVEX.256.F3.0F38.W0 31 /r A V/V (AVX512VL Converts 8 packed double-word integers from
VPMOVDB xmm1/m64 {k1}{z}, ymm2 AND AVX512F) ymm2 into 8 packed byte integers in
OR AVX10.11 xmm1/m64 with truncation under writemask
k1.
EVEX.256.F3.0F38.W0 21 /r A V/V (AVX512VL Converts 8 packed signed double-word integers
VPMOVSDB xmm1/m64 {k1}{z}, ymm2 AND AVX512F) from ymm2 into 8 packed signed byte integers
OR AVX10.11 in xmm1/m64 using signed saturation under
writemask k1.
EVEX.256.F3.0F38.W0 11 /r A V/V (AVX512VL Converts 8 packed unsigned double-word
VPMOVUSDB xmm1/m64 {k1}{z}, ymm2 AND AVX512F) integers from ymm2 into 8 packed unsigned
OR AVX10.11 byte integers in xmm1/m64 using unsigned
saturation under writemask k1.
EVEX.512.F3.0F38.W0 31 /r A V/V AVX512F Converts 16 packed double-word integers from
VPMOVDB xmm1/m128 {k1}{z}, zmm2 OR AVX10.11 zmm2 into 16 packed byte integers in
xmm1/m128 with truncation under writemask
k1.
EVEX.512.F3.0F38.W0 21 /r A V/V AVX512F Converts 16 packed signed double-word
VPMOVSDB xmm1/m128 {k1}{z}, zmm2 OR AVX10.11 integers from zmm2 into 16 packed signed byte
integers in xmm1/m128 using signed saturation
under writemask k1.
EVEX.512.F3.0F38.W0 11 /r A V/V AVX512F Converts 16 packed unsigned double-word
VPMOVUSDB xmm1/m128 {k1}{z}, OR AVX10.11 integers from zmm2 into 16 packed unsigned
zmm2 byte integers in xmm1/m128 using unsigned
saturation under writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Quarter Mem ModRM:r/m (w) ModRM:reg (r) N/A N/A

VPMOVDB/VPMOVSDB/VPMOVUSDB—Down Convert DWord to Byte Vol. 2C 5-556


Description
VPMOVDB down converts 32-bit integer elements in the source operand (the second operand) into packed bytes
using truncation. VPMOVSDB converts signed 32-bit integers into packed signed bytes using signed saturation.
VPMOVUSDB convert unsigned double-word values into unsigned byte values using unsigned saturation.
The source operand is a ZMM/YMM/XMM register. The destination operand is a XMM register or a 128/64/32-bit
memory location.
Down-converted byte elements are written to the destination operand (the first operand) from the least-significant
byte. Byte elements of the destination operand are updated according to the writemask. Bits (MAXVL-1:128/64/32)
of the register destination are zeroed.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
VPMOVDB instruction (EVEX encoded versions) when dest is a register
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 8
m := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := TruncateDoubleWordToByte (SRC[m+31:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/4] := 0;

VPMOVDB instruction (EVEX encoded versions) when dest is memory


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 8
m := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := TruncateDoubleWordToByte (SRC[m+31:m])
ELSE *DEST[i+7:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VPMOVSDB instruction (EVEX encoded versions) when dest is a register


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 8
m := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SaturateSignedDoubleWordToByte (SRC[m+31:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] := 0
FI

VPMOVDB/VPMOVSDB/VPMOVUSDB—Down Convert DWord to Byte Vol. 2C 5-557


FI;
ENDFOR
DEST[MAXVL-1:VL/4] := 0;

VPMOVSDB instruction (EVEX encoded versions) when dest is memory


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 8
m := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SaturateSignedDoubleWordToByte (SRC[m+31:m])
ELSE *DEST[i+7:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VPMOVUSDB instruction (EVEX encoded versions) when dest is a register


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 8
m := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SaturateUnsignedDoubleWordToByte (SRC[m+31:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/4] := 0;

VPMOVUSDB instruction (EVEX encoded versions) when dest is memory


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 8
m := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SaturateUnsignedDoubleWordToByte (SRC[m+31:m])
ELSE *DEST[i+7:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VPMOVDB/VPMOVSDB/VPMOVUSDB—Down Convert DWord to Byte Vol. 2C 5-558


Intel C/C++ Compiler Intrinsic Equivalents
VPMOVDB __m128i _mm512_cvtepi32_epi8( __m512i a);
VPMOVDB __m128i _mm512_mask_cvtepi32_epi8(__m128i s, __mmask16 k, __m512i a);
VPMOVDB __m128i _mm512_maskz_cvtepi32_epi8( __mmask16 k, __m512i a);
VPMOVDB void _mm512_mask_cvtepi32_storeu_epi8(void * d, __mmask16 k, __m512i a);
VPMOVSDB __m128i _mm512_cvtsepi32_epi8( __m512i a);
VPMOVSDB __m128i _mm512_mask_cvtsepi32_epi8(__m128i s, __mmask16 k, __m512i a);
VPMOVSDB __m128i _mm512_maskz_cvtsepi32_epi8( __mmask16 k, __m512i a);
VPMOVSDB void _mm512_mask_cvtsepi32_storeu_epi8(void * d, __mmask16 k, __m512i a);
VPMOVUSDB __m128i _mm512_cvtusepi32_epi8( __m512i a);
VPMOVUSDB __m128i _mm512_mask_cvtusepi32_epi8(__m128i s, __mmask16 k, __m512i a);
VPMOVUSDB __m128i _mm512_maskz_cvtusepi32_epi8( __mmask16 k, __m512i a);
VPMOVUSDB void _mm512_mask_cvtusepi32_storeu_epi8(void * d, __mmask16 k, __m512i a);
VPMOVUSDB __m128i _mm256_cvtusepi32_epi8(__m256i a);
VPMOVUSDB __m128i _mm256_mask_cvtusepi32_epi8(__m128i a, __mmask8 k, __m256i b);
VPMOVUSDB __m128i _mm256_maskz_cvtusepi32_epi8( __mmask8 k, __m256i b);
VPMOVUSDB void _mm256_mask_cvtusepi32_storeu_epi8(void * , __mmask8 k, __m256i b);
VPMOVUSDB __m128i _mm_cvtusepi32_epi8(__m128i a);
VPMOVUSDB __m128i _mm_mask_cvtusepi32_epi8(__m128i a, __mmask8 k, __m128i b);
VPMOVUSDB __m128i _mm_maskz_cvtusepi32_epi8( __mmask8 k, __m128i b);
VPMOVUSDB void _mm_mask_cvtusepi32_storeu_epi8(void * , __mmask8 k, __m128i b);
VPMOVSDB __m128i _mm256_cvtsepi32_epi8(__m256i a);
VPMOVSDB __m128i _mm256_mask_cvtsepi32_epi8(__m128i a, __mmask8 k, __m256i b);
VPMOVSDB __m128i _mm256_maskz_cvtsepi32_epi8( __mmask8 k, __m256i b);
VPMOVSDB void _mm256_mask_cvtsepi32_storeu_epi8(void * , __mmask8 k, __m256i b);
VPMOVSDB __m128i _mm_cvtsepi32_epi8(__m128i a);
VPMOVSDB __m128i _mm_mask_cvtsepi32_epi8(__m128i a, __mmask8 k, __m128i b);
VPMOVSDB __m128i _mm_maskz_cvtsepi32_epi8( __mmask8 k, __m128i b);
VPMOVSDB void _mm_mask_cvtsepi32_storeu_epi8(void * , __mmask8 k, __m128i b);
VPMOVDB __m128i _mm256_cvtepi32_epi8(__m256i a);
VPMOVDB __m128i _mm256_mask_cvtepi32_epi8(__m128i a, __mmask8 k, __m256i b);
VPMOVDB __m128i _mm256_maskz_cvtepi32_epi8( __mmask8 k, __m256i b);
VPMOVDB void _mm256_mask_cvtepi32_storeu_epi8(void * , __mmask8 k, __m256i b);
VPMOVDB __m128i _mm_cvtepi32_epi8(__m128i a);
VPMOVDB __m128i _mm_mask_cvtepi32_epi8(__m128i a, __mmask8 k, __m128i b);
VPMOVDB __m128i _mm_maskz_cvtepi32_epi8( __mmask8 k, __m128i b);
VPMOVDB void _mm_mask_cvtepi32_storeu_epi8(void * , __mmask8 k, __m128i b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instruction, see Table 2-55, “Type E6 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VPMOVDB/VPMOVSDB/VPMOVUSDB—Down Convert DWord to Byte Vol. 2C 5-559


VPMOVDW/VPMOVSDW/VPMOVUSDW—Down Convert DWord to Word
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.F3.0F38.W0 33 /r A V/V (AVX512VL AND Converts 4 packed double-word integers
VPMOVDW xmm1/m64 {k1}{z}, xmm2 AVX512F) OR from xmm2 into 4 packed word integers in
AVX10.11 xmm1/m64 with truncation under
writemask k1.
EVEX.128.F3.0F38.W0 23 /r A V/V (AVX512VL AND Converts 4 packed signed double-word
VPMOVSDW xmm1/m64 {k1}{z}, xmm2 AVX512F) OR integers from xmm2 into 4 packed signed
AVX10.11 word integers in ymm1/m64 using signed
saturation under writemask k1.
EVEX.128.F3.0F38.W0 13 /r A V/V (AVX512VL AND Converts 4 packed unsigned double-word
VPMOVUSDW xmm1/m64 {k1}{z}, xmm2 AVX512F) OR integers from xmm2 into 4 packed unsigned
AVX10.11 word integers in xmm1/m64 using unsigned
saturation under writemask k1.
EVEX.256.F3.0F38.W0 33 /r A V/V (AVX512VL AND Converts 8 packed double-word integers
VPMOVDW xmm1/m128 {k1}{z}, ymm2 AVX512F) OR from ymm2 into 8 packed word integers in
AVX10.11 xmm1/m128 with truncation under
writemask k1.
EVEX.256.F3.0F38.W0 23 /r A V/V (AVX512VL AND Converts 8 packed signed double-word
VPMOVSDW xmm1/m128 {k1}{z}, ymm2 AVX512F) OR integers from ymm2 into 8 packed signed
AVX10.11 word integers in xmm1/m128 using signed
saturation under writemask k1.
EVEX.256.F3.0F38.W0 13 /r A V/V (AVX512VL AND Converts 8 packed unsigned double-word
VPMOVUSDW xmm1/m128 {k1}{z}, AVX512F) OR integers from ymm2 into 8 packed unsigned
ymm2 AVX10.11 word integers in xmm1/m128 using
unsigned saturation under writemask k1.
EVEX.512.F3.0F38.W0 33 /r A V/V AVX512F Converts 16 packed double-word integers
VPMOVDW ymm1/m256 {k1}{z}, zmm2 OR AVX10.11 from zmm2 into 16 packed word integers in
ymm1/m256 with truncation under
writemask k1.
EVEX.512.F3.0F38.W0 23 /r A V/V AVX512F Converts 16 packed signed double-word
VPMOVSDW ymm1/m256 {k1}{z}, zmm2 OR AVX10.11 integers from zmm2 into 16 packed signed
word integers in ymm1/m256 using signed
saturation under writemask k1.
EVEX.512.F3.0F38.W0 13 /r A V/V AVX512F Converts 16 packed unsigned double-word
VPMOVUSDW ymm1/m256 {k1}{z}, OR AVX10.11 integers from zmm2 into 16 packed unsigned
zmm2 word integers in ymm1/m256 using
unsigned saturation under writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Half Mem ModRM:r/m (w) ModRM:reg (r) N/A N/A

VPMOVDW/VPMOVSDW/VPMOVUSDW—Down Convert DWord to Word Vol. 2C 5-560


Description
VPMOVDW down converts 32-bit integer elements in the source operand (the second operand) into packed words
using truncation. VPMOVSDW converts signed 32-bit integers into packed signed words using signed saturation.
VPMOVUSDW convert unsigned double-word values into unsigned word values using unsigned saturation.
The source operand is a ZMM/YMM/XMM register. The destination operand is a YMM/XMM/XMM register or a
256/128/64-bit memory location.
Down-converted word elements are written to the destination operand (the first operand) from the least-significant
word. Word elements of the destination operand are updated according to the writemask. Bits (MAXVL-
1:256/128/64) of the register destination are zeroed.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
VPMOVDW instruction (EVEX encoded versions) when dest is a register
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 16
m := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TruncateDoubleWordToWord (SRC[m+31:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0;

VPMOVDW instruction (EVEX encoded versions) when dest is memory


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 16
m := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TruncateDoubleWordToWord (SRC[m+31:m])
ELSE
*DEST[i+15:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VPMOVSDW instruction (EVEX encoded versions) when dest is a register


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 16
m := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SaturateSignedDoubleWordToWord (SRC[m+31:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0

VPMOVDW/VPMOVSDW/VPMOVUSDW—Down Convert DWord to Word Vol. 2C 5-561


FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0;

VPMOVSDW instruction (EVEX encoded versions) when dest is memory


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 16
m := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SaturateSignedDoubleWordToWord (SRC[m+31:m])
ELSE
*DEST[i+15:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VPMOVUSDW instruction (EVEX encoded versions) when dest is a register


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 16
m := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SaturateUnsignedDoubleWordToWord (SRC[m+31:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0;

VPMOVUSDW instruction (EVEX encoded versions) when dest is memory


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 16
m := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SaturateUnsignedDoubleWordToWord (SRC[m+31:m])
ELSE
*DEST[i+15:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VPMOVDW/VPMOVSDW/VPMOVUSDW—Down Convert DWord to Word Vol. 2C 5-562


Intel C/C++ Compiler Intrinsic Equivalents
VPMOVDW __m256i _mm512_cvtepi32_epi16( __m512i a);
VPMOVDW __m256i _mm512_mask_cvtepi32_epi16(__m256i s, __mmask16 k, __m512i a);
VPMOVDW __m256i _mm512_maskz_cvtepi32_epi16( __mmask16 k, __m512i a);
VPMOVDW void _mm512_mask_cvtepi32_storeu_epi16(void * d, __mmask16 k, __m512i a);
VPMOVSDW __m256i _mm512_cvtsepi32_epi16( __m512i a);
VPMOVSDW __m256i _mm512_mask_cvtsepi32_epi16(__m256i s, __mmask16 k, __m512i a);
VPMOVSDW __m256i _mm512_maskz_cvtsepi32_epi16( __mmask16 k, __m512i a);
VPMOVSDW void _mm512_mask_cvtsepi32_storeu_epi16(void * d, __mmask16 k, __m512i a);
VPMOVUSDW __m256i _mm512_cvtusepi32_epi16 __m512i a);
VPMOVUSDW __m256i _mm512_mask_cvtusepi32_epi16(__m256i s, __mmask16 k, __m512i a);
VPMOVUSDW __m256i _mm512_maskz_cvtusepi32_epi16( __mmask16 k, __m512i a);
VPMOVUSDW void _mm512_mask_cvtusepi32_storeu_epi16(void * d, __mmask16 k, __m512i a);
VPMOVUSDW __m128i _mm256_cvtusepi32_epi16(__m256i a);
VPMOVUSDW __m128i _mm256_mask_cvtusepi32_epi16(__m128i a, __mmask8 k, __m256i b);
VPMOVUSDW __m128i _mm256_maskz_cvtusepi32_epi16( __mmask8 k, __m256i b);
VPMOVUSDW void _mm256_mask_cvtusepi32_storeu_epi16(void * , __mmask8 k, __m256i b);
VPMOVUSDW __m128i _mm_cvtusepi32_epi16(__m128i a);
VPMOVUSDW __m128i _mm_mask_cvtusepi32_epi16(__m128i a, __mmask8 k, __m128i b);
VPMOVUSDW __m128i _mm_maskz_cvtusepi32_epi16( __mmask8 k, __m128i b);
VPMOVUSDW void _mm_mask_cvtusepi32_storeu_epi16(void * , __mmask8 k, __m128i b);
VPMOVSDW __m128i _mm256_cvtsepi32_epi16(__m256i a);
VPMOVSDW __m128i _mm256_mask_cvtsepi32_epi16(__m128i a, __mmask8 k, __m256i b);
VPMOVSDW __m128i _mm256_maskz_cvtsepi32_epi16( __mmask8 k, __m256i b);
VPMOVSDW void _mm256_mask_cvtsepi32_storeu_epi16(void * , __mmask8 k, __m256i b);
VPMOVSDW __m128i _mm_cvtsepi32_epi16(__m128i a);
VPMOVSDW __m128i _mm_mask_cvtsepi32_epi16(__m128i a, __mmask8 k, __m128i b);
VPMOVSDW __m128i _mm_maskz_cvtsepi32_epi16( __mmask8 k, __m128i b);
VPMOVSDW void _mm_mask_cvtsepi32_storeu_epi16(void * , __mmask8 k, __m128i b);
VPMOVDW __m128i _mm256_cvtepi32_epi16(__m256i a);
VPMOVDW __m128i _mm256_mask_cvtepi32_epi16(__m128i a, __mmask8 k, __m256i b);
VPMOVDW __m128i _mm256_maskz_cvtepi32_epi16( __mmask8 k, __m256i b);
VPMOVDW void _mm256_mask_cvtepi32_storeu_epi16(void * , __mmask8 k, __m256i b);
VPMOVDW __m128i _mm_cvtepi32_epi16(__m128i a);
VPMOVDW __m128i _mm_mask_cvtepi32_epi16(__m128i a, __mmask8 k, __m128i b);
VPMOVDW __m128i _mm_maskz_cvtepi32_epi16( __mmask8 k, __m128i b);
VPMOVDW void _mm_mask_cvtepi32_storeu_epi16(void * , __mmask8 k, __m128i b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instruction, see Table 2-55, “Type E6 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VPMOVDW/VPMOVSDW/VPMOVUSDW—Down Convert DWord to Word Vol. 2C 5-563


VPMOVM2B/VPMOVM2W/VPMOVM2D/VPMOVM2Q—Convert a Mask Register to a Vector

VPMOVM2B/VPMOVM2W/VPMOVM2D/VPMOVM2Q—Convert a Mask Register to a Vector Register Vol. 2C 5-564


Register
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.F3.0F38.W0 28 /r RM V/V (AVX512VL AND Sets each byte in XMM1 to all 1’s or all 0’s based on the
VPMOVM2B xmm1, k1 AVX512BW) OR value of the corresponding bit in k1.
AVX10.11
EVEX.256.F3.0F38.W0 28 /r RM V/V (AVX512VL AND Sets each byte in YMM1 to all 1’s or all 0’s based on the
VPMOVM2B ymm1, k1 AVX512BW) OR value of the corresponding bit in k1.
AVX10.11
EVEX.512.F3.0F38.W0 28 /r RM V/V AVX512BW Sets each byte in ZMM1 to all 1’s or all 0’s based on the
VPMOVM2B zmm1, k1 OR AVX10.11 value of the corresponding bit in k1.
EVEX.128.F3.0F38.W1 28 /r RM V/V (AVX512VL AND Sets each word in XMM1 to all 1’s or all 0’s based on
VPMOVM2W xmm1, k1 AVX512BW) OR the value of the corresponding bit in k1.
AVX10.11
EVEX.256.F3.0F38.W1 28 /r RM V/V (AVX512VL AND Sets each word in YMM1 to all 1’s or all 0’s based on
VPMOVM2W ymm1, k1 AVX512BW) OR the value of the corresponding bit in k1.
AVX10.11
EVEX.512.F3.0F38.W1 28 /r RM V/V AVX512BW Sets each word in ZMM1 to all 1’s or all 0’s based on
VPMOVM2W zmm1, k1 OR AVX10.11 the value of the corresponding bit in k1.
EVEX.128.F3.0F38.W0 38 /r RM V/V (AVX512VL AND Sets each doubleword in XMM1 to all 1’s or all 0’s
VPMOVM2D xmm1, k1 AVX512DQ) OR based on the value of the corresponding bit in k1.
AVX10.11
EVEX.256.F3.0F38.W0 38 /r RM V/V (AVX512VL AND Sets each doubleword in YMM1 to all 1’s or all 0’s based
VPMOVM2D ymm1, k1 AVX512DQ) OR on the value of the corresponding bit in k1.
AVX10.11
EVEX.512.F3.0F38.W0 38 /r RM V/V AVX512DQ Sets each doubleword in ZMM1 to all 1’s or all 0’s based
VPMOVM2D zmm1, k1 OR AVX10.11 on the value of the corresponding bit in k1.
EVEX.128.F3.0F38.W1 38 /r RM V/V (AVX512VL AND Sets each quadword in XMM1 to all 1’s or all 0’s based
VPMOVM2Q xmm1, k1 AVX512DQ) OR on the value of the corresponding bit in k1.
AVX10.11
EVEX.256.F3.0F38.W1 38 /r RM V/V (AVX512VL AND Sets each quadword in YMM1 to all 1’s or all 0’s based
VPMOVM2Q ymm1, k1 AVX512DQ) OR on the value of the corresponding bit in k1.
AVX10.11
EVEX.512.F3.0F38.W1 38 /r RM V/V AVX512DQ Sets each quadword in ZMM1 to all 1’s or all 0’s based
VPMOVM2Q zmm1, k1 OR AVX10.11 on the value of the corresponding bit in k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Operand 1 Operand 2 Operand 3 Operand 4
RM ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
Converts a mask register to a vector register. Each element in the destination register is set to all 1’s or all 0’s
depending on the value of the corresponding bit in the source mask register.
The source operand is a mask register. The destination operand is a ZMM/YMM/XMM register.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

VPMOVM2B/VPMOVM2W/VPMOVM2D/VPMOVM2Q—Convert a Mask Register to a Vector Register Vol. 2C 5-565


Operation
VPMOVM2B (EVEX encoded versions)
(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF SRC[j]
THEN DEST[i+7:i] := -1
ELSE DEST[i+7:i] := 0
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPMOVM2W (EVEX encoded versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF SRC[j]
THEN DEST[i+15:i] := -1
ELSE DEST[i+15:i] := 0
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPMOVM2D (EVEX encoded versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF SRC[j]
THEN DEST[i+31:i] := -1
ELSE DEST[i+31:i] := 0
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPMOVM2Q (EVEX encoded versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF SRC[j]
THEN DEST[i+63:i] := -1
ELSE DEST[i+63:i] := 0
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPMOVM2B/VPMOVM2W/VPMOVM2D/VPMOVM2Q—Convert a Mask Register to a Vector Register Vol. 2C 5-566


Intel C/C++ Compiler Intrinsic Equivalents
VPMOVM2B __m512i _mm512_movm_epi8(__mmask64 );
VPMOVM2D __m512i _mm512_movm_epi32(__mmask8 );
VPMOVM2Q __m512i _mm512_movm_epi64(__mmask16 );
VPMOVM2W __m512i _mm512_movm_epi16(__mmask32 );
VPMOVM2B __m256i _mm256_movm_epi8(__mmask32 );
VPMOVM2D __m256i _mm256_movm_epi32(__mmask8 );
VPMOVM2Q __m256i _mm256_movm_epi64(__mmask8 );
VPMOVM2W __m256i _mm256_movm_epi16(__mmask16 );
VPMOVM2B __m128i _mm_movm_epi8(__mmask16 );
VPMOVM2D __m128i _mm_movm_epi32(__mmask8 );
VPMOVM2Q __m128i _mm_movm_epi64(__mmask8 );
VPMOVM2W __m128i _mm_movm_epi16(__mmask8 );

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instruction, see Table 2-57, “Type E7NM Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VPMOVM2B/VPMOVM2W/VPMOVM2D/VPMOVM2Q—Convert a Mask Register to a Vector Register Vol. 2C 5-567


VPMOVQB/VPMOVSQB/VPMOVUSQB—Down Convert QWord to Byte
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.F3.0F38.W0 32 /r A V/V (AVX512VL AND Converts 2 packed quad-word integers from
VPMOVQB xmm1/m16 {k1}{z}, xmm2 AVX512F) OR xmm2 into 2 packed byte integers in
AVX10.11 xmm1/m16 with truncation under writemask
k1.
EVEX.128.F3.0F38.W0 22 /r A V/V (AVX512VL AND Converts 2 packed signed quad-word integers
VPMOVSQB xmm1/m16 {k1}{z}, xmm2 AVX512F) OR from xmm2 into 2 packed signed byte integers
AVX10.11 in xmm1/m16 using signed saturation under
writemask k1.
EVEX.128.F3.0F38.W0 12 /r A V/V (AVX512VL AND Converts 2 packed unsigned quad-word
VPMOVUSQB xmm1/m16 {k1}{z}, AVX512F) OR integers from xmm2 into 2 packed unsigned
xmm2 AVX10.11 byte integers in xmm1/m16 using unsigned
saturation under writemask k1.
EVEX.256.F3.0F38.W0 32 /r A V/V (AVX512VL AND Converts 4 packed quad-word integers from
VPMOVQB xmm1/m32 {k1}{z}, ymm2 AVX512F) OR ymm2 into 4 packed byte integers in
AVX10.11 xmm1/m32 with truncation under writemask
k1.
EVEX.256.F3.0F38.W0 22 /r A V/V (AVX512VL AND Converts 4 packed signed quad-word integers
VPMOVSQB xmm1/m32 {k1}{z}, ymm2 AVX512F) OR from ymm2 into 4 packed signed byte integers
AVX10.11 in xmm1/m32 using signed saturation under
writemask k1.
EVEX.256.F3.0F38.W0 12 /r A V/V (AVX512VL AND Converts 4 packed unsigned quad-word
VPMOVUSQB xmm1/m32 {k1}{z}, AVX512F) OR integers from ymm2 into 4 packed unsigned
ymm2 AVX10.11 byte integers in xmm1/m32 using unsigned
saturation under writemask k1.
EVEX.512.F3.0F38.W0 32 /r A V/V AVX512F Converts 8 packed quad-word integers from
VPMOVQB xmm1/m64 {k1}{z}, zmm2 OR AVX10.11 zmm2 into 8 packed byte integers in
xmm1/m64 with truncation under writemask
k1.
EVEX.512.F3.0F38.W0 22 /r A V/V AVX512F Converts 8 packed signed quad-word integers
VPMOVSQB xmm1/m64 {k1}{z}, zmm2 OR AVX10.11 from zmm2 into 8 packed signed byte integers
in xmm1/m64 using signed saturation under
writemask k1.
EVEX.512.F3.0F38.W0 12 /r A V/V AVX512F Converts 8 packed unsigned quad-word
VPMOVUSQB xmm1/m64 {k1}{z}, OR AVX10.11 integers from zmm2 into 8 packed unsigned
zmm2 byte integers in xmm1/m64 using unsigned
saturation under writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Eighth Mem ModRM:r/m (w) ModRM:reg (r) N/A N/A

VPMOVQB/VPMOVSQB/VPMOVUSQB—Down Convert QWord to Byte Vol. 2C 5-567


Description
VPMOVQB down converts 64-bit integer elements in the source operand (the second operand) into packed byte
elements using truncation. VPMOVSQB converts signed 64-bit integers into packed signed bytes using signed satu-
ration. VPMOVUSQB convert unsigned quad-word values into unsigned byte values using unsigned saturation. The
source operand is a vector register. The destination operand is an XMM register or a memory location.
Down-converted byte elements are written to the destination operand (the first operand) from the least-significant
byte. Byte elements of the destination operand are updated according to the writemask. Bits (MAXVL-1:64) of the
destination are zeroed.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
VPMOVQB instruction (EVEX encoded versions) when dest is a register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 8
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := TruncateQuadWordToByte (SRC[m+63:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/8] := 0;

VPMOVQB instruction (EVEX encoded versions) when dest is memory


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 8
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := TruncateQuadWordToByte (SRC[m+63:m])
ELSE
*DEST[i+7:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VPMOVSQB instruction (EVEX encoded versions) when dest is a register


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 8
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SaturateSignedQuadWordToByte (SRC[m+63:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] := 0
FI

VPMOVQB/VPMOVSQB/VPMOVUSQB—Down Convert QWord to Byte Vol. 2C 5-568


FI;
ENDFOR
DEST[MAXVL-1:VL/8] := 0;

VPMOVSQB instruction (EVEX encoded versions) when dest is memory


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 8
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SaturateSignedQuadWordToByte (SRC[m+63:m])
ELSE
*DEST[i+7:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VPMOVUSQB instruction (EVEX encoded versions) when dest is a register


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 8
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SaturateUnsignedQuadWordToByte (SRC[m+63:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/8] := 0;

VPMOVUSQB instruction (EVEX encoded versions) when dest is memory


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 8
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SaturateUnsignedQuadWordToByte (SRC[m+63:m])
ELSE
*DEST[i+7:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VPMOVQB/VPMOVSQB/VPMOVUSQB—Down Convert QWord to Byte Vol. 2C 5-569


Intel C/C++ Compiler Intrinsic Equivalents
VPMOVQB __m128i _mm512_cvtepi64_epi8( __m512i a);
VPMOVQB __m128i _mm512_mask_cvtepi64_epi8(__m128i s, __mmask8 k, __m512i a);
VPMOVQB __m128i _mm512_maskz_cvtepi64_epi8( __mmask8 k, __m512i a);
VPMOVQB void _mm512_mask_cvtepi64_storeu_epi8(void * d, __mmask8 k, __m512i a);
VPMOVSQB __m128i _mm512_cvtsepi64_epi8( __m512i a);
VPMOVSQB __m128i _mm512_mask_cvtsepi64_epi8(__m128i s, __mmask8 k, __m512i a);
VPMOVSQB __m128i _mm512_maskz_cvtsepi64_epi8( __mmask8 k, __m512i a);
VPMOVSQB void _mm512_mask_cvtsepi64_storeu_epi8(void * d, __mmask8 k, __m512i a);
VPMOVUSQB __m128i _mm512_cvtusepi64_epi8( __m512i a);
VPMOVUSQB __m128i _mm512_mask_cvtusepi64_epi8(__m128i s, __mmask8 k, __m512i a);
VPMOVUSQB __m128i _mm512_maskz_cvtusepi64_epi8( __mmask8 k, __m512i a);
VPMOVUSQB void _mm512_mask_cvtusepi64_storeu_epi8(void * d, __mmask8 k, __m512i a);
VPMOVUSQB __m128i _mm256_cvtusepi64_epi8(__m256i a);
VPMOVUSQB __m128i _mm256_mask_cvtusepi64_epi8(__m128i a, __mmask8 k, __m256i b);
VPMOVUSQB __m128i _mm256_maskz_cvtusepi64_epi8( __mmask8 k, __m256i b);
VPMOVUSQB void _mm256_mask_cvtusepi64_storeu_epi8(void * , __mmask8 k, __m256i b);
VPMOVUSQB __m128i _mm_cvtusepi64_epi8(__m128i a);
VPMOVUSQB __m128i _mm_mask_cvtusepi64_epi8(__m128i a, __mmask8 k, __m128i b);
VPMOVUSQB __m128i _mm_maskz_cvtusepi64_epi8( __mmask8 k, __m128i b);
VPMOVUSQB void _mm_mask_cvtusepi64_storeu_epi8(void * , __mmask8 k, __m128i b);
VPMOVSQB __m128i _mm256_cvtsepi64_epi8(__m256i a);
VPMOVSQB __m128i _mm256_mask_cvtsepi64_epi8(__m128i a, __mmask8 k, __m256i b);
VPMOVSQB __m128i _mm256_maskz_cvtsepi64_epi8( __mmask8 k, __m256i b);
VPMOVSQB void _mm256_mask_cvtsepi64_storeu_epi8(void * , __mmask8 k, __m256i b);
VPMOVSQB __m128i _mm_cvtsepi64_epi8(__m128i a);
VPMOVSQB __m128i _mm_mask_cvtsepi64_epi8(__m128i a, __mmask8 k, __m128i b);
VPMOVSQB __m128i _mm_maskz_cvtsepi64_epi8( __mmask8 k, __m128i b);
VPMOVSQB void _mm_mask_cvtsepi64_storeu_epi8(void * , __mmask8 k, __m128i b);
VPMOVQB __m128i _mm256_cvtepi64_epi8(__m256i a);
VPMOVQB __m128i _mm256_mask_cvtepi64_epi8(__m128i a, __mmask8 k, __m256i b);
VPMOVQB __m128i _mm256_maskz_cvtepi64_epi8( __mmask8 k, __m256i b);
VPMOVQB void _mm256_mask_cvtepi64_storeu_epi8(void * , __mmask8 k, __m256i b);
VPMOVQB __m128i _mm_cvtepi64_epi8(__m128i a);
VPMOVQB __m128i _mm_mask_cvtepi64_epi8(__m128i a, __mmask8 k, __m128i b);
VPMOVQB __m128i _mm_maskz_cvtepi64_epi8( __mmask8 k, __m128i b);
VPMOVQB void _mm_mask_cvtepi64_storeu_epi8(void * , __mmask8 k, __m128i b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instruction, see Table 2-55, “Type E6 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VPMOVQB/VPMOVSQB/VPMOVUSQB—Down Convert QWord to Byte Vol. 2C 5-570


VPMOVQD/VPMOVSQD/VPMOVUSQD—Down Convert QWord to DWord
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.F3.0F38.W0 35 /r A V/V (AVX512VL AND Converts 2 packed quad-word integers from
VPMOVQD xmm1/m128 {k1}{z}, xmm2 AVX512F) OR xmm2 into 2 packed double-word integers in
AVX10.11 xmm1/m128 with truncation subject to
writemask k1.
EVEX.128.F3.0F38.W0 25 /r A V/V (AVX512VL AND Converts 2 packed signed quad-word integers
VPMOVSQD xmm1/m64 {k1}{z}, xmm2 AVX512F) OR from xmm2 into 2 packed signed double-word
AVX10.11 integers in xmm1/m64 using signed saturation
subject to writemask k1.
EVEX.128.F3.0F38.W0 15 /r A V/V (AVX512VL AND Converts 2 packed unsigned quad-word integers
VPMOVUSQD xmm1/m64 {k1}{z}, AVX512F) OR from xmm2 into 2 packed unsigned double-word
xmm2 AVX10.11 integers in xmm1/m64 using unsigned
saturation subject to writemask k1.
EVEX.256.F3.0F38.W0 35 /r A V/V (AVX512VL AND Converts 4 packed quad-word integers from
VPMOVQD xmm1/m128 {k1}{z}, ymm2 AVX512F) OR ymm2 into 4 packed double-word integers in
AVX10.11 xmm1/m128 with truncation subject to
writemask k1.
EVEX.256.F3.0F38.W0 25 /r A V/V (AVX512VL AND Converts 4 packed signed quad-word integers
VPMOVSQD xmm1/m128 {k1}{z}, ymm2 AVX512F) OR from ymm2 into 4 packed signed double-word
AVX10.11 integers in xmm1/m128 using signed saturation
subject to writemask k1.
EVEX.256.F3.0F38.W0 15 /r A V/V (AVX512VL AND Converts 4 packed unsigned quad-word integers
VPMOVUSQD xmm1/m128 {k1}{z}, AVX512F) OR from ymm2 into 4 packed unsigned double-word
ymm2 AVX10.11 integers in xmm1/m128 using unsigned
saturation subject to writemask k1.
EVEX.512.F3.0F38.W0 35 /r A V/V AVX512F Converts 8 packed quad-word integers from
VPMOVQD ymm1/m256 {k1}{z}, zmm2 OR AVX10.11 zmm2 into 8 packed double-word integers in
ymm1/m256 with truncation subject to
writemask k1.
EVEX.512.F3.0F38.W0 25 /r A V/V AVX512F Converts 8 packed signed quad-word integers
VPMOVSQD ymm1/m256 {k1}{z}, zmm2 OR AVX10.11 from zmm2 into 8 packed signed double-word
integers in ymm1/m256 using signed saturation
subject to writemask k1.
EVEX.512.F3.0F38.W0 15 /r A V/V AVX512F Converts 8 packed unsigned quad-word integers
VPMOVUSQD ymm1/m256 {k1}{z}, OR AVX10.11 from zmm2 into 8 packed unsigned double-word
zmm2 integers in ymm1/m256 using unsigned
saturation subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Half Mem ModRM:r/m (w) ModRM:reg (r) N/A N/A

VPMOVQD/VPMOVSQD/VPMOVUSQD—Down Convert QWord to DWord Vol. 2C 5-571


Description
VPMOVQW down converts 64-bit integer elements in the source operand (the second operand) into packed double-
words using truncation. VPMOVSQW converts signed 64-bit integers into packed signed doublewords using signed
saturation. VPMOVUSQW convert unsigned quad-word values into unsigned double-word values using unsigned
saturation.
The source operand is a ZMM/YMM/XMM register. The destination operand is a YMM/XMM/XMM register or a
256/128/64-bit memory location.
Down-converted doubleword elements are written to the destination operand (the first operand) from the least-
significant doubleword. Doubleword elements of the destination operand are updated according to the writemask.
Bits (MAXVL-1:256/128/64) of the register destination are zeroed.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
VPMOVQD instruction (EVEX encoded version) reg-reg form
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 32
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TruncateQuadWordToDWord (SRC[m+63:m])
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0;

VPMOVQD instruction (EVEX encoded version) memory form


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 32
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TruncateQuadWordToDWord (SRC[m+63:m])
ELSE *DEST[i+31:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VPMOVSQD instruction (EVEX encoded version) reg-reg form


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 32
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SaturateSignedQuadWordToDWord (SRC[m+63:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR

VPMOVQD/VPMOVSQD/VPMOVUSQD—Down Convert QWord to DWord Vol. 2C 5-572


DEST[MAXVL-1:VL/2] := 0;

VPMOVSQD instruction (EVEX encoded version) memory form


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 32
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SaturateSignedQuadWordToDWord (SRC[m+63:m])
ELSE *DEST[i+31:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VPMOVUSQD instruction (EVEX encoded version) reg-reg form


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 32
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SaturateUnsignedQuadWordToDWord (SRC[m+63:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0;

VPMOVUSQD instruction (EVEX encoded version) memory form


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 32
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SaturateUnsignedQuadWordToDWord (SRC[m+63:m])
ELSE *DEST[i+31:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VPMOVQD/VPMOVSQD/VPMOVUSQD—Down Convert QWord to DWord Vol. 2C 5-573


Intel C/C++ Compiler Intrinsic Equivalents
VPMOVQD __m256i _mm512_cvtepi64_epi32( __m512i a);
VPMOVQD __m256i _mm512_mask_cvtepi64_epi32(__m256i s, __mmask8 k, __m512i a);
VPMOVQD __m256i _mm512_maskz_cvtepi64_epi32( __mmask8 k, __m512i a);
VPMOVQD void _mm512_mask_cvtepi64_storeu_epi32(void * d, __mmask8 k, __m512i a);
VPMOVSQD __m256i _mm512_cvtsepi64_epi32( __m512i a);
VPMOVSQD __m256i _mm512_mask_cvtsepi64_epi32(__m256i s, __mmask8 k, __m512i a);
VPMOVSQD __m256i _mm512_maskz_cvtsepi64_epi32( __mmask8 k, __m512i a);
VPMOVSQD void _mm512_mask_cvtsepi64_storeu_epi32(void * d, __mmask8 k, __m512i a);
VPMOVUSQD __m256i _mm512_cvtusepi64_epi32( __m512i a);
VPMOVUSQD __m256i _mm512_mask_cvtusepi64_epi32(__m256i s, __mmask8 k, __m512i a);
VPMOVUSQD __m256i _mm512_maskz_cvtusepi64_epi32( __mmask8 k, __m512i a);
VPMOVUSQD void _mm512_mask_cvtusepi64_storeu_epi32(void * d, __mmask8 k, __m512i a);
VPMOVUSQD __m128i _mm256_cvtusepi64_epi32(__m256i a);
VPMOVUSQD __m128i _mm256_mask_cvtusepi64_epi32(__m128i a, __mmask8 k, __m256i b);
VPMOVUSQD __m128i _mm256_maskz_cvtusepi64_epi32( __mmask8 k, __m256i b);
VPMOVUSQD void _mm256_mask_cvtusepi64_storeu_epi32(void * , __mmask8 k, __m256i b);
VPMOVUSQD __m128i _mm_cvtusepi64_epi32(__m128i a);
VPMOVUSQD __m128i _mm_mask_cvtusepi64_epi32(__m128i a, __mmask8 k, __m128i b);
VPMOVUSQD __m128i _mm_maskz_cvtusepi64_epi32( __mmask8 k, __m128i b);
VPMOVUSQD void _mm_mask_cvtusepi64_storeu_epi32(void * , __mmask8 k, __m128i b);
VPMOVSQD __m128i _mm256_cvtsepi64_epi32(__m256i a);
VPMOVSQD __m128i _mm256_mask_cvtsepi64_epi32(__m128i a, __mmask8 k, __m256i b);
VPMOVSQD __m128i _mm256_maskz_cvtsepi64_epi32( __mmask8 k, __m256i b);
VPMOVSQD void _mm256_mask_cvtsepi64_storeu_epi32(void * , __mmask8 k, __m256i b);
VPMOVSQD __m128i _mm_cvtsepi64_epi32(__m128i a);
VPMOVSQD __m128i _mm_mask_cvtsepi64_epi32(__m128i a, __mmask8 k, __m128i b);
VPMOVSQD __m128i _mm_maskz_cvtsepi64_epi32( __mmask8 k, __m128i b);
VPMOVSQD void _mm_mask_cvtsepi64_storeu_epi32(void * , __mmask8 k, __m128i b);
VPMOVQD __m128i _mm256_cvtepi64_epi32(__m256i a);
VPMOVQD __m128i _mm256_mask_cvtepi64_epi32(__m128i a, __mmask8 k, __m256i b);
VPMOVQD __m128i _mm256_maskz_cvtepi64_epi32( __mmask8 k, __m256i b);
VPMOVQD void _mm256_mask_cvtepi64_storeu_epi32(void * , __mmask8 k, __m256i b);
VPMOVQD __m128i _mm_cvtepi64_epi32(__m128i a);
VPMOVQD __m128i _mm_mask_cvtepi64_epi32(__m128i a, __mmask8 k, __m128i b);
VPMOVQD __m128i _mm_maskz_cvtepi64_epi32( __mmask8 k, __m128i b);
VPMOVQD void _mm_mask_cvtepi64_storeu_epi32(void * , __mmask8 k, __m128i b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instruction, see Table 2-55, “Type E6 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VPMOVQD/VPMOVSQD/VPMOVUSQD—Down Convert QWord to DWord Vol. 2C 5-574


VPMOVQW/VPMOVSQW/VPMOVUSQW—Down Convert QWord to Word
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.F3.0F38.W0 34 /r A V/V (AVX512VL AND Converts 2 packed quad-word integers from
VPMOVQW xmm1/m32 {k1}{z}, xmm2 AVX512F) OR xmm2 into 2 packed word integers in
AVX10.11 xmm1/m32 with truncation under writemask
k1.
EVEX.128.F3.0F38.W0 24 /r A V/V (AVX512VL AND Converts 8 packed signed quad-word integers
VPMOVSQW xmm1/m32 {k1}{z}, xmm2 AVX512F) OR from zmm2 into 8 packed signed word
AVX10.11 integers in xmm1/m32 using signed
saturation under writemask k1.
EVEX.128.F3.0F38.W0 14 /r A V/V (AVX512VL AND Converts 2 packed unsigned quad-word
VPMOVUSQW xmm1/m32 {k1}{z}, xmm2 AVX512F) OR integers from xmm2 into 2 packed unsigned
AVX10.11 word integers in xmm1/m32 using unsigned
saturation under writemask k1.
EVEX.256.F3.0F38.W0 34 /r A V/V (AVX512VL AND Converts 4 packed quad-word integers from
VPMOVQW xmm1/m64 {k1}{z}, ymm2 AVX512F) OR ymm2 into 4 packed word integers in
AVX10.11 xmm1/m64 with truncation under writemask
k1.
EVEX.256.F3.0F38.W0 24 /r A V/V (AVX512VL AND Converts 4 packed signed quad-word integers
VPMOVSQW xmm1/m64 {k1}{z}, ymm2 AVX512F) OR from ymm2 into 4 packed signed word
AVX10.11 integers in xmm1/m64 using signed
saturation under writemask k1.
EVEX.256.F3.0F38.W0 14 /r A V/V (AVX512VL AND Converts 4 packed unsigned quad-word
VPMOVUSQW xmm1/m64 {k1}{z}, ymm2 AVX512F) OR integers from ymm2 into 4 packed unsigned
AVX10.11 word integers in xmm1/m64 using unsigned
saturation under writemask k1.
EVEX.512.F3.0F38.W0 34 /r A V/V AVX512F Converts 8 packed quad-word integers from
VPMOVQW xmm1/m128 {k1}{z}, zmm2 OR AVX10.11 zmm2 into 8 packed word integers in
xmm1/m128 with truncation under
writemask k1.
EVEX.512.F3.0F38.W0 24 /r A V/V AVX512F Converts 8 packed signed quad-word integers
VPMOVSQW xmm1/m128 {k1}{z}, zmm2 OR AVX10.11 from zmm2 into 8 packed signed word
integers in xmm1/m128 using signed
saturation under writemask k1.
EVEX.512.F3.0F38.W0 14 /r A V/V AVX512F Converts 8 packed unsigned quad-word
VPMOVUSQW xmm1/m128 {k1}{z}, OR AVX10.11 integers from zmm2 into 8 packed unsigned
zmm2 word integers in xmm1/m128 using unsigned
saturation under writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Quarter Mem ModRM:r/m (w) ModRM:reg (r) N/A N/A

VPMOVQW/VPMOVSQW/VPMOVUSQW—Down Convert QWord to Word Vol. 2C 5-575


Description
VPMOVQW down converts 64-bit integer elements in the source operand (the second operand) into packed words
using truncation. VPMOVSQW converts signed 64-bit integers into packed signed words using signed saturation.
VPMOVUSQW convert unsigned quad-word values into unsigned word values using unsigned saturation.
The source operand is a ZMM/YMM/XMM register. The destination operand is a XMM register or a 128/64/32-bit
memory location.
Down-converted word elements are written to the destination operand (the first operand) from the least-significant
word. Word elements of the destination operand are updated according to the writemask. Bits (MAXVL-
1:128/64/32) of the register destination are zeroed.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
VPMOVQW instruction (EVEX encoded versions) when dest is a register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 16
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TruncateQuadWordToWord (SRC[m+63:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/4] := 0;

VPMOVQW instruction (EVEX encoded versions) when dest is memory


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 16
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TruncateQuadWordToWord (SRC[m+63:m])
ELSE
*DEST[i+15:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VPMOVSQW instruction (EVEX encoded versions) when dest is a register


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 16
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SaturateSignedQuadWordToWord (SRC[m+63:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0

VPMOVQW/VPMOVSQW/VPMOVUSQW—Down Convert QWord to Word Vol. 2C 5-576


FI
FI;
ENDFOR
DEST[MAXVL-1:VL/4] := 0;

VPMOVSQW instruction (EVEX encoded versions) when dest is memory


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 16
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SaturateSignedQuadWordToWord (SRC[m+63:m])
ELSE
*DEST[i+15:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VPMOVUSQW instruction (EVEX encoded versions) when dest is a register


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 16
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SaturateUnsignedQuadWordToWord (SRC[m+63:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/4] := 0;

VPMOVUSQW instruction (EVEX encoded versions) when dest is memory


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 16
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SaturateUnsignedQuadWordToWord (SRC[m+63:m])
ELSE
*DEST[i+15:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VPMOVQW/VPMOVSQW/VPMOVUSQW—Down Convert QWord to Word Vol. 2C 5-577


Intel C/C++ Compiler Intrinsic Equivalents
VPMOVQW __m128i _mm512_cvtepi64_epi16( __m512i a);
VPMOVQW __m128i _mm512_mask_cvtepi64_epi16(__m128i s, __mmask8 k, __m512i a);
VPMOVQW __m128i _mm512_maskz_cvtepi64_epi16( __mmask8 k, __m512i a);
VPMOVQW void _mm512_mask_cvtepi64_storeu_epi16(void * d, __mmask8 k, __m512i a);
VPMOVSQW __m128i _mm512_cvtsepi64_epi16( __m512i a);
VPMOVSQW __m128i _mm512_mask_cvtsepi64_epi16(__m128i s, __mmask8 k, __m512i a);
VPMOVSQW __m128i _mm512_maskz_cvtsepi64_epi16( __mmask8 k, __m512i a);
VPMOVSQW void _mm512_mask_cvtsepi64_storeu_epi16(void * d, __mmask8 k, __m512i a);
VPMOVUSQW __m128i _mm512_cvtusepi64_epi16( __m512i a);
VPMOVUSQW __m128i _mm512_mask_cvtusepi64_epi16(__m128i s, __mmask8 k, __m512i a);
VPMOVUSQW __m128i _mm512_maskz_cvtusepi64_epi16( __mmask8 k, __m512i a);
VPMOVUSQW void _mm512_mask_cvtusepi64_storeu_epi16(void * d, __mmask8 k, __m512i a);
VPMOVUSQD __m128i _mm256_cvtusepi64_epi32(__m256i a);
VPMOVUSQD __m128i _mm256_mask_cvtusepi64_epi32(__m128i a, __mmask8 k, __m256i b);
VPMOVUSQD __m128i _mm256_maskz_cvtusepi64_epi32( __mmask8 k, __m256i b);
VPMOVUSQD void _mm256_mask_cvtusepi64_storeu_epi32(void * , __mmask8 k, __m256i b);
VPMOVUSQD __m128i _mm_cvtusepi64_epi32(__m128i a);
VPMOVUSQD __m128i _mm_mask_cvtusepi64_epi32(__m128i a, __mmask8 k, __m128i b);
VPMOVUSQD __m128i _mm_maskz_cvtusepi64_epi32( __mmask8 k, __m128i b);
VPMOVUSQD void _mm_mask_cvtusepi64_storeu_epi32(void * , __mmask8 k, __m128i b);
VPMOVSQD __m128i _mm256_cvtsepi64_epi32(__m256i a);
VPMOVSQD __m128i _mm256_mask_cvtsepi64_epi32(__m128i a, __mmask8 k, __m256i b);
VPMOVSQD __m128i _mm256_maskz_cvtsepi64_epi32( __mmask8 k, __m256i b);
VPMOVSQD void _mm256_mask_cvtsepi64_storeu_epi32(void * , __mmask8 k, __m256i b);
VPMOVSQD __m128i _mm_cvtsepi64_epi32(__m128i a);
VPMOVSQD __m128i _mm_mask_cvtsepi64_epi32(__m128i a, __mmask8 k, __m128i b);
VPMOVSQD __m128i _mm_maskz_cvtsepi64_epi32( __mmask8 k, __m128i b);
VPMOVSQD void _mm_mask_cvtsepi64_storeu_epi32(void * , __mmask8 k, __m128i b);
VPMOVQD __m128i _mm256_cvtepi64_epi32(__m256i a);
VPMOVQD __m128i _mm256_mask_cvtepi64_epi32(__m128i a, __mmask8 k, __m256i b);
VPMOVQD __m128i _mm256_maskz_cvtepi64_epi32( __mmask8 k, __m256i b);
VPMOVQD void _mm256_mask_cvtepi64_storeu_epi32(void * , __mmask8 k, __m256i b);
VPMOVQD __m128i _mm_cvtepi64_epi32(__m128i a);
VPMOVQD __m128i _mm_mask_cvtepi64_epi32(__m128i a, __mmask8 k, __m128i b);
VPMOVQD __m128i _mm_maskz_cvtepi64_epi32( __mmask8 k, __m128i b);
VPMOVQD void _mm_mask_cvtepi64_storeu_epi32(void * , __mmask8 k, __m128i b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instruction, see Table 2-55, “Type E6 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VPMOVQW/VPMOVSQW/VPMOVUSQW—Down Convert QWord to Word Vol. 2C 5-578


VPMOVWB/VPMOVSWB/VPMOVUSWB—Down Convert Word to Byte
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.F3.0F38.W0 30 /r A V/V (AVX512VL AND Converts 8 packed word integers from xmm2
VPMOVWB xmm1/m64 {k1}{z}, xmm2 AVX512BW) OR into 8 packed bytes in xmm1/m64 with
AVX10.11 truncation under writemask k1.
EVEX.128.F3.0F38.W0 20 /r A V/V (AVX512VL AND Converts 8 packed signed word integers from
VPMOVSWB xmm1/m64 {k1}{z}, AVX512BW) OR xmm2 into 8 packed signed bytes in xmm1/m64
xmm2 AVX10.11 using signed saturation under writemask k1.
EVEX.128.F3.0F38.W0 10 /r A V/V (AVX512VL AND Converts 8 packed unsigned word integers from
VPMOVUSWB xmm1/m64 {k1}{z}, AVX512BW) OR xmm2 into 8 packed unsigned bytes in
xmm2 AVX10.11 8mm1/m64 using unsigned saturation under
writemask k1.
EVEX.256.F3.0F38.W0 30 /r A V/V (AVX512VL AND Converts 16 packed word integers from ymm2
VPMOVWB xmm1/m128 {k1}{z}, AVX512BW) OR into 16 packed bytes in xmm1/m128 with
ymm2 AVX10.11 truncation under writemask k1.
EVEX.256.F3.0F38.W0 20 /r A V/V (AVX512VL AND Converts 16 packed signed word integers from
VPMOVSWB xmm1/m128 {k1}{z}, AVX512BW) OR ymm2 into 16 packed signed bytes in
ymm2 AVX10.11 xmm1/m128 using signed saturation under
writemask k1.
EVEX.256.F3.0F38.W0 10 /r A V/V (AVX512VL AND Converts 16 packed unsigned word integers
VPMOVUSWB xmm1/m128 {k1}{z}, AVX512BW) OR from ymm2 into 16 packed unsigned bytes in
ymm2 AVX10.11 xmm1/m128 using unsigned saturation under
writemask k1.
EVEX.512.F3.0F38.W0 30 /r A V/V AVX512BW Converts 32 packed word integers from zmm2
VPMOVWB ymm1/m256 {k1}{z}, OR AVX10.11 into 32 packed bytes in ymm1/m256 with
zmm2 truncation under writemask k1.
EVEX.512.F3.0F38.W0 20 /r A V/V AVX512BW Converts 32 packed signed word integers from
VPMOVSWB ymm1/m256 {k1}{z}, OR AVX10.11 zmm2 into 32 packed signed bytes in
zmm2 ymm1/m256 using signed saturation under
writemask k1.
EVEX.512.F3.0F38.W0 10 /r A V/V AVX512BW Converts 32 packed unsigned word integers
VPMOVUSWB ymm1/m256 {k1}{z}, OR AVX10.11 from zmm2 into 32 packed unsigned bytes in
zmm2 ymm1/m256 using unsigned saturation under
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Half Mem ModRM:r/m (w) ModRM:reg (r) N/A N/A

Description
VPMOVWB down converts 16-bit integers into packed bytes using truncation. VPMOVSWB converts signed 16-bit
integers into packed signed bytes using signed saturation. VPMOVUSWB convert unsigned word values into
unsigned byte values using unsigned saturation.
The source operand is a ZMM/YMM/XMM register. The destination operand is a YMM/XMM/XMM register or a
256/128/64-bit memory location.

VPMOVWB/VPMOVSWB/VPMOVUSWB—Down Convert Word to Byte Vol. 2C 5-579


Down-converted byte elements are written to the destination operand (the first operand) from the least-significant
byte. Byte elements of the destination operand are updated according to the writemask. Bits (MAXVL-
1:256/128/64) of the register destination are zeroed.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
VPMOVWB instruction (EVEX encoded versions) when dest is a register
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO Kl-1
i := j * 8
m := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := TruncateWordToByte (SRC[m+15:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] = 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0;

VPMOVWB instruction (EVEX encoded versions) when dest is memory


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO Kl-1
i := j * 8
m := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := TruncateWordToByte (SRC[m+15:m])
ELSE
*DEST[i+7:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VPMOVSWB instruction (EVEX encoded versions) when dest is a register


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO Kl-1
i := j * 8
m := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SaturateSignedWordToByte (SRC[m+15:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] = 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0;

VPMOVWB/VPMOVSWB/VPMOVUSWB—Down Convert Word to Byte Vol. 2C 5-580


VPMOVSWB instruction (EVEX encoded versions) when dest is memory
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO Kl-1
i := j * 8
m := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SaturateSignedWordToByte (SRC[m+15:m])
ELSE
*DEST[i+7:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VPMOVUSWB instruction (EVEX encoded versions) when dest is a register


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO Kl-1
i := j * 8
m := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SaturateUnsignedWordToByte (SRC[m+15:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] = 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0;

VPMOVUSWB instruction (EVEX encoded versions) when dest is memory


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO Kl-1
i := j * 8
m := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SaturateUnsignedWordToByte (SRC[m+15:m])
ELSE
*DEST[i+7:i] remains unchanged* ; merging-masking
FI;
ENDFOR

VPMOVWB/VPMOVSWB/VPMOVUSWB—Down Convert Word to Byte Vol. 2C 5-581


Intel C/C++ Compiler Intrinsic Equivalents
VPMOVUSWB __m256i _mm512_cvtusepi16_epi8(__m512i a);
VPMOVUSWB __m256i _mm512_mask_cvtusepi16_epi8(__m256i a, __mmask32 k, __m512i b);
VPMOVUSWB __m256i _mm512_maskz_cvtusepi16_epi8( __mmask32 k, __m512i b);
VPMOVUSWB void _mm512_mask_cvtusepi16_storeu_epi8(void * , __mmask32 k, __m512i b);
VPMOVSWB __m256i _mm512_cvtsepi16_epi8(__m512i a);
VPMOVSWB __m256i _mm512_mask_cvtsepi16_epi8(__m256i a, __mmask32 k, __m512i b);
VPMOVSWB __m256i _mm512_maskz_cvtsepi16_epi8( __mmask32 k, __m512i b);
VPMOVSWB void _mm512_mask_cvtsepi16_storeu_epi8(void * , __mmask32 k, __m512i b);
VPMOVWB __m256i _mm512_cvtepi16_epi8(__m512i a);
VPMOVWB __m256i _mm512_mask_cvtepi16_epi8(__m256i a, __mmask32 k, __m512i b);
VPMOVWB __m256i _mm512_maskz_cvtepi16_epi8( __mmask32 k, __m512i b);
VPMOVWB void _mm512_mask_cvtepi16_storeu_epi8(void * , __mmask32 k, __m512i b);
VPMOVUSWB __m128i _mm256_cvtusepi16_epi8(__m256i a);
VPMOVUSWB __m128i _mm256_mask_cvtusepi16_epi8(__m128i a, __mmask16 k, __m256i b);
VPMOVUSWB __m128i _mm256_maskz_cvtusepi16_epi8( __mmask16 k, __m256i b);
VPMOVUSWB void _mm256_mask_cvtusepi16_storeu_epi8(void * , __mmask16 k, __m256i b);
VPMOVUSWB __m128i _mm_cvtusepi16_epi8(__m128i a);
VPMOVUSWB __m128i _mm_mask_cvtusepi16_epi8(__m128i a, __mmask8 k, __m128i b);
VPMOVUSWB __m128i _mm_maskz_cvtusepi16_epi8( __mmask8 k, __m128i b);
VPMOVUSWB void _mm_mask_cvtusepi16_storeu_epi8(void * , __mmask8 k, __m128i b);
VPMOVSWB __m128i _mm256_cvtsepi16_epi8(__m256i a);
VPMOVSWB __m128i _mm256_mask_cvtsepi16_epi8(__m128i a, __mmask16 k, __m256i b);
VPMOVSWB __m128i _mm256_maskz_cvtsepi16_epi8( __mmask16 k, __m256i b);
VPMOVSWB void _mm256_mask_cvtsepi16_storeu_epi8(void * , __mmask16 k, __m256i b);
VPMOVSWB __m128i _mm_cvtsepi16_epi8(__m128i a);
VPMOVSWB __m128i _mm_mask_cvtsepi16_epi8(__m128i a, __mmask8 k, __m128i b);
VPMOVSWB __m128i _mm_maskz_cvtsepi16_epi8( __mmask8 k, __m128i b);
VPMOVSWB void _mm_mask_cvtsepi16_storeu_epi8(void * , __mmask8 k, __m128i b);
VPMOVWB __m128i _mm256_cvtepi16_epi8(__m256i a);
VPMOVWB __m128i _mm256_mask_cvtepi16_epi8(__m128i a, __mmask16 k, __m256i b);
VPMOVWB __m128i _mm256_maskz_cvtepi16_epi8( __mmask16 k, __m256i b);
VPMOVWB void _mm256_mask_cvtepi16_storeu_epi8(void * , __mmask16 k, __m256i b);
VPMOVWB __m128i _mm_cvtepi16_epi8(__m128i a);
VPMOVWB __m128i _mm_mask_cvtepi16_epi8(__m128i a, __mmask8 k, __m128i b);
VPMOVWB __m128i _mm_maskz_cvtepi16_epi8( __mmask8 k, __m128i b);
VPMOVWB void _mm_mask_cvtepi16_storeu_epi8(void * , __mmask8 k, __m128i b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instruction, see Table 2-55, “Type E6 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VPMOVWB/VPMOVSWB/VPMOVUSWB—Down Convert Word to Byte Vol. 2C 5-582


VPMULTISHIFTQB—Select Packed Unaligned Bytes From Quadword Sources
Opcode / Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 83 /r A V/V (AVX512VL AND Select unaligned bytes from qwords in
VPMULTISHIFTQB xmm1 {k1}{z}, AVX512_VBMI) xmm3/m128/m64bcst using control bytes in
xmm2,xmm3/m128/m64bcst OR AVX10.11 xmm2, write byte results to xmm1 under k1.

EVEX.256.66.0F38.W1 83 /r A V/V (AVX512VL Select unaligned bytes from qwords in


VPMULTISHIFTQB ymm1 {k1}{z}, AVX512_VBMI) ymm3/m256/m64bcst using control bytes in
ymm2,ymm3/m256/m64bcst OR AVX10.11 ymm2, write byte results to ymm1 under k1.

EVEX.512.66.0F38.W1 83 /r A V/V AVX512_VBMI Select unaligned bytes from qwords in


VPMULTISHIFTQB zmm1 {k1}{z}, OR AVX10.11 zmm3/m512/m64bcst using control bytes in
zmm2,zmm3/m512/m64bcst zmm2, write byte results to zmm1 under k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction selects eight unaligned bytes from each input qword element of the second source operand (the
third operand) and writes eight assembled bytes for each qword element in the destination operand (the first
operand). Each byte result is selected using a byte-granular shift control within the corresponding qword element
of the first source operand (the second operand). Each byte result in the destination operand is updated under the
writemask k1.
Only the low 6 bits of each control byte are used to select an 8-bit slot to extract the output byte from the qword
data in the second source operand. The starting bit of the 8-bit slot can be unaligned relative to any byte boundary
and is extracted from the input qword source at the location specified in the low 6-bit of the control byte. If the 8-
bit slot would exceed the qword boundary, the out-of-bound portion of the 8-bit slot is wrapped back to start from
bit 0 of the input qword element.
The first source operand is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM reg-
ister, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory loca-
tion. The destination operand is a ZMM/YMM/XMM register.

VPMULTISHIFTQB—Select Packed Unaligned Bytes From Quadword Sources Vol. 2C 5-583


Operation

VPMULTISHIFTQB DEST, SRC1, SRC2 (EVEX encoded version)


(KL, VL) = (2, 128),(4, 256), (8, 512)
FOR i := 0 TO KL-1
IF EVEX.b=1 AND src2 is memory THEN
tcur := src2.qword[0]; //broadcasting
ELSE
tcur := src2.qword[i];
FI;
FOR j := 0 to 7
ctrl := src1.qword[i].byte[j] & 63;
FOR k := 0 to 7
res.bit[k] := tcur.bit[ (ctrl+k) mod 64 ];
ENDFOR
IF k1[i*8+j] or no writemask THEN
DEST.qword[i].byte[j] := res;
ELSE IF zeroing-masking THEN
DEST.qword[i].byte[j] := 0;
ENDFOR
ENDFOR
DEST.qword[MAX_VL-1:VL] := 0;

Intel C/C++ Compiler Intrinsic Equivalent


VPMULTISHIFTQB __m512i _mm512_multishift_epi64_epi8( __m512i a, __m512i b);
VPMULTISHIFTQB __m512i _mm512_mask_multishift_epi64_epi8(__m512i s, __mmask64 k, __m512i a, __m512i b);
VPMULTISHIFTQB __m512i _mm512_maskz_multishift_epi64_epi8( __mmask64 k, __m512i a, __m512i b);
VPMULTISHIFTQB __m256i _mm256_multishift_epi64_epi8( __m256i a, __m256i b);
VPMULTISHIFTQB __m256i _mm256_mask_multishift_epi64_epi8(__m256i s, __mmask32 k, __m256i a, __m256i b);
VPMULTISHIFTQB __m256i _mm256_maskz_multishift_epi64_epi8( __mmask32 k, __m256i a, __m256i b);
VPMULTISHIFTQB __m128i _mm_multishift_epi64_epi8( __m128i a, __m128i b);
VPMULTISHIFTQB __m128i _mm_mask_multishift_epi64_epi8(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMULTISHIFTQB __m128i _mm_maskz_multishift_epi64_epi8( __mmask8 k, __m128i a, __m128i b);

SIMD Floating-Point Exceptions


None.

Other Exceptions

See Table 2-52, “Type E4NF Class Exception Conditions.”

VPMULTISHIFTQB—Select Packed Unaligned Bytes From Quadword Sources Vol. 2C 5-584


VPOPCNT—Return the Count of Number of Bits Set to 1 in BYTE/WORD/DWORD/QWORD
Opcode/ Op/ 64/32 CPUID Feature Flag Description
Instruction En bit Mode
Support
EVEX.128.66.0F38.W0 54 /r A V/V (AVX512_BITALG Counts the number of bits set to one in
VPOPCNTB xmm1{k1}{z}, AND AVX512VL) OR xmm2/m128 and puts the result in xmm1 with
xmm2/m128 AVX10.11 writemask k1.
EVEX.256.66.0F38.W0 54 /r A V/V (AVX512_BITALG Counts the number of bits set to one in
VPOPCNTB ymm1{k1}{z}, AND AVX512VL) OR ymm2/m256 and puts the result in ymm1 with
ymm2/m256 AVX10.11 writemask k1.
EVEX.512.66.0F38.W0 54 /r A V/V AVX512_BITALG Counts the number of bits set to one in
VPOPCNTB zmm1{k1}{z}, OR AVX10.11 zmm2/m512 and puts the result in zmm1 with
zmm2/m512 writemask k1.
EVEX.128.66.0F38.W1 54 /r A V/V (AVX512_BITALG Counts the number of bits set to one in
VPOPCNTW xmm1{k1}{z}, AND AVX512VL) OR xmm2/m128 and puts the result in xmm1 with
xmm2/m128 AVX10.11 writemask k1.
EVEX.256.66.0F38.W1 54 /r A V/V (AVX512_BITALG Counts the number of bits set to one in
VPOPCNTW ymm1{k1}{z}, AND AVX512VL) OR ymm2/m256 and puts the result in ymm1 with
ymm2/m256 AVX10.11 writemask k1.
EVEX.512.66.0F38.W1 54 /r A V/V AVX512_BITALG Counts the number of bits set to one in
VPOPCNTW zmm1{k1}{z}, OR AVX10.11 zmm2/m512 and puts the result in zmm1 with
zmm2/m512 writemask k1.
EVEX.128.66.0F38.W0 55 /r B V/V (AVX512_VPOPCNTD Counts the number of bits set to one in
VPOPCNTD xmm1{k1}{z}, Q xmm2/m128/m32bcst and puts the result in
xmm2/m128/m32bcst AND AVX512VL) OR xmm1 with writemask k1.
AVX10.11
EVEX.256.66.0F38.W0 55 /r B V/V (AVX512_VPOPCNTD Counts the number of bits set to one in
VPOPCNTD ymm1{k1}{z}, Q ymm2/m256/m32bcst and puts the result in
ymm2/m256/m32bcst AND AVX512VL) OR ymm1 with writemask k1.
AVX10.11
EVEX.512.66.0F38.W0 55 /r B V/V AVX512_VPOPCNTDQ Counts the number of bits set to one in
VPOPCNTD zmm1{k1}{z}, OR AVX10.11 zmm2/m512/m32bcst and puts the result in
zmm2/m512/m32bcst zmm1 with writemask k1.
EVEX.128.66.0F38.W1 55 /r B V/V (AVX512_VPOPCNTD Counts the number of bits set to one in
VPOPCNTQ xmm1{k1}{z}, Q xmm2/m128/m32bcst and puts the result in
xmm2/m128/m64bcst AND AVX512VL) OR xmm1 with writemask k1.
AVX10.11
EVEX.256.66.0F38.W1 55 /r B V/V (AVX512_VPOPCNTD Counts the number of bits set to one in
VPOPCNTQ ymm1{k1}{z}, Q ymm2/m256/m32bcst and puts the result in
ymm2/m256/m64bcst AND AVX512VL) OR ymm1 with writemask k1.
AVX10.11
EVEX.512.66.0F38.W1 55 /r B V/V AVX512_VPOPCNTDQ Counts the number of bits set to one in
VPOPCNTQ zmm1{k1}{z}, OR AVX10.11 zmm2/m512/m64bcst and puts the result in
zmm2/m512/m64bcst zmm1 with writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

VPOPCNT—Return the Count of Number of Bits Set to 1 in BYTE/WORD/DWORD/QWORD Vol. 2C 5-585


Instruction Operand Encoding
Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full Mem ModRM:reg (w) ModRM:r/m (r) N/A N/A
B Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction counts the number of bits set to one in each byte, word, dword or qword element of its source (e.g.,
zmm2 or memory) and places the results in the destination register (zmm1). This instruction supports memory
fault suppression.

Operation
VPOPCNTB
(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1:
IF MaskBit(j) OR *no writemask*:
DEST.byte[j] := POPCNT(SRC.byte[j])
ELSE IF *merging-masking*:
*DEST.byte[j] remains unchanged*
ELSE:
DEST.byte[j] := 0
DEST[MAX_VL-1:VL] := 0

VPOPCNTW
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1:
IF MaskBit(j) OR *no writemask*:
DEST.word[j] := POPCNT(SRC.word[j])
ELSE IF *merging-masking*:
*DEST.word[j] remains unchanged*
ELSE:
DEST.word[j] := 0
DEST[MAX_VL-1:VL] := 0

VPOPCNTD
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1:
IF MaskBit(j) OR *no writemask*:
IF SRC is broadcast memop:
t := SRC.dword[0]
ELSE:
t := SRC.dword[j]
DEST.dword[j] := POPCNT(t)
ELSE IF *merging-masking*:
*DEST..dword[j] remains unchanged*
ELSE:
DEST..dword[j] := 0
DEST[MAX_VL-1:VL] := 0

VPOPCNT—Return the Count of Number of Bits Set to 1 in BYTE/WORD/DWORD/QWORD Vol. 2C 5-586


VPOPCNTQ
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1:
IF MaskBit(j) OR *no writemask*:
IF SRC is broadcast memop:
t := SRC.qword[0]
ELSE:
t := SRC.qword[j]
DEST.qword[j] := POPCNT(t)
ELSE IF *merging-masking*:
*DEST..qword[j] remains unchanged*
ELSE:
DEST..qword[j] := 0
DEST[MAX_VL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPOPCNTW __m128i _mm_popcnt_epi16(__m128i);
VPOPCNTW __m128i _mm_mask_popcnt_epi16(__m128i, __mmask8, __m128i);
VPOPCNTW __m128i _mm_maskz_popcnt_epi16(__mmask8, __m128i);
VPOPCNTW __m256i _mm256_popcnt_epi16(__m256i);
VPOPCNTW __m256i _mm256_mask_popcnt_epi16(__m256i, __mmask16, __m256i);
VPOPCNTW __m256i _mm256_maskz_popcnt_epi16(__mmask16, __m256i);
VPOPCNTW __m512i _mm512_popcnt_epi16(__m512i);
VPOPCNTW __m512i _mm512_mask_popcnt_epi16(__m512i, __mmask32, __m512i);
VPOPCNTW __m512i _mm512_maskz_popcnt_epi16(__mmask32, __m512i);
VPOPCNTQ __m128i _mm_popcnt_epi64(__m128i);
VPOPCNTQ __m128i _mm_mask_popcnt_epi64(__m128i, __mmask8, __m128i);
VPOPCNTQ __m128i _mm_maskz_popcnt_epi64(__mmask8, __m128i);
VPOPCNTQ __m256i _mm256_popcnt_epi64(__m256i);
VPOPCNTQ __m256i _mm256_mask_popcnt_epi64(__m256i, __mmask8, __m256i);
VPOPCNTQ __m256i _mm256_maskz_popcnt_epi64(__mmask8, __m256i);
VPOPCNTQ __m512i _mm512_popcnt_epi64(__m512i);
VPOPCNTQ __m512i _mm512_mask_popcnt_epi64(__m512i, __mmask8, __m512i);
VPOPCNTQ __m512i _mm512_maskz_popcnt_epi64(__mmask8, __m512i);
VPOPCNTD __m128i _mm_popcnt_epi32(__m128i);
VPOPCNTD __m128i _mm_mask_popcnt_epi32(__m128i, __mmask8, __m128i);
VPOPCNTD __m128i _mm_maskz_popcnt_epi32(__mmask8, __m128i);
VPOPCNTD __m256i _mm256_popcnt_epi32(__m256i);
VPOPCNTD __m256i _mm256_mask_popcnt_epi32(__m256i, __mmask8, __m256i);
VPOPCNTD __m256i _mm256_maskz_popcnt_epi32(__mmask8, __m256i);
VPOPCNTD __m512i _mm512_popcnt_epi32(__m512i);
VPOPCNTD __m512i _mm512_mask_popcnt_epi32(__m512i, __mmask16, __m512i);
VPOPCNTD __m512i _mm512_maskz_popcnt_epi32(__mmask16, __m512i);
VPOPCNTB __m128i _mm_popcnt_epi8(__m128i);
VPOPCNTB __m128i _mm_mask_popcnt_epi8(__m128i, __mmask16, __m128i);
VPOPCNTB __m128i _mm_maskz_popcnt_epi8(__mmask16, __m128i);
VPOPCNTB __m256i _mm256_popcnt_epi8(__m256i);
VPOPCNTB __m256i _mm256_mask_popcnt_epi8(__m256i, __mmask32, __m256i);
VPOPCNTB __m256i _mm256_maskz_popcnt_epi8(__mmask32, __m256i);
VPOPCNTB __m512i _mm512_popcnt_epi8(__m512i);
VPOPCNTB __m512i _mm512_mask_popcnt_epi8(__m512i, __mmask64, __m512i);
VPOPCNTB __m512i _mm512_maskz_popcnt_epi8(__mmask64, __m512i);

VPOPCNT—Return the Count of Number of Bits Set to 1 in BYTE/WORD/DWORD/QWORD Vol. 2C 5-587


SIMD Floating-Point Exceptions
None.

Other Exceptions

See Table 2-51, “Type E4 Class Exception Conditions.”

VPOPCNT—Return the Count of Number of Bits Set to 1 in BYTE/WORD/DWORD/QWORD Vol. 2C 5-588


VPROLD/VPROLVD/VPROLQ/VPROLVQ—Bit Rotate Left
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 15 /r B V/V (AVX512VL AND Rotate doublewords in xmm2 left by count in the
VPROLVD xmm1 {k1}{z}, xmm2, AVX512F) OR corresponding element of xmm3/m128/m32bcst.
xmm3/m128/m32bcst AVX10.11 Result written to xmm1 under writemask k1.
EVEX.128.66.0F.W0 72 /1 ib A V/V (AVX512VL AND Rotate doublewords in xmm2/m128/m32bcst
VPROLD xmm1 {k1}{z}, AVX512F) OR left by imm8. Result written to xmm1 using
xmm2/m128/m32bcst, imm8 AVX10.11 writemask k1.
EVEX.128.66.0F38.W1 15 /r B V/V (AVX512VL AND Rotate quadwords in xmm2 left by count in the
VPROLVQ xmm1 {k1}{z}, xmm2, AVX512F) OR corresponding element of xmm3/m128/m64bcst.
xmm3/m128/m64bcst AVX10.11 Result written to xmm1 under writemask k1.
EVEX.128.66.0F.W1 72 /1 ib A V/V (AVX512VL AND Rotate quadwords in xmm2/m128/m64bcst left
VPROLQ xmm1 {k1}{z}, AVX512F) OR by imm8. Result written to xmm1 using
xmm2/m128/m64bcst, imm8 AVX10.11 writemask k1.
EVEX.256.66.0F38.W0 15 /r B V/V (AVX512VL AND Rotate doublewords in ymm2 left by count in the
VPROLVD ymm1 {k1}{z}, ymm2, AVX512F) OR corresponding element of ymm3/m256/m32bcst.
ymm3/m256/m32bcst AVX10.11 Result written to ymm1 under writemask k1.
EVEX.256.66.0F.W0 72 /1 ib A V/V (AVX512VL AND Rotate doublewords in ymm2/m256/m32bcst
VPROLD ymm1 {k1}{z}, AVX512F) OR left by imm8. Result written to ymm1 using
ymm2/m256/m32bcst, imm8 AVX10.11 writemask k1.
EVEX.256.66.0F38.W1 15 /r B V/V (AVX512VL AND Rotate quadwords in ymm2 left by count in the
VPROLVQ ymm1 {k1}{z}, ymm2, AVX512F) OR corresponding element of ymm3/m256/m64bcst.
ymm3/m256/m64bcst AVX10.11 Result written to ymm1 under writemask k1.
EVEX.256.66.0F.W1 72 /1 ib A V/V (AVX512VL AND Rotate quadwords in ymm2/m256/m64bcst left
VPROLQ ymm1 {k1}{z}, AVX512F) OR by imm8. Result written to ymm1 using
ymm2/m256/m64bcst, imm8 AVX10.11 writemask k1.
EVEX.512.66.0F38.W0 15 /r B V/V AVX512F Rotate left of doublewords in zmm2 by count in
VPROLVD zmm1 {k1}{z}, zmm2, OR AVX10.11 the corresponding element of
zmm3/m512/m32bcst zmm3/m512/m32bcst. Result written to zmm1
using writemask k1.
EVEX.512.66.0F.W0 72 /1 ib A V/V AVX512F Rotate left of doublewords in
VPROLD zmm1 {k1}{z}, OR AVX10.11 zmm3/m512/m32bcst by imm8. Result written
zmm2/m512/m32bcst, imm8 to zmm1 using writemask k1.
EVEX.512.66.0F38.W1 15 /r B V/V AVX512F Rotate quadwords in zmm2 left by count in the
VPROLVQ zmm1 {k1}{z}, zmm2, OR AVX10.11 corresponding element of zmm3/m512/m64bcst.
zmm3/m512/m64bcst Result written to zmm1under writemask k1.
EVEX.512.66.0F.W1 72 /1 ib A V/V AVX512F Rotate quadwords in zmm2/m512/m64bcst left
VPROLQ zmm1 {k1}{z}, OR AVX10.11 by imm8. Result written to zmm1 using
zmm2/m512/m64bcst, imm8 writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full VEX.vvvv (w) ModRM:r/m (R) imm8 N/A
B Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

VPROLD/VPROLVD/VPROLQ/VPROLVQ—Bit Rotate Left Vol. 2C 5-588


Description
Rotates the bits in the individual data elements (doublewords, or quadword) in the first source operand to the left
by the number of bits specified in the count operand. If the value specified by the count operand is greater than 31
(for doublewords), or 63 (for a quadword), then the count operand modulo the data size (32 or 64) is used.
EVEX.128 encoded version: The destination operand is a XMM register. The source operand is a XMM register or a
memory location (for immediate form). The count operand can come either from an XMM register or a memory
location or an 8-bit immediate. Bits (MAXVL-1:128) of the corresponding ZMM register are zeroed.
EVEX.256 encoded version: The destination operand is a YMM register. The source operand is a YMM register or a
memory location (for immediate form). The count operand can come either from an XMM register or a memory
location or an 8-bit immediate. Bits (MAXVL-1:256) of the corresponding ZMM register are zeroed.
EVEX.512 encoded version: The destination operand is a ZMM register updated according to the writemask. For the
count operand in immediate form, the source operand can be a ZMM register, a 512-bit memory location or a 512-
bit vector broadcasted from a 32/64-bit memory location, the count operand is an 8-bit immediate. For the count
operand in variable form, the first source operand (the second operand) is a ZMM register and the counter operand
(the third operand) is a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32/64-bit
memory location.

Operation
LEFT_ROTATE_DWORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC modulo 32;
DEST[31:0] := (SRC << COUNT) | (SRC >> (32 - COUNT));

LEFT_ROTATE_QWORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC modulo 64;
DEST[63:0] := (SRC << COUNT) | (SRC >> (64 - COUNT));

VPROLD (EVEX encoded versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC1 *is memory*)
THEN DEST[i+31:i] := LEFT_ROTATE_DWORDS(SRC1[31:0], imm8)
ELSE DEST[i+31:i] := LEFT_ROTATE_DWORDS(SRC1[i+31:i], imm8)
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPROLD/VPROLVD/VPROLQ/VPROLVQ—Bit Rotate Left Vol. 2C 5-589


VPROLVD (EVEX encoded versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+31:i] := LEFT_ROTATE_DWORDS(SRC1[i+31:i], SRC2[31:0])
ELSE DEST[i+31:i] := LEFT_ROTATE_DWORDS(SRC1[i+31:i], SRC2[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPROLQ (EVEX encoded versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC1 *is memory*)
THEN DEST[i+63:i] := LEFT_ROTATE_QWORDS(SRC1[63:0], imm8)
ELSE DEST[i+63:i] := LEFT_ROTATE_QWORDS(SRC1[i+63:i], imm8)
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPROLVQ (EVEX encoded versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+63:i] := LEFT_ROTATE_QWORDS(SRC1[i+63:i], SRC2[63:0])
ELSE DEST[i+63:i] := LEFT_ROTATE_QWORDS(SRC1[i+63:i], SRC2[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;

VPROLD/VPROLVD/VPROLQ/VPROLVQ—Bit Rotate Left Vol. 2C 5-590


ENDFOR
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPROLD __m512i _mm512_rol_epi32(__m512i a, int imm);
VPROLD __m512i _mm512_mask_rol_epi32(__m512i a, __mmask16 k, __m512i b, int imm);
VPROLD __m512i _mm512_maskz_rol_epi32( __mmask16 k, __m512i a, int imm);
VPROLD __m256i _mm256_rol_epi32(__m256i a, int imm);
VPROLD __m256i _mm256_mask_rol_epi32(__m256i a, __mmask8 k, __m256i b, int imm);
VPROLD __m256i _mm256_maskz_rol_epi32( __mmask8 k, __m256i a, int imm);
VPROLD __m128i _mm_rol_epi32(__m128i a, int imm);
VPROLD __m128i _mm_mask_rol_epi32(__m128i a, __mmask8 k, __m128i b, int imm);
VPROLD __m128i _mm_maskz_rol_epi32( __mmask8 k, __m128i a, int imm);
VPROLQ __m512i _mm512_rol_epi64(__m512i a, int imm);
VPROLQ __m512i _mm512_mask_rol_epi64(__m512i a, __mmask8 k, __m512i b, int imm);
VPROLQ __m512i _mm512_maskz_rol_epi64(__mmask8 k, __m512i a, int imm);
VPROLQ __m256i _mm256_rol_epi64(__m256i a, int imm);
VPROLQ __m256i _mm256_mask_rol_epi64(__m256i a, __mmask8 k, __m256i b, int imm);
VPROLQ __m256i _mm256_maskz_rol_epi64( __mmask8 k, __m256i a, int imm);
VPROLQ __m128i _mm_rol_epi64(__m128i a, int imm);
VPROLQ __m128i _mm_mask_rol_epi64(__m128i a, __mmask8 k, __m128i b, int imm);
VPROLQ __m128i _mm_maskz_rol_epi64( __mmask8 k, __m128i a, int imm);
VPROLVD __m512i _mm512_rolv_epi32(__m512i a, __m512i cnt);
VPROLVD __m512i _mm512_mask_rolv_epi32(__m512i a, __mmask16 k, __m512i b, __m512i cnt);
VPROLVD __m512i _mm512_maskz_rolv_epi32(__mmask16 k, __m512i a, __m512i cnt);
VPROLVD __m256i _mm256_rolv_epi32(__m256i a, __m256i cnt);
VPROLVD __m256i _mm256_mask_rolv_epi32(__m256i a, __mmask8 k, __m256i b, __m256i cnt);
VPROLVD __m256i _mm256_maskz_rolv_epi32(__mmask8 k, __m256i a, __m256i cnt);
VPROLVD __m128i _mm_rolv_epi32(__m128i a, __m128i cnt);
VPROLVD __m128i _mm_mask_rolv_epi32(__m128i a, __mmask8 k, __m128i b, __m128i cnt);
VPROLVD __m128i _mm_maskz_rolv_epi32(__mmask8 k, __m128i a, __m128i cnt);
VPROLVQ __m512i _mm512_rolv_epi64(__m512i a, __m512i cnt);
VPROLVQ __m512i _mm512_mask_rolv_epi64(__m512i a, __mmask8 k, __m512i b, __m512i cnt);
VPROLVQ __m512i _mm512_maskz_rolv_epi64( __mmask8 k, __m512i a, __m512i cnt);
VPROLVQ __m256i _mm256_rolv_epi64(__m256i a, __m256i cnt);
VPROLVQ __m256i _mm256_mask_rolv_epi64(__m256i a, __mmask8 k, __m256i b, __m256i cnt);
VPROLVQ __m256i _mm256_maskz_rolv_epi64(__mmask8 k, __m256i a, __m256i cnt);
VPROLVQ __m128i _mm_rolv_epi64(__m128i a, __m128i cnt);
VPROLVQ __m128i _mm_mask_rolv_epi64(__m128i a, __mmask8 k, __m128i b, __m128i cnt);
VPROLVQ __m128i _mm_maskz_rolv_epi64(__mmask8 k, __m128i a, __m128i cnt);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

VPROLD/VPROLVD/VPROLQ/VPROLVQ—Bit Rotate Left Vol. 2C 5-591


VPRORD/VPRORVD/VPRORQ/VPRORVQ—Bit Rotate Right
Opcode/ Op 64/32 CPUID Feature Description
Instruction / En bit Mode Flag
Support
EVEX.128.66.0F38.W0 14 /r B V/V (AVX512VL AND Rotate doublewords in xmm2 right by count in the
VPRORVD xmm1 {k1}{z}, xmm2, AVX512F) OR corresponding element of xmm3/m128/m32bcst,
xmm3/m128/m32bcst AVX10.11 store result using writemask k1.
EVEX.128.66.0F.W0 72 /0 ib A V/V (AVX512VL AND Rotate doublewords in xmm2/m128/m32bcst right
VPRORD xmm1 {k1}{z}, AVX512F) OR by imm8, store result using writemask k1.
xmm2/m128/m32bcst, imm8 AVX10.11
EVEX.128.66.0F38.W1 14 /r B V/V (AVX512VL AND Rotate quadwords in xmm2 right by count in the
VPRORVQ xmm1 {k1}{z}, xmm2, AVX512F) OR corresponding element of xmm3/m128/m64bcst,
xmm3/m128/m64bcst AVX10.11 store result using writemask k1.
EVEX.128.66.0F.W1 72 /0 ib A V/V (AVX512VL AND Rotate quadwords in xmm2/m128/m64bcst right
VPRORQ xmm1 {k1}{z}, AVX512F) OR by imm8, store result using writemask k1.
xmm2/m128/m64bcst, imm8 AVX10.11
EVEX.256.66.0F38.W0 14 /r B V/V (AVX512VL AND Rotate doublewords in ymm2 right by count in the
VPRORVD ymm1 {k1}{z}, ymm2, AVX512F) OR corresponding element of ymm3/m256/m32bcst,
ymm3/m256/m32bcst AVX10.11 store using result writemask k1.
EVEX.256.66.0F.W0 72 /0 ib A V/V (AVX512VL AND Rotate doublewords in ymm2/m256/m32bcst right
VPRORD ymm1 {k1}{z}, AVX512F) OR by imm8, store result using writemask k1.
ymm2/m256/m32bcst, imm8 AVX10.11
EVEX.256.66.0F38.W1 14 /r B V/V (AVX512VL AND Rotate quadwords in ymm2 right by count in the
VPRORVQ ymm1 {k1}{z}, ymm2, AVX512F) OR corresponding element of ymm3/m256/m64bcst,
ymm3/m256/m64bcst AVX10.11 store result using writemask k1.
EVEX.256.66.0F.W1 72 /0 ib A V/V (AVX512VL AND Rotate quadwords in ymm2/m256/m64bcst right
VPRORQ ymm1 {k1}{z}, AVX512F) OR by imm8, store result using writemask k1.
ymm2/m256/m64bcst, imm8 AVX10.11
EVEX.512.66.0F38.W0 14 /r B V/V AVX512F Rotate doublewords in zmm2 right by count in the
VPRORVD zmm1 {k1}{z}, zmm2, OR AVX10.11 corresponding element of zmm3/m512/m32bcst,
zmm3/m512/m32bcst store result using writemask k1.
EVEX.512.66.0F.W0 72 /0 ib A V/V AVX512F Rotate doublewords in zmm2/m512/m32bcst right
VPRORD zmm1 {k1}{z}, OR AVX10.11 by imm8, store result using writemask k1.
zmm2/m512/m32bcst, imm8
EVEX.512.66.0F38.W1 14 /r B V/V AVX512F Rotate quadwords in zmm2 right by count in the
VPRORVQ zmm1 {k1}{z}, zmm2, OR AVX10.11 corresponding element of zmm3/m512/m64bcst,
zmm3/m512/m64bcst store result using writemask k1.
EVEX.512.66.0F.W1 72 /0 ib A V/V AVX512F Rotate quadwords in zmm2/m512/m64bcst right
VPRORQ zmm1 {k1}{z}, OR AVX10.11 by imm8, store result using writemask k1.
zmm2/m512/m64bcst, imm8

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full VEX.vvvv (w) ModRM:r/m (R) imm8 N/A
B Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

VPRORD/VPRORVD/VPRORQ/VPRORVQ—Bit Rotate Right Vol. 2C 5-592


Description
Rotates the bits in the individual data elements (doublewords, or quadword) in the first source operand to the right
by the number of bits specified in the count operand. If the value specified by the count operand is greater than 31
(for doublewords), or 63 (for a quadword), then the count operand modulo the data size (32 or 64) is used.
EVEX.128 encoded version: The destination operand is a XMM register. The source operand is a XMM register or a
memory location (for immediate form). The count operand can come either from an XMM register or a memory
location or an 8-bit immediate. Bits (MAXVL-1:128) of the corresponding ZMM register are zeroed.
EVEX.256 encoded version: The destination operand is a YMM register. The source operand is a YMM register or a
memory location (for immediate form). The count operand can come either from an XMM register or a memory
location or an 8-bit immediate. Bits (MAXVL-1:256) of the corresponding ZMM register are zeroed.
EVEX.512 encoded version: The destination operand is a ZMM register updated according to the writemask. For the
count operand in immediate form, the source operand can be a ZMM register, a 512-bit memory location or a 512-
bit vector broadcasted from a 32/64-bit memory location, the count operand is an 8-bit immediate. For the count
operand in variable form, the first source operand (the second operand) is a ZMM register and the counter operand
(the third operand) is a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32/64-bit
memory location.

Operation
RIGHT_ROTATE_DWORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC modulo 32;
DEST[31:0] := (SRC >> COUNT) | (SRC << (32 - COUNT));

RIGHT_ROTATE_QWORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC modulo 64;
DEST[63:0] := (SRC >> COUNT) | (SRC << (64 - COUNT));

VPRORD (EVEX encoded versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC1 *is memory*)
THEN DEST[i+31:i] := RIGHT_ROTATE_DWORDS( SRC1[31:0], imm8)
ELSE DEST[i+31:i] := RIGHT_ROTATE_DWORDS(SRC1[i+31:i], imm8)
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPRORVD (EVEX encoded versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+31:i] := RIGHT_ROTATE_DWORDS(SRC1[i+31:i], SRC2[31:0])
ELSE DEST[i+31:i] := RIGHT_ROTATE_DWORDS(SRC1[i+31:i], SRC2[i+31:i])
FI;

VPRORD/VPRORVD/VPRORQ/VPRORVQ—Bit Rotate Right Vol. 2C 5-593


ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPRORQ (EVEX encoded versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC1 *is memory*)
THEN DEST[i+63:i] := RIGHT_ROTATE_QWORDS(SRC1[63:0], imm8)
ELSE DEST[i+63:i] := RIGHT_ROTATE_QWORDS(SRC1[i+63:i], imm8])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPRORVQ (EVEX encoded versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+63:i] := RIGHT_ROTATE_QWORDS(SRC1[i+63:i], SRC2[63:0])
ELSE DEST[i+63:i] := RIGHT_ROTATE_QWORDS(SRC1[i+63:i], SRC2[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VPRORD/VPRORVD/VPRORQ/VPRORVQ—Bit Rotate Right Vol. 2C 5-594


Intel C/C++ Compiler Intrinsic Equivalent
VPRORD __m512i _mm512_ror_epi32(__m512i a, int imm);
VPRORD __m512i _mm512_mask_ror_epi32(__m512i a, __mmask16 k, __m512i b, int imm);
VPRORD __m512i _mm512_maskz_ror_epi32( __mmask16 k, __m512i a, int imm);
VPRORD __m256i _mm256_ror_epi32(__m256i a, int imm);
VPRORD __m256i _mm256_mask_ror_epi32(__m256i a, __mmask8 k, __m256i b, int imm);
VPRORD __m256i _mm256_maskz_ror_epi32( __mmask8 k, __m256i a, int imm);
VPRORD __m128i _mm_ror_epi32(__m128i a, int imm);
VPRORD __m128i _mm_mask_ror_epi32(__m128i a, __mmask8 k, __m128i b, int imm);
VPRORD __m128i _mm_maskz_ror_epi32( __mmask8 k, __m128i a, int imm);
VPRORQ __m512i _mm512_ror_epi64(__m512i a, int imm);
VPRORQ __m512i _mm512_mask_ror_epi64(__m512i a, __mmask8 k, __m512i b, int imm);
VPRORQ __m512i _mm512_maskz_ror_epi64(__mmask8 k, __m512i a, int imm);
VPRORQ __m256i _mm256_ror_epi64(__m256i a, int imm);
VPRORQ __m256i _mm256_mask_ror_epi64(__m256i a, __mmask8 k, __m256i b, int imm);
VPRORQ __m256i _mm256_maskz_ror_epi64( __mmask8 k, __m256i a, int imm);
VPRORQ __m128i _mm_ror_epi64(__m128i a, int imm);
VPRORQ __m128i _mm_mask_ror_epi64(__m128i a, __mmask8 k, __m128i b, int imm);
VPRORQ __m128i _mm_maskz_ror_epi64( __mmask8 k, __m128i a, int imm);
VPRORVD __m512i _mm512_rorv_epi32(__m512i a, __m512i cnt);
VPRORVD __m512i _mm512_mask_rorv_epi32(__m512i a, __mmask16 k, __m512i b, __m512i cnt);
VPRORVD __m512i _mm512_maskz_rorv_epi32(__mmask16 k, __m512i a, __m512i cnt);
VPRORVD __m256i _mm256_rorv_epi32(__m256i a, __m256i cnt);
VPRORVD __m256i _mm256_mask_rorv_epi32(__m256i a, __mmask8 k, __m256i b, __m256i cnt);
VPRORVD __m256i _mm256_maskz_rorv_epi32(__mmask8 k, __m256i a, __m256i cnt);
VPRORVD __m128i _mm_rorv_epi32(__m128i a, __m128i cnt);
VPRORVD __m128i _mm_mask_rorv_epi32(__m128i a, __mmask8 k, __m128i b, __m128i cnt);
VPRORVD __m128i _mm_maskz_rorv_epi32(__mmask8 k, __m128i a, __m128i cnt);
VPRORVQ __m512i _mm512_rorv_epi64(__m512i a, __m512i cnt);
VPRORVQ __m512i _mm512_mask_rorv_epi64(__m512i a, __mmask8 k, __m512i b, __m512i cnt);
VPRORVQ __m512i _mm512_maskz_rorv_epi64( __mmask8 k, __m512i a, __m512i cnt);
VPRORVQ __m256i _mm256_rorv_epi64(__m256i a, __m256i cnt);
VPRORVQ __m256i _mm256_mask_rorv_epi64(__m256i a, __mmask8 k, __m256i b, __m256i cnt);
VPRORVQ __m256i _mm256_maskz_rorv_epi64(__mmask8 k, __m256i a, __m256i cnt);
VPRORVQ __m128i _mm_rorv_epi64(__m128i a, __m128i cnt);
VPRORVQ __m128i _mm_mask_rorv_epi64(__m128i a, __mmask8 k, __m128i b, __m128i cnt);
VPRORVQ __m128i _mm_maskz_rorv_epi64(__mmask8 k, __m128i a, __m128i cnt);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

VPRORD/VPRORVD/VPRORQ/VPRORVQ—Bit Rotate Right Vol. 2C 5-595


VPSCATTERDD/VPSCATTERDQ/VPSCATTERQD/VPSCATTERQQ—Scatter Packed Dword, Packed

VPSCATTERDD/VPSCATTERDQ/VPSCATTERQD/VPSCATTERQQ—Scatter Packed Dword, Packed Qword with Signed Dword, Signed Vol. 2C 5-596
Qword with Signed Dword, Signed Qword Indices
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 A0 /vsib A V/V (AVX512VL AND Using signed dword indices, scatter dword values
VPSCATTERDD vm32x {k1}, xmm1 AVX512F) OR to memory using writemask k1.
AVX10.11
EVEX.256.66.0F38.W0 A0 /vsib A V/V (AVX512VL AND Using signed dword indices, scatter dword values
VPSCATTERDD vm32y {k1}, ymm1 AVX512F) OR to memory using writemask k1.
AVX10.11
EVEX.512.66.0F38.W0 A0 /vsib A V/V AVX512F Using signed dword indices, scatter dword values
VPSCATTERDD vm32z {k1}, zmm1 OR AVX10.11 to memory using writemask k1.
EVEX.128.66.0F38.W1 A0 /vsib A V/V (AVX512VL AND Using signed dword indices, scatter qword values
VPSCATTERDQ vm32x {k1}, xmm1 AVX512F) OR to memory using writemask k1.
AVX10.11
EVEX.256.66.0F38.W1 A0 /vsib A V/V (AVX512VL AND Using signed dword indices, scatter qword values
VPSCATTERDQ vm32x {k1}, ymm1 AVX512F) OR to memory using writemask k1.
AVX10.11
EVEX.512.66.0F38.W1 A0 /vsib A V/V AVX512F Using signed dword indices, scatter qword values
VPSCATTERDQ vm32y {k1}, zmm1 OR AVX10.11 to memory using writemask k1.
EVEX.128.66.0F38.W0 A1 /vsib A V/V (AVX512VL AND Using signed qword indices, scatter dword values
VPSCATTERQD vm64x {k1}, xmm1 AVX512F) OR to memory using writemask k1.
AVX10.11
EVEX.256.66.0F38.W0 A1 /vsib A V/V (AVX512VL AND Using signed qword indices, scatter dword values
VPSCATTERQD vm64y {k1}, xmm1 AVX512F) OR to memory using writemask k1.
AVX10.11
EVEX.512.66.0F38.W0 A1 /vsib A V/V AVX512F Using signed qword indices, scatter dword values
VPSCATTERQD vm64z {k1}, ymm1 OR AVX10.11 to memory using writemask k1.
EVEX.128.66.0F38.W1 A1 /vsib A V/V (AVX512VL AND Using signed qword indices, scatter qword values
VPSCATTERQQ vm64x {k1}, xmm1 AVX512F) OR to memory using writemask k1.
AVX10.11
EVEX.256.66.0F38.W1 A1 /vsib A V/V (AVX512VL AND Using signed qword indices, scatter qword values
VPSCATTERQQ vm64y {k1}, ymm1 AVX512F) OR to memory using writemask k1.
AVX10.11
EVEX.512.66.0F38.W1 A1 /vsib A V/V AVX512F Using signed qword indices, scatter qword values
VPSCATTERQQ vm64z {k1}, zmm1 OR AVX10.11 to memory using writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
BaseReg (R): VSIB:base,
A Tuple1 Scalar ModRM:reg (r) N/A N/A
VectorReg(R): VSIB:index

Description
Stores up to 16 elements (8 elements for qword indices) in doubleword vector or 8 elements in quadword vector to
the memory locations pointed by base address BASE_ADDR and index vector VINDEX, with scale SCALE. The
elements are specified via the VSIB (i.e., the index register is a vector register, holding packed indices). Elements

VPSCATTERDD/VPSCATTERDQ/VPSCATTERQD/VPSCATTERQQ—Scatter Packed Dword, Packed Qword with Signed Dword, Signed Vol. 2C 5-597
will only be stored if their corresponding mask bit is one. The entire mask register will be set to zero by this instruc-
tion unless it triggers an exception.
This instruction can be suspended by an exception if at least one element is already scattered (i.e., if the exception
is triggered by an element other than the rightmost one with its mask bit set). When this happens, the destination
register and the mask register are partially updated. If any traps or interrupts are pending from already scattered
elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruction
breakpoint is not re-triggered when the instruction is continued.
Note that:
• Only writes to overlapping vector indices are guaranteed to be ordered with respect to each other (from LSB to
MSB of the source registers). Note that this also include partially overlapping vector indices. Writes that are not
overlapped may happen in any order. Memory ordering with other instructions follows the Intel-64 memory
ordering model. Note that this does not account for non-overlapping indices that map into the same physical
address locations.
• If two or more destination indices completely overlap, the “earlier” write(s) may be skipped.
• Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all
elements closer to the LSB of the destination ZMM will be completed (and non-faulting). Individual elements
closer to the MSB may or may not be completed. If a given element triggers multiple faults, they are delivered
in the conventional order.
• Elements may be scattered in any order, but faults must be delivered in a right-to left order; thus, elements to
the left of a faulting one may be gathered before the fault is delivered. A given implementation of this
instruction is repeatable - given the same input values and architectural state, the same set of elements to the
left of the faulting one will be gathered.
• This instruction does not perform AC checks, and so will never deliver an AC fault.
• Not valid with 16-bit effective addresses. Will deliver a #UD fault.
• If this instruction overwrites itself and then takes a fault, only a subset of elements may be completed before
the fault is delivered (as described above). If the fault handler completes and attempts to re-execute this
instruction, the new instruction will be executed, and the scatter will not complete.
Note that the presence of VSIB byte is enforced in this instruction. Hence, the instruction will #UD fault if
ModRM.rm is different than 100b.
This instruction has special disp8*N and alignment rules. N is considered to be the size of a single vector element.
The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit
mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are
ignored.
The instruction will #UD fault if the k0 mask register is specified.
The instruction will #UD fault if EVEX.Z = 1.

Operation
BASE_ADDR stands for the memory operand base address (a GPR); may not exist
VINDEX stands for the memory operand vector of indices (a ZMM register)
SCALE stands for the memory operand scalar (1, 2, 4 or 8)
DISP is the optional 1 or 4 byte displacement

VPSCATTERDD (EVEX encoded versions)


(KL, VL)= (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN MEM[BASE_ADDR +SignExtend(VINDEX[i+31:i]) * SCALE + DISP] := SRC[i+31:i]
k1[j] := 0

FI;
ENDFOR

VPSCATTERDD/VPSCATTERDQ/VPSCATTERQD/VPSCATTERQQ—Scatter Packed Dword, Packed Qword with Signed Dword, Signed Vol. 2C 5-598
k1[MAX_KL-1:KL] := 0

VPSCATTERDQ (EVEX encoded versions)


(KL, VL)= (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN MEM[BASE_ADDR +SignExtend(VINDEX[k+31:k]) * SCALE + DISP] := SRC[i+63:i]
k1[j] := 0
FI;
ENDFOR
k1[MAX_KL-1:KL] := 0

VPSCATTERQD (EVEX encoded versions)


(KL, VL)= (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN MEM[BASE_ADDR + (VINDEX[k+63:k]) * SCALE + DISP] := SRC[i+31:i]
k1[j] := 0

FI;
ENDFOR
k1[MAX_KL-1:KL] := 0

VPSCATTERQQ (EVEX encoded versions)


(KL, VL)= (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN MEM[BASE_ADDR + (VINDEX[j+63:j]) * SCALE + DISP] := SRC[i+63:i]
FI;
ENDFOR
k1[MAX_KL-1:KL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPSCATTERDD void _mm512_i32scatter_epi32(void * base, __m512i vdx, __m512i a, int scale);
VPSCATTERDD void _mm256_i32scatter_epi32(void * base, __m256i vdx, __m256i a, int scale);
VPSCATTERDD void _mm_i32scatter_epi32(void * base, __m128i vdx, __m128i a, int scale);
VPSCATTERDD void _mm512_mask_i32scatter_epi32(void * base, __mmask16 k, __m512i vdx, __m512i a, int scale);
VPSCATTERDD void _mm256_mask_i32scatter_epi32(void * base, __mmask8 k, __m256i vdx, __m256i a, int scale);
VPSCATTERDD void _mm_mask_i32scatter_epi32(void * base, __mmask8 k, __m128i vdx, __m128i a, int scale);
VPSCATTERDQ void _mm512_i32scatter_epi64(void * base, __m256i vdx, __m512i a, int scale);
VPSCATTERDQ void _mm256_i32scatter_epi64(void * base, __m128i vdx, __m256i a, int scale);
VPSCATTERDQ void _mm_i32scatter_epi64(void * base, __m128i vdx, __m128i a, int scale);
VPSCATTERDQ void _mm512_mask_i32scatter_epi64(void * base, __mmask8 k, __m256i vdx, __m512i a, int scale);
VPSCATTERDQ void _mm256_mask_i32scatter_epi64(void * base, __mmask8 k, __m128i vdx, __m256i a, int scale);
VPSCATTERDQ void _mm_mask_i32scatter_epi64(void * base, __mmask8 k, __m128i vdx, __m128i a, int scale);
VPSCATTERQD void _mm512_i64scatter_epi32(void * base, __m512i vdx, __m256i a, int scale);
VPSCATTERQD void _mm256_i64scatter_epi32(void * base, __m256i vdx, __m128i a, int scale);
VPSCATTERQD void _mm_i64scatter_epi32(void * base, __m128i vdx, __m128i a, int scale);
VPSCATTERQD void _mm512_mask_i64scatter_epi32(void * base, __mmask8 k, __m512i vdx, __m256i a, int scale);

VPSCATTERDD/VPSCATTERDQ/VPSCATTERQD/VPSCATTERQQ—Scatter Packed Dword, Packed Qword with Signed Dword, Signed Vol. 2C 5-599
VPSCATTERQD void _mm256_mask_i64scatter_epi32(void * base, __mmask8 k, __m256i vdx, __m128i a, int scale);
VPSCATTERQD void _mm_mask_i64scatter_epi32(void * base, __mmask8 k, __m128i vdx, __m128i a, int scale);
VPSCATTERQQ void _mm512_i64scatter_epi64(void * base, __m512i vdx, __m512i a, int scale);
VPSCATTERQQ void _mm256_i64scatter_epi64(void * base, __m256i vdx, __m256i a, int scale);
VPSCATTERQQ void _mm_i64scatter_epi64(void * base, __m128i vdx, __m128i a, int scale);
VPSCATTERQQ void _mm512_mask_i64scatter_epi64(void * base, __mmask8 k, __m512i vdx, __m512i a, int scale);
VPSCATTERQQ void _mm256_mask_i64scatter_epi64(void * base, __mmask8 k, __m256i vdx, __m256i a, int scale);
VPSCATTERQQ void _mm_mask_i64scatter_epi64(void * base, __mmask8 k, __m128i vdx, __m128i a, int scale);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-63, “Type E12 Class Exception Conditions.”

VPSCATTERDD/VPSCATTERDQ/VPSCATTERQD/VPSCATTERQQ—Scatter Packed Dword, Packed Qword with Signed Dword, Signed Vol. 2C 5-600
VPSHLD—Concatenate and Shift Packed Data Left Logical
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F3A.W1 70 /r /ib A V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHLDW xmm1{k1}{z}, xmm2, AND AVX512VL) extract result shifted to the left by constant
xmm3/m128, imm8 OR AVX10.11 value in imm8 into xmm1.
EVEX.256.66.0F3A.W1 70 /r /ib A V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHLDW ymm1{k1}{z}, ymm2, AND AVX512VL) extract result shifted to the left by constant
ymm3/m256, imm8 OR AVX10.11 value in imm8 into ymm1.
EVEX.512.66.0F3A.W1 70 /r /ib A V/V AVX512_VBMI2 Concatenate destination and source operands,
VPSHLDW zmm1{k1}{z}, zmm2, OR AVX10.11 extract result shifted to the left by constant
zmm3/m512, imm8 value in imm8 into zmm1.
EVEX.128.66.0F3A.W0 71 /r /ib B V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHLDD xmm1{k1}{z}, xmm2, AND AVX512VL) extract result shifted to the left by constant
xmm3/m128/m32bcst, imm8 OR AVX10.11 value in imm8 into xmm1.
EVEX.256.66.0F3A.W0 71 /r /ib B V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHLDD ymm1{k1}{z}, ymm2, AND AVX512VL) extract result shifted to the left by constant
ymm3/m256/m32bcst, imm8 OR AVX10.11 value in imm8 into ymm1.
EVEX.512.66.0F3A.W0 71 /r /ib B V/V AVX512_VBMI2 Concatenate destination and source operands,
VPSHLDD zmm1{k1}{z}, zmm2, OR AVX10.11 extract result shifted to the left by constant
zmm3/m512/m32bcst, imm8 value in imm8 into zmm1.
EVEX.128.66.0F3A.W1 71 /r /ib B V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHLDQ xmm1{k1}{z}, xmm2, AND AVX512VL) extract result shifted to the left by constant
xmm3/m128/m64bcst, imm8 OR AVX10.11 value in imm8 into xmm1.
EVEX.256.66.0F3A.W1 71 /r /ib B V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHLDQ ymm1{k1}{z}, ymm2, AND AVX512VL) extract result shifted to the left by constant
ymm3/m256/m64bcst, imm8 OR AVX10.11 value in imm8 into ymm1.
EVEX.512.66.0F3A.W1 71 /r /ib B V/V AVX512_VBMI2 Concatenate destination and source operands,
VPSHLDQ zmm1{k1}{z}, zmm2, OR AVX10.11 extract result shifted to the left by constant
zmm3/m512/m64bcst, imm8 value in imm8 into zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8 (r)
B Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8 (r)

Description
Concatenate packed data, extract result shifted to the left by constant value.
This instruction supports memory fault suppression.

VPSHLD—Concatenate and Shift Packed Data Left Logical Vol. 2C 5-600


Operation
VPSHLDW DEST, SRC2, SRC3, imm8
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1:
IF MaskBit(j) OR *no writemask*:
tmp := concat(SRC2.word[j], SRC3.word[j]) << (imm8 & 15)
DEST.word[j] := tmp.word[1]
ELSE IF *zeroing*:
DEST.word[j] := 0
*ELSE DEST.word[j] remains unchanged*
DEST[MAX_VL-1:VL] := 0

VPSHLDD DEST, SRC2, SRC3, imm8


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1:
IF SRC3 is broadcast memop:
tsrc3 := SRC3.dword[0]
ELSE:
tsrc3 := SRC3.dword[j]
IF MaskBit(j) OR *no writemask*:
tmp := concat(SRC2.dword[j], tsrc3) << (imm8 & 31)
DEST.dword[j] := tmp.dword[1]
ELSE IF *zeroing*:
DEST.dword[j] := 0
*ELSE DEST.dword[j] remains unchanged*
DEST[MAX_VL-1:VL] := 0

VPSHLDQ DEST, SRC2, SRC3, imm8


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1:
IF SRC3 is broadcast memop:
tsrc3 := SRC3.qword[0]
ELSE:
tsrc3 := SRC3.qword[j]
IF MaskBit(j) OR *no writemask*:
tmp := concat(SRC2.qword[j], tsrc3) << (imm8 & 63)
DEST.qword[j] := tmp.qword[1]
ELSE IF *zeroing*:
DEST.qword[j] := 0
*ELSE DEST.qword[j] remains unchanged*
DEST[MAX_VL-1:VL] := 0

VPSHLD—Concatenate and Shift Packed Data Left Logical Vol. 2C 5-601


Intel C/C++ Compiler Intrinsic Equivalent
VPSHLDD __m128i _mm_shldi_epi32(__m128i, __m128i, int);
VPSHLDD __m128i _mm_mask_shldi_epi32(__m128i, __mmask8, __m128i, __m128i, int);
VPSHLDD __m128i _mm_maskz_shldi_epi32(__mmask8, __m128i, __m128i, int);
VPSHLDD __m256i _mm256_shldi_epi32(__m256i, __m256i, int);
VPSHLDD __m256i _mm256_mask_shldi_epi32(__m256i, __mmask8, __m256i, __m256i, int);
VPSHLDD __m256i _mm256_maskz_shldi_epi32(__mmask8, __m256i, __m256i, int);
VPSHLDD __m512i _mm512_shldi_epi32(__m512i, __m512i, int);
VPSHLDD __m512i _mm512_mask_shldi_epi32(__m512i, __mmask16, __m512i, __m512i, int);
VPSHLDD __m512i _mm512_maskz_shldi_epi32(__mmask16, __m512i, __m512i, int);
VPSHLDQ __m128i _mm_shldi_epi64(__m128i, __m128i, int);
VPSHLDQ __m128i _mm_mask_shldi_epi64(__m128i, __mmask8, __m128i, __m128i, int);
VPSHLDQ __m128i _mm_maskz_shldi_epi64(__mmask8, __m128i, __m128i, int);
VPSHLDQ __m256i _mm256_shldi_epi64(__m256i, __m256i, int);
VPSHLDQ __m256i _mm256_mask_shldi_epi64(__m256i, __mmask8, __m256i, __m256i, int);
VPSHLDQ __m256i _mm256_maskz_shldi_epi64(__mmask8, __m256i, __m256i, int);
VPSHLDQ __m512i _mm512_shldi_epi64(__m512i, __m512i, int);
VPSHLDQ __m512i _mm512_mask_shldi_epi64(__m512i, __mmask8, __m512i, __m512i, int);
VPSHLDQ __m512i _mm512_maskz_shldi_epi64(__mmask8, __m512i, __m512i, int);
VPSHLDW __m128i _mm_shldi_epi16(__m128i, __m128i, int);
VPSHLDW __m128i _mm_mask_shldi_epi16(__m128i, __mmask8, __m128i, __m128i, int);
VPSHLDW __m128i _mm_maskz_shldi_epi16(__mmask8, __m128i, __m128i, int);
VPSHLDW __m256i _mm256_shldi_epi16(__m256i, __m256i, int);
VPSHLDW __m256i _mm256_mask_shldi_epi16(__m256i, __mmask16, __m256i, __m256i, int);
VPSHLDW __m256i _mm256_maskz_shldi_epi16(__mmask16, __m256i, __m256i, int);
VPSHLDW __m512i _mm512_shldi_epi16(__m512i, __m512i, int);
VPSHLDW __m512i _mm512_mask_shldi_epi16(__m512i, __mmask32, __m512i, __m512i, int);
VPSHLDW __m512i _mm512_maskz_shldi_epi16(__mmask32, __m512i, __m512i, int);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”

VPSHLD—Concatenate and Shift Packed Data Left Logical Vol. 2C 5-602


VPSHLDV—Concatenate and Variable Shift Packed Data Left Logical
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 70 /r A V/V (AVX512_VBMI2 Concatenate xmm1 and xmm2, extract result
VPSHLDVW xmm1{k1}{z}, xmm2, AND AVX512VL) shifted to the left by value in xmm3/m128 into
xmm3/m128 OR AVX10.11 xmm1.
EVEX.256.66.0F38.W1 70 /r A V/V (AVX512_VBMI2 Concatenate ymm1 and ymm2, extract result
VPSHLDVW ymm1{k1}{z}, ymm2, AND AVX512VL) shifted to the left by value in xmm3/m256 into
ymm3/m256 OR AVX10.11 ymm1.
EVEX.512.66.0F38.W1 70 /r A V/V AVX512_VBMI2 Concatenate zmm1 and zmm2, extract result
VPSHLDVW zmm1{k1}{z}, zmm2, OR AVX10.11 shifted to the left by value in zmm3/m512 into
zmm3/m512 zmm1.
EVEX.128.66.0F38.W0 71 /r B V/V (AVX512_VBMI2 Concatenate xmm1 and xmm2, extract result
VPSHLDVD xmm1{k1}{z}, xmm2, AND AVX512VL) shifted to the left by value in xmm3/m128 into
xmm3/m128/m32bcst OR AVX10.11 xmm1.
EVEX.256.66.0F38.W0 71 /r B V/V (AVX512_VBMI2 Concatenate ymm1 and ymm2, extract result
VPSHLDVD ymm1{k1}{z}, ymm2, AND AVX512VL) shifted to the left by value in xmm3/m256 into
ymm3/m256/m32bcst OR AVX10.11 ymm1.
EVEX.512.66.0F38.W0 71 /r B V/V AVX512_VBMI2 Concatenate zmm1 and zmm2, extract result
VPSHLDVD zmm1{k1}{z}, zmm2, OR AVX10.11 shifted to the left by value in zmm3/m512 into
zmm3/m512/m32bcst zmm1.
EVEX.128.66.0F38.W1 71 /r B V/V (AVX512_VBMI2 Concatenate xmm1 and xmm2, extract result
VPSHLDVQ xmm1{k1}{z}, xmm2, AND AVX512VL) shifted to the left by value in xmm3/m128 into
xmm3/m128/m64bcst OR AVX10.11 xmm1.
EVEX.256.66.0F38.W1 71 /r B V/V (AVX512_VBMI2 Concatenate ymm1 and ymm2, extract result
VPSHLDVQ ymm1{k1}{z}, ymm2, AND AVX512VL) shifted to the left by value in xmm3/m256 into
ymm3/m256/m64bcst OR AVX10.11 ymm1.
EVEX.512.66.0F38.W1 71 /r B V/V AVX512_VBMI2 Concatenate zmm1 and zmm2, extract result
VPSHLDVQ zmm1{k1}{z}, zmm2, OR AVX10.11 shifted to the left by value in zmm3/m512 into
zmm3/m512/m64bcst zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full Mem ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Concatenate packed data, extract result shifted to the left by variable value.
This instruction supports memory fault suppression.

VPSHLDV—Concatenate and Variable Shift Packed Data Left Logical Vol. 2C 5-603
Operation
FUNCTION concat(a,b):
IF words:
d.word[1] := a
d.word[0] := b
return d
ELSE IF dwords:
q.dword[1] := a
q.dword[0] := b
return q
ELSE IF qwords:
o.qword[1] := a
o.qword[0] := b
return o

VPSHLDVW DEST, SRC2, SRC3


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1:
IF MaskBit(j) OR *no writemask*:
tmp := concat(DEST.word[j], SRC2.word[j]) << (SRC3.word[j] & 15)
DEST.word[j] := tmp.word[1]
ELSE IF *zeroing*:
DEST.word[j] := 0
*ELSE DEST.word[j] remains unchanged*
DEST[MAX_VL-1:VL] := 0

VPSHLDVD DEST, SRC2, SRC3


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1:
IF SRC3 is broadcast memop:
tsrc3 := SRC3.dword[0]
ELSE:
tsrc3 := SRC3.dword[j]
IF MaskBit(j) OR *no writemask*:
tmp := concat(DEST.dword[j], SRC2.dword[j]) << (tsrc3 & 31)
DEST.dword[j] := tmp.dword[1]
ELSE IF *zeroing*:
DEST.dword[j] := 0
*ELSE DEST.dword[j] remains unchanged*
DEST[MAX_VL-1:VL] := 0

VPSHLDV—Concatenate and Variable Shift Packed Data Left Logical Vol. 2C 5-604
VPSHLDVQ DEST, SRC2, SRC3
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1:
IF SRC3 is broadcast memop:
tsrc3 := SRC3.qword[0]
ELSE:
tsrc3 := SRC3.qword[j]
IF MaskBit(j) OR *no writemask*:
tmp := concat(DEST.qword[j], SRC2.qword[j]) << (tsrc3 & 63)
DEST.qword[j] := tmp.qword[1]
ELSE IF *zeroing*:
DEST.qword[j] := 0
*ELSE DEST.qword[j] remains unchanged*
DEST[MAX_VL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPSHLDVW __m128i _mm_shldv_epi16(__m128i, __m128i, __m128i);
VPSHLDVW __m128i _mm_mask_shldv_epi16(__m128i, __mmask8, __m128i, __m128i);
VPSHLDVW __m128i _mm_maskz_shldv_epi16(__mmask8, __m128i, __m128i, __m128i);
VPSHLDVW __m256i _mm256_shldv_epi16(__m256i, __m256i, __m256i);
VPSHLDVW __m256i _mm256_mask_shldv_epi16(__m256i, __mmask16, __m256i, __m256i);
VPSHLDVW __m256i _mm256_maskz_shldv_epi16(__mmask16, __m256i, __m256i, __m256i);
VPSHLDVQ __m512i _mm512_shldv_epi64(__m512i, __m512i, __m512i);
VPSHLDVQ __m512i _mm512_mask_shldv_epi64(__m512i, __mmask8, __m512i, __m512i);
VPSHLDVQ __m512i _mm512_maskz_shldv_epi64(__mmask8, __m512i, __m512i, __m512i);
VPSHLDVW __m128i _mm_shldv_epi16(__m128i, __m128i, __m128i);
VPSHLDVW __m128i _mm_mask_shldv_epi16(__m128i, __mmask8, __m128i, __m128i);
VPSHLDVW __m128i _mm_maskz_shldv_epi16(__mmask8, __m128i, __m128i, __m128i);
VPSHLDVW __m256i _mm256_shldv_epi16(__m256i, __m256i, __m256i);
VPSHLDVW __m256i _mm256_mask_shldv_epi16(__m256i, __mmask16, __m256i, __m256i);
VPSHLDVW __m256i _mm256_maskz_shldv_epi16(__mmask16, __m256i, __m256i, __m256i);
VPSHLDVW __m512i _mm512_shldv_epi16(__m512i, __m512i, __m512i);
VPSHLDVW __m512i _mm512_mask_shldv_epi16(__m512i, __mmask32, __m512i, __m512i);
VPSHLDVW __m512i _mm512_maskz_shldv_epi16(__mmask32, __m512i, __m512i, __m512i);
VPSHLDVD __m128i _mm_shldv_epi32(__m128i, __m128i, __m128i);
VPSHLDVD __m128i _mm_mask_shldv_epi32(__m128i, __mmask8, __m128i, __m128i);
VPSHLDVD __m128i _mm_maskz_shldv_epi32(__mmask8, __m128i, __m128i, __m128i);
VPSHLDVD __m256i _mm256_shldv_epi32(__m256i, __m256i, __m256i);
VPSHLDVD __m256i _mm256_mask_shldv_epi32(__m256i, __mmask8, __m256i, __m256i);
VPSHLDVD __m256i _mm256_maskz_shldv_epi32(__mmask8, __m256i, __m256i, __m256i);
VPSHLDVD __m512i _mm512_shldv_epi32(__m512i, __m512i, __m512i);
VPSHLDVD __m512i _mm512_mask_shldv_epi32(__m512i, __mmask16, __m512i, __m512i);
VPSHLDVD __m512i _mm512_maskz_shldv_epi32(__mmask16, __m512i, __m512i, __m512i);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”

VPSHLDV—Concatenate and Variable Shift Packed Data Left Logical Vol. 2C 5-605
VPSHRD—Concatenate and Shift Packed Data Right Logical
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F3A.W1 72 /r /ib A V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHRDW xmm1{k1}{z}, xmm2, AND AVX512VL) extract result shifted to the right by constant
xmm3/m128, imm8 OR AVX10.11 value in imm8 into xmm1.
EVEX.256.66.0F3A.W1 72 /r /ib A V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHRDW ymm1{k1}{z}, ymm2, AND AVX512VL) extract result shifted to the right by constant
ymm3/m256, imm8 OR AVX10.11 value in imm8 into ymm1.
EVEX.512.66.0F3A.W1 72 /r /ib A V/V AVX512_VBMI2 Concatenate destination and source operands,
VPSHRDW zmm1{k1}{z}, zmm2, OR AVX10.11 extract result shifted to the right by constant
zmm3/m512, imm8 value in imm8 into zmm1.
EVEX.128.66.0F3A.W0 73 /r /ib B V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHRDD xmm1{k1}{z}, xmm2, AND AVX512VL) extract result shifted to the right by constant
xmm3/m128/m32bcst, imm8 OR AVX10.11 value in imm8 into xmm1.
EVEX.256.66.0F3A.W0 73 /r /ib B V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHRDD ymm1{k1}{z}, ymm2, AND AVX512VL) extract result shifted to the right by constant
ymm3/m256/m32bcst, imm8 OR AVX10.11 value in imm8 into ymm1.
EVEX.512.66.0F3A.W0 73 /r /ib B V/V AVX512_VBMI2 Concatenate destination and source operands,
VPSHRDD zmm1{k1}{z}, zmm2, OR AVX10.11 extract result shifted to the right by constant
zmm3/m512/m32bcst, imm8 value in imm8 into zmm1.
EVEX.128.66.0F3A.W1 73 /r /ib B V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHRDQ xmm1{k1}{z}, xmm2, AND AVX512VL) extract result shifted to the right by constant
xmm3/m128/m64bcst, imm8 OR AVX10.11 value in imm8 into xmm1.
EVEX.256.66.0F3A.W1 73 /r /ib B V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHRDQ ymm1{k1}{z}, ymm2, AND AVX512VL) extract result shifted to the right by constant
ymm3/m256/m64bcst, imm8 OR AVX10.11 value in imm8 into ymm1.
EVEX.512.66.0F3A.W1 73 /r /ib B V/V AVX512_VBMI2 Concatenate destination and source operands,
VPSHRDQ zmm1{k1}{z}, zmm2, OR AVX10.11 extract result shifted to the right by constant
zmm3/m512/m64bcst, imm8 value in imm8 into zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8 (r)
B Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8 (r)

Description
Concatenate packed data, extract result shifted to the right by constant value.
This instruction supports memory fault suppression.

VPSHRD—Concatenate and Shift Packed Data Right Logical Vol. 2C 5-606


Operation
VPSHRDW DEST, SRC2, SRC3, imm8
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1:
IF MaskBit(j) OR *no writemask*:
DEST.word[j] := concat(SRC3.word[j], SRC2.word[j]) >> (imm8 & 15)
ELSE IF *zeroing*:
DEST.word[j] := 0
*ELSE DEST.word[j] remains unchanged*
DEST[MAX_VL-1:VL] := 0

VPSHRDD DEST, SRC2, SRC3, imm8


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1:
IF SRC3 is broadcast memop:
tsrc3 := SRC3.dword[0]
ELSE:
tsrc3 := SRC3.dword[j]
IF MaskBit(j) OR *no writemask*:
DEST.dword[j] := concat(tsrc3, SRC2.dword[j]) >> (imm8 & 31)
ELSE IF *zeroing*:
DEST.dword[j] := 0
*ELSE DEST.dword[j] remains unchanged*
DEST[MAX_VL-1:VL] := 0

VPSHRDQ DEST, SRC2, SRC3, imm8


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1:
IF SRC3 is broadcast memop:
tsrc3 := SRC3.qword[0]
ELSE:
tsrc3 := SRC3.qword[j]
IF MaskBit(j) OR *no writemask*:
DEST.qword[j] := concat(tsrc3, SRC2.qword[j]) >> (imm8 & 63)
ELSE IF *zeroing*:
DEST.qword[j] := 0
*ELSE DEST.qword[j] remains unchanged*
DEST[MAX_VL-1:VL] := 0

VPSHRD—Concatenate and Shift Packed Data Right Logical Vol. 2C 5-607


Intel C/C++ Compiler Intrinsic Equivalent
VPSHRDQ __m128i _mm_shrdi_epi64(__m128i, __m128i, int);
VPSHRDQ __m128i _mm_mask_shrdi_epi64(__m128i, __mmask8, __m128i, __m128i, int);
VPSHRDQ __m128i _mm_maskz_shrdi_epi64(__mmask8, __m128i, __m128i, int);
VPSHRDQ __m256i _mm256_shrdi_epi64(__m256i, __m256i, int);
VPSHRDQ __m256i _mm256_mask_shrdi_epi64(__m256i, __mmask8, __m256i, __m256i, int);
VPSHRDQ __m256i _mm256_maskz_shrdi_epi64(__mmask8, __m256i, __m256i, int);
VPSHRDQ __m512i _mm512_shrdi_epi64(__m512i, __m512i, int);
VPSHRDQ __m512i _mm512_mask_shrdi_epi64(__m512i, __mmask8, __m512i, __m512i, int);
VPSHRDQ __m512i _mm512_maskz_shrdi_epi64(__mmask8, __m512i, __m512i, int);
VPSHRDD __m128i _mm_shrdi_epi32(__m128i, __m128i, int);
VPSHRDD __m128i _mm_mask_shrdi_epi32(__m128i, __mmask8, __m128i, __m128i, int);
VPSHRDD __m128i _mm_maskz_shrdi_epi32(__mmask8, __m128i, __m128i, int);
VPSHRDD __m256i _mm256_shrdi_epi32(__m256i, __m256i, int);
VPSHRDD __m256i _mm256_mask_shrdi_epi32(__m256i, __mmask8, __m256i, __m256i, int);
VPSHRDD __m256i _mm256_maskz_shrdi_epi32(__mmask8, __m256i, __m256i, int);
VPSHRDD __m512i _mm512_shrdi_epi32(__m512i, __m512i, int);
VPSHRDD __m512i _mm512_mask_shrdi_epi32(__m512i, __mmask16, __m512i, __m512i, int);
VPSHRDD __m512i _mm512_maskz_shrdi_epi32(__mmask16, __m512i, __m512i, int);
VPSHRDW __m128i _mm_shrdi_epi16(__m128i, __m128i, int);
VPSHRDW __m128i _mm_mask_shrdi_epi16(__m128i, __mmask8, __m128i, __m128i, int);
VPSHRDW __m128i _mm_maskz_shrdi_epi16(__mmask8, __m128i, __m128i, int);
VPSHRDW __m256i _mm256_shrdi_epi16(__m256i, __m256i, int);
VPSHRDW __m256i _mm256_mask_shrdi_epi16(__m256i, __mmask16, __m256i, __m256i, int);
VPSHRDW __m256i _mm256_maskz_shrdi_epi16(__mmask16, __m256i, __m256i, int);
VPSHRDW __m512i _mm512_shrdi_epi16(__m512i, __m512i, int);
VPSHRDW __m512i _mm512_mask_shrdi_epi16(__m512i, __mmask32, __m512i, __m512i, int);
VPSHRDW __m512i _mm512_maskz_shrdi_epi16(__mmask32, __m512i, __m512i, int);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”

VPSHRD—Concatenate and Shift Packed Data Right Logical Vol. 2C 5-608


VPSHRDV—Concatenate and Variable Shift Packed Data Right Logical
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 72 /r A V/V (AVX512_VBMI2 Concatenate xmm1 and xmm2, extract result
VPSHRDVW xmm1{k1}{z}, xmm2, AND AVX512VL) shifted to the right by value in xmm3/m128
xmm3/m128 OR AVX10.11 into xmm1.
EVEX.256.66.0F38.W1 72 /r A V/V (AVX512_VBMI2 Concatenate ymm1 and ymm2, extract result
VPSHRDVW ymm1{k1}{z}, ymm2, AND AVX512VL) shifted to the right by value in xmm3/m256
ymm3/m256 OR AVX10.11 into ymm1.
EVEX.512.66.0F38.W1 72 /r A V/V AVX512_VBMI2 Concatenate zmm1 and zmm2, extract result
VPSHRDVW zmm1{k1}{z}, zmm2, OR AVX10.11 shifted to the right by value in zmm3/m512
zmm3/m512 into zmm1.
EVEX.128.66.0F38.W0 73 /r B V/V (AVX512_VBMI2 Concatenate xmm1 and xmm2, extract result
VPSHRDVD xmm1{k1}{z}, xmm2, AND AVX512VL) shifted to the right by value in xmm3/m128
xmm3/m128/m32bcst OR AVX10.11 into xmm1.
EVEX.256.66.0F38.W0 73 /r B V/V (AVX512_VBMI2 Concatenate ymm1 and ymm2, extract result
VPSHRDVD ymm1{k1}{z}, ymm2, AND AVX512VL) shifted to the right by value in xmm3/m256
ymm3/m256/m32bcst OR AVX10.11 into ymm1.
EVEX.512.66.0F38.W0 73 /r B V/V AVX512_VBMI2 Concatenate zmm1 and zmm2, extract result
VPSHRDVD zmm1{k1}{z}, zmm2, OR AVX10.11 shifted to the right by value in zmm3/m512
zmm3/m512/m32bcst into zmm1.
EVEX.128.66.0F38.W1 73 /r B V/V (AVX512_VBMI2 Concatenate xmm1 and xmm2, extract result
VPSHRDVQ xmm1{k1}{z}, xmm2, AND AVX512VL) shifted to the right by value in xmm3/m128
xmm3/m128/m64bcst OR AVX10.11 into xmm1.
EVEX.256.66.0F38.W1 73 /r B V/V (AVX512_VBMI2 Concatenate ymm1 and ymm2, extract result
VPSHRDVQ ymm1{k1}{z}, ymm2, AND AVX512VL) shifted to the right by value in xmm3/m256
ymm3/m256/m64bcst OR AVX10.11 into ymm1.
EVEX.512.66.0F38.W1 73 /r B V/V AVX512_VBMI2 Concatenate zmm1 and zmm2, extract result
VPSHRDVQ zmm1{k1}{z}, zmm2, OR AVX10.11 shifted to the right by value in zmm3/m512
zmm3/m512/m64bcst into zmm1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full Mem ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Concatenate packed data, extract result shifted to the right by variable value.
This instruction supports memory fault suppression.

VPSHRDV—Concatenate and Variable Shift Packed Data Right Logical Vol. 2C 5-609
Operation
VPSHRDVW DEST, SRC2, SRC3
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1:
IF MaskBit(j) OR *no writemask*:
DEST.word[j] := concat(SRC2.word[j], DEST.word[j]) >> (SRC3.word[j] & 15)
ELSE IF *zeroing*:
DEST.word[j] := 0
*ELSE DEST.word[j] remains unchanged*
DEST[MAX_VL-1:VL] := 0

VPSHRDVD DEST, SRC2, SRC3


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1:
IF SRC3 is broadcast memop:
tsrc3 := SRC3.dword[0]
ELSE:
tsrc3 := SRC3.dword[j]
IF MaskBit(j) OR *no writemask*:
DEST.dword[j] := concat(SRC2.dword[j], DEST.dword[j]) >> (tsrc3 & 31)
ELSE IF *zeroing*:
DEST.dword[j] := 0
*ELSE DEST.dword[j] remains unchanged*
DEST[MAX_VL-1:VL] := 0

VPSHRDVQ DEST, SRC2, SRC3


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1:
IF SRC3 is broadcast memop:
tsrc3 := SRC3.qword[0]
ELSE:
tsrc3 := SRC3.qword[j]
IF MaskBit(j) OR *no writemask*:
DEST.qword[j] := concat(SRC2.qword[j], DEST.qword[j]) >> (tsrc3 & 63)
ELSE IF *zeroing*:
DEST.qword[j] := 0
*ELSE DEST.qword[j] remains unchanged*
DEST[MAX_VL-1:VL] := 0

VPSHRDV—Concatenate and Variable Shift Packed Data Right Logical Vol. 2C 5-610
Intel C/C++ Compiler Intrinsic Equivalent
VPSHRDVQ __m128i _mm_shrdv_epi64(__m128i, __m128i, __m128i);
VPSHRDVQ __m128i _mm_mask_shrdv_epi64(__m128i, __mmask8, __m128i, __m128i);
VPSHRDVQ __m128i _mm_maskz_shrdv_epi64(__mmask8, __m128i, __m128i, __m128i);
VPSHRDVQ __m256i _mm256_shrdv_epi64(__m256i, __m256i, __m256i);
VPSHRDVQ __m256i _mm256_mask_shrdv_epi64(__m256i, __mmask8, __m256i, __m256i);
VPSHRDVQ __m256i _mm256_maskz_shrdv_epi64(__mmask8, __m256i, __m256i, __m256i);
VPSHRDVQ __m512i _mm512_shrdv_epi64(__m512i, __m512i, __m512i);
VPSHRDVQ __m512i _mm512_mask_shrdv_epi64(__m512i, __mmask8, __m512i, __m512i);
VPSHRDVQ __m512i _mm512_maskz_shrdv_epi64(__mmask8, __m512i, __m512i, __m512i);
VPSHRDVD __m128i _mm_shrdv_epi32(__m128i, __m128i, __m128i);
VPSHRDVD __m128i _mm_mask_shrdv_epi32(__m128i, __mmask8, __m128i, __m128i);
VPSHRDVD __m128i _mm_maskz_shrdv_epi32(__mmask8, __m128i, __m128i, __m128i);
VPSHRDVD __m256i _mm256_shrdv_epi32(__m256i, __m256i, __m256i);
VPSHRDVD __m256i _mm256_mask_shrdv_epi32(__m256i, __mmask8, __m256i, __m256i);
VPSHRDVD __m256i _mm256_maskz_shrdv_epi32(__mmask8, __m256i, __m256i, __m256i);
VPSHRDVD __m512i _mm512_shrdv_epi32(__m512i, __m512i, __m512i);
VPSHRDVD __m512i _mm512_mask_shrdv_epi32(__m512i, __mmask16, __m512i, __m512i);
VPSHRDVD __m512i _mm512_maskz_shrdv_epi32(__mmask16, __m512i, __m512i, __m512i);
VPSHRDVW __m128i _mm_shrdv_epi16(__m128i, __m128i, __m128i);
VPSHRDVW __m128i _mm_mask_shrdv_epi16(__m128i, __mmask8, __m128i, __m128i);
VPSHRDVW __m128i _mm_maskz_shrdv_epi16(__mmask8, __m128i, __m128i, __m128i);
VPSHRDVW __m256i _mm256_shrdv_epi16(__m256i, __m256i, __m256i);
VPSHRDVW __m256i _mm256_mask_shrdv_epi16(__m256i, __mmask16, __m256i, __m256i);
VPSHRDVW __m256i _mm256_maskz_shrdv_epi16(__mmask16, __m256i, __m256i, __m256i);
VPSHRDVW __m512i _mm512_shrdv_epi16(__m512i, __m512i, __m512i);
VPSHRDVW __m512i _mm512_mask_shrdv_epi16(__m512i, __mmask32, __m512i, __m512i);
VPSHRDVW __m512i _mm512_maskz_shrdv_epi16(__mmask32, __m512i, __m512i, __m512i);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”

VPSHRDV—Concatenate and Variable Shift Packed Data Right Logical Vol. 2C 5-611
VPSHUFBITQMB—Shuffle Bits From Quadword Elements Using Byte Indexes Into Mask
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 8F /r A V/V (AVX512_BITALG Extract values in xmm2 using control bits of
VPSHUFBITQMB k1{k2}, xmm2, AND AVX512VL) xmm3/m128 with writemask k2 and leave the
xmm3/m128 OR AVX10.11 result in mask register k1.
EVEX.256.66.0F38.W0 8F /r A V/V (AVX512_BITALG Extract values in ymm2 using control bits of
VPSHUFBITQMB k1{k2}, ymm2, AND AVX512VL) ymm3/m256 with writemask k2 and leave the
ymm3/m256 OR AVX10.11 result in mask register k1.
EVEX.512.66.0F38.W0 8F /r A V/V AVX512_BITALG Extract values in zmm2 using control bits of
VPSHUFBITQMB k1{k2}, zmm2, OR AVX10.11 zmm3/m512 with writemask k2 and leave the
zmm3/m512 result in mask register k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
The VPSHUFBITQMB instruction performs a bit gather select using second source as control and first source as
data. Each bit uses 6 control bits (2nd source operand) to select which data bit is going to be gathered (first source
operand). A given bit can only access 64 different bits of data (first 64 destination bits can access first 64 data bits,
second 64 destination bits can access second 64 data bits, etc.).
Control data for each output bit is stored in 8 bit elements of SRC2, but only the 6 least significant bits of each
element are used.
This instruction uses write masking (zeroing only). This instruction supports memory fault suppression.
The first source operand is a ZMM register. The second source operand is a ZMM register or a memory location. The
destination operand is a mask register.

Operation
VPSHUFBITQMB DEST, SRC1, SRC2
(KL, VL) = (16,128), (32,256), (64, 512)
FOR i := 0 TO KL/8-1: //Qword
FOR j := 0 to 7: // Byte
IF k2[i*8+j] or *no writemask*:
m := SRC2.qword[i].byte[j] & 0x3F
k1[i*8+j] := SRC1.qword[i].bit[m]
ELSE:
k1[i*8+j] := 0
k1[MAX_KL-1:KL] := 0

VPSHUFBITQMB—Shuffle Bits From Quadword Elements Using Byte Indexes Into Mask Vol. 2C 5-612
Intel C/C++ Compiler Intrinsic Equivalent
VPSHUFBITQMB __mmask16 _mm_bitshuffle_epi64_mask(__m128i, __m128i);
VPSHUFBITQMB __mmask16 _mm_mask_bitshuffle_epi64_mask(__mmask16, __m128i, __m128i);
VPSHUFBITQMB __mmask32 _mm256_bitshuffle_epi64_mask(__m256i, __m256i);
VPSHUFBITQMB __mmask32 _mm256_mask_bitshuffle_epi64_mask(__mmask32, __m256i, __m256i);
VPSHUFBITQMB __mmask64 _mm512_bitshuffle_epi64_mask(__m512i, __m512i);
VPSHUFBITQMB __mmask64 _mm512_mask_bitshuffle_epi64_mask(__mmask64, __m512i, __m512i);

VPSHUFBITQMB—Shuffle Bits From Quadword Elements Using Byte Indexes Into Mask Vol. 2C 5-613
VPSLLVW/VPSLLVD/VPSLLVQ—Variable Bit Shift Left Logical
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.66.0F38.W0 47 /r A V/V AVX2 Shift doublewords in xmm2 left by amount
VPSLLVD xmm1, xmm2, xmm3/m128 specified in the corresponding element of
xmm3/m128 while shifting in 0s.
VEX.128.66.0F38.W1 47 /r A V/V AVX2 Shift quadwords in xmm2 left by amount specified
VPSLLVQ xmm1, xmm2, xmm3/m128 in the corresponding element of xmm3/m128
while shifting in 0s.
VEX.256.66.0F38.W0 47 /r A V/V AVX2 Shift doublewords in ymm2 left by amount
VPSLLVD ymm1, ymm2, ymm3/m256 specified in the corresponding element of
ymm3/m256 while shifting in 0s.
VEX.256.66.0F38.W1 47 /r A V/V AVX2 Shift quadwords in ymm2 left by amount specified
VPSLLVQ ymm1, ymm2, ymm3/m256 in the corresponding element of ymm3/m256
while shifting in 0s.
EVEX.128.66.0F38.W1 12 /r B V/V (AVX512VL AND Shift words in xmm2 left by amount specified in
VPSLLVW xmm1 {k1}{z}, xmm2, AVX512BW) OR the corresponding element of xmm3/m128 while
xmm3/m128 AVX10.11 shifting in 0s using writemask k1.
EVEX.256.66.0F38.W1 12 /r B V/V (AVX512VL AND Shift words in ymm2 left by amount specified in
VPSLLVW ymm1 {k1}{z}, ymm2, AVX512BW) OR the corresponding element of ymm3/m256 while
ymm3/m256 AVX10.11 shifting in 0s using writemask k1.
EVEX.512.66.0F38.W1 12 /r B V/V AVX512BW Shift words in zmm2 left by amount specified in
VPSLLVW zmm1 {k1}{z}, zmm2, OR AVX10.11 the corresponding element of zmm3/m512 while
zmm3/m512 shifting in 0s using writemask k1.
EVEX.128.66.0F38.W0 47 /r C V/V (AVX512VL AND Shift doublewords in xmm2 left by amount
VPSLLVD xmm1 {k1}{z}, xmm2, AVX512F) OR specified in the corresponding element of
xmm3/m128/m32bcst AVX10.11 xmm3/m128/m32bcst while shifting in 0s using
writemask k1.
EVEX.256.66.0F38.W0 47 /r C V/V (AVX512VL AND Shift doublewords in ymm2 left by amount
VPSLLVD ymm1 {k1}{z}, ymm2, AVX512F) OR specified in the corresponding element of
ymm3/m256/m32bcst AVX10.11 ymm3/m256/m32bcst while shifting in 0s using
writemask k1.
EVEX.512.66.0F38.W0 47 /r C V/V AVX512F Shift doublewords in zmm2 left by amount
VPSLLVD zmm1 {k1}{z}, zmm2, OR AVX10.11 specified in the corresponding element of
zmm3/m512/m32bcst zmm3/m512/m32bcst while shifting in 0s using
writemask k1.
EVEX.128.66.0F38.W1 47 /r C V/V (AVX512VL AND Shift quadwords in xmm2 left by amount specified
VPSLLVQ xmm1 {k1}{z}, xmm2, AVX512F) OR in the corresponding element of
xmm3/m128/m64bcst AVX10.11 xmm3/m128/m64bcst while shifting in 0s using
writemask k1.
EVEX.256.66.0F38.W1 47 /r C V/V (AVX512VL AND Shift quadwords in ymm2 left by amount specified
VPSLLVQ ymm1 {k1}{z}, ymm2, AVX512F) OR in the corresponding element of
ymm3/m256/m64bcst AVX10.11 ymm3/m256/m64bcst while shifting in 0s using
writemask k1.
EVEX.512.66.0F38.W1 47 /r C V/V AVX512F Shift quadwords in zmm2 left by amount specified
VPSLLVQ zmm1 {k1}{z}, zmm2, OR AVX10.11 in the corresponding element of
zmm3/m512/m64bcst zmm3/m512/m64bcst while shifting in 0s using
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

VPSLLVW/VPSLLVD/VPSLLVQ—Variable Bit Shift Left Logical Vol. 2C 5-613


Instruction Operand Encoding
Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Shifts the bits in the individual data elements (words, doublewords or quadword) in the first source operand to the
left by the count value of respective data elements in the second source operand. As the bits in the data elements
are shifted left, the empty low-order bits are cleared (set to 0).
The count values are specified individually in each data element of the second source operand. If the unsigned
integer value specified in the respective data element of the second source operand is greater than 15 (for word),
31 (for doublewords), or 63 (for a quadword), then the destination data element are written with 0.
VEX.128 encoded version: The destination and first source operands are XMM registers. The count operand can be
either an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination register
are zeroed.
VEX.256 encoded version: The destination and first source operands are YMM registers. The count operand can be
either an YMM register or a 256-bit memory. Bits (MAXVL-1:256) of the corresponding ZMM register are zeroed.
EVEX encoded VPSLLVD/Q: The destination and first source operands are ZMM/YMM/XMM registers. The count
operand can be either a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512-bit vector broad-
casted from a 32/64-bit memory location. The destination is conditionally updated with writemask k1.
EVEX encoded VPSLLVW: The destination and first source operands are ZMM/YMM/XMM registers. The count
operand can be either a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination is condition-
ally updated with writemask k1.

Operation
VPSLLVW (EVEX encoded version)
(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := ZeroExtend(SRC1[i+15:i] << SRC2[i+15:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0;

VPSLLVW/VPSLLVD/VPSLLVQ—Variable Bit Shift Left Logical Vol. 2C 5-614


VPSLLVD (VEX.128 version)
COUNT_0 := SRC2[31 : 0]
(* Repeat Each COUNT_i for the 2nd through 4th dwords of SRC2*)
COUNT_3 := SRC2[127 : 96];
IF COUNT_0 < 32 THEN
DEST[31:0] := ZeroExtend(SRC1[31:0] << COUNT_0);
ELSE
DEST[31:0] := 0;
(* Repeat shift operation for 2nd through 4th dwords *)
IF COUNT_3 < 32 THEN
DEST[127:96] := ZeroExtend(SRC1[127:96] << COUNT_3);
ELSE
DEST[127:96] := 0;
DEST[MAXVL-1:128] := 0;

VPSLLVD (VEX.256 version)


COUNT_0 := SRC2[31 : 0];
(* Repeat Each COUNT_i for the 2nd through 7th dwords of SRC2*)
COUNT_7 := SRC2[255 : 224];
IF COUNT_0 < 32 THEN
DEST[31:0] := ZeroExtend(SRC1[31:0] << COUNT_0);
ELSE
DEST[31:0] := 0;
(* Repeat shift operation for 2nd through 7th dwords *)
IF COUNT_7 < 32 THEN
DEST[255:224] := ZeroExtend(SRC1[255:224] << COUNT_7);
ELSE
DEST[255:224] := 0;
DEST[MAXVL-1:256] := 0;

VPSLLVD (EVEX encoded version)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+31:i] := ZeroExtend(SRC1[i+31:i] << SRC2[31:0])
ELSE DEST[i+31:i] := ZeroExtend(SRC1[i+31:i] << SRC2[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0;

VPSLLVW/VPSLLVD/VPSLLVQ—Variable Bit Shift Left Logical Vol. 2C 5-615


VPSLLVQ (VEX.128 version)
COUNT_0 := SRC2[63 : 0];
COUNT_1 := SRC2[127 : 64];
IF COUNT_0 < 64THEN
DEST[63:0] := ZeroExtend(SRC1[63:0] << COUNT_0);
ELSE
DEST[63:0] := 0;
IF COUNT_1 < 64 THEN
DEST[127:64] := ZeroExtend(SRC1[127:64] << COUNT_1);
ELSE
DEST[127:96] := 0;
DEST[MAXVL-1:128] := 0;

VPSLLVQ (VEX.256 version)


COUNT_0 := SRC2[63 : 0];
(* Repeat Each COUNT_i for the 2nd through 4th dwords of SRC2*)
COUNT_3 := SRC2[255 : 192];
IF COUNT_0 < 64THEN
DEST[63:0] := ZeroExtend(SRC1[63:0] << COUNT_0);
ELSE
DEST[63:0] := 0;
(* Repeat shift operation for 2nd through 4th dwords *)
IF COUNT_3 < 64 THEN
DEST[255:192] := ZeroExtend(SRC1[255:192] << COUNT_3);
ELSE
DEST[255:192] := 0;
DEST[MAXVL-1:256] := 0;

VPSLLVQ (EVEX encoded version)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+63:i] := ZeroExtend(SRC1[i+63:i] << SRC2[63:0])
ELSE DEST[i+63:i] := ZeroExtend(SRC1[i+63:i] << SRC2[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0;

VPSLLVW/VPSLLVD/VPSLLVQ—Variable Bit Shift Left Logical Vol. 2C 5-616


Intel C/C++ Compiler Intrinsic Equivalent
VPSLLVW __m512i _mm512_sllv_epi16(__m512i a, __m512i cnt);
VPSLLVW __m512i _mm512_mask_sllv_epi16(__m512i s, __mmask32 k, __m512i a, __m512i cnt);
VPSLLVW __m512i _mm512_maskz_sllv_epi16( __mmask32 k, __m512i a, __m512i cnt);
VPSLLVW __m256i _mm256_mask_sllv_epi16(__m256i s, __mmask16 k, __m256i a, __m256i cnt);
VPSLLVW __m256i _mm256_maskz_sllv_epi16( __mmask16 k, __m256i a, __m256i cnt);
VPSLLVW __m128i _mm_mask_sllv_epi16(__m128i s, __mmask8 k, __m128i a, __m128i cnt);
VPSLLVW __m128i _mm_maskz_sllv_epi16( __mmask8 k, __m128i a, __m128i cnt);
VPSLLVD __m512i _mm512_sllv_epi32(__m512i a, __m512i cnt);
VPSLLVD __m512i _mm512_mask_sllv_epi32(__m512i s, __mmask16 k, __m512i a, __m512i cnt);
VPSLLVD __m512i _mm512_maskz_sllv_epi32( __mmask16 k, __m512i a, __m512i cnt);
VPSLLVD __m256i _mm256_mask_sllv_epi32(__m256i s, __mmask8 k, __m256i a, __m256i cnt);
VPSLLVD __m256i _mm256_maskz_sllv_epi32( __mmask8 k, __m256i a, __m256i cnt);
VPSLLVD __m128i _mm_mask_sllv_epi32(__m128i s, __mmask8 k, __m128i a, __m128i cnt);
VPSLLVD __m128i _mm_maskz_sllv_epi32( __mmask8 k, __m128i a, __m128i cnt);
VPSLLVQ __m512i _mm512_sllv_epi64(__m512i a, __m512i cnt);
VPSLLVQ __m512i _mm512_mask_sllv_epi64(__m512i s, __mmask8 k, __m512i a, __m512i cnt);
VPSLLVQ __m512i _mm512_maskz_sllv_epi64( __mmask8 k, __m512i a, __m512i cnt);
VPSLLVD __m256i _mm256_mask_sllv_epi64(__m256i s, __mmask8 k, __m256i a, __m256i cnt);
VPSLLVD __m256i _mm256_maskz_sllv_epi64( __mmask8 k, __m256i a, __m256i cnt);
VPSLLVD __m128i _mm_mask_sllv_epi64(__m128i s, __mmask8 k, __m128i a, __m128i cnt);
VPSLLVD __m128i _mm_maskz_sllv_epi64( __mmask8 k, __m128i a, __m128i cnt);
VPSLLVD __m256i _mm256_sllv_epi32 (__m256i m, __m256i count)
VPSLLVQ __m256i _mm256_sllv_epi64 (__m256i m, __m256i count)

SIMD Floating-Point Exceptions


None.

Other Exceptions
VEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPSLLVD/VPSLLVQ, see Table 2-51, “Type E4 Class Exception Conditions.”
EVEX-encoded VPSLLVW, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

VPSLLVW/VPSLLVD/VPSLLVQ—Variable Bit Shift Left Logical Vol. 2C 5-617


VPSRAVW/VPSRAVD/VPSRAVQ—Variable Bit Shift Right Arithmetic
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.66.0F38.W0 46 /r A V/V AVX2 Shift doublewords in xmm2 right by amount
VPSRAVD xmm1, xmm2, xmm3/m128 specified in the corresponding element of
xmm3/m128 while shifting in sign bits.
VEX.256.66.0F38.W0 46 /r A V/V AVX2 Shift doublewords in ymm2 right by amount
VPSRAVD ymm1, ymm2, ymm3/m256 specified in the corresponding element of
ymm3/m256 while shifting in sign bits.
EVEX.128.66.0F38.W1 11 /r B V/V (AVX512VL AND Shift words in xmm2 right by amount specified
VPSRAVW xmm1 {k1}{z}, xmm2, AVX512BW) OR in the corresponding element of xmm3/m128
xmm3/m128 AVX10.11 while shifting in sign bits using writemask k1.
EVEX.256.66.0F38.W1 11 /r B V/V (AVX512VL AND Shift words in ymm2 right by amount specified
VPSRAVW ymm1 {k1}{z}, ymm2, AVX512BW) OR in the corresponding element of ymm3/m256
ymm3/m256 AVX10.11 while shifting in sign bits using writemask k1.
EVEX.512.66.0F38.W1 11 /r B V/V AVX512BW Shift words in zmm2 right by amount specified in
VPSRAVW zmm1 {k1}{z}, zmm2, OR AVX10.11 the corresponding element of zmm3/m512
zmm3/m512 while shifting in sign bits using writemask k1.
EVEX.128.66.0F38.W0 46 /r C V/V (AVX512VL AND Shift doublewords in xmm2 right by amount
VPSRAVD xmm1 {k1}{z}, xmm2, AVX512F) OR specified in the corresponding element of
xmm3/m128/m32bcst AVX10.11 xmm3/m128/m32bcst while shifting in sign bits
using writemask k1.
EVEX.256.66.0F38.W0 46 /r C V/V (AVX512VL AND Shift doublewords in ymm2 right by amount
VPSRAVD ymm1 {k1}{z}, ymm2, AVX512F) OR specified in the corresponding element of
ymm3/m256/m32bcst AVX10.11 ymm3/m256/m32bcst while shifting in sign bits
using writemask k1.
EVEX.512.66.0F38.W0 46 /r C V/V AVX512F Shift doublewords in zmm2 right by amount
VPSRAVD zmm1 {k1}{z}, zmm2, OR AVX10.11 specified in the corresponding element of
zmm3/m512/m32bcst zmm3/m512/m32bcst while shifting in sign bits
using writemask k1.
EVEX.128.66.0F38.W1 46 /r C V/V (AVX512VL AND Shift quadwords in xmm2 right by amount
VPSRAVQ xmm1 {k1}{z}, xmm2, AVX512F) OR specified in the corresponding element of
xmm3/m128/m64bcst AVX10.11 xmm3/m128/m64bcst while shifting in sign bits
using writemask k1.
EVEX.256.66.0F38.W1 46 /r C V/V (AVX512VL AND Shift quadwords in ymm2 right by amount
VPSRAVQ ymm1 {k1}{z}, ymm2, AVX512F) OR specified in the corresponding element of
ymm3/m256/m64bcst AVX10.11 ymm3/m256/m64bcst while shifting in sign bits
using writemask k1.

EVEX.512.66.0F38.W1 46 /r C V/V AVX512F Shift quadwords in zmm2 right by amount


VPSRAVQ zmm1 {k1}{z}, zmm2, OR AVX10.11 specified in the corresponding element of
zmm3/m512/m64bcst zmm3/m512/m64bcst while shifting in sign bits
using writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

VPSRAVW/VPSRAVD/VPSRAVQ—Variable Bit Shift Right Arithmetic Vol. 2C 5-618


Instruction Operand Encoding
Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Shifts the bits in the individual data elements (word/doublewords/quadword) in the first source operand (the
second operand) to the right by the number of bits specified in the count value of respective data elements in the
second source operand (the third operand). As the bits in the data elements are shifted right, the empty high-order
bits are set to the MSB (sign extension).
The count values are specified individually in each data element of the second source operand. If the unsigned
integer value specified in the respective data element of the second source operand is greater than 15 (for words),
31 (for doublewords), or 63 (for a quadword), then the destination data element is filled with the corresponding
sign bit of the source element.
VEX.128 encoded version: The destination and first source operands are XMM registers. The count operand can be
either an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination register
are zeroed.
VEX.256 encoded version: The destination and first source operands are YMM registers. The count operand can be
either an YMM register or a 256-bit memory. Bits (MAXVL-1:256) of the corresponding destination register are
zeroed.
EVEX.512/256/128 encoded VPSRAVD/W: The destination and first source operands are ZMM/YMM/XMM registers.
The count operand can be either a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a
512/256/128-bit vector broadcasted from a 32/64-bit memory location. The destination is conditionally updated
with writemask k1.
EVEX.512/256/128 encoded VPSRAVQ: The destination and first source operands are ZMM/YMM/XMM registers.
The count operand can be either a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination
is conditionally updated with writemask k1.

Operation
VPSRAVW (EVEX encoded version)
(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN
COUNT := SRC2[i+3:i]
IF COUNT < 16
THEN DEST[i+15:i] := SignExtend(SRC1[i+15:i] >> COUNT)
ELSE
FOR k := 0 TO 15
DEST[i+k] := SRC1[i+15]
ENDFOR;
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;

VPSRAVW/VPSRAVD/VPSRAVQ—Variable Bit Shift Right Arithmetic Vol. 2C 5-619


ENDFOR;
DEST[MAXVL-1:VL] := 0;

VPSRAVD (VEX.128 version)


COUNT_0 := SRC2[31 : 0]
(* Repeat Each COUNT_i for the 2nd through 4th dwords of SRC2*)
COUNT_3 := SRC2[127 : 96];
DEST[31:0] := SignExtend(SRC1[31:0] >> COUNT_0);
(* Repeat shift operation for 2nd through 4th dwords *)
DEST[127:96] := SignExtend(SRC1[127:96] >> COUNT_3);
DEST[MAXVL-1:128] := 0;

VPSRAVD (VEX.256 version)


COUNT_0 := SRC2[31 : 0];
(* Repeat Each COUNT_i for the 2nd through 8th dwords of SRC2*)
COUNT_7 := SRC2[255 : 224];
DEST[31:0] := SignExtend(SRC1[31:0] >> COUNT_0);
(* Repeat shift operation for 2nd through 7th dwords *)
DEST[255:224] := SignExtend(SRC1[255:224] >> COUNT_7);
DEST[MAXVL-1:256] := 0;

VPSRAVD (EVEX encoded version)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
COUNT := SRC2[4:0]
IF COUNT < 32
THEN DEST[i+31:i] := SignExtend(SRC1[i+31:i] >> COUNT)
ELSE
FOR k := 0 TO 31
DEST[i+k] := SRC1[i+31]
ENDFOR;
FI
ELSE
COUNT := SRC2[i+4:i]
IF COUNT < 32
THEN DEST[i+31:i] := SignExtend(SRC1[i+31:i] >> COUNT)
ELSE
FOR k := 0 TO 31
DEST[i+k] := SRC1[i+31]
ENDFOR;
FI
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
DEST[31:0] := 0
FI
FI;
ENDFOR;

VPSRAVW/VPSRAVD/VPSRAVQ—Variable Bit Shift Right Arithmetic Vol. 2C 5-620


DEST[MAXVL-1:VL] := 0;

VPSRAVQ (EVEX encoded version)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
COUNT := SRC2[5:0]
IF COUNT < 64
THEN DEST[i+63:i] := SignExtend(SRC1[i+63:i] >> COUNT)
ELSE
FOR k := 0 TO 63
DEST[i+k] := SRC1[i+63]
ENDFOR;
FI
ELSE
COUNT := SRC2[i+5:i]
IF COUNT < 64
THEN DEST[i+63:i] := SignExtend(SRC1[i+63:i] >> COUNT)
ELSE
FOR k := 0 TO 63
DEST[i+k] := SRC1[i+63]
ENDFOR;
FI
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
DEST[63:0] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0;

VPSRAVW/VPSRAVD/VPSRAVQ—Variable Bit Shift Right Arithmetic Vol. 2C 5-621


Intel C/C++ Compiler Intrinsic Equivalent
VPSRAVD __m512i _mm512_srav_epi32(__m512i a, __m512i cnt);
VPSRAVD __m512i _mm512_mask_srav_epi32(__m512i s, __mmask16 m, __m512i a, __m512i cnt);
VPSRAVD __m512i _mm512_maskz_srav_epi32(__mmask16 m, __m512i a, __m512i cnt);
VPSRAVD __m256i _mm256_srav_epi32(__m256i a, __m256i cnt);
VPSRAVD __m256i _mm256_mask_srav_epi32(__m256i s, __mmask8 m, __m256i a, __m256i cnt);
VPSRAVD __m256i _mm256_maskz_srav_epi32(__mmask8 m, __m256i a, __m256i cnt);
VPSRAVD __m128i _mm_srav_epi32(__m128i a, __m128i cnt);
VPSRAVD __m128i _mm_mask_srav_epi32(__m128i s, __mmask8 m, __m128i a, __m128i cnt);
VPSRAVD __m128i _mm_maskz_srav_epi32(__mmask8 m, __m128i a, __m128i cnt);
VPSRAVQ __m512i _mm512_srav_epi64(__m512i a, __m512i cnt);
VPSRAVQ __m512i _mm512_mask_srav_epi64(__m512i s, __mmask8 m, __m512i a, __m512i cnt);
VPSRAVQ __m512i _mm512_maskz_srav_epi64( __mmask8 m, __m512i a, __m512i cnt);
VPSRAVQ __m256i _mm256_srav_epi64(__m256i a, __m256i cnt);
VPSRAVQ __m256i _mm256_mask_srav_epi64(__m256i s, __mmask8 m, __m256i a, __m256i cnt);
VPSRAVQ __m256i _mm256_maskz_srav_epi64( __mmask8 m, __m256i a, __m256i cnt);
VPSRAVQ __m128i _mm_srav_epi64(__m128i a, __m128i cnt);
VPSRAVQ __m128i _mm_mask_srav_epi64(__m128i s, __mmask8 m, __m128i a, __m128i cnt);
VPSRAVQ __m128i _mm_maskz_srav_epi64( __mmask8 m, __m128i a, __m128i cnt);
VPSRAVW __m512i _mm512_srav_epi16(__m512i a, __m512i cnt);
VPSRAVW __m512i _mm512_mask_srav_epi16(__m512i s, __mmask32 m, __m512i a, __m512i cnt);
VPSRAVW __m512i _mm512_maskz_srav_epi16(__mmask32 m, __m512i a, __m512i cnt);
VPSRAVW __m256i _mm256_srav_epi16(__m256i a, __m256i cnt);
VPSRAVW __m256i _mm256_mask_srav_epi16(__m256i s, __mmask16 m, __m256i a, __m256i cnt);
VPSRAVW __m256i _mm256_maskz_srav_epi16(__mmask16 m, __m256i a, __m256i cnt);
VPSRAVW __m128i _mm_srav_epi16(__m128i a, __m128i cnt);
VPSRAVW __m128i _mm_mask_srav_epi16(__m128i s, __mmask8 m, __m128i a, __m128i cnt);
VPSRAVW __m128i _mm_maskz_srav_epi32(__mmask8 m, __m128i a, __m128i cnt);
VPSRAVD __m256i _mm256_srav_epi32 (__m256i m, __m256i count)

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

VPSRAVW/VPSRAVD/VPSRAVQ—Variable Bit Shift Right Arithmetic Vol. 2C 5-622


VPSRLVW/VPSRLVD/VPSRLVQ—Variable Bit Shift Right Logical
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.66.0F38.W0 45 /r A V/V AVX2 Shift doublewords in xmm2 right by amount
VPSRLVD xmm1, xmm2, specified in the corresponding element of
xmm3/m128 xmm3/m128 while shifting in 0s.
VEX.128.66.0F38.W1 45 /r A V/V AVX2 Shift quadwords in xmm2 right by amount
VPSRLVQ xmm1, xmm2, specified in the corresponding element of
xmm3/m128 xmm3/m128 while shifting in 0s.
VEX.256.66.0F38.W0 45 /r A V/V AVX2 Shift doublewords in ymm2 right by amount
VPSRLVD ymm1, ymm2, specified in the corresponding element of
ymm3/m256 ymm3/m256 while shifting in 0s.
VEX.256.66.0F38.W1 45 /r A V/V AVX2 Shift quadwords in ymm2 right by amount
VPSRLVQ ymm1, ymm2, specified in the corresponding element of
ymm3/m256 ymm3/m256 while shifting in 0s.
EVEX.128.66.0F38.W1 10 /r B V/V (AVX512VL AND Shift words in xmm2 right by amount specified in
VPSRLVW xmm1 {k1}{z}, xmm2, AVX512BW) OR the corresponding element of xmm3/m128 while
xmm3/m128 AVX10.11 shifting in 0s using writemask k1.
EVEX.256.66.0F38.W1 10 /r B V/V (AVX512VL AND Shift words in ymm2 right by amount specified in
VPSRLVW ymm1 {k1}{z}, ymm2, AVX512BW) OR the corresponding element of ymm3/m256 while
ymm3/m256 AVX10.11 shifting in 0s using writemask k1.
EVEX.512.66.0F38.W1 10 /r B V/V AVX512BW Shift words in zmm2 right by amount specified in
VPSRLVW zmm1 {k1}{z}, zmm2, OR AVX10.11 the corresponding element of zmm3/m512 while
zmm3/m512 shifting in 0s using writemask k1.
EVEX.128.66.0F38.W0 45 /r C V/V (AVX512VL AND Shift doublewords in xmm2 right by amount
VPSRLVD xmm1 {k1}{z}, xmm2, AVX512F) OR specified in the corresponding element of
xmm3/m128/m32bcst AVX10.11 xmm3/m128/m32bcst while shifting in 0s using
writemask k1.
EVEX.256.66.0F38.W0 45 /r C V/V (AVX512VL AND Shift doublewords in ymm2 right by amount
VPSRLVD ymm1 {k1}{z}, ymm2, AVX512F) OR specified in the corresponding element of
ymm3/m256/m32bcst AVX10.11 ymm3/m256/m32bcst while shifting in 0s using
writemask k1.
EVEX.512.66.0F38.W0 45 /r C V/V AVX512F Shift doublewords in zmm2 right by amount
VPSRLVD zmm1 {k1}{z}, zmm2, OR AVX10.11 specified in the corresponding element of
zmm3/m512/m32bcst zmm3/m512/m32bcst while shifting in 0s using
writemask k1.
EVEX.128.66.0F38.W1 45 /r C V/V (AVX512VL AND Shift quadwords in xmm2 right by amount
VPSRLVQ xmm1 {k1}{z}, xmm2, AVX512F) OR specified in the corresponding element of
xmm3/m128/m64bcst AVX10.11 xmm3/m128/m64bcst while shifting in 0s using
writemask k1.
EVEX.256.66.0F38.W1 45 /r C V/V (AVX512VL AND Shift quadwords in ymm2 right by amount
VPSRLVQ ymm1 {k1}{z}, ymm2, AVX512F) OR specified in the corresponding element of
ymm3/m256/m64bcst AVX10.11 ymm3/m256/m64bcst while shifting in 0s using
writemask k1.
EVEX.512.66.0F38.W1 45 /r C V/V AVX512F Shift quadwords in zmm2 right by amount
VPSRLVQ zmm1 {k1}{z}, zmm2, OR AVX10.11 specified in the corresponding element of
zmm3/m512/m64bcst zmm3/m512/m64bcst while shifting in 0s using
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

VPSRLVW/VPSRLVD/VPSRLVQ—Variable Bit Shift Right Logical Vol. 2C 5-623


Instruction Operand Encoding
Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
B Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Shifts the bits in the individual data elements (words, doublewords or quadword) in the first source operand to the
right by the count value of respective data elements in the second source operand. As the bits in the data elements
are shifted right, the empty high-order bits are cleared (set to 0).
The count values are specified individually in each data element of the second source operand. If the unsigned
integer value specified in the respective data element of the second source operand is greater than 15 (for word),
31 (for doublewords), or 63 (for a quadword), then the destination data element are written with 0.
VEX.128 encoded version: The destination and first source operands are XMM registers. The count operand can be
either an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination register
are zeroed.
VEX.256 encoded version: The destination and first source operands are YMM registers. The count operand can be
either an YMM register or a 256-bit memory. Bits (MAXVL-1:256) of the corresponding ZMM register are zeroed.
EVEX encoded VPSRLVD/Q: The destination and first source operands are ZMM/YMM/XMM registers. The count
operand can be either a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512-bit vector broad-
casted from a 32/64-bit memory location. The destination is conditionally updated with writemask k1.
EVEX encoded VPSRLVW: The destination and first source operands are ZMM/YMM/XMM registers. The count
operand can be either a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination is condition-
ally updated with writemask k1.

Operation
VPSRLVW (EVEX encoded version)
(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := ZeroExtend(SRC1[i+15:i] >> SRC2[i+15:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0;

VPSRLVW/VPSRLVD/VPSRLVQ—Variable Bit Shift Right Logical Vol. 2C 5-624


VPSRLVD (VEX.128 version)
COUNT_0 := SRC2[31 : 0]
(* Repeat Each COUNT_i for the 2nd through 4th dwords of SRC2*)
COUNT_3 := SRC2[127 : 96];
IF COUNT_0 < 32 THEN
DEST[31:0] := ZeroExtend(SRC1[31:0] >> COUNT_0);
ELSE
DEST[31:0] := 0;
(* Repeat shift operation for 2nd through 4th dwords *)
IF COUNT_3 < 32 THEN
DEST[127:96] := ZeroExtend(SRC1[127:96] >> COUNT_3);
ELSE
DEST[127:96] := 0;
DEST[MAXVL-1:128] := 0;

VPSRLVD (VEX.256 version)


COUNT_0 := SRC2[31 : 0];
(* Repeat Each COUNT_i for the 2nd through 7th dwords of SRC2*)
COUNT_7 := SRC2[255 : 224];
IF COUNT_0 < 32 THEN
DEST[31:0] := ZeroExtend(SRC1[31:0] >> COUNT_0);
ELSE
DEST[31:0] := 0;
(* Repeat shift operation for 2nd through 7th dwords *)
IF COUNT_7 < 32 THEN
DEST[255:224] := ZeroExtend(SRC1[255:224] >> COUNT_7);
ELSE
DEST[255:224] := 0;
DEST[MAXVL-1:256] := 0;

VPSRLVD (EVEX encoded version)


(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+31:i] := ZeroExtend(SRC1[i+31:i] >> SRC2[31:0])
ELSE DEST[i+31:i] := ZeroExtend(SRC1[i+31:i] >> SRC2[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0;

VPSRLVW/VPSRLVD/VPSRLVQ—Variable Bit Shift Right Logical Vol. 2C 5-625


VPSRLVQ (VEX.128 version)
COUNT_0 := SRC2[63 : 0];
COUNT_1 := SRC2[127 : 64];
IF COUNT_0 < 64 THEN
DEST[63:0] := ZeroExtend(SRC1[63:0] >> COUNT_0);
ELSE
DEST[63:0] := 0;
IF COUNT_1 < 64 THEN
DEST[127:64] := ZeroExtend(SRC1[127:64] >> COUNT_1);
ELSE
DEST[127:64] := 0;
DEST[MAXVL-1:128] := 0;

VPSRLVQ (VEX.256 version)


COUNT_0 := SRC2[63 : 0];
(* Repeat Each COUNT_i for the 2nd through 4th dwords of SRC2*)
COUNT_3 := SRC2[255 : 192];
IF COUNT_0 < 64 THEN
DEST[63:0] := ZeroExtend(SRC1[63:0] >> COUNT_0);
ELSE
DEST[63:0] := 0;
(* Repeat shift operation for 2nd through 4th dwords *)
IF COUNT_3 < 64 THEN
DEST[255:192] := ZeroExtend(SRC1[255:192] >> COUNT_3);
ELSE
DEST[255:192] := 0;
DEST[MAXVL-1:256] := 0;

VPSRLVQ (EVEX encoded version)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+63:i] := ZeroExtend(SRC1[i+63:i] >> SRC2[63:0])
ELSE DEST[i+63:i] := ZeroExtend(SRC1[i+63:i] >> SRC2[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0;

VPSRLVW/VPSRLVD/VPSRLVQ—Variable Bit Shift Right Logical Vol. 2C 5-626


Intel C/C++ Compiler Intrinsic Equivalent
VPSRLVW __m512i _mm512_srlv_epi16(__m512i a, __m512i cnt);
VPSRLVW __m512i _mm512_mask_srlv_epi16(__m512i s, __mmask32 k, __m512i a, __m512i cnt);
VPSRLVW __m512i _mm512_maskz_srlv_epi16( __mmask32 k, __m512i a, __m512i cnt);
VPSRLVW __m256i _mm256_mask_srlv_epi16(__m256i s, __mmask16 k, __m256i a, __m256i cnt);
VPSRLVW __m256i _mm256_maskz_srlv_epi16( __mmask16 k, __m256i a, __m256i cnt);
VPSRLVW __m128i _mm_mask_srlv_epi16(__m128i s, __mmask8 k, __m128i a, __m128i cnt);
VPSRLVW __m128i _mm_maskz_srlv_epi16( __mmask8 k, __m128i a, __m128i cnt);
VPSRLVW __m256i _mm256_srlv_epi32 (__m256i m, __m256i count)
VPSRLVD __m512i _mm512_srlv_epi32(__m512i a, __m512i cnt);
VPSRLVD __m512i _mm512_mask_srlv_epi32(__m512i s, __mmask16 k, __m512i a, __m512i cnt);
VPSRLVD __m512i _mm512_maskz_srlv_epi32( __mmask16 k, __m512i a, __m512i cnt);
VPSRLVD __m256i _mm256_mask_srlv_epi32(__m256i s, __mmask8 k, __m256i a, __m256i cnt);
VPSRLVD __m256i _mm256_maskz_srlv_epi32( __mmask8 k, __m256i a, __m256i cnt);
VPSRLVD __m128i _mm_mask_srlv_epi32(__m128i s, __mmask8 k, __m128i a, __m128i cnt);
VPSRLVD __m128i _mm_maskz_srlv_epi32( __mmask8 k, __m128i a, __m128i cnt);
VPSRLVQ __m512i _mm512_srlv_epi64(__m512i a, __m512i cnt);
VPSRLVQ __m512i _mm512_mask_srlv_epi64(__m512i s, __mmask8 k, __m512i a, __m512i cnt);
VPSRLVQ __m512i _mm512_maskz_srlv_epi64( __mmask8 k, __m512i a, __m512i cnt);
VPSRLVQ __m256i _mm256_mask_srlv_epi64(__m256i s, __mmask8 k, __m256i a, __m256i cnt);
VPSRLVQ __m256i _mm256_maskz_srlv_epi64( __mmask8 k, __m256i a, __m256i cnt);
VPSRLVQ __m128i _mm_mask_srlv_epi64(__m128i s, __mmask8 k, __m128i a, __m128i cnt);
VPSRLVQ __m128i _mm_maskz_srlv_epi64( __mmask8 k, __m128i a, __m128i cnt);
VPSRLVQ __m256i _mm256_srlv_epi64 (__m256i m, __m256i count)
VPSRLVD __m128i _mm_srlv_epi32( __m128i a, __m128i cnt);
VPSRLVQ __m128i _mm_srlv_epi64( __m128i a, __m128i cnt);

SIMD Floating-Point Exceptions


None.

Other Exceptions
VEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPSRLVD/Q, see Table 2-51, “Type E4 Class Exception Conditions.”
EVEX-encoded VPSRLVW, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

VPSRLVW/VPSRLVD/VPSRLVQ—Variable Bit Shift Right Logical Vol. 2C 5-627


VPTERNLOGD/VPTERNLOGQ—Bitwise Ternary Logic
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F3A.W0 25 /r ib A V/V (AVX512VL AND Bitwise ternary logic taking xmm1, xmm2, and
VPTERNLOGD xmm1 {k1}{z}, xmm2, AVX512F) OR xmm3/m128/m32bcst as source operands and
xmm3/m128/m32bcst, imm8 AVX10.11 writing the result to xmm1 under writemask k1
with dword granularity. The immediate value
determines the specific binary function being
implemented.
EVEX.256.66.0F3A.W0 25 /r ib A V/V (AVX512VL AND Bitwise ternary logic taking ymm1, ymm2, and
VPTERNLOGD ymm1 {k1}{z}, ymm2, AVX512F) OR ymm3/m256/m32bcst as source operands and
ymm3/m256/m32bcst, imm8 AVX10.11 writing the result to ymm1 under writemask k1
with dword granularity. The immediate value
determines the specific binary function being
implemented.
EVEX.512.66.0F3A.W0 25 /r ib A V/V AVX512F Bitwise ternary logic taking zmm1, zmm2, and
VPTERNLOGD zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m32bcst as source operands and
zmm3/m512/m32bcst, imm8 writing the result to zmm1 under writemask k1
with dword granularity. The immediate value
determines the specific binary function being
implemented.
EVEX.128.66.0F3A.W1 25 /r ib A V/V (AVX512VL AND Bitwise ternary logic taking xmm1, xmm2, and
VPTERNLOGQ xmm1 {k1}{z}, xmm2, AVX512F) OR xmm3/m128/m64bcst as source operands and
xmm3/m128/m64bcst, imm8 AVX10.11 writing the result to xmm1 under writemask k1
with qword granularity. The immediate value
determines the specific binary function being
implemented.
EVEX.256.66.0F3A.W1 25 /r ib A V/V (AVX512VL AND Bitwise ternary logic taking ymm1, ymm2, and
VPTERNLOGQ ymm1 {k1}{z}, ymm2, AVX512F) OR ymm3/m256/m64bcst as source operands and
ymm3/m256/m64bcst, imm8 AVX10.11 writing the result to ymm1 under writemask k1
with qword granularity. The immediate value
determines the specific binary function being
implemented.
EVEX.512.66.0F3A.W1 25 /r ib A V/V AVX512F Bitwise ternary logic taking zmm1, zmm2, and
VPTERNLOGQ zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m64bcst as source operands and
zmm3/m512/m64bcst, imm8 writing the result to zmm1 under writemask k1
with qword granularity. The immediate value
determines the specific binary function being
implemented.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) imm8

VPTERNLOGD/VPTERNLOGQ—Bitwise Ternary Logic Vol. 2C 5-628


Description
VPTERNLOGD/Q takes three bit vectors of 512-bit length (in the first, second, and third operand) as input data to
form a set of 512 indices, each index is comprised of one bit from each input vector. The imm8 byte specifies a
boolean logic table producing a binary value for each 3-bit index value. The final 512-bit boolean result is written
to the destination operand (the first operand) using the writemask k1 with the granularity of doubleword element
or quadword element into the destination.
The destination operand is a ZMM (EVEX.512)/YMM (EVEX.256)/XMM (EVEX.128) register. The first source
operand is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM register, a
512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location The
destination operand is a ZMM register conditionally updated with writemask k1.
Table 5-20 shows two examples of Boolean functions specified by immediate values 0xE2 and 0xE4, with the look
up result listed in the fourth column following the three columns containing all possible values of the 3-bit index.

Table 5-20. Examples of VPTERNLOGD/Q Imm8 Boolean Function and Input Index Values
VPTERNLOGD reg1, reg2, src3, 0xE2 Bit Result with VPTERNLOGD reg1, reg2, src3, 0xE4 Bit Result with
Imm8=0xE2 Imm8=0xE4
Bit(reg1) Bit(reg2) Bit(src3) Bit(reg1) Bit(reg2) Bit(src3)
0 0 0 0 0 0 0 0
0 0 1 1 0 0 1 0
0 1 0 0 0 1 0 1
0 1 1 0 0 1 1 0
1 0 0 0 1 0 0 0
1 0 1 1 1 0 1 1
1 1 0 1 1 1 0 1
1 1 1 1 1 1 1 1

Specifying different values in imm8 will allow any arbitrary three-input Boolean functions to be implemented in
software using VPTERNLOGD/Q. Table 5-1 and Table 5-2 provide a mapping of all 256 possible imm8 values to
various Boolean expressions.

Operation
VPTERNLOGD (EVEX encoded versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
FOR k := 0 TO 31
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[j][k] := imm[(DEST[i+k] << 2) + (SRC1[ i+k ] << 1) + SRC2[ k ]]
ELSE DEST[j][k] := imm[(DEST[i+k] << 2) + (SRC1[ i+k ] << 1) + SRC2[ i+k ]]
FI;
; table lookup of immediate bellow;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31+i:i] remains unchanged*
ELSE ; zeroing-masking
DEST[31+i:i] := 0
FI;
FI;
ENDFOR;

VPTERNLOGD/VPTERNLOGQ—Bitwise Ternary Logic Vol. 2C 5-629


DEST[MAXVL-1:VL] := 0

VPTERNLOGQ (EVEX encoded versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
FOR k := 0 TO 63
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[j][k] := imm[(DEST[i+k] << 2) + (SRC1[ i+k ] << 1) + SRC2[ k ]]
ELSE DEST[j][k] := imm[(DEST[i+k] << 2) + (SRC1[ i+k ] << 1) + SRC2[ i+k ]]
FI; ; table lookup of immediate bellow;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63+i:i] remains unchanged*
ELSE ; zeroing-masking
DEST[63+i:i] := 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalents


VPTERNLOGD __m512i _mm512_ternarylogic_epi32(__m512i a, __m512i b, int imm);
VPTERNLOGD __m512i _mm512_mask_ternarylogic_epi32(__m512i s, __mmask16 m, __m512i a, __m512i b, int imm);
VPTERNLOGD __m512i _mm512_maskz_ternarylogic_epi32(__mmask m, __m512i a, __m512i b, int imm);
VPTERNLOGD __m256i _mm256_ternarylogic_epi32(__m256i a, __m256i b, int imm);
VPTERNLOGD __m256i _mm256_mask_ternarylogic_epi32(__m256i s, __mmask8 m, __m256i a, __m256i b, int imm);
VPTERNLOGD __m256i _mm256_maskz_ternarylogic_epi32( __mmask8 m, __m256i a, __m256i b, int imm);
VPTERNLOGD __m128i _mm_ternarylogic_epi32(__m128i a, __m128i b, int imm);
VPTERNLOGD __m128i _mm_mask_ternarylogic_epi32(__m128i s, __mmask8 m, __m128i a, __m128i b, int imm);
VPTERNLOGD __m128i _mm_maskz_ternarylogic_epi32( __mmask8 m, __m128i a, __m128i b, int imm);
VPTERNLOGQ __m512i _mm512_ternarylogic_epi64(__m512i a, __m512i b, int imm);
VPTERNLOGQ __m512i _mm512_mask_ternarylogic_epi64(__m512i s, __mmask8 m, __m512i a, __m512i b, int imm);
VPTERNLOGQ __m512i _mm512_maskz_ternarylogic_epi64( __mmask8 m, __m512i a, __m512i b, int imm);
VPTERNLOGQ __m256i _mm256_ternarylogic_epi64(__m256i a, __m256i b, int imm);
VPTERNLOGQ __m256i _mm256_mask_ternarylogic_epi64(__m256i s, __mmask8 m, __m256i a, __m256i b, int imm);
VPTERNLOGQ __m256i _mm256_maskz_ternarylogic_epi64( __mmask8 m, __m256i a, __m256i b, int imm);
VPTERNLOGQ __m128i _mm_ternarylogic_epi64(__m128i a, __m128i b, int imm);
VPTERNLOGQ __m128i _mm_mask_ternarylogic_epi64(__m128i s, __mmask8 m, __m128i a, __m128i b, int imm);
VPTERNLOGQ __m128i _mm_maskz_ternarylogic_epi64( __mmask8 m, __m128i a, __m128i b, int imm);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”

VPTERNLOGD/VPTERNLOGQ—Bitwise Ternary Logic Vol. 2C 5-630


VPTESTMB/VPTESTMW/VPTESTMD/VPTESTMQ—Logical AND and Set Mask
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 26 /r A V/V (AVX512VL AND Bitwise AND of packed byte integers in xmm2 and
VPTESTMB k2 {k1}, xmm2, AVX512BW) OR xmm3/m128 and set mask k2 to reflect the zero/non-
xmm3/m128 AVX10.11 zero status of each element of the result, under
writemask k1.
EVEX.256.66.0F38.W0 26 /r A V/V (AVX512VL AND Bitwise AND of packed byte integers in ymm2 and
VPTESTMB k2 {k1}, ymm2, AVX512BW) OR ymm3/m256 and set mask k2 to reflect the zero/non-
ymm3/m256 AVX10.11 zero status of each element of the result, under
writemask k1.
EVEX.512.66.0F38.W0 26 /r A V/V AVX512BW Bitwise AND of packed byte integers in zmm2 and
VPTESTMB k2 {k1}, zmm2, OR AVX10.11 zmm3/m512 and set mask k2 to reflect the zero/non-
zmm3/m512 zero status of each element of the result, under
writemask k1.
EVEX.128.66.0F38.W1 26 /r A V/V (AVX512VL AND Bitwise AND of packed word integers in xmm2 and
VPTESTMW k2 {k1}, xmm2, AVX512BW) OR xmm3/m128 and set mask k2 to reflect the zero/non-
xmm3/m128 AVX10.11 zero status of each element of the result, under
writemask k1.
EVEX.256.66.0F38.W1 26 /r A V/V (AVX512VL AND Bitwise AND of packed word integers in ymm2 and
VPTESTMW k2 {k1}, ymm2, AVX512BW) OR ymm3/m256 and set mask k2 to reflect the zero/non-
ymm3/m256 AVX10.11 zero status of each element of the result, under
writemask k1.
EVEX.512.66.0F38.W1 26 /r A V/V AVX512BW Bitwise AND of packed word integers in zmm2 and
VPTESTMW k2 {k1}, zmm2, OR AVX10.11 zmm3/m512 and set mask k2 to reflect the zero/non-
zmm3/m512 zero status of each element of the result, under
writemask k1.
EVEX.128.66.0F38.W0 27 /r B V/V (AVX512VL AND Bitwise AND of packed doubleword integers in xmm2
VPTESTMD k2 {k1}, xmm2, AVX512F) OR and xmm3/m128/m32bcst and set mask k2 to reflect
xmm3/m128/m32bcst AVX10.11 the zero/non-zero status of each element of the result,
under writemask k1.
EVEX.256.66.0F38.W0 27 /r B V/V (AVX512VL AND Bitwise AND of packed doubleword integers in ymm2
VPTESTMD k2 {k1}, ymm2, AVX512F) OR and ymm3/m256/m32bcst and set mask k2 to reflect
ymm3/m256/m32bcst AVX10.11 the zero/non-zero status of each element of the result,
under writemask k1.
EVEX.512.66.0F38.W0 27 /r B V/V AVX512F Bitwise AND of packed doubleword integers in zmm2
VPTESTMD k2 {k1}, zmm2, OR AVX10.11 and zmm3/m512/m32bcst and set mask k2 to reflect
zmm3/m512/m32bcst the zero/non-zero status of each element of the result,
under writemask k1.
EVEX.128.66.0F38.W1 27 /r B V/V (AVX512VL AND Bitwise AND of packed quadword integers in xmm2 and
VPTESTMQ k2 {k1}, xmm2, AVX512F) OR xmm3/m128/m64bcst and set mask k2 to reflect the
xmm3/m128/m64bcst AVX10.11 zero/non-zero status of each element of the result,
under writemask k1.
EVEX.256.66.0F38.W1 27 /r B V/V (AVX512VL AND Bitwise AND of packed quadword integers in ymm2 and
VPTESTMQ k2 {k1}, ymm2, AVX512F) OR ymm3/m256/m64bcst and set mask k2 to reflect the
ymm3/m256/m64bcst AVX10.11 zero/non-zero status of each element of the result,
under writemask k1.
EVEX.512.66.0F38.W1 27 /r B V/V AVX512F Bitwise AND of packed quadword integers in zmm2 and
VPTESTMQ k2 {k1}, zmm2, OR AVX10.11 zmm3/m512/m64bcst and set mask k2 to reflect the
zmm3/m512/m64bcst zero/non-zero status of each element of the result,
under writemask k1.

VPTESTMB/VPTESTMW/VPTESTMD/VPTESTMQ—Logical AND and Set Mask Vol. 2C 5-631


NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a bitwise logical AND operation on the first source operand (the second operand) and second source
operand (the third operand) and stores the result in the destination operand (the first operand) under the write-
mask. Each bit of the result is set to 1 if the bitwise AND of the corresponding elements of the first and second src
operands is non-zero; otherwise it is set to 0.
VPTESTMD/VPTESTMQ: The first source operand is a ZMM/YMM/XMM register. The second source operand can be a
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a
32/64-bit memory location. The destination operand is a mask register updated under the writemask.
VPTESTMB/VPTESTMW: The first source operand is a ZMM/YMM/XMM register. The second source operand can be a
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand is a mask register
updated under the writemask.

Operation
VPTESTMB (EVEX encoded versions)
(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[j] := (SRC1[i+7:i] BITWISE AND SRC2[i+7:i] != 0)? 1 : 0;
ELSE DEST[j] = 0 ; zeroing-masking only
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

VPTESTMW (EVEX encoded versions)


(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[j] := (SRC1[i+15:i] BITWISE AND SRC2[i+15:i] != 0)? 1 : 0;
ELSE DEST[j] = 0 ; zeroing-masking only
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

VPTESTMB/VPTESTMW/VPTESTMD/VPTESTMQ—Logical AND and Set Mask Vol. 2C 5-632


VPTESTMD (EVEX encoded versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[j] := (SRC1[i+31:i] BITWISE AND SRC2[31:0] != 0)? 1 : 0;
ELSE DEST[j] := (SRC1[i+31:i] BITWISE AND SRC2[i+31:i] != 0)? 1 : 0;
FI;
ELSE DEST[j] := 0 ; zeroing-masking only
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

VPTESTMQ (EVEX encoded versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[j] := (SRC1[i+63:i] BITWISE AND SRC2[63:0] != 0)? 1 : 0;
ELSE DEST[j] := (SRC1[i+63:i] BITWISE AND SRC2[i+63:i] != 0)? 1 : 0;
FI;
ELSE DEST[j] := 0 ; zeroing-masking only
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0

Intel C/C++ Compiler Intrinsic Equivalents


VPTESTMB __mmask64 _mm512_test_epi8_mask( __m512i a, __m512i b);
VPTESTMB __mmask64 _mm512_mask_test_epi8_mask(__mmask64, __m512i a, __m512i b);
VPTESTMW __mmask32 _mm512_test_epi16_mask( __m512i a, __m512i b);
VPTESTMW __mmask32 _mm512_mask_test_epi16_mask(__mmask32, __m512i a, __m512i b);
VPTESTMD __mmask16 _mm512_test_epi32_mask( __m512i a, __m512i b);
VPTESTMD __mmask16 _mm512_mask_test_epi32_mask(__mmask16, __m512i a, __m512i b);
VPTESTMQ __mmask8 _mm512_test_epi64_mask(__m512i a, __m512i b);
VPTESTMQ __mmask8 _mm512_mask_test_epi64_mask(__mmask8, __m512i a, __m512i b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
VPTESTMD/Q: See Table 2-51, “Type E4 Class Exception Conditions.”
VPTESTMB/W: See Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

VPTESTMB/VPTESTMW/VPTESTMD/VPTESTMQ—Logical AND and Set Mask Vol. 2C 5-633


VPTESTNMB/W/D/Q—Logical NAND and Set
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.F3.0F38.W0 26 /r A V/V (AVX512VL AND Bitwise NAND of packed byte integers in xmm2 and
VPTESTNMB k2 {k1}, xmm2, AVX512BW) OR xmm3/m128 and set mask k2 to reflect the zero/non-
xmm3/m128 AVX10.11 zero status of each element of the result, under
writemask k1.
EVEX.256.F3.0F38.W0 26 /r A V/V (AVX512VL AND Bitwise NAND of packed byte integers in ymm2 and
VPTESTNMB k2 {k1}, ymm2, AVX512BW) OR ymm3/m256 and set mask k2 to reflect the zero/non-
ymm3/m256 AVX10.11 zero status of each element of the result, under
writemask k1.
EVEX.512.F3.0F38.W0 26 /r A V/V (AVX512F AND Bitwise NAND of packed byte integers in zmm2 and
VPTESTNMB k2 {k1}, zmm2, AVX512BW) OR zmm3/m512 and set mask k2 to reflect the zero/non-
zmm3/m512 AVX10.11 zero status of each element of the result, under
writemask k1.
EVEX.128.F3.0F38.W1 26 /r A V/V (AVX512VL AND Bitwise NAND of packed word integers in xmm2 and
VPTESTNMW k2 {k1}, xmm2, AVX512BW) OR xmm3/m128 and set mask k2 to reflect the zero/non-
xmm3/m128 AVX10.11 zero status of each element of the result, under
writemask k1.
EVEX.256.F3.0F38.W1 26 /r A V/V (AVX512VL AND Bitwise NAND of packed word integers in ymm2 and
VPTESTNMW k2 {k1}, ymm2, AVX512BW) OR ymm3/m256 and set mask k2 to reflect the zero/non-
ymm3/m256 AVX10.11 zero status of each element of the result, under
writemask k1.
EVEX.512.F3.0F38.W1 26 /r A V/V (AVX512F AND Bitwise NAND of packed word integers in zmm2 and
VPTESTNMW k2 {k1}, zmm2, AVX512BW) OR zmm3/m512 and set mask k2 to reflect the zero/non-
zmm3/m512 AVX10.11 zero status of each element of the result, under
writemask k1.
EVEX.128.F3.0F38.W0 27 /r B V/V (AVX512VL AND Bitwise NAND of packed doubleword integers in
VPTESTNMD k2 {k1}, xmm2, AVX512F) OR xmm2 and xmm3/m128/m32bcst and set mask k2 to
xmm3/m128/m32bcst AVX10.11 reflect the zero/non-zero status of each element of
the result, under writemask k1.
EVEX.256.F3.0F38.W0 27 /r B V/V (AVX512VL AND Bitwise NAND of packed doubleword integers in
VPTESTNMD k2 {k1}, ymm2, AVX512F) OR ymm2 and ymm3/m256/m32bcst and set mask k2 to
ymm3/m256/m32bcst AVX10.11 reflect the zero/non-zero status of each element of
the result, under writemask k1.
EVEX.512.F3.0F38.W0 27 /r B V/V AVX512F Bitwise NAND of packed doubleword integers in
VPTESTNMD k2 {k1}, zmm2, OR AVX10.11 zmm2 and zmm3/m512/m32bcst and set mask k2 to
zmm3/m512/m32bcst reflect the zero/non-zero status of each element of
the result, under writemask k1.
EVEX.128.F3.0F38.W1 27 /r B V/V (AVX512VL AND Bitwise NAND of packed quadword integers in xmm2
VPTESTNMQ k2 {k1}, xmm2, AVX512F) OR and xmm3/m128/m64bcst and set mask k2 to reflect
xmm3/m128/m64bcst AVX10.11 the zero/non-zero status of each element of the
result, under writemask k1.
EVEX.256.F3.0F38.W1 27 /r B V/V (AVX512VL AND Bitwise NAND of packed quadword integers in ymm2
VPTESTNMQ k2 {k1}, ymm2, AVX512F) OR and ymm3/m256/m64bcst and set mask k2 to reflect
ymm3/m256/m64bcst AVX10.11 the zero/non-zero status of each element of the
result, under writemask k1.
EVEX.512.F3.0F38.W1 27 /r B V/V AVX512F Bitwise NAND of packed quadword integers in zmm2
VPTESTNMQ k2 {k1}, zmm2, OR AVX10.11 and zmm3/m512/m64bcst and set mask k2 to reflect
zmm3/m512/m64bcst the zero/non-zero status of each element of the
result, under writemask k1.

VPTESTNMB/W/D/Q—Logical NAND and Set Vol. 2C 5-634


NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A
B Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a bitwise logical NAND operation on the byte/word/doubleword/quadword element of the first source
operand (the second operand) with the corresponding element of the second source operand (the third operand)
and stores the logical comparison result into each bit of the destination operand (the first operand) according to the
writemask k1. Each bit of the result is set to 1 if the bitwise AND of the corresponding elements of the first and
second src operands is zero; otherwise it is set to 0.
EVEX encoded VPTESTNMD/Q: The first source operand is a ZMM/YMM/XMM registers. The second source operand
can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted
from a 32/64-bit memory location. The destination is updated according to the writemask.
EVEX encoded VPTESTNMB/W: The first source operand is a ZMM/YMM/XMM registers. The second source operand
can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination is updated according to the
writemask.

Operation
VPTESTNMB
(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j*8
IF MaskBit(j) OR *no writemask*
THEN
DEST[j] := (SRC1[i+7:i] BITWISE AND SRC2[i+7:i] == 0)? 1 : 0
ELSE DEST[j] := 0; zeroing masking only
FI
ENDFOR
DEST[MAX_KL-1:KL] := 0

VPTESTNMW
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j*16
IF MaskBit(j) OR *no writemask*
THEN
DEST[j] := (SRC1[i+15:i] BITWISE AND SRC2[i+15:i] == 0)? 1 : 0
ELSE DEST[j] := 0; zeroing masking only
FI
ENDFOR
DEST[MAX_KL-1:KL] := 0

VPTESTNMB/W/D/Q—Logical NAND and Set Vol. 2C 5-635


VPTESTNMD
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j*32
IF MaskBit(j) OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+31:i] := (SRC1[i+31:i] BITWISE AND SRC2[31:0] == 0)? 1 : 0
ELSE DEST[j] := (SRC1[i+31:i] BITWISE AND SRC2[i+31:i] == 0)? 1 : 0
FI
ELSE DEST[j] := 0; zeroing masking only
FI
ENDFOR
DEST[MAX_KL-1:KL] := 0

VPTESTNMQ
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j*64
IF MaskBit(j) OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[j] := (SRC1[i+63:i] BITWISE AND SRC2[63:0] == 0)? 1 : 0;
ELSE DEST[j] := (SRC1[i+63:i] BITWISE AND SRC2[i+63:i] == 0)? 1 : 0;
FI;
ELSE DEST[j] := 0; zeroing masking only
FI
ENDFOR
DEST[MAX_KL-1:KL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VPTESTNMB __mmask64 _mm512_testn_epi8_mask( __m512i a, __m512i b);
VPTESTNMB __mmask64 _mm512_mask_testn_epi8_mask(__mmask64, __m512i a, __m512i b);
VPTESTNMB __mmask32 _mm256_testn_epi8_mask(__m256i a, __m256i b);
VPTESTNMB __mmask32 _mm256_mask_testn_epi8_mask(__mmask32, __m256i a, __m256i b);
VPTESTNMB __mmask16 _mm_testn_epi8_mask(__m128i a, __m128i b);
VPTESTNMB __mmask16 _mm_mask_testn_epi8_mask(__mmask16, __m128i a, __m128i b);
VPTESTNMW __mmask32 _mm512_testn_epi16_mask( __m512i a, __m512i b);
VPTESTNMW __mmask32 _mm512_mask_testn_epi16_mask(__mmask32, __m512i a, __m512i b);
VPTESTNMW __mmask16 _mm256_testn_epi16_mask(__m256i a, __m256i b);
VPTESTNMW __mmask16 _mm256_mask_testn_epi16_mask(__mmask16, __m256i a, __m256i b);
VPTESTNMW __mmask8 _mm_testn_epi16_mask(__m128i a, __m128i b);
VPTESTNMW __mmask8 _mm_mask_testn_epi16_mask(__mmask8, __m128i a, __m128i b);
VPTESTNMD __mmask16 _mm512_testn_epi32_mask( __m512i a, __m512i b);
VPTESTNMD __mmask16 _mm512_mask_testn_epi32_mask(__mmask16, __m512i a, __m512i b);
VPTESTNMD __mmask8 _mm256_testn_epi32_mask(__m256i a, __m256i b);
VPTESTNMD __mmask8 _mm256_mask_testn_epi32_mask(__mmask8, __m256i a, __m256i b);
VPTESTNMD __mmask8 _mm_testn_epi32_mask(__m128i a, __m128i b);
VPTESTNMD __mmask8 _mm_mask_testn_epi32_mask(__mmask8, __m128i a, __m128i b);
VPTESTNMQ __mmask8 _mm512_testn_epi64_mask(__m512i a, __m512i b);
VPTESTNMQ __mmask8 _mm512_mask_testn_epi64_mask(__mmask8, __m512i a, __m512i b);
VPTESTNMQ __mmask8 _mm256_testn_epi64_mask(__m256i a, __m256i b);
VPTESTNMQ __mmask8 _mm256_mask_testn_epi64_mask(__mmask8, __m256i a, __m256i b);
VPTESTNMQ __mmask8 _mm_testn_epi64_mask(__m128i a, __m128i b);

VPTESTNMB/W/D/Q—Logical NAND and Set Vol. 2C 5-636


VPTESTNMQ __mmask8 _mm_mask_testn_epi64_mask(__mmask8, __m128i a, __m128i b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
VPTESTNMD/VPTESTNMQ: See Table 2-51, “Type E4 Class Exception Conditions.”
VPTESTNMB/VPTESTNMW: See Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”

VPTESTNMB/W/D/Q—Logical NAND and Set Vol. 2C 5-637


VRANGEPD—Range Restriction Calculation for Packed Pairs of Float64 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F3A.W1 50 /r ib A V/V (AVX512VL Calculate two RANGE operation output value from 2
VRANGEPD xmm1 {k1}{z}, xmm2, AND AVX512DQ) pairs of double precision floating-point values in
xmm3/m128/m64bcst, imm8 OR AVX10.11 xmm2 and xmm3/m128/m32bcst, store the results
to xmm1 under the writemask k1. Imm8 specifies
the comparison and sign of the range operation.
EVEX.256.66.0F3A.W1 50 /r ib A V/V (AVX512VL Calculate four RANGE operation output value from
VRANGEPD ymm1 {k1}{z}, ymm2, AND AVX512DQ) 4pairs of double precision floating-point values in
ymm3/m256/m64bcst, imm8 OR AVX10.11 ymm2 and ymm3/m256/m32bcst, store the results
to ymm1 under the writemask k1. Imm8 specifies
the comparison and sign of the range operation.
EVEX.512.66.0F3A.W1 50 /r ib A V/V AVX512DQ Calculate eight RANGE operation output value from
VRANGEPD zmm1 {k1}{z}, zmm2, OR AVX10.11 8 pairs of double precision floating-point values in
zmm3/m512/m64bcst{sae}, imm8 zmm2 and zmm3/m512/m32bcst, store the results
to zmm1 under the writemask k1. Imm8 specifies
the comparison and sign of the range operation.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
This instruction calculates 2/4/8 range operation outputs from two sets of packed input double precision floating-
point values in the first source operand (the second operand) and the second source operand (the third operand).
The range outputs are written to the destination operand (the first operand) under the writemask k1.
Bits7:4 of imm8 byte must be zero. The range operation output is performed in two parts, each configured by a
two-bit control field within imm8[3:0]:
• Imm8[1:0] specifies the initial comparison operation to be one of max, min, max absolute value or min
absolute value of the input value pair. Each comparison of two input values produces an intermediate result
that combines with the sign selection control (imm8[3:2]) to determine the final range operation output.
• Imm8[3:2] specifies the sign of the range operation output to be one of the following: from the first input
value, from the comparison result, set or clear.
The encodings of imm8[1:0] and imm8[3:2] are shown in Figure 5-27.

VRANGEPD—Range Restriction Calculation for Packed Pairs of Float64 Values Vol. 2C 5-637
7 6 5 4 3 2 1 0

imm8 Must Be Zero Sign Control (SC) Compare Operation Select

Imm8[1:0] = 00b : Select Min value


Imm8[3:2] = 00b : Select sign(SRC1) Imm8[1:0] = 01b : Select Max value
Imm8[3:2] = 01b : Select sign(Compare_Result) Imm8[1:0] = 10b : Select Min-Abs value
Imm8[3:2] = 10b : Set sign to 0 Imm8[1:0] = 11b : Select Max-Abs value
Imm8[3:2] = 11b : Set sign to 1

Figure 5-27. Imm8 Controls for VRANGEPD/SD/PS/SS

When one or more of the input value is a NAN, the comparison operation may signal invalid exception (IE). Details
with one of more input value is NAN is listed in Table 5-21. If the comparison raises an IE, the sign select control
(imm8[3:2]) has no effect to the range operation output; this is indicated also in Table 5-21.
When both input values are zeros of opposite signs, the comparison operation of MIN/MAX in the range compare
operation is slightly different from the conceptually similar floating-point MIN/MAX operation that are found in the
instructions VMAXPD/VMINPD. The details of MIN/MAX/MIN_ABS/MAX_ABS operation for VRANGEPD/PS/SD/SS
for magnitude-0, opposite-signed input cases are listed in Table 5-22.
Additionally, non-zero, equal-magnitude with opposite-sign input values perform MIN_ABS or MAX_ABS compar-
ison operation with result listed in Table 5-23.

Table 5-21. Signaling of Comparison Operation of One or More NaN Input Values and Effect of Imm8[3:2]
Src1 Src2 Result IE Signaling Due to Comparison Imm8[3:2] Effect to Range Output
sNaN1 sNaN2 Quiet(sNaN1) Yes Ignored
sNaN1 qNaN2 Quiet(sNaN1) Yes Ignored
sNaN1 Norm2 Quiet(sNaN1) Yes Ignored
qNaN1 sNaN2 Quiet(sNaN2) Yes Ignored
qNaN1 qNaN2 qNaN1 No Applicable
qNaN1 Norm2 Norm2 No Applicable
Norm1 sNaN2 Quiet(sNaN2) Yes Ignored
Norm1 qNaN2 Norm1 No Applicable

Table 5-22. Comparison Result for Opposite-Signed Zero Cases for MIN, MIN_ABS, and MAX, MAX_ABS
MIN and MIN_ABS MAX and MAX_ABS
Src1 Src2 Result Src1 Src2 Result
+0 -0 -0 +0 -0 +0
-0 +0 -0 -0 +0 +0

Table 5-23. Comparison Result of Equal-Magnitude Input Cases for MIN_ABS and MAX_ABS, (|a| = |b|, a>0, b<0)
MIN_ABS (|a| = |b|, a>0, b<0) MAX_ABS (|a| = |b|, a>0, b<0)
Src1 Src2 Result Src1 Src2 Result
a b b a b a
b a b b a a

VRANGEPD—Range Restriction Calculation for Packed Pairs of Float64 Values Vol. 2C 5-638
Operation
RangeDP(SRC1[63:0], SRC2[63:0], CmpOpCtl[1:0], SignSelCtl[1:0])
{
// Check if SNAN and report IE, see also Table 5-21
IF (SRC1 = SNAN) THEN RETURN (QNAN(SRC1), set IE);
IF (SRC2 = SNAN) THEN RETURN (QNAN(SRC2), set IE);

Src1.exp := SRC1[62:52];
Src1.fraction := SRC1[51:0];
IF ((Src1.exp = 0 ) and (Src1.fraction != 0)) THEN// Src1 is a denormal number
IF DAZ THEN Src1.fraction := 0;
ELSE IF (SRC2 <> QNAN) Set DE; FI;
FI;
Src2.exp := SRC2[62:52];
Src2.fraction := SRC2[51:0];
IF ((Src2.exp = 0) and (Src2.fraction !=0 )) THEN// Src2 is a denormal number
IF DAZ THEN Src2.fraction := 0;
ELSE IF (SRC1 <> QNAN) Set DE; FI;
FI;

IF (SRC2 = QNAN) THEN{TMP[63:0] := SRC1[63:0]}


ELSE IF(SRC1 = QNAN) THEN{TMP[63:0] := SRC2[63:0]}
ELSE IF (Both SRC1, SRC2 are magnitude-0 and opposite-signed) TMP[63:0] := from Table 5-22
ELSE IF (Both SRC1, SRC2 are magnitude-equal and opposite-signed and CmpOpCtl[1:0] > 01) TMP[63:0] := from Table 5-23
ELSE
Case(CmpOpCtl[1:0])
00: TMP[63:0] := (SRC1[63:0] ≤ SRC2[63:0]) ? SRC1[63:0] : SRC2[63:0];
01: TMP[63:0] := (SRC1[63:0] ≤ SRC2[63:0]) ? SRC2[63:0] : SRC1[63:0];
10: TMP[63:0] := (ABS(SRC1[63:0]) ≤ ABS(SRC2[63:0])) ? SRC1[63:0] : SRC2[63:0];
11: TMP[63:0] := (ABS(SRC1[63:0]) ≤ ABS(SRC2[63:0])) ? SRC2[63:0] : SRC1[63:0];
ESAC;
FI;

Case(SignSelCtl[1:0])
00: dest := (SRC1[63] << 63) OR (TMP[62:0]);// Preserve Src1 sign bit
01: dest := TMP[63:0];// Preserve sign of compare result
10: dest := (0 << 63) OR (TMP[62:0]);// Zero out sign bit
11: dest := (1 << 63) OR (TMP[62:0]);// Set the sign bit
ESAC;
RETURN dest[63:0];
}

CmpOpCtl[1:0]= imm8[1:0];
SignSelCtl[1:0]=imm8[3:2];

VRANGEPD (EVEX encoded versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b == 1) AND (SRC2 *is memory*)
THEN DEST[i+63:i] := RangeDP (SRC1[i+63:i], SRC2[63:0], CmpOpCtl[1:0], SignSelCtl[1:0]);
ELSE DEST[i+63:i] := RangeDP (SRC1[i+63:i], SRC2[i+63:i], CmpOpCtl[1:0], SignSelCtl[1:0]);

VRANGEPD—Range Restriction Calculation for Packed Pairs of Float64 Values Vol. 2C 5-639
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] = 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

The following example describes a common usage of this instruction for checking that the input operand is
bounded between ±1023.
VRANGEPD zmm_dst, zmm_src, zmm_1023, 02h;

Where:
zmm_dst is the destination operand.
zmm_src is the input operand to compare against ±1023 (this is SRC1).
zmm_1023 is the reference operand, contains the value of 1023 (and this is SRC2).
IMM=02(imm8[1:0]='10) selects the Min Absolute value operation with selection of SRC1.sign.

In case |zmm_src| < 1023 (i.e., SRC1 is smaller than 1023 in magnitude), then its value will be written into
zmm_dst. Otherwise, the value stored in zmm_dst will get the value of 1023 (received on zmm_1023, which is
SRC2).
However, the sign control (imm8[3:2]='00) instructs to select the sign of SRC1 received from zmm_src. So, even
in the case of |zmm_src| ≥ 1023, the selected sign of SRC1 is kept.
Thus, if zmm_src < -1023, the result of VRANGEPD will be the minimal value of -1023 while if zmm_src > +1023,
the result of VRANGE will be the maximal value of +1023.

Intel C/C++ Compiler Intrinsic Equivalent


VRANGEPD __m512d _mm512_range_pd ( __m512d a, __m512d b, int imm);
VRANGEPD __m512d _mm512_range_round_pd ( __m512d a, __m512d b, int imm, int sae);
VRANGEPD __m512d _mm512_mask_range_pd (__m512 ds, __mmask8 k, __m512d a, __m512d b, int imm);
VRANGEPD __m512d _mm512_mask_range_round_pd (__m512d s, __mmask8 k, __m512d a, __m512d b, int imm, int sae);
VRANGEPD __m512d _mm512_maskz_range_pd ( __mmask8 k, __m512d a, __m512d b, int imm);
VRANGEPD __m512d _mm512_maskz_range_round_pd ( __mmask8 k, __m512d a, __m512d b, int imm, int sae);
VRANGEPD __m256d _mm256_range_pd ( __m256d a, __m256d b, int imm);
VRANGEPD __m256d _mm256_mask_range_pd (__m256d s, __mmask8 k, __m256d a, __m256d b, int imm);
VRANGEPD __m256d _mm256_maskz_range_pd ( __mmask8 k, __m256d a, __m256d b, int imm);
VRANGEPD __m128d _mm_range_pd ( __m128 a, __m128d b, int imm);
VRANGEPD __m128d _mm_mask_range_pd (__m128 s, __mmask8 k, __m128d a, __m128d b, int imm);
VRANGEPD __m128d _mm_maskz_range_pd ( __mmask8 k, __m128d a, __m128d b, int imm);

SIMD Floating-Point Exceptions

Invalid, Denormal.

Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”

VRANGEPD—Range Restriction Calculation for Packed Pairs of Float64 Values Vol. 2C 5-640
VRANGEPS—Range Restriction Calculation for Packed Pairs of Float32 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F3A.W0 50 /r ib A V/V (AVX512VL Calculate four RANGE operation output value from
VRANGEPS xmm1 {k1}{z}, xmm2, AND AVX512DQ) 4 pairs of single-precision floating-point values in
xmm3/m128/m32bcst, imm8 OR AVX10.11 xmm2 and xmm3/m128/m32bcst, store the results
to xmm1 under the writemask k1. Imm8 specifies
the comparison and sign of the range operation.
EVEX.256.66.0F3A.W0 50 /r ib A V/V (AVX512VL Calculate eight RANGE operation output value from
VRANGEPS ymm1 {k1}{z}, ymm2, AND AVX512DQ) 8 pairs of single-precision floating-point values in
ymm3/m256/m32bcst, imm8 OR AVX10.11 ymm2 and ymm3/m256/m32bcst, store the results
to ymm1 under the writemask k1. Imm8 specifies
the comparison and sign of the range operation.
EVEX.512.66.0F3A.W0 50 /r ib A V/V AVX512DQ Calculate 16 RANGE operation output value from
VRANGEPS zmm1 {k1}{z}, zmm2, OR AVX10.11 16 pairs of single-precision floating-point values in
zmm3/m512/m32bcst{sae}, imm8 zmm2 and zmm3/m512/m32bcst, store the results
to zmm1 under the writemask k1. Imm8 specifies
the comparison and sign of the range operation.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
This instruction calculates 4/8/16 range operation outputs from two sets of packed input single precision floating-
point values in the first source operand (the second operand) and the second source operand (the third operand).
The range outputs are written to the destination operand (the first operand) under the writemask k1.
Bits7:4 of imm8 byte must be zero. The range operation output is performed in two parts, each configured by a
two-bit control field within imm8[3:0]:
• Imm8[1:0] specifies the initial comparison operation to be one of max, min, max absolute value or min
absolute value of the input value pair. Each comparison of two input values produces an intermediate result
that combines with the sign selection control (imm8[3:2]) to determine the final range operation output.
• Imm8[3:2] specifies the sign of the range operation output to be one of the following: from the first input
value, from the comparison result, set or clear.
The encodings of imm8[1:0] and imm8[3:2] are shown in Figure 5-27.
When one or more of the input value is a NAN, the comparison operation may signal invalid exception (IE). Details
with one of more input value is NAN is listed in Table 5-21. If the comparison raises an IE, the sign select control
(imm8[3:2]) has no effect to the range operation output; this is indicated also in Table 5-21.
When both input values are zeros of opposite signs, the comparison operation of MIN/MAX in the range compare
operation is slightly different from the conceptually similar floating-point MIN/MAX operation that are found in the
instructions VMAXPD/VMINPD. The details of MIN/MAX/MIN_ABS/MAX_ABS operation for VRANGEPD/PS/SD/SS
for magnitude-0, opposite-signed input cases are listed in Table 5-22.
Additionally, non-zero, equal-magnitude with opposite-sign input values perform MIN_ABS or MAX_ABS compar-
ison operation with result listed in Table 5-23.

VRANGEPS—Range Restriction Calculation for Packed Pairs of Float32 Values Vol. 2C 5-641
Operation
RangeSP(SRC1[31:0], SRC2[31:0], CmpOpCtl[1:0], SignSelCtl[1:0])
{
// Check if SNAN and report IE, see also Table 5-21
IF (SRC1=SNAN) THEN RETURN (QNAN(SRC1), set IE);
IF (SRC2=SNAN) THEN RETURN (QNAN(SRC2), set IE);

Src1.exp := SRC1[30:23];
Src1.fraction := SRC1[22:0];
IF ((Src1.exp = 0 ) and (Src1.fraction != 0 )) THEN// Src1 is a denormal number
IF DAZ THEN Src1.fraction := 0;
ELSE IF (SRC2 <> QNAN) Set DE; FI;
FI;
Src2.exp := SRC2[30:23];
Src2.fraction := SRC2[22:0];
IF ((Src2.exp = 0 ) and (Src2.fraction != 0 )) THEN// Src2 is a denormal number
IF DAZ THEN Src2.fraction := 0;
ELSE IF (SRC1 <> QNAN) Set DE; FI;
FI;

IF (SRC2 = QNAN) THEN{TMP[31:0] := SRC1[31:0]}


ELSE IF(SRC1 = QNAN) THEN{TMP[31:0] := SRC2[31:0]}
ELSE IF (Both SRC1, SRC2 are magnitude-0 and opposite-signed) TMP[31:0] := from Table 5-22
ELSE IF (Both SRC1, SRC2 are magnitude-equal and opposite-signed and CmpOpCtl[1:0] > 01) TMP[31:0] := from Table 5-23
ELSE
Case(CmpOpCtl[1:0])
00: TMP[31:0] := (SRC1[31:0] ≤ SRC2[31:0]) ? SRC1[31:0] : SRC2[31:0];
01: TMP[31:0] := (SRC1[31:0] ≤ SRC2[31:0]) ? SRC2[31:0] : SRC1[31:0];
10: TMP[31:0] := (ABS(SRC1[31:0]) ≤ ABS(SRC2[31:0])) ? SRC1[31:0] : SRC2[31:0];
11: TMP[31:0] := (ABS(SRC1[31:0]) ≤ ABS(SRC2[31:0])) ? SRC2[31:0] : SRC1[31:0];
ESAC;
FI;
Case(SignSelCtl[1:0])
00: dest := (SRC1[31] << 31) OR (TMP[30:0]);// Preserve Src1 sign bit
01: dest := TMP[31:0];// Preserve sign of compare result
10: dest := (0 << 31) OR (TMP[30:0]);// Zero out sign bit
11: dest := (1 << 31) OR (TMP[30:0]);// Set the sign bit
ESAC;
RETURN dest[31:0];
}

CmpOpCtl[1:0]= imm8[1:0];
SignSelCtl[1:0]=imm8[3:2];

VRANGEPS
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b == 1) AND (SRC2 *is memory*)
THEN DEST[i+31:i] := RangeSP (SRC1[i+31:i], SRC2[31:0], CmpOpCtl[1:0], SignSelCtl[1:0]);
ELSE DEST[i+31:i] := RangeSP (SRC1[i+31:i], SRC2[i+31:i], CmpOpCtl[1:0], SignSelCtl[1:0]);
FI;

VRANGEPS—Range Restriction Calculation for Packed Pairs of Float32 Values Vol. 2C 5-642
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] = 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

The following example describes a common usage of this instruction for checking that the input operand is
bounded between ±150.

VRANGEPS zmm_dst, zmm_src, zmm_150, 02h;

Where:
zmm_dst is the destination operand.
zmm_src is the input operand to compare against ±150.
zmm_150 is the reference operand, contains the value of 150.
IMM=02(imm8[1:0]=’10) selects the Min Absolute value operation with selection of src1.sign.

In case |zmm_src| < 150, then its value will be written into zmm_dst. Otherwise, the value stored in zmm_dst
will get the value of 150 (received on zmm_150).
However, the sign control (imm8[3:2]=’00) instructs to select the sign of SRC1 received from zmm_src. So, even
in the case of |zmm_src| ≥ 150, the selected sign of SRC1 is kept.
Thus, if zmm_src < -150, the result of VRANGEPS will be the minimal value of -150 while if zmm_src > +150,
the result of VRANGE will be the maximal value of +150.

Intel C/C++ Compiler Intrinsic Equivalent


VRANGEPS __m512 _mm512_range_ps ( __m512 a, __m512 b, int imm);
VRANGEPS __m512 _mm512_range_round_ps ( __m512 a, __m512 b, int imm, int sae);
VRANGEPS __m512 _mm512_mask_range_ps (__m512 s, __mmask16 k, __m512 a, __m512 b, int imm);
VRANGEPS __m512 _mm512_mask_range_round_ps (__m512 s, __mmask16 k, __m512 a, __m512 b, int imm, int sae);
VRANGEPS __m512 _mm512_maskz_range_ps ( __mmask16 k, __m512 a, __m512 b, int imm);
VRANGEPS __m512 _mm512_maskz_range_round_ps ( __mmask16 k, __m512 a, __m512 b, int imm, int sae);
VRANGEPS __m256 _mm256_range_ps ( __m256 a, __m256 b, int imm);
VRANGEPS __m256 _mm256_mask_range_ps (__m256 s, __mmask8 k, __m256 a, __m256 b, int imm);
VRANGEPS __m256 _mm256_maskz_range_ps ( __mmask8 k, __m256 a, __m256 b, int imm);
VRANGEPS __m128 _mm_range_ps ( __m128 a, __m128 b, int imm);
VRANGEPS __m128 _mm_mask_range_ps (__m128 s, __mmask8 k, __m128 a, __m128 b, int imm);
VRANGEPS __m128 _mm_maskz_range_ps ( __mmask8 k, __m128 a, __m128 b, int imm);

SIMD Floating-Point Exceptions

Invalid, Denormal.

Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”

VRANGEPS—Range Restriction Calculation for Packed Pairs of Float32 Values Vol. 2C 5-643
VRANGESD—Range Restriction Calculation From a Pair of Scalar Float64 Values
Opcode/ Op / 64/32 CPUID Description
Instruction En bit Mode Feature Flag
Support
EVEX.LLIG.66.0F3A.W1 51 /r A V/V AVX512DQ Calculate a RANGE operation output value from 2 double
VRANGESD xmm1 {k1}{z}, OR AVX10.11 precision floating-point values in xmm2 and xmm3/m64,
xmm2, xmm3/m64{sae}, imm8 store the output to xmm1 under writemask. Imm8
specifies the comparison and sign of the range operation.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
This instruction calculates a range operation output from two input double precision floating-point values in the low
qword element of the first source operand (the second operand) and second source operand (the third operand).
The range output is written to the low qword element of the destination operand (the first operand) under the
writemask k1.
Bits7:4 of imm8 byte must be zero. The range operation output is performed in two parts, each configured by a
two-bit control field within imm8[3:0]:
• Imm8[1:0] specifies the initial comparison operation to be one of max, min, max absolute value or min
absolute value of the input value pair. Each comparison of two input values produces an intermediate result
that combines with the sign selection control (imm8[3:2]) to determine the final range operation output.
• Imm8[3:2] specifies the sign of the range operation output to be one of the following: from the first input
value, from the comparison result, set or clear.
The encodings of imm8[1:0] and imm8[3:2] are shown in Figure 5-27.
Bits 128:63 of the destination operand are copied from the respective element of the first source operand.
When one or more of the input value is a NAN, the comparison operation may signal invalid exception (IE). Details
with one of more input value is NAN is listed in Table 5-21. If the comparison raises an IE, the sign select control
(imm8[3:2]) has no effect to the range operation output; this is indicated also in Table 5-21.
When both input values are zeros of opposite signs, the comparison operation of MIN/MAX in the range compare
operation is slightly different from the conceptually similar floating-point MIN/MAX operation that are found in the
instructions VMAXPD/VMINPD. The details of MIN/MAX/MIN_ABS/MAX_ABS operation for VRANGEPD/PS/SD/SS
for magnitude-0, opposite-signed input cases are listed in Table 5-22.
Additionally, non-zero, equal-magnitude with opposite-sign input values perform MIN_ABS or MAX_ABS compar-
ison operation with result listed in Table 5-23.

VRANGESD—Range Restriction Calculation From a Pair of Scalar Float64 Values Vol. 2C 5-644
Operation
RangeDP(SRC1[63:0], SRC2[63:0], CmpOpCtl[1:0], SignSelCtl[1:0])
{
// Check if SNAN and report IE, see also Table 5-21
IF (SRC1 = SNAN) THEN RETURN (QNAN(SRC1), set IE);
IF (SRC2 = SNAN) THEN RETURN (QNAN(SRC2), set IE);

Src1.exp := SRC1[62:52];
Src1.fraction := SRC1[51:0];
IF ((Src1.exp = 0 ) and (Src1.fraction != 0)) THEN// Src1 is a denormal number
IF DAZ THEN Src1.fraction := 0;
ELSE IF (SRC2 <> QNAN) Set DE; FI;
FI;

Src2.exp := SRC2[62:52];
Src2.fraction := SRC2[51:0];
IF ((Src2.exp = 0) and (Src2.fraction !=0 )) THEN// Src2 is a denormal number
IF DAZ THEN Src2.fraction := 0;
ELSE IF (SRC1 <> QNAN) Set DE; FI;
FI;

IF (SRC2 = QNAN) THEN{TMP[63:0] := SRC1[63:0]}


ELSE IF(SRC1 = QNAN) THEN{TMP[63:0] := SRC2[63:0]}
ELSE IF (Both SRC1, SRC2 are magnitude-0 and opposite-signed) TMP[63:0] := from Table 5-22
ELSE IF (Both SRC1, SRC2 are magnitude-equal and opposite-signed and CmpOpCtl[1:0] > 01) TMP[63:0] := from Table 5-23
ELSE
Case(CmpOpCtl[1:0])
00: TMP[63:0] := (SRC1[63:0] ≤ SRC2[63:0]) ? SRC1[63:0] : SRC2[63:0];
01: TMP[63:0] := (SRC1[63:0] ≤ SRC2[63:0]) ? SRC2[63:0] : SRC1[63:0];
10: TMP[63:0] := (ABS(SRC1[63:0]) ≤ ABS(SRC2[63:0])) ? SRC1[63:0] : SRC2[63:0];
11: TMP[63:0] := (ABS(SRC1[63:0]) ≤ ABS(SRC2[63:0])) ? SRC2[63:0] : SRC1[63:0];
ESAC;
FI;

Case(SignSelCtl[1:0])
00: dest := (SRC1[63] << 63) OR (TMP[62:0]);// Preserve Src1 sign bit
01: dest := TMP[63:0];// Preserve sign of compare result
10: dest := (0 << 63) OR (TMP[62:0]);// Zero out sign bit
11: dest := (1 << 63) OR (TMP[62:0]);// Set the sign bit
ESAC;
RETURN dest[63:0];
}

CmpOpCtl[1:0]= imm8[1:0];
SignSelCtl[1:0]=imm8[3:2];

VRANGESD—Range Restriction Calculation From a Pair of Scalar Float64 Values Vol. 2C 5-645
VRANGESD
IF k1[0] OR *no writemask*
THEN DEST[63:0] := RangeDP (SRC1[63:0], SRC2[63:0], CmpOpCtl[1:0], SignSelCtl[1:0]);
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
DEST[63:0] = 0
FI;
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

The following example describes a common usage of this instruction for checking that the input operand is
bounded between ±1023.

VRANGESD xmm_dst, xmm_src, xmm_1023, 02h;

Where:
xmm_dst is the destination operand.
xmm_src is the input operand to compare against ±1023.
xmm_1023 is the reference operand, contains the value of 1023.
IMM=02(imm8[1:0]=’10) selects the Min Absolute value operation with selection of src1.sign.

In case |xmm_src| < 1023, then its value will be written into xmm_dst. Otherwise, the value stored in xmm_dst
will get the value of 1023 (received on xmm_1023).
However, the sign control (imm8[3:2]=’00) instructs to select the sign of SRC1 received from xmm_src. So, even
in the case of |xmm_src| ≥ 1023, the selected sign of SRC1 is kept.
Thus, if xmm_src < -1023, the result of VRANGEPD will be the minimal value of -1023while if xmm_src > +1023,
the result of VRANGE will be the maximal value of +1023.

Intel C/C++ Compiler Intrinsic Equivalent


VRANGESD __m128d _mm_range_sd ( __m128d a, __m128d b, int imm);
VRANGESD __m128d _mm_range_round_sd ( __m128d a, __m128d b, int imm, int sae);
VRANGESD __m128d _mm_mask_range_sd (__m128d s, __mmask8 k, __m128d a, __m128d b, int imm);
VRANGESD __m128d _mm_mask_range_round_sd (__m128d s, __mmask8 k, __m128d a, __m128d b, int imm, int sae);
VRANGESD __m128d _mm_maskz_range_sd ( __mmask8 k, __m128d a, __m128d b, int imm);
VRANGESD __m128d _mm_maskz_range_round_sd ( __mmask8 k, __m128d a, __m128d b, int imm, int sae);

SIMD Floating-Point Exceptions

Invalid, Denormal.

Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”

VRANGESD—Range Restriction Calculation From a Pair of Scalar Float64 Values Vol. 2C 5-646
VRANGESS—Range Restriction Calculation From a Pair of Scalar Float32 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.LLIG.66.0F3A.W0 51 /r A V/V AVX512DQ Calculate a RANGE operation output value from 2
VRANGESS xmm1 {k1}{z}, OR AVX10.11 single-precision floating-point values in xmm2 and
xmm2, xmm3/m32{sae}, imm8 xmm3/m32, store the output to xmm1 under
writemask. Imm8 specifies the comparison and sign of
the range operation.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction calculates a range operation output from two input single precision floating-point values in the low
dword element of the first source operand (the second operand) and second source operand (the third operand).
The range output is written to the low dword element of the destination operand (the first operand) under the
writemask k1.
Bits7:4 of imm8 byte must be zero. The range operation output is performed in two parts, each configured by a
two-bit control field within imm8[3:0]:
• Imm8[1:0] specifies the initial comparison operation to be one of max, min, max absolute value or min
absolute value of the input value pair. Each comparison of two input values produces an intermediate result
that combines with the sign selection control (imm8[3:2]) to determine the final range operation output.
• Imm8[3:2] specifies the sign of the range operation output to be one of the following: from the first input
value, from the comparison result, set or clear.
The encodings of imm8[1:0] and imm8[3:2] are shown in Figure 5-27.
Bits 128:31 of the destination operand are copied from the respective elements of the first source operand.
When one or more of the input value is a NAN, the comparison operation may signal invalid exception (IE). Details
with one of more input value is NAN is listed in Table 5-21. If the comparison raises an IE, the sign select control
(imm8[3:2]) has no effect to the range operation output; this is indicated also in Table 5-21.
When both input values are zeros of opposite signs, the comparison operation of MIN/MAX in the range compare
operation is slightly different from the conceptually similar floating-point MIN/MAX operation that are found in the
instructions VMAXPD/VMINPD. The details of MIN/MAX/MIN_ABS/MAX_ABS operation for VRANGEPD/PS/SD/SS
for magnitude-0, opposite-signed input cases are listed in Table 5-22.
Additionally, non-zero, equal-magnitude with opposite-sign input values perform MIN_ABS or MAX_ABS compar-
ison operation with result listed in Table 5-23.

VRANGESS—Range Restriction Calculation From a Pair of Scalar Float32 Values Vol. 2C 5-647
Operation
RangeSP(SRC1[31:0], SRC2[31:0], CmpOpCtl[1:0], SignSelCtl[1:0])
{
// Check if SNAN and report IE, see also Table 5-21
IF (SRC1=SNAN) THEN RETURN (QNAN(SRC1), set IE);
IF (SRC2=SNAN) THEN RETURN (QNAN(SRC2), set IE);

Src1.exp := SRC1[30:23];
Src1.fraction := SRC1[22:0];
IF ((Src1.exp = 0 ) and (Src1.fraction != 0 )) THEN// Src1 is a denormal number
IF DAZ THEN Src1.fraction := 0;
ELSE IF (SRC2 <> QNAN) Set DE; FI;
FI;
Src2.exp := SRC2[30:23];
Src2.fraction := SRC2[22:0];
IF ((Src2.exp = 0 ) and (Src2.fraction != 0 )) THEN// Src2 is a denormal number
IF DAZ THEN Src2.fraction := 0;
ELSE IF (SRC1 <> QNAN) Set DE; FI;
FI;

IF (SRC2 = QNAN) THEN{TMP[31:0] := SRC1[31:0]}


ELSE IF(SRC1 = QNAN) THEN{TMP[31:0] := SRC2[31:0]}
ELSE IF (Both SRC1, SRC2 are magnitude-0 and opposite-signed) TMP[31:0] := from Table 5-22
ELSE IF (Both SRC1, SRC2 are magnitude-equal and opposite-signed and CmpOpCtl[1:0] > 01) TMP[31:0] := from Table 5-23
ELSE
Case(CmpOpCtl[1:0])
00: TMP[31:0] := (SRC1[31:0] ≤ SRC2[31:0]) ? SRC1[31:0] : SRC2[31:0];
01: TMP[31:0] := (SRC1[31:0] ≤ SRC2[31:0]) ? SRC2[31:0] : SRC1[31:0];
10: TMP[31:0] := (ABS(SRC1[31:0]) ≤ ABS(SRC2[31:0])) ? SRC1[31:0] : SRC2[31:0];
11: TMP[31:0] := (ABS(SRC1[31:0]) ≤ ABS(SRC2[31:0])) ? SRC2[31:0] : SRC1[31:0];
ESAC;
FI;
Case(SignSelCtl[1:0])
00: dest := (SRC1[31] << 31) OR (TMP[30:0]);// Preserve Src1 sign bit
01: dest := TMP[31:0];// Preserve sign of compare result
10: dest := (0 << 31) OR (TMP[30:0]);// Zero out sign bit
11: dest := (1 << 31) OR (TMP[30:0]);// Set the sign bit
ESAC;
RETURN dest[31:0];
}

CmpOpCtl[1:0]= imm8[1:0];
SignSelCtl[1:0]=imm8[3:2];

VRANGESS—Range Restriction Calculation From a Pair of Scalar Float32 Values Vol. 2C 5-648
VRANGESS
IF k1[0] OR *no writemask*
THEN DEST[31:0] := RangeSP (SRC1[31:0], SRC2[31:0], CmpOpCtl[1:0], SignSelCtl[1:0]);
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
DEST[31:0] = 0
FI;
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

The following example describes a common usage of this instruction for checking that the input operand is
bounded between ±150.

VRANGESS zmm_dst, zmm_src, zmm_150, 02h;

Where:
xmm_dst is the destination operand.
xmm_src is the input operand to compare against ±150.
xmm_150 is the reference operand, contains the value of 150.
IMM=02(imm8[1:0]=’10) selects the Min Absolute value operation with selection of src1.sign.

In case |xmm_src| < 150, then its value will be written into zmm_dst. Otherwise, the value stored in xmm_dst
will get the value of 150 (received on zmm_150).
However, the sign control (imm8[3:2]=’00) instructs to select the sign of SRC1 received from xmm_src. So, even
in the case of |xmm_src| ≥ 150, the selected sign of SRC1 is kept.
Thus, if xmm_src < -150, the result of VRANGESS will be the minimal value of -150 while if xmm_src > +150,
the result of VRANGE will be the maximal value of +150.

Intel C/C++ Compiler Intrinsic Equivalent


VRANGESS __m128 _mm_range_ss ( __m128 a, __m128 b, int imm);
VRANGESS __m128 _mm_range_round_ss ( __m128 a, __m128 b, int imm, int sae);
VRANGESS __m128 _mm_mask_range_ss (__m128 s, __mmask8 k, __m128 a, __m128 b, int imm);
VRANGESS __m128 _mm_mask_range_round_ss (__m128 s, __mmask8 k, __m128 a, __m128 b, int imm, int sae);
VRANGESS __m128 _mm_maskz_range_ss ( __mmask8 k, __m128 a, __m128 b, int imm);
VRANGESS __m128 _mm_maskz_range_round_ss ( __mmask8 k, __m128 a, __m128 b, int imm, int sae);

SIMD Floating-Point Exceptions

Invalid, Denormal.

Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”

VRANGESS—Range Restriction Calculation From a Pair of Scalar Float32 Values Vol. 2C 5-649
VRCP14PD—Compute Approximate Reciprocals of Packed Float64 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 4C /r A V/V (AVX512VL AND Computes the approximate reciprocals of the packed
VRCP14PD xmm1 {k1}{z}, AVX512F) OR double precision floating-point values in
xmm2/m128/m64bcst AVX10.11 xmm2/m128/m64bcst and stores the results in xmm1.
Under writemask.
EVEX.256.66.0F38.W1 4C /r A V/V (AVX512VL AND Computes the approximate reciprocals of the packed
VRCP14PD ymm1 {k1}{z}, AVX512F) OR double precision floating-point values in
ymm2/m256/m64bcst AVX10.11 ymm2/m256/m64bcst and stores the results in ymm1.
Under writemask.
EVEX.512.66.0F38.W1 4C /r A V/V AVX512F Computes the approximate reciprocals of the packed
VRCP14PD zmm1 {k1}{z}, OR AVX10.11 double precision floating-point values in
zmm2/m512/m64bcst zmm2/m512/m64bcst and stores the results in zmm1.
Under writemask.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction performs a SIMD computation of the approximate reciprocals of eight/four/two packed double
precision floating-point values in the source operand (the second operand) and stores the packed double precision
floating-point results in the destination operand. The maximum relative error for this approximation is less than 2-
14.

The source operand can be a ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 64-
bit memory location. The destination operand is a ZMM register conditionally updated according to the writemask.
The VRCP14PD instruction is not affected by the rounding control bits in the MXCSR register. When a source value
is a 0.0, an ∞ with the sign of the source value is returned. A denormal source value will be treated as zero only in
case of DAZ bit set in MXCSR. Otherwise it is treated correctly (i.e., not as a 0.0). Underflow results are flushed to
zero only in case of FTZ bit set in MXCSR. Otherwise it will be treated correctly (i.e., correct underflow result is
written) with the sign of the operand. When a source value is a SNaN or QNaN, the SNaN is converted to a QNaN
or the source QNaN is returned.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported.

VRCP14PD—Compute Approximate Reciprocals of Packed Float64 Values Vol. 2C 5-650


Table 5-24. VRCP14PD/VRCP14SD Special Cases
Input value Result value Comments
0≤X≤ 2-1024 INF Very small denormal
-1024
-2 ≤ X ≤ -0 -INF Very small denormal
1022
X>2 Underflow Up to 18 bits of fractions are returned*
X< -21022 -Underflow Up to 18 bits of fractions are returned*
-n n
X=2 2
-n
X = -2 -2n

* in this case the mantissa is shifted right by one or two bits

A numerically exact implementation of VRCP14xx can be found at https://software.intel.com/en-us/articles/refer-


ence-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2.

Operation
VRCP14PD ((EVEX encoded versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC *is memory*)
THEN DEST[i+63:i] := APPROXIMATE(1.0/SRC[63:0]);
ELSE DEST[i+63:i] := APPROXIMATE(1.0/SRC[i+63:i]);
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VRCP14PD __m512d _mm512_rcp14_pd( __m512d a);
VRCP14PD __m512d _mm512_mask_rcp14_pd(__m512d s, __mmask8 k, __m512d a);
VRCP14PD __m512d _mm512_maskz_rcp14_pd( __mmask8 k, __m512d a);

SIMD Floating-Point Exceptions

None.

Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”

VRCP14PD—Compute Approximate Reciprocals of Packed Float64 Values Vol. 2C 5-651


VRCP14PS—Compute Approximate Reciprocals of Packed Float32 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 4C /r A V/V (AVX512VL AND Computes the approximate reciprocals of the packed
VRCP14PS xmm1 {k1}{z}, AVX512F) OR single-precision floating-point values in
xmm2/m128/m32bcst AVX10.11 xmm2/m128/m32bcst and stores the results in xmm1.
Under writemask.
EVEX.256.66.0F38.W0 4C /r A V/V (AVX512VL AND Computes the approximate reciprocals of the packed
VRCP14PS ymm1 {k1}{z}, AVX512F) OR single-precision floating-point values in
ymm2/m256/m32bcst AVX10.11 ymm2/m256/m32bcst and stores the results in ymm1.
Under writemask.
EVEX.512.66.0F38.W0 4C /r A V/V AVX512F Computes the approximate reciprocals of the packed
VRCP14PS zmm1 {k1}{z}, OR AVX10.11 single-precision floating-point values in
zmm2/m512/m32bcst zmm2/m512/m32bcst and stores the results in zmm1.
Under writemask.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction performs a SIMD computation of the approximate reciprocals of the packed single precision
floating-point values in the source operand (the second operand) and stores the packed single precision floating-
point results in the destination operand (the first operand). The maximum relative error for this approximation is
less than 2-14.
The source operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32-
bit memory location. The destination operand is a ZMM register conditionally updated according to the writemask.
The VRCP14PS instruction is not affected by the rounding control bits in the MXCSR register. When a source value
is a 0.0, an ∞ with the sign of the source value is returned. A denormal source value will be treated as zero only in
case of DAZ bit set in MXCSR. Otherwise it is treated correctly (i.e., not as a 0.0). Underflow results are flushed to
zero only in case of FTZ bit set in MXCSR. Otherwise it will be treated correctly (i.e., correct underflow result is
written) with the sign of the operand. When a source value is a SNaN or QNaN, the SNaN is converted to a QNaN
or the source QNaN is returned.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported.

VRCP14PS—Compute Approximate Reciprocals of Packed Float32 Values Vol. 2C 5-652


Table 5-25. VRCP14PS/VRCP14SS Special Cases
Input value Result value Comments
0≤X≤ 2-128 INF Very small denormal
-128
-2 ≤ X ≤ -0 -INF Very small denormal
126
X>2 Underflow Up to 18 bits of fractions are returned1
X < -2126 -Underflow Up to 18 bits of fractions are returned1
X = 2-n 2n
X = -2-n -2n
NOTES:
1. In this case, the mantissa is shifted right by one or two bits.

A numerically exact implementation of VRCP14xx can be found at:


https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-
vrsqrt14-vrcp28-vrsqrt28-vexp2.

Operation
VRCP14PS (EVEX encoded versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC *is memory*)
THEN DEST[i+31:i] := APPROXIMATE(1.0/SRC[31:0]);
ELSE DEST[i+31:i] := APPROXIMATE(1.0/SRC[i+31:i]);
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VRCP14PS __m512 _mm512_rcp14_ps( __m512 a);
VRCP14PS __m512 _mm512_mask_rcp14_ps(__m512 s, __mmask16 k, __m512 a);
VRCP14PS __m512 _mm512_maskz_rcp14_ps( __mmask16 k, __m512 a);
VRCP14PS __m256 _mm256_rcp14_ps( __m256 a);
VRCP14PS __m256 _mm512_mask_rcp14_ps(__m256 s, __mmask8 k, __m256 a);
VRCP14PS __m256 _mm512_maskz_rcp14_ps( __mmask8 k, __m256 a);
VRCP14PS __m128 _mm_rcp14_ps( __m128 a);
VRCP14PS __m128 _mm_mask_rcp14_ps(__m128 s, __mmask8 k, __m128 a);
VRCP14PS __m128 _mm_maskz_rcp14_ps( __mmask8 k, __m128 a);

SIMD Floating-Point Exceptions

None.

VRCP14PS—Compute Approximate Reciprocals of Packed Float32 Values Vol. 2C 5-653


Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”

VRCP14PS—Compute Approximate Reciprocals of Packed Float32 Values Vol. 2C 5-654


VRCP14SD—Compute Approximate Reciprocal of Scalar Float64 Value
Opcode/ Op 64/32 CPUID Feature Description
Instruction / En bit Mode Flag
Support
EVEX.LLIG.66.0F38.W1 4D /r A V/V AVX512F Computes the approximate reciprocal of the scalar
VRCP14SD xmm1 {k1}{z}, xmm2, OR AVX10.11 double precision floating-point value in xmm3/m64 and
xmm3/m64 stores the result in xmm1 using writemask k1. Also,
upper double precision floating-point value (bits[127:64])
from xmm2 is copied to xmm1[127:64].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs a SIMD computation of the approximate reciprocal of the low double precision floating-
point value in the second source operand (the third operand) stores the result in the low quadword element of the
destination operand (the first operand) according to the writemask k1. Bits (127:64) of the XMM register destina-
tion are copied from corresponding bits in the first source operand (the second operand). The maximum relative
error for this approximation is less than 2-14. The source operand can be an XMM register or a 64-bit memory loca-
tion. The destination operand is an XMM register.
The VRCP14SD instruction is not affected by the rounding control bits in the MXCSR register. When a source value
is a 0.0, an ∞ with the sign of the source value is returned. A denormal source value will be treated as zero only in
case of DAZ bit set in MXCSR. Otherwise it is treated correctly (i.e., not as a 0.0). Underflow results are flushed to
zero only in case of FTZ bit set in MXCSR. Otherwise it will be treated correctly (i.e., correct underflow result is
written) with the sign of the operand. When a source value is a SNaN or QNaN, the SNaN is converted to a QNaN
or the source QNaN is returned. See Table 5-24 for special-case input values.
MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported.
A numerically exact implementation of VRCP14xx can be found at:
https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-
vrsqrt14-vrcp28-vrsqrt28-vexp2.

Operation
VRCP14SD (EVEX version)
IF k1[0] OR *no writemask*
THEN DEST[63:0] := APPROXIMATE(1.0/SRC2[63:0]);
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
DEST[63:0] := 0
FI;
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

VRCP14SD—Compute Approximate Reciprocal of Scalar Float64 Value Vol. 2C 5-654


Intel C/C++ Compiler Intrinsic Equivalent
VRCP14SD __m128d _mm_rcp14_sd( __m128d a, __m128d b);
VRCP14SD __m128d _mm_mask_rcp14_sd(__m128d s, __mmask8 k, __m128d a, __m128d b);
VRCP14SD __m128d _mm_maskz_rcp14_sd( __mmask8 k, __m128d a, __m128d b);

SIMD Floating-Point Exceptions

None.

Other Exceptions
See Table 2-53, “Type E5 Class Exception Conditions.”

VRCP14SD—Compute Approximate Reciprocal of Scalar Float64 Value Vol. 2C 5-655


VRCP14SS—Compute Approximate Reciprocal of Scalar Float32 Value
Opcode/ Op / 64/32 CPUID Description
Instruction En bit Mode Feature Flag
Support
EVEX.LLIG.66.0F38.W0 4D /r A V/V AVX512F Computes the approximate reciprocal of the scalar
VRCP14SS xmm1 {k1}{z}, xmm2, OR AVX10.11 single-precision floating-point value in xmm3/m32 and
xmm3/m32 stores the results in xmm1 using writemask k1. Also,
upper double precision floating-point value
(bits[127:32]) from xmm2 is copied to xmm1[127:32].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs a SIMD computation of the approximate reciprocal of the low single precision floating-
point value in the second source operand (the third operand) and stores the result in the low quadword element of
the destination operand (the first operand) according to the writemask k1. Bits (127:32) of the XMM register desti-
nation are copied from corresponding bits in the first source operand (the second operand). The maximum relative
error for this approximation is less than 2-14. The source operand can be an XMM register or a 32-bit memory loca-
tion. The destination operand is an XMM register.
The VRCP14SS instruction is not affected by the rounding control bits in the MXCSR register. When a source value
is a 0.0, an ∞ with the sign of the source value is returned. A denormal source value will be treated as zero only in
case of DAZ bit set in MXCSR. Otherwise it is treated correctly (i.e., not as a 0.0). Underflow results are flushed to
zero only in case of FTZ bit set in MXCSR. Otherwise it will be treated correctly (i.e., correct underflow result is
written) with the sign of the operand. When a source value is a SNaN or QNaN, the SNaN is converted to a QNaN
or the source QNaN is returned. See Table 5-25 for special-case input values.
MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported.
A numerically exact implementation of VRCP14xx can be found at https://software.intel.com/en-us/articles/refer-
ence-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2.

Operation
VRCP14SS (EVEX version)
IF k1[0] OR *no writemask*
THEN DEST[31:0] := APPROXIMATE(1.0/SRC2[31:0]);
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
DEST[31:0] := 0
FI;
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

VRCP14SS—Compute Approximate Reciprocal of Scalar Float32 Value Vol. 2C 5-656


Intel C/C++ Compiler Intrinsic Equivalent
VRCP14SS __m128 _mm_rcp14_ss( __m128 a, __m128 b);
VRCP14SS __m128 _mm_mask_rcp14_ss(__m128 s, __mmask8 k, __m128 a, __m128 b);
VRCP14SS __m128 _mm_maskz_rcp14_ss( __mmask8 k, __m128 a, __m128 b);

SIMD Floating-Point Exceptions

None.

Other Exceptions
See Table 2-53, “Type E5 Class Exception Conditions.”

VRCP14SS—Compute Approximate Reciprocal of Scalar Float32 Value Vol. 2C 5-657


VRCPPH—Compute Reciprocals of Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.MAP6.W0 4C /r A V/V (AVX512-FP16 Compute the approximate reciprocals of packed
VRCPPH xmm1{k1}{z}, AND AVX512VL) FP16 values in xmm2/m128/m16bcst and store
xmm2/m128/m16bcst OR AVX10.11 the result in xmm1 subject to writemask k1.
EVEX.256.66.MAP6.W0 4C /r A V/V (AVX512-FP16 Compute the approximate reciprocals of packed
VRCPPH ymm1{k1}{z}, AND AVX512VL) FP16 values in ymm2/m256/m16bcst and store
ymm2/m256/m16bcst OR AVX10.11 the result in ymm1 subject to writemask k1.
EVEX.512.66.MAP6.W0 4C /r A V/V AVX512-FP16 Compute the approximate reciprocals of packed
VRCPPH zmm1{k1}{z}, OR AVX10.11 FP16 values in zmm2/m512/m16bcst and store
zmm2/m512/m16bcst the result in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction performs a SIMD computation of the approximate reciprocals of 8/16/32 packed FP16 values in the
source operand (the second operand) and stores the packed FP16 results in the destination operand. The maximum
relative error for this approximation is less than 2−11 + 2−14.
For special cases, see Table 5-26.

Table 5-26. VRCPPH/VRCPSH Special Cases


Input Value Result Value Comments
0≤X≤ 2-16 INF Very small denormal
−2-16 ≤ X ≤ -0 −INF Very small denormal
X > +∞ +0
X < −∞ −0
X= 2-n 2n
X = −2-n −2n

VRCPPH—Compute Reciprocals of Packed FP16 Values Vol. 2C 5-658


Operation
VRCPPH dest{k1}, src
VL = 128, 256 or 512
KL := VL/16

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF SRC is memory and (EVEX.b = 1):
tsrc := src.fp16[0]
ELSE:
tsrc := src.fp16[i]
DEST.fp16[i] := APPROXIMATE(1.0 / tsrc)
ELSE IF *zeroing*:
DEST.fp16[i] := 0
//else DEST.fp16[i] remains unchanged
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VRCPPH __m128h _mm_mask_rcp_ph (__m128h src, __mmask8 k, __m128h a);
VRCPPH __m128h _mm_maskz_rcp_ph (__mmask8 k, __m128h a);
VRCPPH __m128h _mm_rcp_ph (__m128h a);
VRCPPH __m256h _mm256_mask_rcp_ph (__m256h src, __mmask16 k, __m256h a);
VRCPPH __m256h _mm256_maskz_rcp_ph (__mmask16 k, __m256h a);
VRCPPH __m256h _mm256_rcp_ph (__m256h a);
VRCPPH __m512h _mm512_mask_rcp_ph (__m512h src, __mmask32 k, __m512h a);
VRCPPH __m512h _mm512_maskz_rcp_ph (__mmask32 k, __m512h a);
VRCPPH __m512h _mm512_rcp_ph (__m512h a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

VRCPPH—Compute Reciprocals of Packed FP16 Values Vol. 2C 5-659


VRCPSH—Compute Reciprocal of Scalar FP16 Value
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.LLIG.66.MAP6.W0 4D /r A V/V AVX512-FP16 Compute the approximate reciprocal of the low
VRCPSH xmm1{k1}{z}, xmm2, OR AVX10.11 FP16 value in xmm3/m16 and store the result in
xmm3/m16 xmm1 subject to writemask k1. Bits 127:16 from
xmm2 are copied to xmm1[127:16].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs a SIMD computation of the approximate reciprocal of the low FP16 value in the second
source operand (the third operand) and stores the result in the low word element of the destination operand (the
first operand) according to the writemask k1. Bits 127:16 of the XMM register destination are copied from corre-
sponding bits in the first source operand (the second operand). The maximum relative error for this approximation
is less than 2−11 + 2−14.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
For special cases, see Table 5-26.

Operation
VRCPSH dest{k1}, src1, src2
IF k1[0] or *no writemask*:
DEST.fp16[0] := APPROXIMATE(1.0 / src2.fp16[0])
ELSE IF *zeroing*:
DEST.fp16[0] := 0
//else DEST.fp16[0] remains unchanged

DEST[127:16] := src1[127:16]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VRCPSH __m128h _mm_mask_rcp_sh (__m128h src, __mmask8 k, __m128h a, __m128h b);
VRCPSH __m128h _mm_maskz_rcp_sh (__mmask8 k, __m128h a, __m128h b);
VRCPSH __m128h _mm_rcp_sh (__m128h a, __m128h b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instruction, see Table 2-60, “Type E10 Class Exception Conditions.”

VRCPSH—Compute Reciprocal of Scalar FP16 Value Vol. 2C 5-660


VREDUCEPD—Perform Reduction Transformation on Packed Float64 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F3A.W1 56 /r ib A V/V (AVX512VL Perform reduction transformation on packed double
VREDUCEPD xmm1 {k1}{z}, AND AVX512DQ) precision floating-point values in
xmm2/m128/m64bcst, imm8 OR AVX10.11 xmm2/m128/m32bcst by subtracting a number of
fraction bits specified by the imm8 field. Stores the
result in xmm1 register under writemask k1.
EVEX.256.66.0F3A.W1 56 /r ib A V/V (AVX512VL Perform reduction transformation on packed double
VREDUCEPD ymm1 {k1}{z}, AND AVX512DQ) precision floating-point values in
ymm2/m256/m64bcst, imm8 OR AVX10.11 ymm2/m256/m32bcst by subtracting a number of
fraction bits specified by the imm8 field. Stores the
result in ymm1 register under writemask k1.
EVEX.512.66.0F3A.W1 56 /r ib A V/V AVX512DQ Perform reduction transformation on double precision
VREDUCEPD zmm1 {k1}{z}, OR AVX10.11 floating-point values in zmm2/m512/m32bcst by
zmm2/m512/m64bcst{sae}, subtracting a number of fraction bits specified by the
imm8 imm8 field. Stores the result in zmm1 register under
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) imm8 N/A

Description
Perform reduction transformation of the packed binary encoded double precision floating-point values in the source
operand (the second operand) and store the reduced results in binary floating-point format to the destination
operand (the first operand) under the writemask k1.
The reduction transformation subtracts the integer part and the leading M fractional bits from the binary floating-
point source value, where M is a unsigned integer specified by imm8[7:4], see Figure 5-28. Specifically, the reduc-
tion transformation can be expressed as:
dest = src – (ROUND(2M*src))*2-M;
where “Round()” treats “src”, “2M”, and their product as binary floating-point numbers with normalized signifi-
cand and biased exponents.
The magnitude of the reduced result can be expressed by considering src= 2p*man2,
where ‘man2’ is the normalized significand and ‘p’ is the unbiased exponent
Then if RC = RNE: 0<=|Reduced Result|<=2p-M-1
Then if RC ≠ RNE: 0<=|Reduced Result|<2p-M
This instruction might end up with a precision exception set. However, in case of SPE set (i.e., Suppress Precision
Exception, which is imm8[3]=1), no precision exception is reported.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

VREDUCEPD—Perform Reduction Transformation on Packed Float64 Values Vol. 2C 5-661


7 6 5 4 3 2 1 0

imm8 Fixed point length SPE RS Round Control Override

Suppress Precision Exception: Imm8[3] Imm8[1:0] = 00b : Round nearest even


Imm8[3] = 0b : Use MXCSR exception mask Round Select: Imm8[2]
Imm8[7:4] : Number of fixed points to subtract Imm8[1:0] = 01b : Round down
Imm8[3] = 1b : Suppress Imm8[2] = 0b : Use Imm8[1:0]
Imm8[1:0] = 10b : Round up
Imm8[2] = 1b : Use MXCSR
Imm8[1:0] = 11b : Truncate

Figure 5-28. Imm8 Controls for VREDUCEPD/SD/PS/SS

Handling of special case of input values are listed in Table 5-27.

Table 5-27. VREDUCEPD/SD/PS/SS Special Cases


Round Mode Returned value
|Src1| < 2-M-1 RNE Src1
RPI, Src1 > 0 Round (Src1-2-M) *
RPI, Src1 ≤ 0 Src1
RNI, Src1 ≥ 0 Src1
|Src1| < 2-M RNI, Src1 < 0 Round (Src1+2-M) *
NOT RNI +0.0
Src1 = ±0, or
Dest = ±0 (Src1!=INF) RNI -0.0
Src1 = ±INF any +0.0
Src1= ±NAN n/a QNaN(Src1)

* Round control = (imm8.MS1)? MXCSR.RC: imm8.RC

Operation
ReduceArgumentDP(SRC[63:0], imm8[7:0])
{
// Check for NaN
IF (SRC [63:0] = NAN) THEN
RETURN (Convert SRC[63:0] to QNaN); FI;
M := imm8[7:4]; // Number of fraction bits of the normalized significand to be subtracted
RC := imm8[1:0];// Round Control for ROUND() operation
RC source := imm[2];
SPE := imm[3];// Suppress Precision Exception
TMP[63:0] := 2-M *{ROUND(2M*SRC[63:0], SPE, RC_source, RC)}; // ROUND() treats SRC and 2M as standard binary FP values
TMP[63:0] := SRC[63:0] – TMP[63:0]; // subtraction under the same RC,SPE controls
RETURN TMP[63:0]; // binary encoded FP with biased exponent and normalized significand
}

VREDUCEPD—Perform Reduction Transformation on Packed Float64 Values Vol. 2C 5-662


VREDUCEPD
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b == 1) AND (SRC *is memory*)
THEN DEST[i+63:i] := ReduceArgumentDP(SRC[63:0], imm8[7:0]);
ELSE DEST[i+63:i] := ReduceArgumentDP(SRC[i+63:i], imm8[7:0]);
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] = 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VREDUCEPD __m512d _mm512_mask_reduce_pd( __m512d a, int imm, int sae)
VREDUCEPD __m512d _mm512_mask_reduce_pd(__m512d s, __mmask8 k, __m512d a, int imm, int sae)
VREDUCEPD __m512d _mm512_maskz_reduce_pd(__mmask8 k, __m512d a, int imm, int sae)
VREDUCEPD __m256d _mm256_mask_reduce_pd( __m256d a, int imm)
VREDUCEPD __m256d _mm256_mask_reduce_pd(__m256d s, __mmask8 k, __m256d a, int imm)
VREDUCEPD __m256d _mm256_maskz_reduce_pd(__mmask8 k, __m256d a, int imm)
VREDUCEPD __m128d _mm_mask_reduce_pd( __m128d a, int imm)
VREDUCEPD __m128d _mm_mask_reduce_pd(__m128d s, __mmask8 k, __m128d a, int imm)
VREDUCEPD __m128d _mm_maskz_reduce_pd(__mmask8 k, __m128d a, int imm)

SIMD Floating-Point Exceptions

Invalid, Precision.
If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask).

Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.

VREDUCEPD—Perform Reduction Transformation on Packed Float64 Values Vol. 2C 5-663


VREDUCEPH—Perform Reduction Transformation on Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.NP.0F3A.W0 56 /r /ib A V/V (AVX512-FP16 Perform reduction transformation on packed
VREDUCEPH xmm1{k1}{z}, AND AVX512VL) FP16 values in xmm2/m128/m16bcst by
xmm2/m128/m16bcst, imm8 OR AVX10.11 subtracting a number of fraction bits specified by
the imm8 field. Store the result in xmm1 subject
to writemask k1.
EVEX.256.NP.0F3A.W0 56 /r /ib A V/V (AVX512-FP16 Perform reduction transformation on packed
VREDUCEPH ymm1{k1}{z}, AND AVX512VL) FP16 values in ymm2/m256/m16bcst by
ymm2/m256/m16bcst, imm8 OR AVX10.11 subtracting a number of fraction bits specified by
the imm8 field. Store the result in ymm1 subject
to writemask k1.
EVEX.512.NP.0F3A.W0 56 /r /ib A V/V AVX512-FP16 Perform reduction transformation on packed
VREDUCEPH zmm1{k1}{z}, OR AVX10.11 FP16 values in zmm2/m512/m16bcst by
zmm2/m512/m16bcst {sae}, imm8 subtracting a number of fraction bits specified by
the imm8 field. Store the result in zmm1 subject
to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) imm8 (r) N/A

Description
This instruction performs a reduction transformation of the packed binary encoded FP16 values in the source
operand (the second operand) and store the reduced results in binary FP format to the destination operand (the
first operand) under the writemask k1.
The reduction transformation subtracts the integer part and the leading M fractional bits from the binary FP source
value, where M is a unsigned integer specified by imm8[7:4]. Specifically, the reduction transformation can be
expressed as:
dest = src − (ROUND(2M * src)) * 2−M
where ROUND() treats src, 2M, and their product as binary FP numbers with normalized significand and biased
exponents.
The magnitude of the reduced result can be expressed by considering src = 2p * man2, where ‘man2’ is the normal-
ized significand and ‘p’ is the unbiased exponent.
Then if RC=RNE: 0 ≤ |ReducedResult| ≤ 2−M−1.
Then if RC ≠ RNE: 0 ≤ |ReducedResult| < 2−M.
This instruction might end up with a precision exception set. However, in case of SPE set (i.e., Suppress Precision
Exception, which is imm8[3]=1), no precision exception is reported.
This instruction may generate tiny non-zero result. If it does so, it does not report underflow exception, even if
underflow exceptions are unmasked (UM flag in MXCSR register is 0).
For special cases, see Table 5-28.

VREDUCEPH—Perform Reduction Transformation on Packed FP16 Values Vol. 2C 5-664


Table 5-28. VREDUCEPH/VREDUCESH Special Cases
Input value Round Mode Returned Value
|Src1| < 2−M−1 RNE Src1
RU, Src1 > 0 Round(Src1 − 2−M)1
RU, Src1 ≤ 0 Src1
|Src1| < 2−M
RD, Src1 ≥ 0 Src1
RD, Src1 < 0 Round(Src1 + 2−M)

Src1 = ±0 or NOT RD +0.0


Dest = ±0 (Src1 ≠ ∞) RD −0.0
Src1 = ±∞ Any +0.0
Src1 = ±NAN Any QNaN (Src1)
NOTES:
1. The Round(.) function uses rounding controls specified by (imm8[2]? MXCSR.RC: imm8[1:0]).

Operation
def reduce_fp16(src, imm8):
nan := (src.exp = 0x1F) and (src.fraction != 0)
if nan:
return QNAN(src)
m := imm8[7:4]
rc := imm8[1:0]
rc_source := imm8[2]
spe := imm[3] // suppress precision exception
tmp := 2^(-m) * ROUND(2^m * src, spe, rc_source, rc)
tmp := src - tmp // using same RC, SPE controls
return tmp

VREDUCEPH dest{k1}, src, imm8


VL = 128, 256 or 512
KL := VL/16

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF SRC is memory and (EVEX.b = 1):
tsrc := src.fp16[0]
ELSE:
tsrc := src.fp16[i]
DEST.fp16[i] := reduce_fp16(tsrc, imm8)
ELSE IF *zeroing*:
DEST.fp16[i] := 0
//else DEST.fp16[i] remains unchanged

DEST[MAXVL-1:VL] := 0

VREDUCEPH—Perform Reduction Transformation on Packed FP16 Values Vol. 2C 5-665


Intel C/C++ Compiler Intrinsic Equivalent
VREDUCEPH __m128h _mm_mask_reduce_ph (__m128h src, __mmask8 k, __m128h a, int imm8);
VREDUCEPH __m128h _mm_maskz_reduce_ph (__mmask8 k, __m128h a, int imm8);
VREDUCEPH __m128h _mm_reduce_ph (__m128h a, int imm8);
VREDUCEPH __m256h _mm256_mask_reduce_ph (__m256h src, __mmask16 k, __m256h a, int imm8);
VREDUCEPH __m256h _mm256_maskz_reduce_ph (__mmask16 k, __m256h a, int imm8);
VREDUCEPH __m256h _mm256_reduce_ph (__m256h a, int imm8);
VREDUCEPH __m512h _mm512_mask_reduce_ph (__m512h src, __mmask32 k, __m512h a, int imm8);
VREDUCEPH __m512h _mm512_maskz_reduce_ph (__mmask32 k, __m512h a, int imm8);
VREDUCEPH __m512h _mm512_reduce_ph (__m512h a, int imm8);
VREDUCEPH __m512h _mm512_mask_reduce_round_ph (__m512h src, __mmask32 k, __m512h a, int imm8, const int sae);
VREDUCEPH __m512h _mm512_maskz_reduce_round_ph (__mmask32 k, __m512h a, int imm8, const int sae);
VREDUCEPH __m512h _mm512_reduce_round_ph (__m512h a, int imm8, const int sae);

SIMD Floating-Point Exceptions


Invalid, Precision.

Other Exceptions
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”

VREDUCEPH—Perform Reduction Transformation on Packed FP16 Values Vol. 2C 5-666


VREDUCEPS—Perform Reduction Transformation on Packed Float32 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F3A.W0 56 /r ib A V/V (AVX512VL Perform reduction transformation on packed single-
VREDUCEPS xmm1 {k1}{z}, AND AVX512DQ) precision floating-point values in
xmm2/m128/m32bcst, imm8 OR AVX10.11 xmm2/m128/m32bcst by subtracting a number of
fraction bits specified by the imm8 field. Stores the
result in xmm1 register under writemask k1.
EVEX.256.66.0F3A.W0 56 /r ib A V/V (AVX512VL Perform reduction transformation on packed single-
VREDUCEPS ymm1 {k1}{z}, AND AVX512DQ) precision floating-point values in
ymm2/m256/m32bcst, imm8 OR AVX10.11 ymm2/m256/m32bcst by subtracting a number of
fraction bits specified by the imm8 field. Stores the
result in ymm1 register under writemask k1.
EVEX.512.66.0F3A.W0 56 /r ib A V/V AVX512DQ Perform reduction transformation on packed single-
VREDUCEPS zmm1 {k1}{z}, OR AVX10.11 precision floating-point values in
zmm2/m512/m32bcst{sae}, zmm2/m512/m32bcst by subtracting a number of
imm8 fraction bits specified by the imm8 field. Stores the
result in zmm1 register under writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) imm8 N/A

Description
Perform reduction transformation of the packed binary encoded single precision floating-point values in the source
operand (the second operand) and store the reduced results in binary floating-point format to the destination
operand (the first operand) under the writemask k1.
The reduction transformation subtracts the integer part and the leading M fractional bits from the binary floating-
point source value, where M is a unsigned integer specified by imm8[7:4], see Figure 5-28. Specifically, the reduc-
tion transformation can be expressed as:
dest = src – (ROUND(2M*src))*2-M;
where “Round()” treats “src”, “2M”, and their product as binary floating-point numbers with normalized signifi-
cand and biased exponents.
The magnitude of the reduced result can be expressed by considering src= 2p*man2,
where ‘man2’ is the normalized significand and ‘p’ is the unbiased exponent
Then if RC = RNE: 0<=|Reduced Result|<=2p-M-1
Then if RC ≠ RNE: 0<=|Reduced Result|<2p-M

This instruction might end up with a precision exception set. However, in case of SPE set (i.e., Suppress Precision
Exception, which is imm8[3]=1), no precision exception is reported.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Handling of special case of input values are listed in Table 5-27.

VREDUCEPS—Perform Reduction Transformation on Packed Float32 Values Vol. 2C 5-667


Operation
ReduceArgumentSP(SRC[31:0], imm8[7:0])
{
// Check for NaN
IF (SRC [31:0] = NAN) THEN
RETURN (Convert SRC[31:0] to QNaN); FI
M := imm8[7:4]; // Number of fraction bits of the normalized significand to be subtracted
RC := imm8[1:0];// Round Control for ROUND() operation
RC source := imm[2];
SPE := imm[3];// Suppress Precision Exception
TMP[31:0] := 2-M *{ROUND(2M*SRC[31:0], SPE, RC_source, RC)}; // ROUND() treats SRC and 2M as standard binary FP values
TMP[31:0] := SRC[31:0] – TMP[31:0]; // subtraction under the same RC,SPE controls
RETURN TMP[31:0]; // binary encoded FP with biased exponent and normalized significand
}

VREDUCEPS
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b == 1) AND (SRC *is memory*)
THEN DEST[i+31:i] := ReduceArgumentSP(SRC[31:0], imm8[7:0]);
ELSE DEST[i+31:i] := ReduceArgumentSP(SRC[i+31:i], imm8[7:0]);
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] = 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VREDUCEPS __m512 _mm512_mask_reduce_ps( __m512 a, int imm, int sae)
VREDUCEPS __m512 _mm512_mask_reduce_ps(__m512 s, __mmask16 k, __m512 a, int imm, int sae)
VREDUCEPS __m512 _mm512_maskz_reduce_ps(__mmask16 k, __m512 a, int imm, int sae)
VREDUCEPS __m256 _mm256_mask_reduce_ps( __m256 a, int imm)
VREDUCEPS __m256 _mm256_mask_reduce_ps(__m256 s, __mmask8 k, __m256 a, int imm)
VREDUCEPS __m256 _mm256_maskz_reduce_ps(__mmask8 k, __m256 a, int imm)
VREDUCEPS __m128 _mm_mask_reduce_ps( __m128 a, int imm)
VREDUCEPS __m128 _mm_mask_reduce_ps(__m128 s, __mmask8 k, __m128 a, int imm)
VREDUCEPS __m128 _mm_maskz_reduce_ps(__mmask8 k, __m128 a, int imm)

SIMD Floating-Point Exceptions

Invalid, Precision.
If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask).

Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions”; additionally:
#UD If EVEX.vvvv != 1111B.

VREDUCEPS—Perform Reduction Transformation on Packed Float32 Values Vol. 2C 5-668


VREDUCESD—Perform a Reduction Transformation on a Scalar Float64 Value
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
EVEX.LLIG.66.0F3A.W1 57 A V/V AVX512DQ Perform a reduction transformation on a scalar double
VREDUCESD xmm1 {k1}{z}, OR AVX10.11 precision floating-point value in xmm3/m64 by
xmm2, xmm3/m64{sae}, subtracting a number of fraction bits specified by the
imm8/r imm8 field. Also, upper double precision floating-point
value (bits[127:64]) from xmm2 are copied to
xmm1[127:64]. Stores the result in xmm1 register.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Perform a reduction transformation of the binary encoded double precision floating-point value in the low qword
element of the second source operand (the third operand) and store the reduced result in binary floating-point
format to the low qword element of the destination operand (the first operand) under the writemask k1. Bits
127:64 of the destination operand are copied from respective qword elements of the first source operand (the
second operand).
The reduction transformation subtracts the integer part and the leading M fractional bits from the binary floating-
point source value, where M is a unsigned integer specified by imm8[7:4], see Figure 5-28. Specifically, the reduc-
tion transformation can be expressed as:
dest = src – (ROUND(2M*src))*2-M;
where “Round()” treats “src”, “2M”, and their product as binary floating-point numbers with normalized signifi-
cand and biased exponents.
The magnitude of the reduced result can be expressed by considering src= 2p*man2,
where ‘man2’ is the normalized significand and ‘p’ is the unbiased exponent
Then if RC = RNE: 0<=|Reduced Result|<=2p-M-1
Then if RC ≠ RNE: 0<=|Reduced Result|<2p-M
This instruction might end up with a precision exception set. However, in case of SPE set (i.e., Suppress Precision
Exception, which is imm8[3]=1), no precision exception is reported.
The operation is write masked.
Handling of special case of input values are listed in Table 5-27.

VREDUCESD—Perform a Reduction Transformation on a Scalar Float64 Value Vol. 2C 5-669


Operation
ReduceArgumentDP(SRC[63:0], imm8[7:0])
{
// Check for NaN
IF (SRC [63:0] = NAN) THEN
RETURN (Convert SRC[63:0] to QNaN); FI;
M := imm8[7:4]; // Number of fraction bits of the normalized significand to be subtracted
RC := imm8[1:0];// Round Control for ROUND() operation
RC source := imm[2];
SPE := imm[3];// Suppress Precision Exception
TMP[63:0] := 2-M *{ROUND(2M*SRC[63:0], SPE, RC_source, RC)}; // ROUND() treats SRC and 2M as standard binary FP values
TMP[63:0] := SRC[63:0] – TMP[63:0]; // subtraction under the same RC,SPE controls
RETURN TMP[63:0]; // binary encoded FP with biased exponent and normalized significand
}

VREDUCESD
IF k1[0] or *no writemask*
THEN DEST[63:0] := ReduceArgumentDP(SRC2[63:0], imm8[7:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] = 0
FI;
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VREDUCESD __m128d _mm_mask_reduce_sd( __m128d a, __m128d b, int imm, int sae)
VREDUCESD __m128d _mm_mask_reduce_sd(__m128d s, __mmask16 k, __m128d a, __m128d b, int imm, int sae)
VREDUCESD __m128d _mm_maskz_reduce_sd(__mmask16 k, __m128d a, __m128d b, int imm, int sae)

SIMD Floating-Point Exceptions


Invalid, Precision.
If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask).

Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”

VREDUCESD—Perform a Reduction Transformation on a Scalar Float64 Value Vol. 2C 5-670


VREDUCESH—Perform Reduction Transformation on Scalar FP16 Value
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.LLIG.NP.0F3A.W0 57 /r /ib A V/V AVX512-FP16 Perform a reduction transformation on the low
VREDUCESH xmm1{k1}{z}, xmm2, OR AVX10.11 binary encoded FP16 value in xmm3/m16 by
xmm3/m16 {sae}, imm8 subtracting a number of fraction bits specified by
the imm8 field. Store the result in xmm1 subject
to writemask k1. Bits 127:16 from xmm2 are
copied to xmm1[127:16].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8 (r)

Description
This instruction performs a reduction transformation of the low binary encoded FP16 value in the source operand
(the second operand) and store the reduced result in binary FP format to the low element of the destination
operand (the first operand) under the writemask k1. For further details see the description of VREDUCEPH.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
This instruction might end up with a precision exception set. However, in case of SPE set (i.e., Suppress Precision
Exception, which is imm8[3]=1), no precision exception is reported.
This instruction may generate tiny non-zero result. If it does so, it does not report underflow exception, even if
underflow exceptions are unmasked (UM flag in MXCSR register is 0).
For special cases, see Table 5-28.

Operation
VREDUCESH dest{k1}, src, imm8
IF k1[0] or *no writemask*:
dest.fp16[0] := reduce_fp16(src2.fp16[0], imm8) // see VREDUCEPH
ELSE IF *zeroing*:
dest.fp16[0] := 0
//else dest.fp16[0] remains unchanged

DEST[127:16] := src1[127:16]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VREDUCESH __m128h _mm_mask_reduce_round_sh (__m128h src, __mmask8 k, __m128h a, __m128h b, int imm8, const int sae);
VREDUCESH __m128h _mm_maskz_reduce_round_sh (__mmask8 k, __m128h a, __m128h b, int imm8, const int sae);
VREDUCESH __m128h _mm_reduce_round_sh (__m128h a, __m128h b, int imm8, const int sae);
VREDUCESH __m128h _mm_mask_reduce_sh (__m128h src, __mmask8 k, __m128h a, __m128h b, int imm8);
VREDUCESH __m128h _mm_maskz_reduce_sh (__mmask8 k, __m128h a, __m128h b, int imm8);
VREDUCESH __m128h _mm_reduce_sh (__m128h a, __m128h b, int imm8);

VREDUCESH—Perform Reduction Transformation on Scalar FP16 Value Vol. 2C 5-671


SIMD Floating-Point Exceptions
Invalid, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VREDUCESH—Perform Reduction Transformation on Scalar FP16 Value Vol. 2C 5-672


VREDUCESS—Perform a Reduction Transformation on a Scalar Float32 Value
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.LLIG.66.0F3A.W0 57 /r /ib A V/V AVX512DQ Perform a reduction transformation on a scalar single-
VREDUCESS xmm1 {k1}{z}, OR AVX10.11 precision floating-point value in xmm3/m32 by
xmm2, xmm3/m32{sae}, imm8 subtracting a number of fraction bits specified by the
imm8 field. Also, upper single-precision floating-point
values (bits[127:32]) from xmm2 are copied to
xmm1[127:32]. Stores the result in xmm1 register.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Perform a reduction transformation of the binary encoded single precision floating-point value in the low dword
element of the second source operand (the third operand) and store the reduced result in binary floating-point
format to the low dword element of the destination operand (the first operand) under the writemask k1. Bits
127:32 of the destination operand are copied from respective dword elements of the first source operand (the
second operand).
The reduction transformation subtracts the integer part and the leading M fractional bits from the binary floating-
point source value, where M is a unsigned integer specified by imm8[7:4], see Figure 5-28. Specifically, the reduc-
tion transformation can be expressed as:
dest = src – (ROUND(2M*src))*2-M;
where “Round()” treats “src”, “2M”, and their product as binary floating-point numbers with normalized signifi-
cand and biased exponents.
The magnitude of the reduced result can be expressed by considering src= 2p*man2,
where ‘man2’ is the normalized significand and ‘p’ is the unbiased exponent
Then if RC = RNE: 0<=|Reduced Result|<=2p-M-1
Then if RC ≠ RNE: 0<=|Reduced Result|<2p-M

This instruction might end up with a precision exception set. However, in case of SPE set (i.e., Suppress Precision
Exception, which is imm8[3]=1), no precision exception is reported.
Handling of special case of input values are listed in Table 5-27.

VREDUCESS—Perform a Reduction Transformation on a Scalar Float32 Value Vol. 2C 5-673


Operation
ReduceArgumentSP(SRC[31:0], imm8[7:0])
{
// Check for NaN
IF (SRC [31:0] = NAN) THEN
RETURN (Convert SRC[31:0] to QNaN); FI
M := imm8[7:4]; // Number of fraction bits of the normalized significand to be subtracted
RC := imm8[1:0];// Round Control for ROUND() operation
RC source := imm[2];
SPE := imm[3];// Suppress Precision Exception
TMP[31:0] := 2-M *{ROUND(2M*SRC[31:0], SPE, RC_source, RC)}; // ROUND() treats SRC and 2M as standard binary FP values
TMP[31:0] := SRC[31:0] – TMP[31:0]; // subtraction under the same RC,SPE controls
RETURN TMP[31:0]; // binary encoded FP with biased exponent and normalized significand
}

VREDUCESS
IF k1[0] or *no writemask*
THEN DEST[31:0] := ReduceArgumentSP(SRC2[31:0], imm8[7:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] = 0
FI;
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VREDUCESS __m128 _mm_mask_reduce_ss( __m128 a, __m128 b, int imm, int sae)
VREDUCESS __m128 _mm_mask_reduce_ss(__m128 s, __mmask16 k, __m128 a, __m128 b, int imm, int sae)
VREDUCESS __m128 _mm_maskz_reduce_ss(__mmask16 k, __m128 a, __m128 b, int imm, int sae)

SIMD Floating-Point Exceptions

Invalid, Precision.
If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask).

Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”

VREDUCESS—Perform a Reduction Transformation on a Scalar Float32 Value Vol. 2C 5-674


VRNDSCALEPD—Round Packed Float64 Values to Include a Given Number of Fraction Bits
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F3A.W1 09 /r ib A V/V (AVX512VL Rounds packed double precision floating-point values
VRNDSCALEPD xmm1 {k1}{z}, AND AVX512F) in xmm2/m128/m64bcst to a number of fraction bits
xmm2/m128/m64bcst, imm8 OR AVX10.11 specified by the imm8 field. Stores the result in xmm1
register. Under writemask.
EVEX.256.66.0F3A.W1 09 /r ib A V/V (AVX512VL Rounds packed double precision floating-point values
VRNDSCALEPD ymm1 {k1}{z}, AND AVX512F) in ymm2/m256/m64bcst to a number of fraction bits
ymm2/m256/m64bcst, imm8 OR AVX10.11 specified by the imm8 field. Stores the result in ymm1
register. Under writemask.
EVEX.512.66.0F3A.W1 09 /r ib A V/V AVX512F Rounds packed double precision floating-point values
VRNDSCALEPD zmm1 {k1}{z}, OR AVX10.11 in zmm2/m512/m64bcst to a number of fraction bits
zmm2/m512/m64bcst{sae}, imm8 specified by the imm8 field. Stores the result in zmm1
register using writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) imm8 N/A

Description
Round the double precision floating-point values in the source operand by the rounding mode specified in the
immediate operand (see Figure 5-29) and places the result in the destination operand.
The destination operand (the first operand) is a ZMM/YMM/XMM register conditionally updated according to the
writemask. The source operand (the second operand) can be a ZMM/YMM/XMM register, a 512/256/128-bit
memory location, or a 512/256/128-bit vector broadcasted from a 64-bit memory location.
The rounding process rounds the input to an integral value, plus number bits of fraction that are specified by
imm8[7:4] (to be included in the result) and returns the result as a double precision floating-point value.
It should be noticed that no overflow is induced while executing this instruction (although the source is scaled by
the imm8[7:4] value).
The immediate operand also specifies control fields for the rounding operation, three bit fields are defined and
shown in the “Immediate Control Description” figure below. Bit 3 of the immediate byte controls the processor
behavior for a precision exception, bit 2 selects the source of rounding mode control. Bits 1:0 specify a non-sticky
rounding-mode value (immediate control table below lists the encoded values for rounding-mode field).
The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an
SNaN then it will be converted to a QNaN. If DAZ is set to ‘1 then denormals will be converted to zero before
rounding.
The sign of the result of this instruction is preserved, including the sign of zero.
The formula of the operation on each data element for VRNDSCALEPD is
ROUND(x) = 2-M*Round_to_INT(x*2M, round_ctrl),
round_ctrl = imm[3:0];
M=imm[7:4];
The operation of x*2M is computed as if the exponent range is unlimited (i.e., no overflow ever occurs).

VRNDSCALEPD—Round Packed Float64 Values to Include a Given Number of Fraction Bits Vol. 2C 5-675
VRNDSCALEPD is a more general form of the VEX-encoded VROUNDPD instruction. In VROUNDPD, the formula of
the operation on each element is
ROUND(x) = Round_to_INT(x, round_ctrl),
round_ctrl = imm[3:0];

Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

7 6 5 4 3 2 1 0

imm8 Fixed point length SPE RS Round Control Override

Suppress Precision Exception: Imm8[3] Imm8[1:0] = 00b : Round nearest even


Imm8[3] = 0b : Use MXCSR exception mask Round Select: Imm8[2]
Imm8[7:4] : Number of fixed points to preserve Imm8[1:0] = 01b : Round down
Imm8[3] = 1b : Suppress Imm8[2] = 0b : Use Imm8[1:0]
Imm8[1:0] = 10b : Round up
Imm8[2] = 1b : Use MXCSR
Imm8[1:0] = 11b : Truncate

Figure 5-29. Imm8 Controls for VRNDSCALEPD/SD/PS/SS

Handling of special case of input values are listed in Table 5-29.

Table 5-29. VRNDSCALEPD/SD/PS/SS Special Cases


Returned value
Src1=±inf Src1
Src1=±NAN Src1 converted to QNAN
Src1=±0 Src1

Operation
RoundToIntegerDP(SRC[63:0], imm8[7:0]) {
if (imm8[2] = 1)
rounding_direction := MXCSR:RC ; get round control from MXCSR
else
rounding_direction := imm8[1:0] ; get round control from imm8[1:0]
FI
M := imm8[7:4] ; get the scaling factor

case (rounding_direction)
00: TMP[63:0] := round_to_nearest_even_integer(2M*SRC[63:0])
01: TMP[63:0] := round_to_equal_or_smaller_integer(2M*SRC[63:0])
10: TMP[63:0] := round_to_equal_or_larger_integer(2M*SRC[63:0])
11: TMP[63:0] := round_to_nearest_smallest_magnitude_integer(2M*SRC[63:0])
ESAC

Dest[63:0] := 2-M* TMP[63:0] ; scale down back to 2-M

if (imm8[3] = 0) Then ; check SPE


if (SRC[63:0] != Dest[63:0]) Then ; check precision lost
set_precision() ; set #PE
FI;
FI;

VRNDSCALEPD—Round Packed Float64 Values to Include a Given Number of Fraction Bits Vol. 2C 5-676
return(Dest[63:0])
}

VRNDSCALEPD (EVEX encoded versions)


(KL, VL) = (2, 128), (4, 256), (8, 512)
IF *src is a memory operand*
THEN TMP_SRC := BROADCAST64(SRC, VL, k1)
ELSE TMP_SRC := SRC
FI;

FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := RoundToIntegerDP((TMP_SRC[i+63:i], imm8[7:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VRNDSCALEPD __m512d _mm512_roundscale_pd( __m512d a, int imm);
VRNDSCALEPD __m512d _mm512_roundscale_round_pd( __m512d a, int imm, int sae);
VRNDSCALEPD __m512d _mm512_mask_roundscale_pd(__m512d s, __mmask8 k, __m512d a, int imm);
VRNDSCALEPD __m512d _mm512_mask_roundscale_round_pd(__m512d s, __mmask8 k, __m512d a, int imm, int sae);
VRNDSCALEPD __m512d _mm512_maskz_roundscale_pd( __mmask8 k, __m512d a, int imm);
VRNDSCALEPD __m512d _mm512_maskz_roundscale_round_pd( __mmask8 k, __m512d a, int imm, int sae);
VRNDSCALEPD __m256d _mm256_roundscale_pd( __m256d a, int imm);
VRNDSCALEPD __m256d _mm256_mask_roundscale_pd(__m256d s, __mmask8 k, __m256d a, int imm);
VRNDSCALEPD __m256d _mm256_maskz_roundscale_pd( __mmask8 k, __m256d a, int imm);
VRNDSCALEPD __m128d _mm_roundscale_pd( __m128d a, int imm);
VRNDSCALEPD __m128d _mm_mask_roundscale_pd(__m128d s, __mmask8 k, __m128d a, int imm);
VRNDSCALEPD __m128d _mm_maskz_roundscale_pd( __mmask8 k, __m128d a, int imm);

SIMD Floating-Point Exceptions

Invalid, Precision.
If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask).

Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”

VRNDSCALEPD—Round Packed Float64 Values to Include a Given Number of Fraction Bits Vol. 2C 5-677
VRNDSCALEPH—Round Packed FP16 Values to Include a Given Number of Fraction Bits
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.NP.0F3A.W0 08 /r /ib A V/V (AVX512-FP16 Round packed FP16 values in
VRNDSCALEPH xmm1{k1}{z}, AND AVX512VL) xmm2/m128/m16bcst to a number of fraction
xmm2/m128/m16bcst, imm8 OR AVX10.11 bits specified by the imm8 field. Store the result
in xmm1 subject to writemask k1.
EVEX.256.NP.0F3A.W0 08 /r /ib A V/V (AVX512-FP16 Round packed FP16 values in
VRNDSCALEPH ymm1{k1}{z}, AND AVX512VL) ymm2/m256/m16bcst to a number of fraction
ymm2/m256/m16bcst, imm8 OR AVX10.11 bits specified by the imm8 field. Store the result
in ymm1 subject to writemask k1.
EVEX.512.NP.0F3A.W0 08 /r /ib A V/V AVX512-FP16 Round packed FP16 values in
VRNDSCALEPH zmm1{k1}{z}, OR AVX10.11 zmm2/m512/m16bcst to a number of fraction
zmm2/m512/m16bcst {sae}, imm8 bits specified by the imm8 field. Store the result
in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) imm8 (r) N/A

Description
This instruction rounds the FP16 values in the source operand by the rounding mode specified in the immediate
operand (see Table 5-30) and places the result in the destination operand. The destination operand is conditionally
updated according to the writemask.
The rounding process rounds the input to an integral value, plus number bits of fraction that are specified by
imm8[7:4] (to be included in the result), and returns the result as an FP16 value.
Note that no overflow is induced while executing this instruction (although the source is scaled by the imm8[7:4]
value).
The immediate operand also specifies control fields for the rounding operation. Three bit fields are defined and
shown in Table 5-30, “Imm8 Controls for VRNDSCALEPH/VRNDSCALESH.” Bit 3 of the immediate byte controls the
processor behavior for a precision exception, bit 2 selects the source of rounding mode control, and bits 1:0 specify
a non-sticky rounding-mode value.
The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an
SNaN then it will be converted to a QNaN.
The sign of the result of this instruction is preserved, including the sign of zero. Special cases are described in Table
5-31.
The formula of the operation on each data element for VRNDSCALEPH is
ROUND(x) = 2−M *Round_to_INT(x * 2M, round_ctrl),
round_ctrl = imm[3:0];
M=imm[7:4];
The operation of x * 2M is computed as if the exponent range is unlimited (i.e., no overflow ever occurs).
If this instruction encoding’s SPE bit (bit 3) in the immediate operand is 1, VRNDSCALEPH can set MXCSR.UE
without MXCSR.PE.
EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

VRNDSCALEPH—Round Packed FP16 Values to Include a Given Number of Fraction Bits Vol. 2C 5-678
Table 5-30. Imm8 Controls for VRNDSCALEPH/VRNDSCALESH
Imm8 Bits Description
imm8[7:4] Number of fixed points to preserve.
imm8[3] Suppress Precision Exception (SPE)
0b00: Implies use of MXCSR exception mask.
0b01: Implies suppress.
imm8[2] Round Select (RS)
0b00: Implies use of imm8[1:0].
0b01: Implies use of MXCSR.
imm8[1:0] Round Control Override:
0b00: Round nearest even.
0b01: Round down.
0b10: Round up.
0b11: Truncate.

Table 5-31. VRNDSCALEPH/VRNDSCALESH Special Cases


Input Value Returned Value
Src1 = ±∞ Src1
Src1 = ±NaN Src1 converted to QNaN
Src1 = ±0 Src1

Operation
def round_fp16_to_integer(src, imm8):
if imm8[2] = 1:
rounding_direction := MXCSR.RC
else:
rounding_direction := imm8[1:0]
m := imm8[7:4] // scaling factor

tsrc1 := 2^m * src

if rounding_direction = 0b00:
tmp := round_to_nearest_even_integer(trc1)
else if rounding_direction = 0b01:
tmp := round_to_equal_or_smaller_integer(trc1)
else if rounding_direction = 0b10:
tmp := round_to_equal_or_larger_integer(trc1)
else if rounding_direction = 0b11:
tmp := round_to_smallest_magnitude_integer(trc1)

dst := 2^(-m) * tmp

if imm8[3]==0: // check SPE


if src != dst:
MXCSR.PE := 1
return dst

VRNDSCALEPH—Round Packed FP16 Values to Include a Given Number of Fraction Bits Vol. 2C 5-679
VRNDSCALEPH dest{k1}, src, imm8
VL = 128, 256 or 512
KL := VL/16

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF SRC is memory and (EVEX.b = 1):
tsrc := src.fp16[0]
ELSE:
tsrc := src.fp16[i]
DEST.fp16[i] := round_fp16_to_integer(tsrc, imm8)
ELSE IF *zeroing*:
DEST.fp16[i] := 0
//else DEST.fp16[i] remains unchanged

DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VRNDSCALEPH __m128h _mm_mask_roundscale_ph (__m128h src, __mmask8 k, __m128h a, int imm8);
VRNDSCALEPH __m128h _mm_maskz_roundscale_ph (__mmask8 k, __m128h a, int imm8);
VRNDSCALEPH __m128h _mm_roundscale_ph (__m128h a, int imm8);
VRNDSCALEPH __m256h _mm256_mask_roundscale_ph (__m256h src, __mmask16 k, __m256h a, int imm8);
VRNDSCALEPH __m256h _mm256_maskz_roundscale_ph (__mmask16 k, __m256h a, int imm8);
VRNDSCALEPH __m256h _mm256_roundscale_ph (__m256h a, int imm8);
VRNDSCALEPH __m512h _mm512_mask_roundscale_ph (__m512h src, __mmask32 k, __m512h a, int imm8);
VRNDSCALEPH __m512h _mm512_maskz_roundscale_ph (__mmask32 k, __m512h a, int imm8);
VRNDSCALEPH __m512h _mm512_roundscale_ph (__m512h a, int imm8);
VRNDSCALEPH __m512h _mm512_mask_roundscale_round_ph (__m512h src, __mmask32 k, __m512h a, int imm8, const int sae);
VRNDSCALEPH __m512h _mm512_maskz_roundscale_round_ph (__mmask32 k, __m512h a, int imm8, const int sae);
VRNDSCALEPH __m512h _mm512_roundscale_round_ph (__m512h a, int imm8, const int sae);

SIMD Floating-Point Exceptions


Invalid, Underflow, Precision.

Other Exceptions
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”

VRNDSCALEPH—Round Packed FP16 Values to Include a Given Number of Fraction Bits Vol. 2C 5-680
VRNDSCALEPS—Round Packed Float32 Values to Include a Given Number of Fraction Bits
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F3A.W0 08 /r ib A V/V (AVX512VL Rounds packed single-precision floating-point values
VRNDSCALEPS xmm1 {k1}{z}, AND AVX512F) in xmm2/m128/m32bcst to a number of fraction bits
xmm2/m128/m32bcst, imm8 OR AVX10.11 specified by the imm8 field. Stores the result in xmm1
register. Under writemask.
EVEX.256.66.0F3A.W0 08 /r ib A V/V (AVX512VL Rounds packed single-precision floating-point values
VRNDSCALEPS ymm1 {k1}{z}, AND AVX512F) in ymm2/m256/m32bcst to a number of fraction bits
ymm2/m256/m32bcst, imm8 OR AVX10.11 specified by the imm8 field. Stores the result in ymm1
register. Under writemask.
EVEX.512.66.0F3A.W0 08 /r ib A V/V AVX512F Rounds packed single-precision floating-point values
VRNDSCALEPS zmm1 {k1}{z}, OR AVX10.11 in zmm2/m512/m32bcst to a number of fraction bits
zmm2/m512/m32bcst{sae}, imm8 specified by the imm8 field. Stores the result in zmm1
register using writemask.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) imm8 N/A

Description
Round the single precision floating-point values in the source operand by the rounding mode specified in the imme-
diate operand (see Figure 5-29) and places the result in the destination operand.
The destination operand (the first operand) is a ZMM register conditionally updated according to the writemask.
The source operand (the second operand) can be a ZMM register, a 512-bit memory location, or a 512-bit vector
broadcasted from a 32-bit memory location.
The rounding process rounds the input to an integral value, plus number bits of fraction that are specified by
imm8[7:4] (to be included in the result) and returns the result as a single precision floating-point value.
It should be noticed that no overflow is induced while executing this instruction (although the source is scaled by
the imm8[7:4] value).
The immediate operand also specifies control fields for the rounding operation, three bit fields are defined and
shown in the “Immediate Control Description” figure below. Bit 3 of the immediate byte controls the processor
behavior for a precision exception, bit 2 selects the source of rounding mode control. Bits 1:0 specify a non-sticky
rounding-mode value (immediate control table below lists the encoded values for rounding-mode field).
The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an
SNaN then it will be converted to a QNaN. If DAZ is set to ‘1 then denormals will be converted to zero before
rounding.
The sign of the result of this instruction is preserved, including the sign of zero.

The formula of the operation on each data element for VRNDSCALEPS is


ROUND(x) = 2-M*Round_to_INT(x*2M, round_ctrl),
round_ctrl = imm[3:0];
M=imm[7:4];
The operation of x*2M is computed as if the exponent range is unlimited (i.e., no overflow ever occurs).

VRNDSCALEPS—Round Packed Float32 Values to Include a Given Number of Fraction Bits Vol. 2C 5-681
VRNDSCALEPS is a more general form of the VEX-encoded VROUNDPS instruction. In VROUNDPS, the formula of
the operation on each element is
ROUND(x) = Round_to_INT(x, round_ctrl),
round_ctrl = imm[3:0];
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
Handling of special case of input values are listed in Table 5-29.

Operation
RoundToIntegerSP(SRC[31:0], imm8[7:0]) {
if (imm8[2] = 1)
rounding_direction := MXCSR:RC ; get round control from MXCSR
else
rounding_direction := imm8[1:0] ; get round control from imm8[1:0]
FI
M := imm8[7:4] ; get the scaling factor

case (rounding_direction)
00: TMP[31:0] := round_to_nearest_even_integer(2M*SRC[31:0])
01: TMP[31:0] := round_to_equal_or_smaller_integer(2M*SRC[31:0])
10: TMP[31:0] := round_to_equal_or_larger_integer(2M*SRC[31:0])
11: TMP[31:0] := round_to_nearest_smallest_magnitude_integer(2M*SRC[31:0])
ESAC;

Dest[31:0] := 2-M* TMP[31:0] ; scale down back to 2-M


if (imm8[3] = 0) Then ; check SPE
if (SRC[31:0] != Dest[31:0]) Then ; check precision lost
set_precision() ; set #PE
FI;
FI;
return(Dest[31:0])
}

VRNDSCALEPS (EVEX encoded versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
IF *src is a memory operand*
THEN TMP_SRC := BROADCAST32(SRC, VL, k1)
ELSE TMP_SRC := SRC
FI;

FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := RoundToIntegerSP(TMP_SRC[i+31:i]), imm8[7:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

VRNDSCALEPS—Round Packed Float32 Values to Include a Given Number of Fraction Bits Vol. 2C 5-682
Intel C/C++ Compiler Intrinsic Equivalent
VRNDSCALEPS __m512 _mm512_roundscale_ps( __m512 a, int imm);
VRNDSCALEPS __m512 _mm512_roundscale_round_ps( __m512 a, int imm, int sae);
VRNDSCALEPS __m512 _mm512_mask_roundscale_ps(__m512 s, __mmask16 k, __m512 a, int imm);
VRNDSCALEPS __m512 _mm512_mask_roundscale_round_ps(__m512 s, __mmask16 k, __m512 a, int imm, int sae);
VRNDSCALEPS __m512 _mm512_maskz_roundscale_ps( __mmask16 k, __m512 a, int imm);
VRNDSCALEPS __m512 _mm512_maskz_roundscale_round_ps( __mmask16 k, __m512 a, int imm, int sae);
VRNDSCALEPS __m256 _mm256_roundscale_ps( __m256 a, int imm);
VRNDSCALEPS __m256 _mm256_mask_roundscale_ps(__m256 s, __mmask8 k, __m256 a, int imm);
VRNDSCALEPS __m256 _mm256_maskz_roundscale_ps( __mmask8 k, __m256 a, int imm);
VRNDSCALEPS __m128 _mm_roundscale_ps( __m256 a, int imm);
VRNDSCALEPS __m128 _mm_mask_roundscale_ps(__m128 s, __mmask8 k, __m128 a, int imm);
VRNDSCALEPS __m128 _mm_maskz_roundscale_ps( __mmask8 k, __m128 a, int imm);

SIMD Floating-Point Exceptions

Invalid, Precision.
If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask).

Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”

VRNDSCALEPS—Round Packed Float32 Values to Include a Given Number of Fraction Bits Vol. 2C 5-683
VRNDSCALESD—Round Scalar Float64 Value to Include a Given Number of Fraction Bits
Opcode/ Op / 64/32 CPUID Description
Instruction En bit Mode Feature Flag
Support
EVEX.LLIG.66.0F3A.W1 0B /r ib A V/V AVX512F Rounds scalar double precision floating-point value in
VRNDSCALESD xmm1 {k1}{z}, xmm2, OR AVX10.11 xmm3/m64 to a number of fraction bits specified by
xmm3/m64{sae}, imm8 the imm8 field. Stores the result in xmm1 register.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8

Description
Rounds a double precision floating-point value in the low quadword (see Figure 5-29) element of the second source
operand (the third operand) by the rounding mode specified in the immediate operand and places the result in the
corresponding element of the destination operand (the first operand) according to the writemask. The quadword
element at bits 127:64 of the destination is copied from the first source operand (the second operand).
The destination and first source operands are XMM registers, the 2nd source operand can be an XMM register or
memory location. Bits MAXVL-1:128 of the destination register are cleared.
The rounding process rounds the input to an integral value, plus number bits of fraction that are specified by
imm8[7:4] (to be included in the result) and returns the result as a double precision floating-point value.
It should be noticed that no overflow is induced while executing this instruction (although the source is scaled by
the imm8[7:4] value).
The immediate operand also specifies control fields for the rounding operation, three bit fields are defined and
shown in the “Immediate Control Description” figure below. Bit 3 of the immediate byte controls the processor
behavior for a precision exception, bit 2 selects the source of rounding mode control. Bits 1:0 specify a non-sticky
rounding-mode value (immediate control table below lists the encoded values for rounding-mode field).
The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an
SNaN then it will be converted to a QNaN. If DAZ is set to ‘1 then denormals will be converted to zero before
rounding.
The sign of the result of this instruction is preserved, including the sign of zero.

The formula of the operation for VRNDSCALESD is


ROUND(x) = 2-M*Round_to_INT(x*2M, round_ctrl),
round_ctrl = imm[3:0];
M=imm[7:4];
The operation of x*2M is computed as if the exponent range is unlimited (i.e., no overflow ever occurs).
VRNDSCALESD is a more general form of the VEX-encoded VROUNDSD instruction. In VROUNDSD, the formula of
the operation is
ROUND(x) = Round_to_INT(x, round_ctrl),
round_ctrl = imm[3:0];

EVEX encoded version: The source operand is a XMM register or a 64-bit memory location. The destination operand
is a XMM register.

VRNDSCALESD—Round Scalar Float64 Value to Include a Given Number of Fraction Bits Vol. 2C 5-684
Handling of special case of input values are listed in Table 5-29.

Operation
RoundToIntegerDP(SRC[63:0], imm8[7:0]) {
if (imm8[2] = 1)
rounding_direction := MXCSR:RC ; get round control from MXCSR
else
rounding_direction := imm8[1:0] ; get round control from imm8[1:0]
FI
M := imm8[7:4] ; get the scaling factor

case (rounding_direction)
00: TMP[63:0] := round_to_nearest_even_integer(2M*SRC[63:0])
01: TMP[63:0] := round_to_equal_or_smaller_integer(2M*SRC[63:0])
10: TMP[63:0] := round_to_equal_or_larger_integer(2M*SRC[63:0])
11: TMP[63:0] := round_to_nearest_smallest_magnitude_integer(2M*SRC[63:0])
ESAC

Dest[63:0] := 2-M* TMP[63:0] ; scale down back to 2-M

if (imm8[3] = 0) Then ; check SPE


if (SRC[63:0] != Dest[63:0]) Then ; check precision lost
set_precision() ; set #PE
FI;
FI;
return(Dest[63:0])
}

VRNDSCALESD (EVEX encoded version)


IF k1[0] or *no writemask*
THEN DEST[63:0] := RoundToIntegerDP(SRC2[63:0], Zero_upper_imm[7:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VRNDSCALESD __m128d _mm_roundscale_sd ( __m128d a, __m128d b, int imm);
VRNDSCALESD __m128d _mm_roundscale_round_sd ( __m128d a, __m128d b, int imm, int sae);
VRNDSCALESD __m128d _mm_mask_roundscale_sd (__m128d s, __mmask8 k, __m128d a, __m128d b, int imm);
VRNDSCALESD __m128d _mm_mask_roundscale_round_sd (__m128d s, __mmask8 k, __m128d a, __m128d b, int imm, int sae);
VRNDSCALESD __m128d _mm_maskz_roundscale_sd ( __mmask8 k, __m128d a, __m128d b, int imm);
VRNDSCALESD __m128d _mm_maskz_roundscale_round_sd ( __mmask8 k, __m128d a, __m128d b, int imm, int sae);

SIMD Floating-Point Exceptions

Invalid, Precision.
If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask).

VRNDSCALESD—Round Scalar Float64 Value to Include a Given Number of Fraction Bits Vol. 2C 5-685
Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”

VRNDSCALESD—Round Scalar Float64 Value to Include a Given Number of Fraction Bits Vol. 2C 5-686
VRNDSCALESH—Round Scalar FP16 Value to Include a Given Number of Fraction Bits
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.LLIG.NP.0F3A.W0 0A /r /ib A V/V AVX512-FP16 Round the low FP16 value in xmm3/m16 to a
VRNDSCALESH xmm1{k1}{z}, xmm2, OR AVX10.11 number of fraction bits specified by the imm8
xmm3/m16 {sae}, imm8 field. Store the result in xmm1 subject to
writemask k1. Bits 127:16 from xmm2 are
copied to xmm1[127:16].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8 (r)

Description
This instruction rounds the low FP16 value in the second source operand by the rounding mode specified in the
immediate operand (see Table 5-30) and places the result in the destination operand.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
The rounding process rounds the input to an integral value, plus number bits of fraction that are specified by
imm8[7:4] (to be included in the result), and returns the result as a FP16 value.
Note that no overflow is induced while executing this instruction (although the source is scaled by the imm8[7:4]
value).
The immediate operand also specifies control fields for the rounding operation. Three bit fields are defined and
shown in Table 5-30, “Imm8 Controls for VRNDSCALEPH/VRNDSCALESH.” Bit 3 of the immediate byte controls the
processor behavior for a precision exception, bit 2 selects the source of rounding mode control, and bits 1:0 specify
a non-sticky rounding-mode value.
The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an
SNaN then it will be converted to a QNaN.
The sign of the result of this instruction is preserved, including the sign of zero. Special cases are described in Table
5-31.
If this instruction encoding’s SPE bit (bit 3) in the immediate operand is 1, VRNDSCALESH can set MXCSR.UE
without MXCSR.PE.
The formula of the operation on each data element for VRNDSCALESH is:
ROUND(x) = 2−M *Round_to_INT(x * 2M, round_ctrl),
round_ctrl = imm[3:0];
M=imm[7:4];
The operation of x * 2M is computed as if the exponent range is unlimited (i.e., no overflow ever occurs).

VRNDSCALESH—Round Scalar FP16 Value to Include a Given Number of Fraction Bits Vol. 2C 5-686
Operation
VRNDSCALESH dest{k1}, src1, src2, imm8
IF k1[0] or *no writemask*:
DEST.fp16[0] := round_fp16_to_integer(src2.fp16[0], imm8) // see VRNDSCALEPH
ELSE IF *zeroing*:
DEST.fp16[0] := 0
//else DEST.fp16[0] remains unchanged

DEST[127:16] = src1[127:16]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VRNDSCALESH __m128h _mm_mask_roundscale_round_sh (__m128h src, __mmask8 k, __m128h a, __m128h b, int imm8, const int
sae);
VRNDSCALESH __m128h _mm_maskz_roundscale_round_sh (__mmask8 k, __m128h a, __m128h b, int imm8, const int sae);
VRNDSCALESH __m128h _mm_roundscale_round_sh (__m128h a, __m128h b, int imm8, const int sae);
VRNDSCALESH __m128h _mm_mask_roundscale_sh (__m128h src, __mmask8 k, __m128h a, __m128h b, int imm8);
VRNDSCALESH __m128h _mm_maskz_roundscale_sh (__mmask8 k, __m128h a, __m128h b, int imm8);
VRNDSCALESH __m128h _mm_roundscale_sh (__m128h a, __m128h b, int imm8);

SIMD Floating-Point Exceptions


Invalid, Underflow, Precision.

Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VRNDSCALESH—Round Scalar FP16 Value to Include a Given Number of Fraction Bits Vol. 2C 5-687
VRNDSCALESS—Round Scalar Float32 Value to Include a Given Number of Fraction Bits
Opcode/ Op / 64/32 CPUID Description
Instruction En bit Mode Feature Flag
Support
EVEX.LLIG.66.0F3A.W0 0A /r ib A V/V AVX512F Rounds scalar single-precision floating-point value in
VRNDSCALESS xmm1 {k1}{z}, xmm2, OR AVX10.11 xmm3/m32 to a number of fraction bits specified by
xmm3/m32{sae}, imm8 the imm8 field. Stores the result in xmm1 register
under writemask.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Rounds the single precision floating-point value in the low doubleword element of the second source operand (the
third operand) by the rounding mode specified in the immediate operand (see Figure 5-29) and places the result in
the corresponding element of the destination operand (the first operand) according to the writemask. The double-
word elements at bits 127:32 of the destination are copied from the first source operand (the second operand).
The destination and first source operands are XMM registers, the 2nd source operand can be an XMM register or
memory location. Bits MAXVL-1:128 of the destination register are cleared.
The rounding process rounds the input to an integral value, plus number bits of fraction that are specified by
imm8[7:4] (to be included in the result) and returns the result as a single precision floating-point value.
It should be noticed that no overflow is induced while executing this instruction (although the source is scaled by
the imm8[7:4] value).
The immediate operand also specifies control fields for the rounding operation, three bit fields are defined and
shown in the “Immediate Control Description” figure below. Bit 3 of the immediate byte controls the processor
behavior for a precision exception, bit 2 selects the source of rounding mode control. Bits 1:0 specify a non-sticky
rounding-mode value (immediate control tables below lists the encoded values for rounding-mode field).
The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an
SNaN then it will be converted to a QNaN. If DAZ is set to ‘1 then denormals will be converted to zero before
rounding.
The sign of the result of this instruction is preserved, including the sign of zero.

The formula of the operation for VRNDSCALESS is


ROUND(x) = 2-M*Round_to_INT(x*2M, round_ctrl),
round_ctrl = imm[3:0];
M=imm[7:4];
The operation of x*2M is computed as if the exponent range is unlimited (i.e., no overflow ever occurs).
VRNDSCALESS is a more general form of the VEX-encoded VROUNDSS instruction. In VROUNDSS, the formula of
the operation on each element is
ROUND(x) = Round_to_INT(x, round_ctrl),
round_ctrl = imm[3:0];

VRNDSCALESS—Round Scalar Float32 Value to Include a Given Number of Fraction Bits Vol. 2C 5-688
EVEX encoded version: The source operand is a XMM register or a 32-bit memory location. The destination operand
is a XMM register.
Handling of special case of input values are listed in Table 5-29.

Operation
RoundToIntegerSP(SRC[31:0], imm8[7:0]) {
if (imm8[2] = 1)
rounding_direction := MXCSR:RC ; get round control from MXCSR
else
rounding_direction := imm8[1:0] ; get round control from imm8[1:0]
FI
M := imm8[7:4] ; get the scaling factor

case (rounding_direction)
00: TMP[31:0] := round_to_nearest_even_integer(2M*SRC[31:0])
01: TMP[31:0] := round_to_equal_or_smaller_integer(2M*SRC[31:0])
10: TMP[31:0] := round_to_equal_or_larger_integer(2M*SRC[31:0])
11: TMP[31:0] := round_to_nearest_smallest_magnitude_integer(2M*SRC[31:0])
ESAC;

Dest[31:0] := 2-M* TMP[31:0] ; scale down back to 2-M


if (imm8[3] = 0) Then ; check SPE
if (SRC[31:0] != Dest[31:0]) Then ; check precision lost
set_precision() ; set #PE
FI;
FI;
return(Dest[31:0])
}

VRNDSCALESS (EVEX encoded version)


IF k1[0] or *no writemask*
THEN DEST[31:0] := RoundToIntegerSP(SRC2[31:0], Zero_upper_imm[7:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VRNDSCALESS __m128 _mm_roundscale_ss ( __m128 a, __m128 b, int imm);
VRNDSCALESS __m128 _mm_roundscale_round_ss ( __m128 a, __m128 b, int imm, int sae);
VRNDSCALESS __m128 _mm_mask_roundscale_ss (__m128 s, __mmask8 k, __m128 a, __m128 b, int imm);
VRNDSCALESS __m128 _mm_mask_roundscale_round_ss (__m128 s, __mmask8 k, __m128 a, __m128 b, int imm, int sae);
VRNDSCALESS __m128 _mm_maskz_roundscale_ss ( __mmask8 k, __m128 a, __m128 b, int imm);
VRNDSCALESS __m128 _mm_maskz_roundscale_round_ss ( __mmask8 k, __m128 a, __m128 b, int imm, int sae);

SIMD Floating-Point Exceptions


Invalid, Precision.
If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask).

VRNDSCALESS—Round Scalar Float32 Value to Include a Given Number of Fraction Bits Vol. 2C 5-689
Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”

VRNDSCALESS—Round Scalar Float32 Value to Include a Given Number of Fraction Bits Vol. 2C 5-690
VRSQRT14PD—Compute Approximate Reciprocals of Square Roots of Packed Float64 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 4E /r A V/V (AVX512VL AND Computes the approximate reciprocal square roots of
VRSQRT14PD xmm1 {k1}{z}, AVX512F) OR the packed double precision floating-point values in
xmm2/m128/m64bcst AVX10.11 xmm2/m128/m64bcst and stores the results in
xmm1. Under writemask.
EVEX.256.66.0F38.W1 4E /r A V/V (AVX512VL AND Computes the approximate reciprocal square roots of
VRSQRT14PD ymm1 {k1}{z}, AVX512F) OR the packed double precision floating-point values in
ymm2/m256/m64bcst AVX10.11 ymm2/m256/m64bcst and stores the results in
ymm1. Under writemask.
EVEX.512.66.0F38.W1 4E /r A V/V AVX512F Computes the approximate reciprocal square roots of
VRSQRT14PD zmm1 {k1}{z}, OR AVX10.11 the packed double precision floating-point values in
zmm2/m512/m64bcst zmm2/m512/m64bcst and stores the results in
zmm1 under writemask.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction performs a SIMD computation of the approximate reciprocals of the square roots of the eight
packed double precision floating-point values in the source operand (the second operand) and stores the packed
double precision floating-point results in the destination operand (the first operand) according to the writemask.
The maximum relative error for this approximation is less than 2-14.
EVEX.512 encoded version: The source operand can be a ZMM register, a 512-bit memory location, or a 512-bit
vector broadcasted from a 64-bit memory location. The destination operand is a ZMM register, conditionally
updated using writemask k1.
EVEX.256 encoded version: The source operand is a YMM register, a 256-bit memory location, or a 256-bit vector
broadcasted from a 64-bit memory location. The destination operand is a YMM register, conditionally updated using
writemask k1.
EVEX.128 encoded version: The source operand is a XMM register, a 128-bit memory location, or a 128-bit vector
broadcasted from a 64-bit memory location. The destination operand is a XMM register, conditionally updated using
writemask k1.
The VRSQRT14PD instruction is not affected by the rounding control bits in the MXCSR register. When a source
value is a 0.0, an ∞ with the sign of the source value is returned. When the source operand is an +∞ then +ZERO
value is returned. A denormal source value is treated as zero only if DAZ bit is set in MXCSR. Otherwise it is treated
correctly and performs the approximation with the specified masked response. When a source value is a negative
value (other than 0.0) a floating-point QNaN_indefinite is returned. When a source value is an SNaN or QNaN, the
SNaN is converted to a QNaN or the source QNaN is returned.
MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
A numerically exact implementation of VRSQRT14xx can be found at https://software.intel.com/en-us/arti-
cles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2.

VRSQRT14PD—Compute Approximate Reciprocals of Square Roots of Packed Float64 Values Vol. 2C 5-690
Operation
VRSQRT14PD (EVEX encoded versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC *is memory*)
THEN DEST[i+63:i] := APPROXIMATE(1.0/ SQRT(SRC[63:0]));
ELSE DEST[i+63:i] := APPROXIMATE(1.0/ SQRT(SRC[i+63:i]));
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

Table 5-32. VRSQRT14PD Special Cases


Input value Result value Comments
Any denormal Normal Cannot generate overflow
X= 2-2n 2n
X<0 QNaN_Indefinite Including -INF
X = -0 -INF
X = +0 +INF
X = +INF +0

Intel C/C++ Compiler Intrinsic Equivalent


VRSQRT14PD __m512d _mm512_rsqrt14_pd( __m512d a);
VRSQRT14PD __m512d _mm512_mask_rsqrt14_pd(__m512d s, __mmask8 k, __m512d a);
VRSQRT14PD __m512d _mm512_maskz_rsqrt14_pd( __mmask8 k, __m512d a);
VRSQRT14PD __m256d _mm256_rsqrt14_pd( __m256d a);
VRSQRT14PD __m256d _mm512_mask_rsqrt14_pd(__m256d s, __mmask8 k, __m256d a);
VRSQRT14PD __m256d _mm512_maskz_rsqrt14_pd( __mmask8 k, __m256d a);
VRSQRT14PD __m128d _mm_rsqrt14_pd( __m128d a);
VRSQRT14PD __m128d _mm_mask_rsqrt14_pd(__m128d s, __mmask8 k, __m128d a);
VRSQRT14PD __m128d _mm_maskz_rsqrt14_pd( __mmask8 k, __m128d a);

SIMD Floating-Point Exceptions

None.

Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”

VRSQRT14PD—Compute Approximate Reciprocals of Square Roots of Packed Float64 Values Vol. 2C 5-691
VRSQRT14PS—Compute Approximate Reciprocals of Square Roots of Packed Float32 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 4E /r A V/V (AVX512VL AND Computes the approximate reciprocal square roots of
VRSQRT14PS xmm1 {k1}{z}, AVX512F) OR the packed single-precision floating-point values in
xmm2/m128/m32bcst AVX10.11 xmm2/m128/m32bcst and stores the results in xmm1.
Under writemask.
EVEX.256.66.0F38.W0 4E /r A V/V (AVX512VL AND Computes the approximate reciprocal square roots of
VRSQRT14PS ymm1 {k1}{z}, AVX512F) OR the packed single-precision floating-point values in
ymm2/m256/m32bcst AVX10.11 ymm2/m256/m32bcst and stores the results in ymm1.
Under writemask.
EVEX.512.66.0F38.W0 4E /r A V/V AVX512F Computes the approximate reciprocal square roots of
VRSQRT14PS zmm1 {k1}{z}, OR AVX10.11 the packed single-precision floating-point values in
zmm2/m512/m32bcst zmm2/m512/m32bcst and stores the results in zmm1.
Under writemask.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction performs a SIMD computation of the approximate reciprocals of the square roots of 16 packed
single precision floating-point values in the source operand (the second operand) and stores the packed single
precision floating-point results in the destination operand (the first operand) according to the writemask. The
maximum relative error for this approximation is less than 2-14.
EVEX.512 encoded version: The source operand can be a ZMM register, a 512-bit memory location or a 512-bit
vector broadcasted from a 32-bit memory location. The destination operand is a ZMM register, conditionally
updated using writemask k1.
EVEX.256 encoded version: The source operand is a YMM register, a 256-bit memory location, or a 256-bit vector
broadcasted from a 32-bit memory location. The destination operand is a YMM register, conditionally updated using
writemask k1.
EVEX.128 encoded version: The source operand is a XMM register, a 128-bit memory location, or a 128-bit vector
broadcasted from a 32-bit memory location. The destination operand is a XMM register, conditionally updated using
writemask k1.
The VRSQRT14PS instruction is not affected by the rounding control bits in the MXCSR register. When a source
value is a 0.0, an ∞ with the sign of the source value is returned. When the source operand is an +∞ then +ZERO
value is returned. A denormal source value is treated as zero only if DAZ bit is set in MXCSR. Otherwise it is treated
correctly and performs the approximation with the specified masked response. When a source value is a negative
value (other than 0.0) a floating-point QNaN_indefinite is returned. When a source value is an SNaN or QNaN, the
SNaN is converted to a QNaN or the source QNaN is returned.
MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
A numerically exact implementation of VRSQRT14xx can be found at https://software.intel.com/en-us/arti-
cles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2.

VRSQRT14PS—Compute Approximate Reciprocals of Square Roots of Packed Float32 Values Vol. 2C 5-692
Operation
VRSQRT14PS (EVEX encoded versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC *is memory*)
THEN DEST[i+31:i] := APPROXIMATE(1.0/ SQRT(SRC[31:0]));
ELSE DEST[i+31:i] := APPROXIMATE(1.0/ SQRT(SRC[i+31:i]));
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0

Table 5-33. VRSQRT14PS Special Cases


Input value Result value Comments
Any denormal Normal Cannot generate overflow
X= 2-2n 2n
X<0 QNaN_Indefinite Including -INF
X = -0 -INF
X = +0 +INF
X = +INF +0

Intel C/C++ Compiler Intrinsic Equivalent


VRSQRT14PS __m512 _mm512_rsqrt14_ps( __m512 a);
VRSQRT14PS __m512 _mm512_mask_rsqrt14_ps(__m512 s, __mmask16 k, __m512 a);
VRSQRT14PS __m512 _mm512_maskz_rsqrt14_ps( __mmask16 k, __m512 a);
VRSQRT14PS __m256 _mm256_rsqrt14_ps( __m256 a);
VRSQRT14PS __m256 _mm256_mask_rsqrt14_ps(__m256 s, __mmask8 k, __m256 a);
VRSQRT14PS __m256 _mm256_maskz_rsqrt14_ps( __mmask8 k, __m256 a);
VRSQRT14PS __m128 _mm_rsqrt14_ps( __m128 a);
VRSQRT14PS __m128 _mm_mask_rsqrt14_ps(__m128 s, __mmask8 k, __m128 a);
VRSQRT14PS __m128 _mm_maskz_rsqrt14_ps( __mmask8 k, __m128 a);

SIMD Floating-Point Exceptions

None.

Other Exceptions
See Table 2-21, “Type 4 Class Exception Conditions.”

VRSQRT14PS—Compute Approximate Reciprocals of Square Roots of Packed Float32 Values Vol. 2C 5-693
VRSQRT14SD—Compute Approximate Reciprocal of Square Root of Scalar Float64 Value
Opcode/ Op / 64/32 CPUID Description
Instruction En bit Mode Feature Flag
Support
EVEX.LLIG.66.0F38.W1 4F /r A V/V AVX512F Computes the approximate reciprocal square root of the
VRSQRT14SD xmm1 {k1}{z}, OR AVX10.11 scalar double precision floating-point value in
xmm2, xmm3/m64 xmm3/m64 and stores the result in the low quadword
element of xmm1 using writemask k1. Bits[127:64] of
xmm2 is copied to xmm1[127:64].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Computes the approximate reciprocal of the square roots of the scalar double precision floating-point value in the
low quadword element of the source operand (the second operand) and stores the result in the low quadword
element of the destination operand (the first operand) according to the writemask. The maximum relative error for
this approximation is less than 2-14. The source operand can be an XMM register or a 32-bit memory location. The
destination operand is an XMM register.
Bits (127:64) of the XMM register destination are copied from corresponding bits in the first source operand. Bits
(MAXVL-1:128) of the destination register are zeroed.
The VRSQRT14SD instruction is not affected by the rounding control bits in the MXCSR register. When a source
value is a 0.0, an ∞ with the sign of the source value is returned. When the source operand is an +∞ then +ZERO
value is returned. A denormal source value is treated as zero only if DAZ bit is set in MXCSR. Otherwise it is treated
correctly and performs the approximation with the specified masked response. When a source value is a negative
value (other than 0.0) a floating-point QNaN_indefinite is returned. When a source value is an SNaN or QNaN, the
SNaN is converted to a QNaN or the source QNaN is returned.
MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported.
A numerically exact implementation of VRSQRT14xx can be found at https://software.intel.com/en-us/arti-
cles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2.

Operation
VRSQRT14SD (EVEX version)
IF k1[0] or *no writemask*
THEN DEST[63:0] := APPROXIMATE(1.0/ SQRT(SRC2[63:0]))
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

VRSQRT14SD—Compute Approximate Reciprocal of Square Root of Scalar Float64 Value Vol. 2C 5-694
Table 5-34. VRSQRT14SD Special Cases
Input value Result value Comments
Any denormal Normal Cannot generate overflow
X= 2-2n 2 n

X<0 QNaN_Indefinite Including -INF


X = -0 -INF
X = +0 +INF
X = +INF +0

Intel C/C++ Compiler Intrinsic Equivalent


VRSQRT14SD __m128d _mm_rsqrt14_sd( __m128d a, __m128d b);
VRSQRT14SD __m128d _mm_mask_rsqrt14_sd(__m128d s, __mmask8 k, __m128d a, __m128d b);
VRSQRT14SD __m128d _mm_maskz_rsqrt14_sd( __mmask8d m, __m128d a, __m128d b);

SIMD Floating-Point Exceptions

None.

Other Exceptions
See Table 2-53, “Type E5 Class Exception Conditions.”

VRSQRT14SD—Compute Approximate Reciprocal of Square Root of Scalar Float64 Value Vol. 2C 5-695
VRSQRT14SS—Compute Approximate Reciprocal of Square Root of Scalar Float32 Value
Opcode/ Op / 64/32 CPUID Description
Instruction En bit Mode Feature Flag
Support
EVEX.LLIG.66.0F38.W0 4F /r A V/V AVX512F Computes the approximate reciprocal square root of the
VRSQRT14SS xmm1 {k1}{z}, OR AVX10.11 scalar single-precision floating-point value in xmm3/m32
xmm2, xmm3/m32 and stores the result in the low doubleword element of
xmm1 using writemask k1. Bits[127:32] of xmm2 is
copied to xmm1[127:32].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
Computes of the approximate reciprocal of the square root of the scalar single precision floating-point value in the
low doubleword element of the source operand (the second operand) and stores the result in the low doubleword
element of the destination operand (the first operand) according to the writemask. The maximum relative error for
this approximation is less than 2-14. The source operand can be an XMM register or a 32-bit memory location. The
destination operand is an XMM register.
Bits (127:32) of the XMM register destination are copied from corresponding bits in the first source operand. Bits
(MAXVL-1:128) of the destination register are zeroed.
The VRSQRT14SS instruction is not affected by the rounding control bits in the MXCSR register. When a source
value is a 0.0, an ∞ with the sign of the source value is returned. When the source operand is an ∞, zero with the
sign of the source value is returned. A denormal source value is treated as zero only if DAZ bit is set in MXCSR.
Otherwise it is treated correctly and performs the approximation with the specified masked response. When a
source value is a negative value (other than 0.0) a floating-point indefinite is returned. When a source value is an
SNaN or QNaN, the SNaN is converted to a QNaN or the source QNaN is returned.
MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported.
A numerically exact implementation of VRSQRT14xx can be found at https://software.intel.com/en-us/arti-
cles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2.

Operation
VRSQRT14SS (EVEX version)
IF k1[0] or *no writemask*
THEN DEST[31:0] := APPROXIMATE(1.0/ SQRT(SRC2[31:0]))
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

VRSQRT14SS—Compute Approximate Reciprocal of Square Root of Scalar Float32 Value Vol. 2C 5-696
Table 5-35. VRSQRT14SS Special Cases
Input value Result value Comments
Any denormal Normal Cannot generate overflow
X= 2-2n 2 n

X<0 QNaN_Indefinite Including -INF


X = -0 -INF
X = +0 +INF
X = +INF +0

Intel C/C++ Compiler Intrinsic Equivalent


VRSQRT14SS __m128 _mm_rsqrt14_ss( __m128 a, __m128 b);
VRSQRT14SS __m128 _mm_mask_rsqrt14_ss(__m128 s, __mmask8 k, __m128 a, __m128 b);
VRSQRT14SS __m128 _mm_maskz_rsqrt14_ss( __mmask8 k, __m128 a, __m128 b);

SIMD Floating-Point Exceptions

None.

Other Exceptions
See Table 2-53, “Type E5 Class Exception Conditions.”

VRSQRT14SS—Compute Approximate Reciprocal of Square Root of Scalar Float32 Value Vol. 2C 5-697
VRSQRTPH—Compute Reciprocals of Square Roots of Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.MAP6.W0 4E /r A V/V (AVX512-FP16 Compute the approximate reciprocals of the
VRSQRTPH xmm1{k1}{z}, AND AVX512VL) square roots of packed FP16 values in
xmm2/m128/m16bcst OR AVX10.11 xmm2/m128/m16bcst and store the result in
xmm1 subject to writemask k1.
EVEX.256.66.MAP6.W0 4E /r A V/V (AVX512-FP16 Compute the approximate reciprocals of the
VRSQRTPH ymm1{k1}{z}, AND AVX512VL) square roots of packed FP16 values in
ymm2/m256/m16bcst OR AVX10.11 ymm2/m256/m16bcst and store the result in
ymm1 subject to writemask k1.
EVEX.512.66.MAP6.W0 4E /r A V/V AVX512-FP16 Compute the approximate reciprocals of the
VRSQRTPH zmm1{k1}{z}, OR AVX10.11 square roots of packed FP16 values in
zmm2/m512/m16bcst zmm2/m512/m16bcst and store the result in
zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction performs a SIMD computation of the approximate reciprocals square-root of 8/16/32 packed FP16
floating-point values in the source operand (the second operand) and stores the packed FP16 floating-point results
in the destination operand.
The maximum relative error for this approximation is less than 2−11 + 2−14. For special cases, see Table 5-36.
The destination elements are updated according to the writemask.

Table 5-36. VRSQRTPH/VRSQRTSH Special Cases


Input value Reset Value Comments
Any denormal Normal Cannot generate overflow
X= 2−2n 2n
X<0 QNaN_Indefinite Including −∞
X = −0 −∞
X = +0 +∞
X = +∞ +0

VRSQRTPH—Compute Reciprocals of Square Roots of Packed FP16 Values Vol. 2C 5-698


Operation
VRSQRTPH dest{k1}, src
VL = 128, 256 or 512
KL := VL/16

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF SRC is memory and (EVEX.b = 1):
tsrc := src.fp16[0]
ELSE:
tsrc := src.fp16[i]
DEST.fp16[i] := APPROXIMATE(1.0 / SQRT(tsrc) )
ELSE IF *zeroing*:
DEST.fp16[i] := 0
//else DEST.fp16[i] remains unchanged

DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VRSQRTPH __m128h _mm_mask_rsqrt_ph (__m128h src, __mmask8 k, __m128h a);
VRSQRTPH __m128h _mm_maskz_rsqrt_ph (__mmask8 k, __m128h a);
VRSQRTPH __m128h _mm_rsqrt_ph (__m128h a);
VRSQRTPH __m256h _mm256_mask_rsqrt_ph (__m256h src, __mmask16 k, __m256h a);
VRSQRTPH __m256h _mm256_maskz_rsqrt_ph (__mmask16 k, __m256h a);
VRSQRTPH __m256h _mm256_rsqrt_ph (__m256h a);
VRSQRTPH __m512h _mm512_mask_rsqrt_ph (__m512h src, __mmask32 k, __m512h a);
VRSQRTPH __m512h _mm512_maskz_rsqrt_ph (__mmask32 k, __m512h a);
VRSQRTPH __m512h _mm512_rsqrt_ph (__m512h a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”

VRSQRTPH—Compute Reciprocals of Square Roots of Packed FP16 Values Vol. 2C 5-699


VRSQRTSH—Compute Approximate Reciprocal of Square Root of Scalar FP16 Value
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.LLIG.66.MAP6.W0 4F /r A V/V AVX512-FP16 Compute the approximate reciprocal square root
VRSQRTSH xmm1{k1}{z}, xmm2, OR AVX10.11 of the FP16 value in xmm3/m16 and store the
xmm3/m16 result in the low word element of xmm1 subject
to writemask k1. Bits 127:16 of xmm2 are
copied to xmm1[127:16].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs the computation of the approximate reciprocal square-root of the low FP16 value in the
second source operand (the third operand) and stores the result in the low word element of the destination operand
(the first operand) according to the writemask k1.
The maximum relative error for this approximation is less than 2−11 + 2−14.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL−1:128 of the destination operand are zeroed.
For special cases, see Table 5-36.

Operation
VRSQRTSH dest{k1}, src1, src2
VL = 128, 256 or 512
KL := VL/16

IF k1[0] or *no writemask*:


DEST.fp16[0] := APPROXIMATE(1.0 / SQRT(src2.fp16[0]))
ELSE IF *zeroing*:
DEST.fp16[0] := 0
//else DEST.fp16[0] remains unchanged
DEST[127:16] := src1[127:16]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VRSQRTSH __m128h _mm_mask_rsqrt_sh (__m128h src, __mmask8 k, __m128h a, __m128h b);
VRSQRTSH __m128h _mm_maskz_rsqrt_sh (__mmask8 k, __m128h a, __m128h b);
VRSQRTSH __m128h _mm_rsqrt_sh (__m128h a, __m128h b);

SIMD Floating-Point Exceptions


None.

Other Exceptions
EVEX-encoded instruction, see Table 2-60, “Type E10 Class Exception Conditions.”

VRSQRTSH—Compute Approximate Reciprocal of Square Root of Scalar FP16 Value Vol. 2C 5-700
VSCALEFPD—Scale Packed Float64 Values With Float64 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 2C /r A V/V (AVX512VL Scale the packed double precision floating-point
VSCALEFPD xmm1 {k1}{z}, xmm2, AND AVX512F) values in xmm2 using values from
xmm3/m128/m64bcst OR AVX10.11 xmm3/m128/m64bcst. Under writemask k1.
EVEX.256.66.0F38.W1 2C /r A V/V (AVX512VL Scale the packed double precision floating-point
VSCALEFPD ymm1 {k1}{z}, ymm2, AND AVX512F) values in ymm2 using values from
ymm3/m256/m64bcst OR AVX10.11 ymm3/m256/m64bcst. Under writemask k1.
EVEX.512.66.0F38.W1 2C /r A V/V AVX512F Scale the packed double precision floating-point
VSCALEFPD zmm1 {k1}{z}, zmm2, OR AVX10.11 values in zmm2 using values from
zmm3/m512/m64bcst{er} zmm3/m512/m64bcst. Under writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a floating-point scale of the packed double precision floating-point values in the first source operand by
multiplying them by 2 to the power of the double precision floating-point values in second source operand.
The equation of this operation is given by:
zmm1 := zmm2*2floor(zmm3).
Floor(zmm3) means maximum integer value ≤ zmm3.
If the result cannot be represented in double precision, then the proper overflow response (for positive scaling
operand), or the proper underflow response (for negative scaling operand) is issued. The overflow and underflow
responses are dependent on the rounding mode (for IEEE-compliant rounding), as well as on other settings in
MXCSR (exception mask bits, FTZ bit), and on the SAE bit.
The first source operand is a ZMM/YMM/XMM register. The second source operand is a ZMM/YMM/XMM register, a
512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The
destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.
Handling of special-case input values are listed in Table 5-37 and Table 5-38.

VSCALEFPD—Scale Packed Float64 Values With Float64 Values Vol. 2C 5-701


Table 5-37. VSCALEFPD/SD/PS/SS Special Cases

Src2 Set IE
±NaN +Inf -Inf 0/Denorm/Norm
Src1 ±QNaN QNaN(Src1) +INF +0 QNaN(Src1) IF either source is SNAN
±SNaN QNaN(Src1) QNaN(Src1) QNaN(Src1) QNaN(Src1) YES
±Inf QNaN(Src2) Src1 QNaN_Indefinite Src1 IF Src2 is SNAN or -INF
±0 QNaN(Src2) QNaN_Indefinite Src1 Src1 IF Src2 is SNAN or +INF
Denorm/Norm QNaN(Src2) ±INF (Src1 sign) ±0 (Src1 sign) Compute Result IF Src2 is SNAN

Table 5-38. Additional VSCALEFPD/SD Special Cases


Special Case Returned value Faults
|result| < 2-1074 ±0 or ±Min-Denormal (Src1 sign) Underflow
|result| ≥ 21024 ±INF (Src1 sign) or ±Max-normal (Src1 sign) Overflow

Operation
SCALE(SRC1, SRC2)
{
TMP_SRC2 := SRC2
TMP_SRC1 := SRC1
IF (SRC2 is denormal AND MXCSR.DAZ) THEN TMP_SRC2=0
IF (SRC1 is denormal AND MXCSR.DAZ) THEN TMP_SRC1=0
/* SRC2 is a 64 bits floating-point value */
DEST[63:0] := TMP_SRC1[63:0] * POW(2, Floor(TMP_SRC2[63:0]))
}
VSCALEFPD (EVEX encoded versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1) AND (SRC2 *is register*)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+63:i] := SCALE(SRC1[i+63:i], SRC2[63:0]);
ELSE DEST[i+63:i] := SCALE(SRC1[i+63:i], SRC2[i+63:i]);
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VSCALEFPD—Scale Packed Float64 Values With Float64 Values Vol. 2C 5-702


Intel C/C++ Compiler Intrinsic Equivalent
VSCALEFPD __m512d _mm512_scalef_round_pd(__m512d a, __m512d b, int rounding);
VSCALEFPD __m512d _mm512_mask_scalef_round_pd(__m512d s, __mmask8 k, __m512d a, __m512d b, int rounding);
VSCALEFPD __m512d _mm512_maskz_scalef_round_pd(__mmask8 k, __m512d a, __m512d b, int rounding);
VSCALEFPD __m512d _mm512_scalef_pd(__m512d a, __m512d b);
VSCALEFPD __m512d _mm512_mask_scalef_pd(__m512d s, __mmask8 k, __m512d a, __m512d b);
VSCALEFPD __m512d _mm512_maskz_scalef_pd(__mmask8 k, __m512d a, __m512d b);
VSCALEFPD __m256d _mm256_scalef_pd(__m256d a, __m256d b);
VSCALEFPD __m256d _mm256_mask_scalef_pd(__m256d s, __mmask8 k, __m256d a, __m256d b);
VSCALEFPD __m256d _mm256_maskz_scalef_pd(__mmask8 k, __m256d a, __m256d b);
VSCALEFPD __m128d _mm_scalef_pd(__m128d a, __m128d b);
VSCALEFPD __m128d _mm_mask_scalef_pd(__m128d s, __mmask8 k, __m128d a, __m128d b);
VSCALEFPD __m128d _mm_maskz_scalef_pd(__mmask8 k, __m128d a, __m128d b);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal (for Src1).
Denormal is not reported for Src2.

Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”

VSCALEFPD—Scale Packed Float64 Values With Float64 Values Vol. 2C 5-703


VSCALEFPH—Scale Packed FP16 Values with FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.MAP6.W0 2C /r A V/V (AVX512-FP16 Scale the packed FP16 values in xmm2 using
VSCALEFPH xmm1{k1}{z}, xmm2, AND AVX512VL) values from xmm3/m128/m16bcst, and store
xmm3/m128/m16bcst OR AVX10.11 the result in xmm1 subject to writemask k1.
EVEX.256.66.MAP6.W0 2C /r A V/V (AVX512-FP16 Scale the packed FP16 values in ymm2 using
VSCALEFPH ymm1{k1}{z}, ymm2, AND AVX512VL) values from ymm3/m256/m16bcst, and store the
ymm3/m256/m16bcst OR AVX10.11 result in ymm1 subject to writemask k1.
EVEX.512.66.MAP6.W0 2C /r A V/V AVX512-FP16 Scale the packed FP16 values in zmm2 using
VSCALEFPH zmm1{k1}{z}, zmm2, OR AVX10.11 values from zmm3/m512/m16bcst, and store the
zmm3/m512/m16bcst {er} result in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs a floating-point scale of the packed FP16 values in the first source operand by multiplying
it by 2 to the power of the FP16 values in second source operand. The destination elements are updated according
to the writemask.
The equation of this operation is given by:
zmm1 := zmm2 * 2floor(zmm3).
Floor(zmm3) means maximum integer value ≤ zmm3.
If the result cannot be represented in FP16, then the proper overflow response (for positive scaling operand), or
the proper underflow response (for negative scaling operand), is issued. The overflow and underflow responses are
dependent on the rounding mode (for IEEE-compliant rounding), as well as on other settings in MXCSR (exception
mask bits), and on the SAE bit.
Handling of special-case input values are listed in Table 5-39 and Table 5-40.

VSCALEFPH—Scale Packed FP16 Values with FP16 Values Vol. 2C 5-704


Table 5-39. VSCALEFPH/VSCALEFSH Special Cases

Src2
Src1 Set IE
±NaN +INF −INF 0/Denorm/Norm
±QNaN QNaN(Src1) +INF +0 QNaN(Src1) IF either source is SNaN
±SNaN QNaN(Src1) QNaN(Src1) QNaN(Src1) QNaN(Src1) YES
±INF QNaN(Src2) Src1 QNaN_Indefinite Src1 IF Src2 is SNaN or −INF
±0 QNaN(Src2) QNaN_Indefinite Src1 Src1 IF Src2 is SNaN or +INF
Denorm/Norm QNaN(Src2) ±INF (Src1 sign) ±0 (Src1 sign) Compute Result IF Src2 is SNaN

Table 5-40. Additional VSCALEFPH/VSCALEFSH Special Cases


Special Case Returned Value Faults
|result| < 2-24 ±0 or ±Min-Denormal (Src1 sign) Underflow
|result| ≥ 216 ±INF (Src1 sign) or ±Max-Denormal (Src1 sign) Overflow

Operation
def scale_fp16(src1,src2):
tmp1 := src1
tmp2 := src2
return tmp1 * POW(2, FLOOR(tmp2))

VSCALEFPH dest{k1}, src1, src2


VL = 128, 256, or 512
KL := VL / 16

IF (VL = 512) AND (EVEX.b = 1) and no memory operand:


SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF SRC2 is memory and (EVEX.b = 1):
tsrc := src2.fp16[0]
ELSE:
tsrc := src2.fp16[i]
dest.fp16[i] := scale_fp16(src1.fp16[i],tsrc)
ELSE IF *zeroing*:
dest.fp16[i] := 0
//else dest.fp16[i] remains unchanged

DEST[MAXVL-1:VL] := 0

VSCALEFPH—Scale Packed FP16 Values with FP16 Values Vol. 2C 5-705


Intel C/C++ Compiler Intrinsic Equivalent
VSCALEFPH __m128h _mm_mask_scalef_ph (__m128h src, __mmask8 k, __m128h a, __m128h b);
VSCALEFPH __m128h _mm_maskz_scalef_ph (__mmask8 k, __m128h a, __m128h b);
VSCALEFPH __m128h _mm_scalef_ph (__m128h a, __m128h b);
VSCALEFPH __m256h _mm256_mask_scalef_ph (__m256h src, __mmask16 k, __m256h a, __m256h b);
VSCALEFPH __m256h _mm256_maskz_scalef_ph (__mmask16 k, __m256h a, __m256h b);
VSCALEFPH __m256h _mm256_scalef_ph (__m256h a, __m256h b);
VSCALEFPH __m512h _mm512_mask_scalef_ph (__m512h src, __mmask32 k, __m512h a, __m512h b);
VSCALEFPH __m512h _mm512_maskz_scalef_ph (__mmask32 k, __m512h a, __m512h b);
VSCALEFPH __m512h _mm512_scalef_ph (__m512h a, __m512h b);
VSCALEFPH __m512h _mm512_mask_scalef_round_ph (__m512h src, __mmask32 k, __m512h a, __m512h b, const int rounding);
VSCALEFPH __m512h _mm512_maskz_scalef_round_ph (__mmask32 k, __m512h a, __m512h b, const int;
VSCALEFPH __m512h _mm512_scalef_round_ph (__m512h a, __m512h b, const int rounding);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal.

Other Exceptions
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions”.
Denormal-operand exception (#D) is checked and signaled for src1 operand, but not for src2 operand. The
denormal-operand exception is checked for src1 operand only if the src2 operand is not NaN. If the src2 operand is
NaN, the processor generates NaN and does not signal denormal-operand exception, even if src1 operand is
denormal.

VSCALEFPH—Scale Packed FP16 Values with FP16 Values Vol. 2C 5-706


VSCALEFPS—Scale Packed Float32 Values With Float32 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 2C /r A V/V (AVX512VL Scale the packed single-precision floating-point
VSCALEFPS xmm1 {k1}{z}, xmm2, AND AVX512F) values in xmm2 using values from
xmm3/m128/m32bcst OR AVX10.11 xmm3/m128/m32bcst. Under writemask k1.
EVEX.256.66.0F38.W0 2C /r A V/V (AVX512VL Scale the packed single-precision values in ymm2
VSCALEFPS ymm1 {k1}{z}, ymm2, AND AVX512F) using floating-point values from
ymm3/m256/m32bcst OR AVX10.11 ymm3/m256/m32bcst. Under writemask k1.
EVEX.512.66.0F38.W0 2C /r A V/V AVX512F Scale the packed single-precision floating-point
VSCALEFPS zmm1 {k1}{z}, zmm2, OR AVX10.11 values in zmm2 using floating-point values from
zmm3/m512/m32bcst{er} zmm3/m512/m32bcst. Under writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a floating-point scale of the packed single precision floating-point values in the first source operand by
multiplying them by 2 to the power of the float32 values in second source operand.
The equation of this operation is given by:
zmm1 := zmm2*2floor(zmm3).
Floor(zmm3) means maximum integer value ≤ zmm3.

If the result cannot be represented in single precision, then the proper overflow response (for positive scaling
operand), or the proper underflow response (for negative scaling operand) is issued. The overflow and underflow
responses are dependent on the rounding mode (for IEEE-compliant rounding), as well as on other settings in
MXCSR (exception mask bits, FTZ bit), and on the SAE bit.
EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM
register, a 512-bit memory location or a 512-bit vector broadcasted from a 32-bit memory location. The destina-
tion operand is a ZMM register conditionally updated with writemask k1.
EVEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM
register, a 256-bit memory location, or a 256-bit vector broadcasted from a 32-bit memory location. The destina-
tion operand is a YMM register, conditionally updated using writemask k1.
EVEX.128 encoded version: The first source operand is an XMM register. The second source operand is a XMM
register, a 128-bit memory location, or a 128-bit vector broadcasted from a 32-bit memory location. The destina-
tion operand is a XMM register, conditionally updated using writemask k1.
Handling of special-case input values are listed in Table 5-37 and Table 5-41.

Table 5-41. Additional VSCALEFPS/SS Special Cases


Special Case Returned value Faults
|result| < 2-149 ±0 or ±Min-Denormal (Src1 sign) Underflow
|result| ≥ 2128 ±INF (Src1 sign) or ±Max-normal (Src1 sign) Overflow

VSCALEFPS—Scale Packed Float32 Values With Float32 Values Vol. 2C 5-706


Operation
SCALE(SRC1, SRC2)
{ ; Check for denormal operands
TMP_SRC2 := SRC2
TMP_SRC1 := SRC1
IF (SRC2 is denormal AND MXCSR.DAZ) THEN TMP_SRC2=0
IF (SRC1 is denormal AND MXCSR.DAZ) THEN TMP_SRC1=0
/* SRC2 is a 32 bits floating-point value */
DEST[31:0] := TMP_SRC1[31:0] * POW(2, Floor(TMP_SRC2[31:0]))
}

VSCALEFPS (EVEX encoded versions)


(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1) AND (SRC2 *is register*)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+31:i] := SCALE(SRC1[i+31:i], SRC2[31:0]);
ELSE DEST[i+31:i] := SCALE(SRC1[i+31:i], SRC2[i+31:i]);
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0;

Intel C/C++ Compiler Intrinsic Equivalent


VSCALEFPS __m512 _mm512_scalef_round_ps(__m512 a, __m512 b, int rounding);
VSCALEFPS __m512 _mm512_mask_scalef_round_ps(__m512 s, __mmask16 k, __m512 a, __m512 b, int rounding);
VSCALEFPS __m512 _mm512_maskz_scalef_round_ps(__mmask16 k, __m512 a, __m512 b, int rounding);
VSCALEFPS __m512 _mm512_scalef_ps(__m512 a, __m512 b);
VSCALEFPS __m512 _mm512_mask_scalef_ps(__m512 s, __mmask16 k, __m512 a, __m512 b);
VSCALEFPS __m512 _mm512_maskz_scalef_ps(__mmask16 k, __m512 a, __m512 b);
VSCALEFPS __m256 _mm256_scalef_ps(__m256 a, __m256 b);
VSCALEFPS __m256 _mm256_mask_scalef_ps(__m256 s, __mmask8 k, __m256 a, __m256 b);
VSCALEFPS __m256 _mm256_maskz_scalef_ps(__mmask8 k, __m256 a, __m256 b);
VSCALEFPS __m128 _mm_scalef_ps(__m128 a, __m128 b);
VSCALEFPS __m128 _mm_mask_scalef_ps(__m128 s, __mmask8 k, __m128 a, __m128 b);
VSCALEFPS __m128 _mm_maskz_scalef_ps(__mmask8 k, __m128 a, __m128 b);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal (for Src1).
Denormal is not reported for Src2.

VSCALEFPS—Scale Packed Float32 Values With Float32 Values Vol. 2C 5-707


Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”

VSCALEFPS—Scale Packed Float32 Values With Float32 Values Vol. 2C 5-708


VSCALEFSD—Scale Scalar Float64 Values With Float64 Values
Opcode/ Op / 64/32 CPUID Description
Instruction En bit Mode Feature Flag
Support
EVEX.LLIG.66.0F38.W1 2D /r A V/V AVX512F Scale the scalar double precision floating-point values
VSCALEFSD xmm1 {k1}{z}, xmm2, OR AVX10.11 in xmm2 using the value from xmm3/m64. Under
xmm3/m64{er} writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a floating-point scale of the scalar double precision floating-point value in the first source operand by
multiplying it by 2 to the power of the double precision floating-point value in second source operand.
The equation of this operation is given by:
xmm1 := xmm2*2floor(xmm3).
Floor(xmm3) means maximum integer value ≤ xmm3.
If the result cannot be represented in double precision, then the proper overflow response (for positive scaling
operand), or the proper underflow response (for negative scaling operand) is issued. The overflow and underflow
responses are dependent on the rounding mode (for IEEE-compliant rounding), as well as on other settings in
MXCSR (exception mask bits, FTZ bit), and on the SAE bit.
EVEX encoded version: The first source operand is an XMM register. The second source operand is an XMM register
or a memory location. The destination operand is an XMM register conditionally updated with writemask k1.
Handling of special-case input values are listed in Table 5-37 and Table 5-38.

Operation
SCALE(SRC1, SRC2)
{
; Check for denormal operands
TMP_SRC2 := SRC2
TMP_SRC1 := SRC1
IF (SRC2 is denormal AND MXCSR.DAZ) THEN TMP_SRC2=0
IF (SRC1 is denormal AND MXCSR.DAZ) THEN TMP_SRC1=0
/* SRC2 is a 64 bits floating-point value */
DEST[63:0] := TMP_SRC1[63:0] * POW(2, Floor(TMP_SRC2[63:0]))
}

VSCALEFSD—Scale Scalar Float64 Values With Float64 Values Vol. 2C 5-709


VSCALEFSD (EVEX encoded version)
IF (EVEX.b= 1) and SRC2 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] OR *no writemask*
THEN DEST[63:0] := SCALE(SRC1[63:0], SRC2[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
DEST[63:0] := 0
FI
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VSCALEFSD __m128d _mm_scalef_round_sd(__m128d a, __m128d b, int);
VSCALEFSD __m128d _mm_mask_scalef_round_sd(__m128d s, __mmask8 k, __m128d a, __m128d b, int);
VSCALEFSD __m128d _mm_maskz_scalef_round_sd(__mmask8 k, __m128d a, __m128d b, int);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal (for Src1).
Denormal is not reported for Src2.

Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”

VSCALEFSD—Scale Scalar Float64 Values With Float64 Values Vol. 2C 5-710


VSCALEFSH—Scale Scalar FP16 Values with FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.LLIG.66.MAP6.W0 2D /r A V/V AVX512-FP16 Scale the FP16 values in xmm2 using the value
VSCALEFSH xmm1{k1}{z}, xmm2, OR AVX10.11 from xmm3/m16 and store the result in xmm1
xmm3/m16 {er} subject to writemask k1. Bits 127:16 from xmm2
are copied to xmm1[127:16].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs a floating-point scale of the low FP16 element in the first source operand by multiplying it
by 2 to the power of the low FP16 element in second source operand, storing the result in the low element of the
destination operand.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
The equation of this operation is given by:
xmm1 := xmm2 * 2floor(xmm3).
Floor(xmm3) means maximum integer value ≤ xmm3.
If the result cannot be represented in FP16, then the proper overflow response (for positive scaling operand), or
the proper underflow response (for negative scaling operand), is issued. The overflow and underflow responses are
dependent on the rounding mode (for IEEE-compliant rounding), as well as on other settings in MXCSR (exception
mask bits, FTZ bit), and on the SAE bit.
Handling of special-case input values are listed in Table 5-39 and Table 5-40.

Operation
VSCALEFSH dest{k1}, src1, src2
IF (EVEX.b = 1) and no memory operand:
SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

IF k1[0] or *no writemask*:


dest.fp16[0] := scale_fp16(src1.fp16[0], src2.fp16[0]) // see VSCALEFPH
ELSE IF *zeroing*:
dest.fp16[0] := 0
//else DEST.fp16[0] remains unchanged

DEST[127:16] := src1[127:16]
DEST[MAXVL-1:128] := 0

VSCALEFSH—Scale Scalar FP16 Values with FP16 Values Vol. 2C 5-711


Intel C/C++ Compiler Intrinsic Equivalent
VSCALEFSH __m128h _mm_mask_scalef_round_sh (__m128h src, __mmask8 k, __m128h a, __m128h b, const int rounding);
VSCALEFSH __m128h _mm_maskz_scalef_round_sh (__mmask8 k, __m128h a, __m128h b, const int rounding);
VSCALEFSH __m128h _mm_scalef_round_sh (__m128h a, __m128h b, const int rounding);
VSCALEFSH __m128h _mm_mask_scalef_sh (__m128h src, __mmask8 k, __m128h a, __m128h b);
VSCALEFSH __m128h _mm_maskz_scalef_sh (__mmask8 k, __m128h a, __m128h b);
VSCALEFSH __m128h _mm_scalef_sh (__m128h a, __m128h b);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
Denormal-operand exception (#D) is checked and signaled for src1 operand, but not for src2 operand. The
denormal-operand exception is checked for src1 operand only if the src2 operand is not NaN. If the src2 operand is
NaN, the processor generates NaN and does not signal denormal-operand exception, even if src1 operand is
denormal.

VSCALEFSH—Scale Scalar FP16 Values with FP16 Values Vol. 2C 5-712


VSCALEFSS—Scale Scalar Float32 Value With Float32 Value
Opcode/ Op / 64/32 CPUID Description
Instruction En bit Mode Feature Flag
Support
EVEX.LLIG.66.0F38.W0 2D /r A V/V AVX512F Scale the scalar single-precision floating-point value in
VSCALEFSS xmm1 {k1}{z}, xmm2, OR AVX10.11 xmm2 using floating-point value from xmm3/m32. Under
xmm3/m32{er} writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a floating-point scale of the scalar single precision floating-point value in the first source operand by
multiplying it by 2 to the power of the float32 value in second source operand.
The equation of this operation is given by:
xmm1 := xmm2*2floor(xmm3).
Floor(xmm3) means maximum integer value ≤ xmm3.

If the result cannot be represented in single precision, then the proper overflow response (for positive scaling
operand), or the proper underflow response (for negative scaling operand) is issued. The overflow and underflow
responses are dependent on the rounding mode (for IEEE-compliant rounding), as well as on other settings in
MXCSR (exception mask bits, FTZ bit), and on the SAE bit.
EVEX encoded version: The first source operand is an XMM register. The second source operand is an XMM register
or a memory location. The destination operand is an XMM register conditionally updated with writemask k1.
Handling of special-case input values are listed in Table 5-37 and Table 5-41.

Operation
SCALE(SRC1, SRC2)
{
; Check for denormal operands
TMP_SRC2 := SRC2
TMP_SRC1 := SRC1
IF (SRC2 is denormal AND MXCSR.DAZ) THEN TMP_SRC2=0
IF (SRC1 is denormal AND MXCSR.DAZ) THEN TMP_SRC1=0
/* SRC2 is a 32 bits floating-point value */
DEST[31:0] := TMP_SRC1[31:0] * POW(2, Floor(TMP_SRC2[31:0]))
}

VSCALEFSS—Scale Scalar Float32 Value With Float32 Value Vol. 2C 5-713


VSCALEFSS (EVEX encoded version)
IF (EVEX.b= 1) and SRC2 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] OR *no writemask*
THEN DEST[31:0] := SCALE(SRC1[31:0], SRC2[31:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
DEST[31:0] := 0
FI
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VSCALEFSS __m128 _mm_scalef_round_ss(__m128 a, __m128 b, int);
VSCALEFSS __m128 _mm_mask_scalef_round_ss(__m128 s, __mmask8 k, __m128 a, __m128 b, int);
VSCALEFSS __m128 _mm_maskz_scalef_round_ss(__mmask8 k, __m128 a, __m128 b, int);

SIMD Floating-Point Exceptions


Overflow, Underflow, Invalid, Precision, Denormal (for Src1).
Denormal is not reported for Src2.

Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”

VSCALEFSS—Scale Scalar Float32 Value With Float32 Value Vol. 2C 5-714


VSCATTERDPS/VSCATTERDPD/VSCATTERQPS/VSCATTERQPD—Scatter Packed Single Precision,
Packed Double Precision Floating-Point Values with Signed Dword and Qword Indices
Opcode/ Op/E 64/32 CPUID Feature Description
Instruction n bit Mode Flag
Support
EVEX.128.66.0F38.W0 A2 /vsib A V/V (AVX512VL AND Using signed dword indices, scatter single-
VSCATTERDPS vm32x {k1}, xmm1 AVX512F) OR precision floating-point values to memory using
AVX10.11 writemask k1.
EVEX.256.66.0F38.W0 A2 /vsib A V/V (AVX512VL AND Using signed dword indices, scatter single-
VSCATTERDPS vm32y {k1}, ymm1 AVX512F) OR precision floating-point values to memory using
AVX10.11 writemask k1.
EVEX.512.66.0F38.W0 A2 /vsib A V/V AVX512F Using signed dword indices, scatter single-
VSCATTERDPS vm32z {k1}, zmm1 OR AVX10.11 precision floating-point values to memory using
writemask k1.
EVEX.128.66.0F38.W1 A2 /vsib A V/V (AVX512VL AND Using signed dword indices, scatter double
VSCATTERDPD vm32x {k1}, xmm1 AVX512F) OR precision floating-point values to memory using
AVX10.11 writemask k1.
EVEX.256.66.0F38.W1 A2 /vsib A V/V (AVX512VL AND Using signed dword indices, scatter double
VSCATTERDPD vm32y {k1}, ymm1 AVX512F) OR precision floating-point values to memory using
AVX10.11 writemask k1.
EVEX.512.66.0F38.W1 A2 /vsib A V/V AVX512F Using signed dword indices, scatter double
VSCATTERDPD vm32z {k1}, zmm1 OR AVX10.11 precision floating-point values to memory using
writemask k1.
EVEX.128.66.0F38.W0 A3 /vsib A V/V (AVX512VL AND Using signed qword indices, scatter single-
VSCATTERQPS vm64x {k1}, xmm1 AVX512F) OR precision floating-point values to memory using
AVX10.11 writemask k1.
EVEX.256.66.0F38.W0 A3 /vsib A V/V (AVX512VL AND Using signed qword indices, scatter single-
VSCATTERQPS vm64y {k1}, xmm1 AVX512F) OR precision floating-point values to memory using
AVX10.11 writemask k1.
EVEX.512.66.0F38.W0 A3 /vsib A V/V AVX512F Using signed qword indices, scatter single-
VSCATTERQPS vm64z {k1}, ymm1 OR AVX10.11 precision floating-point values to memory using
writemask k1.
EVEX.128.66.0F38.W1 A3 /vsib A V/V (AVX512VL AND Using signed qword indices, scatter double
VSCATTERQPD vm64x {k1}, xmm1 AVX512F) OR precision floating-point values to memory using
AVX10.11 writemask k1.
EVEX.256.66.0F38.W1 A3 /vsib A V/V (AVX512VL AND Using signed qword indices, scatter double
VSCATTERQPD vm64y {k1}, ymm1 AVX512F) OR precision floating-point values to memory using
AVX10.11 writemask k1.
EVEX.512.66.0F38.W1 A3 /vsib A V/V AVX512F Using signed qword indices, scatter double
VSCATTERQPD vm64z {k1}, zmm1 OR AVX10.11 precision floating-point values to memory using
writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
BaseReg (R): VSIB:base,
A Tuple1 Scalar ModRM:reg (r) N/A N/A
VectorReg(R): VSIB:index

VSCATTERDPS/VSCATTERDPD/VSCATTERQPS/VSCATTERQPD—Scatter Packed Single Precision, Packed Double Precision Floating- Vol. 2C 5-715
Description
Stores up to four, eight, or 16 single precision elements (or two, four, or eight double precision elements) in double-
word/quadword vector xmm1, ymm1, or zmm1, to the memory locations pointed by base address BASE_ADDR
and index vector VINDEX, with scale SCALE. The elements are specified via the VSIB (i.e., the index register is a
vector register, holding packed indices). Elements will only be stored if their corresponding mask bit is one. The
entire mask register will be set to zero by this instruction unless it triggers an exception.
This instruction can be suspended by an exception if at least one element is already scattered (i.e., if the exception
is triggered by an element other than the rightmost one with its mask bit set). When this happens, the destination
register and the mask register (k1) are partially updated. If any traps or interrupts are pending from already scat-
tered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruction
breakpoint is not re-triggered when the instruction is continued.
Note that:
• Only writes to overlapping vector indices are guaranteed to be ordered with respect to each other (from LSB to
MSB of the source registers). Note that this also include partially overlapping vector indices. Writes that are not
overlapped may happen in any order. Memory ordering with other instructions follows the Intel-64 memory
ordering model. Note that this does not account for non-overlapping indices that map into the same physical
address locations.
• If two or more destination indices completely overlap, the “earlier” write(s) may be skipped.
• Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all
elements closer to the LSB of the source register xmm, ymm, or zmm will be completed (and non-faulting).
Individual elements closer to the MSB may or may not be completed. If a given element triggers multiple faults,
they are delivered in the conventional order.
• Elements may be scattered in any order, but faults must be delivered in a right-to left order; thus, elements to
the left of a faulting one may be scattered before the fault is delivered. A given implementation of this
instruction is repeatable - given the same input values and architectural state, the same set of elements to the
left of the faulting one will be scattered.
• This instruction does not perform AC checks, and so will never deliver an AC fault.
• Not valid with 16-bit effective addresses. Will deliver a #UD fault.
• If this instruction overwrites itself and then takes a fault, only a subset of elements may be completed before
the fault is delivered (as described above). If the fault handler completes and attempts to re-execute this
instruction, the new instruction will be executed, and the scatter will not complete.
Note that the presence of VSIB byte is enforced in this instruction. Hence, the instruction will #UD fault if
ModRM.rm is different than 100b.
This instruction has special disp8*N and alignment rules. N is considered to be the size of a single vector element.
The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit
mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are
ignored.
The instruction will #UD fault if the k0 mask register is specified.

Operation
BASE_ADDR stands for the memory operand base address (a GPR); may not exist
VINDEX stands for the memory operand vector of indices (a ZMM register)
SCALE stands for the memory operand scalar (1, 2, 4 or 8)
DISP is the optional 1 or 4 byte displacement

VSCATTERDPS/VSCATTERDPD/VSCATTERQPS/VSCATTERQPD—Scatter Packed Single Precision, Packed Double Precision Floating- Vol. 2C 5-716
VSCATTERDPS (EVEX encoded versions)
(KL, VL)= (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN MEM[BASE_ADDR +SignExtend(VINDEX[i+31:i]) * SCALE + DISP] :=
SRC[i+31:i]
k1[j] := 0
FI;
ENDFOR
k1[MAX_KL-1:KL] := 0

VSCATTERDPD (EVEX encoded versions)


(KL, VL)= (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN MEM[BASE_ADDR +SignExtend(VINDEX[k+31:k]) * SCALE + DISP] :=
SRC[i+63:i]
k1[j] := 0
FI;
ENDFOR
k1[MAX_KL-1:KL] := 0

VSCATTERQPS (EVEX encoded versions)


(KL, VL)= (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN MEM[BASE_ADDR + (VINDEX[k+63:k]) * SCALE + DISP] :=
SRC[i+31:i]
k1[j] := 0
FI;
ENDFOR
k1[MAX_KL-1:KL] := 0

VSCATTERQPD (EVEX encoded versions)


(KL, VL)= (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN MEM[BASE_ADDR + (VINDEX[i+63:i]) * SCALE + DISP] :=
SRC[i+63:i]
k1[j] := 0
FI;
ENDFOR
k1[MAX_KL-1:KL] := 0

VSCATTERDPS/VSCATTERDPD/VSCATTERQPS/VSCATTERQPD—Scatter Packed Single Precision, Packed Double Precision Floating- Vol. 2C 5-717
Intel C/C++ Compiler Intrinsic Equivalent
VSCATTERDPD void _mm512_i32scatter_pd(void * base, __m512i vdx, __m512d a, int scale);
VSCATTERDPD void _mm512_mask_i32scatter_pd(void * base, __mmask8 k, __m512i vdx, __m512d a, int scale);
VSCATTERDPS void _mm512_i32scatter_ps(void * base, __m512i vdx, __m512 a, int scale);
VSCATTERDPS void _mm512_mask_i32scatter_ps(void * base, __mmask16 k, __m512i vdx, __m512 a, int scale);
VSCATTERQPD void _mm512_i64scatter_pd(void * base, __m512i vdx, __m512d a, int scale);
VSCATTERQPD void _mm512_mask_i64scatter_pd(void * base, __mmask8 k, __m512i vdx, __m512d a, int scale);
VSCATTERQPS void _mm512_i64scatter_ps(void * base, __m512i vdx, __m512 a, int scale);
VSCATTERQPS void _mm512_mask_i64scatter_ps(void * base, __mmask8 k, __m512i vdx, __m512 a, int scale);
VSCATTERDPD void _mm256_i32scatter_pd(void * base, __m256i vdx, __m256d a, int scale);
VSCATTERDPD void _mm256_mask_i32scatter_pd(void * base, __mmask8 k, __m256i vdx, __m256d a, int scale);
VSCATTERDPS void _mm256_i32scatter_ps(void * base, __m256i vdx, __m256 a, int scale);
VSCATTERDPS void _mm256_mask_i32scatter_ps(void * base, __mmask8 k, __m256i vdx, __m256 a, int scale);
VSCATTERQPD void _mm256_i64scatter_pd(void * base, __m256i vdx, __m256d a, int scale);
VSCATTERQPD void _mm256_mask_i64scatter_pd(void * base, __mmask8 k, __m256i vdx, __m256d a, int scale);
VSCATTERQPS void _mm256_i64scatter_ps(void * base, __m256i vdx, __m256 a, int scale);
VSCATTERQPS void _mm256_mask_i64scatter_ps(void * base, __mmask8 k, __m256i vdx, __m256 a, int scale);
VSCATTERDPD void _mm_i32scatter_pd(void * base, __m128i vdx, __m128d a, int scale);
VSCATTERDPD void _mm_mask_i32scatter_pd(void * base, __mmask8 k, __m128i vdx, __m128d a, int scale);
VSCATTERDPS void _mm_i32scatter_ps(void * base, __m128i vdx, __m128 a, int scale);
VSCATTERDPS void _mm_mask_i32scatter_ps(void * base, __mmask8 k, __m128i vdx, __m128 a, int scale);
VSCATTERQPD void _mm_i64scatter_pd(void * base, __m128i vdx, __m128d a, int scale);
VSCATTERQPD void _mm_mask_i64scatter_pd(void * base, __mmask8 k, __m128i vdx, __m128d a, int scale);
VSCATTERQPS void _mm_i64scatter_ps(void * base, __m128i vdx, __m128 a, int scale);
VSCATTERQPS void _mm_mask_i64scatter_ps(void * base, __mmask8 k, __m128i vdx, __m128 a, int scale);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-63, “Type E12 Class Exception Conditions.”

VSCATTERDPS/VSCATTERDPD/VSCATTERQPS/VSCATTERQPD—Scatter Packed Single Precision, Packed Double Precision Floating- Vol. 2C 5-718
VSHUFF32x4/VSHUFF64x2/VSHUFI32x4/VSHUFI64x2—Shuffle Packed Values at 128-Bit
Granularity
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.256.66.0F3A.W0 23 /r ib A V/V (AVX512VL AND Shuffle 128-bit packed single-precision floating-
VSHUFF32X4 ymm1{k1}{z}, ymm2, AVX512F) OR point values selected by imm8 from ymm2 and
ymm3/m256/m32bcst, imm8 AVX10.11 ymm3/m256/m32bcst and place results in ymm1
subject to writemask k1.
EVEX.512.66.0F3A.W0 23 /r ib A V/V AVX512F Shuffle 128-bit packed single-precision floating-
VSHUFF32x4 zmm1{k1}{z}, zmm2, OR AVX10.11 point values selected by imm8 from zmm2 and
zmm3/m512/m32bcst, imm8 zmm3/m512/m32bcst and place results in zmm1
subject to writemask k1.
EVEX.256.66.0F3A.W1 23 /r ib A V/V (AVX512VL AND Shuffle 128-bit packed double precision floating-
VSHUFF64X2 ymm1{k1}{z}, ymm2, AVX512F) OR point values selected by imm8 from ymm2 and
ymm3/m256/m64bcst, imm8 AVX10.11 ymm3/m256/m64bcst and place results in ymm1
subject to writemask k1.
EVEX.512.66.0F3A.W1 23 /r ib A V/V AVX512F Shuffle 128-bit packed double precision floating-
VSHUFF64x2 zmm1{k1}{z}, zmm2, OR AVX10.11 point values selected by imm8 from zmm2 and
zmm3/m512/m64bcst, imm8 zmm3/m512/m64bcst and place results in zmm1
subject to writemask k1.
EVEX.256.66.0F3A.W0 43 /r ib A V/V (AVX512VL AND Shuffle 128-bit packed double-word values
VSHUFI32X4 ymm1{k1}{z}, ymm2, AVX512F) OR selected by imm8 from ymm2 and
ymm3/m256/m32bcst, imm8 AVX10.11 ymm3/m256/m32bcst and place results in ymm1
subject to writemask k1.
EVEX.512.66.0F3A.W0 43 /r ib A V/V AVX512F Shuffle 128-bit packed double-word values
VSHUFI32x4 zmm1{k1}{z}, zmm2, OR AVX10.11 selected by imm8 from zmm2 and
zmm3/m512/m32bcst, imm8 zmm3/m512/m32bcst and place results in zmm1
subject to writemask k1.
EVEX.256.66.0F3A.W1 43 /r ib A V/V (AVX512VL AND Shuffle 128-bit packed quad-word values selected
VSHUFI64X2 ymm1{k1}{z}, ymm2, AVX512F) OR by imm8 from ymm2 and ymm3/m256/m64bcst
ymm3/m256/m64bcst, imm8 AVX10.11 and place results in ymm1 subject to writemask k1.
EVEX.512.66.0F3A.W1 43 /r ib A V/V AVX512F Shuffle 128-bit packed quad-word values selected
VSHUFI64x2 zmm1{k1}{z}, zmm2, OR AVX10.11 by imm8 from zmm2 and zmm3/m512/m64bcst
zmm3/m512/m64bcst, imm8 and place results in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
256-bit Version: Moves one of the two 128-bit packed single precision floating-point values from the first source
operand (second operand) into the low 128-bit of the destination operand (first operand); moves one of the two
packed 128-bit floating-point values from the second source operand (third operand) into the high 128-bit of the
destination operand. The selector operand (third operand) determines which values are moved to the destination
operand.

VSHUFF32x4/VSHUFF64x2/VSHUFI32x4/VSHUFI64x2—Shuffle Packed Values at 128-Bit Granularity Vol. 2C 5-725


512-bit Version: Moves two of the four 128-bit packed single precision floating-point values from the first source
operand (second operand) into the low 256-bit of each double qword of the destination operand (first operand);
moves two of the four packed 128-bit floating-point values from the second source operand (third operand) into
the high 256-bit of the destination operand. The selector operand (third operand) determines which values are
moved to the destination operand.
The first source operand is a vector register. The second source operand can be a ZMM register, a 512-bit memory
location or a 512-bit vector broadcasted from a 32/64-bit memory location. The destination operand is a vector
register.
The writemask updates the destination operand with the granularity of 32/64-bit data elements.

Operation
Select2(SRC, control) {
CASE (control[0]) OF
0: TMP := SRC[127:0];
1: TMP := SRC[255:128];
ESAC;
RETURN TMP
}

Select4(SRC, control) {
CASE (control[1:0]) OF
0: TMP := SRC[127:0];
1: TMP := SRC[255:128];
2: TMP := SRC[383:256];
3: TMP := SRC[511:384];
ESAC;
RETURN TMP
}

VSHUFF32x4 (EVEX versions)


(KL, VL) = (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN TMP_SRC2[i+31:i] := SRC2[31:0]
ELSE TMP_SRC2[i+31:i] := SRC2[i+31:i]
FI;
ENDFOR;
IF VL = 256
TMP_DEST[127:0] := Select2(SRC1[255:0], imm8[0]);
TMP_DEST[255:128] := Select2(SRC2[255:0], imm8[1]);
FI;
IF VL = 512
TMP_DEST[127:0] := Select4(SRC1[511:0], imm8[1:0]);
TMP_DEST[255:128] := Select4(SRC1[511:0], imm8[3:2]);
TMP_DEST[383:256] := Select4(TMP_SRC2[511:0], imm8[5:4]);
TMP_DEST[511:384] := Select4(TMP_SRC2[511:0], imm8[7:6]);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking

VSHUFF32x4/VSHUFF64x2/VSHUFI32x4/VSHUFI64x2—Shuffle Packed Values at 128-Bit Granularity Vol. 2C 5-726


THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
THEN DEST[i+31:i] := 0
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VSHUFF64x2 (EVEX 512-bit version)


(KL, VL) = (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN TMP_SRC2[i+63:i] := SRC2[63:0]
ELSE TMP_SRC2[i+63:i] := SRC2[i+63:i]
FI;
ENDFOR;
IF VL = 256
TMP_DEST[127:0] := Select2(SRC1[255:0], imm8[0]);
TMP_DEST[255:128] := Select2(SRC2[255:0], imm8[1]);
FI;
IF VL = 512
TMP_DEST[127:0] := Select4(SRC1[511:0], imm8[1:0]);
TMP_DEST[255:128] := Select4(SRC1[511:0], imm8[3:2]);
TMP_DEST[383:256] := Select4(TMP_SRC2[511:0], imm8[5:4]);
TMP_DEST[511:384] := Select4(TMP_SRC2[511:0], imm8[7:6]);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
THEN DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VSHUFI32x4 (EVEX 512-bit version)


(KL, VL) = (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN TMP_SRC2[i+31:i] := SRC2[31:0]
ELSE TMP_SRC2[i+31:i] := SRC2[i+31:i]
FI;
ENDFOR;
IF VL = 256
TMP_DEST[127:0] := Select2(SRC1[255:0], imm8[0]);
TMP_DEST[255:128] := Select2(SRC2[255:0], imm8[1]);
FI;

VSHUFF32x4/VSHUFF64x2/VSHUFI32x4/VSHUFI64x2—Shuffle Packed Values at 128-Bit Granularity Vol. 2C 5-727


IF VL = 512
TMP_DEST[127:0] := Select4(SRC1[511:0], imm8[1:0]);
TMP_DEST[255:128] := Select4(SRC1[511:0], imm8[3:2]);
TMP_DEST[383:256] := Select4(TMP_SRC2[511:0], imm8[5:4]);
TMP_DEST[511:384] := Select4(TMP_SRC2[511:0], imm8[7:6]);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
THEN DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VSHUFI64x2 (EVEX 512-bit version)


(KL, VL) = (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN TMP_SRC2[i+63:i] := SRC2[63:0]
ELSE TMP_SRC2[i+63:i] := SRC2[i+63:i]
FI;
ENDFOR;
IF VL = 256
TMP_DEST[127:0] := Select2(SRC1[255:0], imm8[0]);
TMP_DEST[255:128] := Select2(SRC2[255:0], imm8[1]);
FI;
IF VL = 512
TMP_DEST[127:0] := Select4(SRC1[511:0], imm8[1:0]);
TMP_DEST[255:128] := Select4(SRC1[511:0], imm8[3:2]);
TMP_DEST[383:256] := Select4(TMP_SRC2[511:0], imm8[5:4]);
TMP_DEST[511:384] := Select4(TMP_SRC2[511:0], imm8[7:6]);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
THEN DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VSHUFF32x4/VSHUFF64x2/VSHUFI32x4/VSHUFI64x2—Shuffle Packed Values at 128-Bit Granularity Vol. 2C 5-728


Intel C/C++ Compiler Intrinsic Equivalent
VSHUFI32x4 __m512i _mm512_shuffle_i32x4(__m512i a, __m512i b, int imm);
VSHUFI32x4 __m512i _mm512_mask_shuffle_i32x4(__m512i s, __mmask16 k, __m512i a, __m512i b, int imm);
VSHUFI32x4 __m512i _mm512_maskz_shuffle_i32x4( __mmask16 k, __m512i a, __m512i b, int imm);
VSHUFI32x4 __m256i _mm256_shuffle_i32x4(__m256i a, __m256i b, int imm);
VSHUFI32x4 __m256i _mm256_mask_shuffle_i32x4(__m256i s, __mmask8 k, __m256i a, __m256i b, int imm);
VSHUFI32x4 __m256i _mm256_maskz_shuffle_i32x4( __mmask8 k, __m256i a, __m256i b, int imm);
VSHUFF32x4 __m512 _mm512_shuffle_f32x4(__m512 a, __m512 b, int imm);
VSHUFF32x4 __m512 _mm512_mask_shuffle_f32x4(__m512 s, __mmask16 k, __m512 a, __m512 b, int imm);
VSHUFF32x4 __m512 _mm512_maskz_shuffle_f32x4( __mmask16 k, __m512 a, __m512 b, int imm);
VSHUFI64x2 __m512i _mm512_shuffle_i64x2(__m512i a, __m512i b, int imm);
VSHUFI64x2 __m512i _mm512_mask_shuffle_i64x2(__m512i s, __mmask8 k, __m512i b, __m512i b, int imm);
VSHUFI64x2 __m512i _mm512_maskz_shuffle_i64x2( __mmask8 k, __m512i a, __m512i b, int imm);
VSHUFF64x2 __m512d _mm512_shuffle_f64x2(__m512d a, __m512d b, int imm);
VSHUFF64x2 __m512d _mm512_mask_shuffle_f64x2(__m512d s, __mmask8 k, __m512d a, __m512d b, int imm);
VSHUFF64x2 __m512d _mm512_maskz_shuffle_f64x2( __mmask8 k, __m512d a, __m512d b, int imm);

SIMD Floating-Point Exceptions


None.

Other Exceptions
See Table 2-52, “Type E4NF Class Exception Conditions.”
Additionally:
#UD If EVEX.L’L = 0 for VSHUFF32x4/VSHUFF64x2.

VSHUFF32x4/VSHUFF64x2/VSHUFI32x4/VSHUFI64x2—Shuffle Packed Values at 128-Bit Granularity Vol. 2C 5-729


VSQRTPH—Compute Square Root of Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.NP.MAP5.W0 51 /r A V/V (AVX512-FP16 Compute square roots of the packed FP16 values
VSQRTPH xmm1{k1}{z}, AND AVX512VL) in xmm2/m128/m16bcst, and store the result in
xmm2/m128/m16bcst OR AVX10.11 xmm1 subject to writemask k1.
EVEX.256.NP.MAP5.W0 51 /r A V/V (AVX512-FP16 Compute square roots of the packed FP16 values
VSQRTPH ymm1{k1}{z}, AND AVX512VL) in ymm2/m256/m16bcst, and store the result in
ymm2/m256/m16bcst OR AVX10.11 ymm1 subject to writemask k1.
EVEX.512.NP.MAP5.W0 51 /r A V/V AVX512-FP16 Compute square roots of the packed FP16 values
VSQRTPH zmm1{k1}{z}, OR AVX10.11 in zmm2/m512/m16bcst, and store the result in
zmm2/m512/m16bcst {er} zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction performs a packed FP16 square-root computation on the values from source operand and stores
the packed FP16 result in the destination operand. The destination elements are updated according to the write-
mask.

Operation
VSQRTPH dest{k1}, src
VL = 128, 256 or 512
KL := VL/16

FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF SRC is memory and (EVEX.b = 1):
tsrc := src.fp16[0]
ELSE:
tsrc := src.fp16[i]
DEST.fp16[i] := SQRT(tsrc)
ELSE IF *zeroing*:
DEST.fp16[i] := 0
//else DEST.fp16[i] remains unchanged

DEST[MAXVL-1:VL] := 0

VSQRTPH—Compute Square Root of Packed FP16 Values Vol. 2C 5-740


Intel C/C++ Compiler Intrinsic Equivalent
VSQRTPH __m128h _mm_mask_sqrt_ph (__m128h src, __mmask8 k, __m128h a);
VSQRTPH __m128h _mm_maskz_sqrt_ph (__mmask8 k, __m128h a);
VSQRTPH __m128h _mm_sqrt_ph (__m128h a);
VSQRTPH __m256h _mm256_mask_sqrt_ph (__m256h src, __mmask16 k, __m256h a);
VSQRTPH __m256h _mm256_maskz_sqrt_ph (__mmask16 k, __m256h a);
VSQRTPH __m256h _mm256_sqrt_ph (__m256h a);
VSQRTPH __m512h _mm512_mask_sqrt_ph (__m512h src, __mmask32 k, __m512h a);
VSQRTPH __m512h _mm512_maskz_sqrt_ph (__mmask32 k, __m512h a);
VSQRTPH __m512h _mm512_sqrt_ph (__m512h a);
VSQRTPH __m512h _mm512_mask_sqrt_round_ph (__m512h src, __mmask32 k, __m512h a, const int rounding);
VSQRTPH __m512h _mm512_maskz_sqrt_round_ph (__mmask32 k, __m512h a, const int rounding);
VSQRTPH __m512h _mm512_sqrt_round_ph (__m512h a, const int rounding);

SIMD Floating-Point Exceptions


Invalid, Precision, Denormal.

Other Exceptions
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”

VSQRTPH—Compute Square Root of Packed FP16 Values Vol. 2C 5-741


VSQRTSH—Compute Square Root of Scalar FP16 Value
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.LLIG.F3.MAP5.W0 51 /r A V/V AVX512-FP16 Compute square root of the low FP16 value in
VSQRTSH xmm1{k1}{z}, xmm2, OR AVX10.11 xmm3/m16 and store the result in xmm1 subject
xmm3/m16 {er} to writemask k1. Bits 127:16 from xmm2 are
copied to xmm1[127:16].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction performs a scalar FP16 square-root computation on the source operand and stores the FP16 result
in the destination operand. Bits 127:16 of the destination operand are copied from the corresponding bits of the
first source operand. Bits MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the desti-
nation is updated according to the writemask.

Operation
VSQRTSH dest{k1}, src1, src2
IF k1[0] or *no writemask*:
DEST.fp16[0] := SQRT(src2.fp16[0])
ELSE IF *zeroing*:
DEST.fp16[0] := 0
//else DEST.fp16[0] remains unchanged

DEST[127:16] := src1[127:16]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VSQRTSH __m128h _mm_mask_sqrt_round_sh (__m128h src, __mmask8 k, __m128h a, __m128h b, const int rounding);
VSQRTSH __m128h _mm_maskz_sqrt_round_sh (__mmask8 k, __m128h a, __m128h b, const int rounding);
VSQRTSH __m128h _mm_sqrt_round_sh (__m128h a, __m128h b, const int rounding);
VSQRTSH __m128h _mm_mask_sqrt_sh (__m128h src, __mmask8 k, __m128h a, __m128h b);
VSQRTSH __m128h _mm_maskz_sqrt_sh (__mmask8 k, __m128h a, __m128h b);
VSQRTSH __m128h _mm_sqrt_sh (__m128h a, __m128h b);

SIMD Floating-Point Exceptions


Invalid, Precision, Denormal

Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VSQRTSH—Compute Square Root of Scalar FP16 Value Vol. 2C 5-742


VSUBPH—Subtract Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.NP.MAP5.W0 5C /r A V/V (AVX512-FP16 Subtract packed FP16 values from
VSUBPH xmm1{k1}{z}, xmm2, AND AVX512VL) xmm3/m128/m16bcst to xmm2, and store the
xmm3/m128/m16bcst OR AVX10.11 result in xmm1 subject to writemask k1.
EVEX.256.NP.MAP5.W0 5C /r A V/V (AVX512-FP16 Subtract packed FP16 values from
VSUBPH ymm1{k1}{z}, ymm2, AND AVX512VL) ymm3/m256/m16bcst to ymm2, and store the
ymm3/m256/m16bcst OR AVX10.11 result in ymm1 subject to writemask k1.
EVEX.512.NP.MAP5.W0 5C /r A V/V AVX512-FP16 Subtract packed FP16 values from
VSUBPH zmm1{k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m16bcst to zmm2, and store the
zmm3/m512/m16bcst {er} result in zmm1 subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Full ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction subtracts packed FP16 values from second source operand from the corresponding elements in the
first source operand, storing the packed FP16 result in the destination operand. The destination elements are
updated according to the writemask.

Operation
VSUBPH (EVEX encoded versions) when src2 operand is a register
VL = 128, 256 or 512
KL := VL/16

IF (VL = 512) AND (EVEX.b = 1):


SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.fp16[j] := SRC1.fp16[j] - SRC2.fp16[j]
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

VSUBPH—Subtract Packed FP16 Values Vol. 2C 5-743


VSUBPH (EVEX encoded versions) when src2 operand is a memory source
VL = 128, 256 or 512
KL := VL/16

FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
DEST.fp16[j] := SRC1.fp16[j] - SRC2.fp16[0]
ELSE:
DEST.fp16[j] := SRC1.fp16[j] - SRC2.fp16[j]
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged

DEST[MAXVL-1:VL] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VSUBPH __m128h _mm_mask_sub_ph (__m128h src, __mmask8 k, __m128h a, __m128h b);
VSUBPH __m128h _mm_maskz_sub_ph (__mmask8 k, __m128h a, __m128h b);
VSUBPH __m128h _mm_sub_ph (__m128h a, __m128h b);
VSUBPH __m256h _mm256_mask_sub_ph (__m256h src, __mmask16 k, __m256h a, __m256h b);
VSUBPH __m256h _mm256_maskz_sub_ph (__mmask16 k, __m256h a, __m256h b);
VSUBPH __m256h _mm256_sub_ph (__m256h a, __m256h b);
VSUBPH __m512h _mm512_mask_sub_ph (__m512h src, __mmask32 k, __m512h a, __m512h b);
VSUBPH __m512h _mm512_maskz_sub_ph (__mmask32 k, __m512h a, __m512h b);
VSUBPH __m512h _mm512_sub_ph (__m512h a, __m512h b);
VSUBPH __m512h _mm512_mask_sub_round_ph (__m512h src, __mmask32 k, __m512h a, __m512h b, int rounding);
VSUBPH __m512h _mm512_maskz_sub_round_ph (__mmask32 k, __m512h a, __m512h b, int rounding);
VSUBPH __m512h _mm512_sub_round_ph (__m512h a, __m512h b, int rounding);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal.

Other Exceptions
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”

VSUBPH—Subtract Packed FP16 Values Vol. 2C 5-744


VSUBSH—Subtract Scalar FP16 Value
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.LLIG.F3.MAP5.W0 5C /r A V/V AVX512-FP16 Subtract the low FP16 value in xmm3/m16 from
VSUBSH xmm1{k1}{z}, xmm2, OR AVX10.11 xmm2 and store the result in xmm1 subject to
xmm3/m16 {er} writemask k1. Bits 127:16 from xmm2 are
copied to xmm1[127:16].

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A

Description
This instruction subtracts the low FP16 value from the second source operand from the corresponding value in the
first source operand, storing the FP16 result in the destination operand. Bits 127:16 of the destination operand are
copied from the corresponding bits of the first source operand. Bits MAXVL-1:128 of the destination operand are
zeroed. The low FP16 element of the destination is updated according to the writemask.

Operation
VSUBSH (EVEX encoded versions)
IF EVEX.b = 1 and SRC2 is a register:
SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)

IF k1[0] OR *no writemask*:


DEST.fp16[0] := SRC1.fp16[0] - SRC2.fp16[0]
ELSE IF *zeroing*:
DEST.fp16[0] := 0
// else dest.fp16[0] remains unchanged
DEST[127:16] := SRC1[127:16]
DEST[MAXVL-1:128] := 0

Intel C/C++ Compiler Intrinsic Equivalent


VSUBSH __m128h _mm_mask_sub_round_sh (__m128h src, __mmask8 k, __m128h a, __m128h b, int rounding);
VSUBSH __m128h _mm_maskz_sub_round_sh (__mmask8 k, __m128h a, __m128h b, int rounding);
VSUBSH __m128h _mm_sub_round_sh (__m128h a, __m128h b, int rounding);
VSUBSH __m128h _mm_mask_sub_sh (__m128h src, __mmask8 k, __m128h a, __m128h b);
VSUBSH __m128h _mm_maskz_sub_sh (__mmask8 k, __m128h a, __m128h b);
VSUBSH __m128h _mm_sub_sh (__m128h a, __m128h b);

SIMD Floating-Point Exceptions


Invalid, Underflow, Overflow, Precision, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”

VSUBSH—Subtract Scalar FP16 Value Vol. 2C 5-745


VUCOMISH—Unordered Compare Scalar FP16 Values and Set EFLAGS
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.LLIG.NP.MAP5.W0 2E /r A V/V AVX512-FP16 Compare low FP16 values in xmm1 and
VUCOMISH xmm1, xmm2/m16 {sae} OR AVX10.11 xmm2/m16 and set the EFLAGS flags accordingly.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Scalar ModRM:reg (w) ModRM:r/m (r) N/A N/A

Description
This instruction compares the FP16 values in the low word of operand 1 (first operand) and operand 2 (second
operand), and sets the ZF, PF, and CF flags in the EFLAGS register according to the result (unordered, greater than,
less than, or equal). The OF, SF and AF flags in the EFLAGS register are set to 0. The unordered result is returned
if either source operand is a NaN (QNaN or SNaN).
Operand 1 is an XMM register; operand 2 can be an XMM register or a 16-bit memory location.
The VUCOMISH instruction differs from the VCOMISH instruction in that it signals a SIMD floating-point invalid oper-
ation exception (#I) only if a source operand is an SNaN. The COMISS instruction signals an invalid numeric excep-
tion when a source operand is either a QNaN or SNaN.
The EFLAGS register is not updated if an unmasked SIMD floating-point exception is generated. EVEX.vvvv are
reserved and must be 1111b, otherwise instructions will #UD.

Operation
VUCOMISH
RESULT := UnorderedCompare(SRC1.fp16[0],SRC2.fp16[0])
if RESULT is UNORDERED:
ZF, PF, CF := 1, 1, 1
else if RESULT is GREATER_THAN:
ZF, PF, CF := 0, 0, 0
else if RESULT is LESS_THAN:
ZF, PF, CF := 0, 0, 1
else: // RESULT is EQUALS
ZF, PF, CF := 1, 0, 0

OF, AF, SF := 0, 0, 0

Intel C/C++ Compiler Intrinsic Equivalent


VUCOMISH int _mm_ucomieq_sh (__m128h a, __m128h b);
VUCOMISH int _mm_ucomige_sh (__m128h a, __m128h b);
VUCOMISH int _mm_ucomigt_sh (__m128h a, __m128h b);
VUCOMISH int _mm_ucomile_sh (__m128h a, __m128h b);
VUCOMISH int _mm_ucomilt_sh (__m128h a, __m128h b);
VUCOMISH int _mm_ucomineq_sh (__m128h a, __m128h b);

VUCOMISH—Unordered Compare Scalar FP16 Values and Set EFLAGS Vol. 2C 5-749
SIMD Floating-Point Exceptions
Invalid, Denormal.

Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”

VUCOMISH—Unordered Compare Scalar FP16 Values and Set EFLAGS Vol. 2C 5-750
9. Updates to Chapter 6, Volume 2D
Change bars and violet text show changes to Chapter 6 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 2D: Instruction Set Reference, W-Z.

------------------------------------------------------------------------------------------
Changes to this chapter:
• Added the following instructions: WRMSRLIST and WRMSRNS.
• Added Intel® AVX10.1 information to the following instructions:
— XORPD
— XORPS

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 13


WRMSRLIST—Write List of Model Specific Registers
Opcode / Op/ 64/32 bit CPUID Feature Flag Description
Instruction En Mode
Support
F3 0F 01 C6 ZO V/N.E. MSRLIST Write requested list of MSRs with the values
WRMSRLIST specified in memory.

Instruction Operand Encoding


Op/En Operand 1 Operand 2 Operand 3 Operand 4
ZO N/A N/A N/A N/A

Description
This instruction writes a software-provided list of up to 64 MSRs with values loaded from memory.
WRMSRLIST takes three implied input operands:
• RSI: Linear address of a table of MSR addresses (8 bytes per address)1.
• RDI: Linear address of a table from which MSR data is loaded (8 bytes per MSR).
• RCX: 64-bit bitmask of valid bits for the MSRs. Bit 0 is the valid bit for entry 0 in each table, etc.
For each RCX bit [n] from 0 to 63, if RCX[n] is 1, WRMSRLIST will write the MSR specified at entry [n] in the RSI-
based table with the value read from memory at the entry [n] in the RDI-based table.
This implies a maximum of 64 MSRs that can be processed by this instruction. The processor will clear RCX[n] after
it finishes handling that MSR. Similar to repeated string operations, WRMSRLIST supports partial completion for
interrupts, exceptions, and traps. In these situations, the RIP register saved will point to the MSRLIST instruction
while the RCX register will have cleared bits corresponding to all completed iterations.
This instruction must be executed at privilege level 0; otherwise, a general protection exception #GP(0) is gener-
ated. This instruction performs MSR-specific checks in the same manner as WRMSR.
Like WRMSRNS (and unlike WRMSR), WRMSRLIST is not defined as a serializing instruction (see “Serializing
Instructions” in Chapter 9 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A). This
means that software should not rely on WRMSRLIST to drain all buffered writes to memory before the next instruc-
tion is fetched and executed. For implementation reasons, some processors may serialize when writing certain
MSRs, even though that is not guaranteed.
Like WRMSR and WRMSRNS, WRMSRLIST ensures that all operations before WRMSRLIST do not use any new MSR
value and that all operations after WRMSRLIST do use the new values. An exception to this rule is certain store
related performance-monitor events that only count stores when they are drained to memory. Since WRMSRLIST
is not a serializing instruction, if software uses WRMSRLIST to change the controls for such performance-monitor
events, stores issued before WRMSRLIST may be counted based on the controls established by WRMSRLIST. Soft-
ware can insert the SERIALIZE instruction before the WRMSRLIST if so desired.
Those MSRs that cause a TLB invalidation when they are written via WRMSR (e.g., MTRRs) will also cause the same
TLB invalidation when written by WRMSRLIST.
In places where WRMSR is being used as a proxy for a serializing instruction, a different serializing instruction can
be used (e.g., SERIALIZE).
WRMSRLIST writes MSRs in order, which means the processor will ensure that an MSR in iteration “n” will be
written only after previous iterations (“n-1”). If the older MSR writes had a side effect that affects the behavior of
the next MSR, the processor will ensure that side effect is honored.
The processor is allowed (but not required) to “load ahead” in the list. The following are examples of things the
processor may do:
• Use an old memory type or TLB entry for loads or stores to memory containing the tables despite an MSR
written by a previous iteration changing MTRR or invalidating TLBs.

1. Since MSR addresses are only 32-bits wide, bits 63:32 of each MSR address table entry is reserved.

WRMSRLIST—Write List of Model Specific Registers Vol. 2A 3-9


• Cause a page fault for access to a table entry after the nth, despite the processor having written only n MSRs.1

Operation
DO WHILE RCX != 0
MSR_index := position of least significant bit set in RCX;
Load MSR_address_table_entry from 8 bytes at the linear address RSI + (MSR_index * 8);
IF MSR_address_table_entry[63:32] != 0 THEN #GP(0); FI;
MSR_address := MSR_address_table_entry[31:0];
Load MSR_data from 8 bytes at the linear address RDI + (MSR_index * 8);
IF WRMSR of MSR_data to the MSR with address MSR_address would #GP THEN #GP(0); FI;
Load the MSR with address MSR_address with MSR_data;
RCX[MSR_index] := 0;
Allow delivery of any pending interrupts or traps;
OD;

Flags Affected
None.

Protected Mode Exceptions


#UD The WRMSRLIST instruction is not recognized in protected mode.

Real-Address Mode Exceptions


#UD The WRMSRLIST instruction is not recognized in real-address mode.

Virtual-8086 Mode Exceptions


#UD The WRMSRLIST instruction is not recognized in virtual-8086 mode.

Compatibility Mode Exceptions


#UD The WRMSRLIST instruction is not recognized in compatibility mode.

64-Bit Mode Exceptions


#GP(0) If the current privilege level is not 0.
If RSI [2:0] ≠ 0, RDI [2:0] ≠ 0, or bits 63:32 of an MSR-address table entry are not all zero.
If an execution of WRMSR to a specified MSR with a specified value would generate a general-
protection exception (#GP(0)).
#UD If the LOCK prefix is used.
If CPUID.(EAX=07H, ECX=01H):EAX.MSRLIST[bit 27] = 0.

1. For example, the processor may take a page fault due to a linear address for the 10th entry in the MSR address table despite only
having completed the MSR writes up to entry 5.

WRMSRLIST—Write List of Model Specific Registers Vol. 2A 3-10


WRMSRNS—Non-Serializing Write to Model Specific Register
Opcode/ Op/ 64/32 Bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 01 C6 ZO V/V WRMSRNS Write the value in EDX:EAX to MSR specified by
WRMSRNS ECX.

Instruction Operand Encoding


Op/En Operand 1 Operand 2 Operand 3 Operand 4
ZO N/A N/A N/A N/A

Description
WRMSRNS is an instruction that behaves like WRMSR except that it is not a serializing instruction by default. It can
be executed only at privilege level 0 or in real-address mode; otherwise, a general protection exception #GP(0) is
generated.
The instruction writes the contents of registers EDX:EAX into the 64-bit model specific register (MSR) specified in
the ECX register. The contents of the EDX register are copied to the high-order 32 bits of the selected MSR and the
contents of the EAX register are copied to the low-order 32 bits of the MSR. The high-order 32 bits of RAX, RCX,
and RDX are ignored.
Unlike WRMSR, WRMSRNS is not defined as a serializing instruction (see “Serializing Instructions” in Chapter 9 of
the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A). This means that software should
not rely on it to drain all buffered writes to memory before the next instruction is fetched and executed. For imple-
mentation reasons, some processors may serialize when writing certain MSRs, even though that is not guaranteed.
Like WRMSR, WRMSRNS will ensure that all operations before it do not use the new MSR value and that all opera-
tions after the WRMSRNS do use the new value. An exception to this rule is certain store related performance-
monitor events that only count stores when they are drained to memory. Since WRMSRNS is not a serializing
instruction, if software uses WRMSRNS to change the controls for such performance-monitor events, stores issued
before WRMSRMS may be counted based on the controls established by WRMSRNS. Software can insert the
SERIALIZE instruction before the WRMSRNS if so desired.
Those MSRs that cause a TLB invalidation when they are written via WRMSR (e.g., MTRRs) will also cause the same
TLB invalidation when written by WRMSRNS.
In order to improve performance, software may replace WRMSR with WRMSRNS. In places where WRMSR is being
used as a proxy for a serializing instruction, a different serializing instruction can be used (e.g., SERIALIZE).

Operation
MSR[ECX] := EDX:EAX;

Flags Affected
None.

WRMSRNS—Non-Serializing Write to Model Specific Register Vol. 2D 6-11


Protected Mode Exceptions
#GP(0) If the current privilege level is not 0.
If the specified MSR address is reserved or unimplemented MSR.
If the source data sets bits that are reserved in the specified MSR.
If the source data contains a non-canonical address and the specified MSR is one of the
following: IA32_BNDCFGS, IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
IA32_INTERRUPT_SSP_TABLE_ADDR, IA32_KERNEL_GS_BASE, IA32_LSTAR,
IA32_PL0_SSP, IA32_PL1_SSP, IA32_PL2_SSP, IA32_PL3_SSP, IA32_RTIT_ADDR0_A,
IA32_RTIT_ADDR0_B, IA32_RTIT_ADDR1_A, IA32_RTIT_ADDR1_B, IA32_RTIT_ADDR2_A,
IA32_RTIT_ADDR2_B, IA32_RTIT_ADDR3_A, IA32_RTIT_ADDR3_B, IA32_S_CET,
IA32_SYSENTER_EIP, IA32_SYSENTER_ESP, IA32_UINTR_HANDLER, IA32_UINTR_PD,
IA32_UINTR_STACKADJUST, IA32_U_CET, and IA32_UINTR_TT.
#UD If the LOCK prefix is used.

Real-Address Mode Exceptions


#GP(0) If the specified MSR address is reserved or unimplemented MSR.
If the source data sets bits that are reserved in the specified MSR.
If the source data contains a non-canonical address and the specified MSR is one of the
following: IA32_BNDCFGS, IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
IA32_INTERRUPT_SSP_TABLE_ADDR, IA32_KERNEL_GS_BASE, IA32_LSTAR,
IA32_PL0_SSP, IA32_PL1_SSP, IA32_PL2_SSP, IA32_PL3_SSP, IA32_RTIT_ADDR0_A,
IA32_RTIT_ADDR0_B, IA32_RTIT_ADDR1_A, IA32_RTIT_ADDR1_B, IA32_RTIT_ADDR2_A,
IA32_RTIT_ADDR2_B, IA32_RTIT_ADDR3_A, IA32_RTIT_ADDR3_B, IA32_S_CET,
IA32_SYSENTER_EIP, IA32_SYSENTER_ESP, IA32_UINTR_HANDLER, IA32_UINTR_PD,
IA32_UINTR_STACKADJUST, IA32_U_CET, and IA32_UINTR_TT.
#UD If the LOCK prefix is used.

Virtual-8086 Mode Exceptions


#GP(0) The WRMSRNS instruction is not recognized in virtual-8086 mode.

Compatibility Mode Exceptions


Same exceptions as in protected mode.

64-Bit Mode Exceptions


#GP(0) If the current privilege level is not 0.
If the specified MSR address is reserved or unimplemented MSR.
If the source data sets bits that are reserved in the specified MSR.
If the source data contains a non-canonical address and the specified MSR is one of the
following: IA32_BNDCFGS, IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
IA32_INTERRUPT_SSP_TABLE_ADDR, IA32_KERNEL_GS_BASE, IA32_LSTAR,
IA32_PL0_SSP, IA32_PL1_SSP, IA32_PL2_SSP, IA32_PL3_SSP, IA32_RTIT_ADDR0_A,
IA32_RTIT_ADDR0_B, IA32_RTIT_ADDR1_A, IA32_RTIT_ADDR1_B, IA32_RTIT_ADDR2_A,
IA32_RTIT_ADDR2_B, IA32_RTIT_ADDR3_A, IA32_RTIT_ADDR3_B, IA32_S_CET,
IA32_SYSENTER_EIP, IA32_SYSENTER_ESP, IA32_UINTR_HANDLER, IA32_UINTR_PD,
IA32_UINTR_STACKADJUST, IA32_U_CET, and IA32_UINTR_TT.
#UD If the LOCK prefix is used.

WRMSRNS—Non-Serializing Write to Model Specific Register Vol. 2D 6-12


XORPD—Bitwise Logical XOR of Packed Double Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 57/r A V/V SSE2 Return the bitwise logical XOR of packed double
XORPD xmm1, xmm2/m128 precision floating-point values in xmm1 and
xmm2/mem.
VEX.128.66.0F.WIG 57 /r B V/V AVX Return the bitwise logical XOR of packed double
VXORPD xmm1,xmm2, xmm3/m128 precision floating-point values in xmm2 and
xmm3/mem.
VEX.256.66.0F.WIG 57 /r B V/V AVX Return the bitwise logical XOR of packed double
VXORPD ymm1, ymm2, ymm3/m256 precision floating-point values in ymm2 and
ymm3/mem.
EVEX.128.66.0F.W1 57 /r C V/V (AVX512VL AND Return the bitwise logical XOR of packed double
VXORPD xmm1 {k1}{z}, xmm2, AVX512DQ) OR precision floating-point values in xmm2 and
xmm3/m128/m64bcst AVX10.11 xmm3/m128/m64bcst subject to writemask k1.
EVEX.256.66.0F.W1 57 /r C V/V (AVX512VL AND Return the bitwise logical XOR of packed double
VXORPD ymm1 {k1}{z}, ymm2, AVX512DQ) OR precision floating-point values in ymm2 and
ymm3/m256/m64bcst AVX10.11 ymm3/m256/m64bcst subject to writemask k1.
EVEX.512.66.0F.W1 57 /r C V/V AVX512DQ Return the bitwise logical XOR of packed double
VXORPD zmm1 {k1}{z}, zmm2, OR AVX10.11 precision floating-point values in zmm2 and
zmm3/m512/m64bcst zmm3/m512/m64bcst subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a bitwise logical XOR of the two, four or eight packed double precision floating-point values from the first
source operand and the second source operand, and stores the result in the destination operand.
EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand can be a ZMM
register or a vector memory location. The destination operand is a ZMM register conditionally updated with write-
mask k1.
VEX.256 and EVEX.256 encoded versions: The first source operand is a YMM register. The second source operand
is a YMM register or a 256-bit memory location. The destination operand is a YMM register (conditionally updated
with writemask k1 in case of EVEX). The upper bits (MAXVL-1:256) of the corresponding ZMM register destination
are zeroed.
VEX.128 and EVEX.128 encoded versions: The first source operand is an XMM register. The second source operand
is an XMM register or 128-bit memory location. The destination operand is an XMM register (conditionally updated
with writemask k1 in case of EVEX). The upper bits (MAXVL-1:128) of the corresponding ZMM register destination
are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
register destination are unmodified.

XORPD—Bitwise Logical XOR of Packed Double Precision Floating-Point Values Vol. 2D 6-42
Operation
VXORPD (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b == 1) AND (SRC2 *is memory*)
THEN DEST[i+63:i] := SRC1[i+63:i] BITWISE XOR SRC2[63:0];
ELSE DEST[i+63:i] := SRC1[i+63:i] BITWISE XOR SRC2[i+63:i];
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] = 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VXORPD (VEX.256 Encoded Version)


DEST[63:0] := SRC1[63:0] BITWISE XOR SRC2[63:0]
DEST[127:64] := SRC1[127:64] BITWISE XOR SRC2[127:64]
DEST[191:128] := SRC1[191:128] BITWISE XOR SRC2[191:128]
DEST[255:192] := SRC1[255:192] BITWISE XOR SRC2[255:192]
DEST[MAXVL-1:256] := 0

VXORPD (VEX.128 Encoded Version)


DEST[63:0] := SRC1[63:0] BITWISE XOR SRC2[63:0]
DEST[127:64] := SRC1[127:64] BITWISE XOR SRC2[127:64]
DEST[MAXVL-1:128] := 0

XORPD (128-bit Legacy SSE Version)


DEST[63:0] := DEST[63:0] BITWISE XOR SRC[63:0]
DEST[127:64] := DEST[127:64] BITWISE XOR SRC[127:64]
DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent


VXORPD __m512d _mm512_xor_pd (__m512d a, __m512d b);
VXORPD __m512d _mm512_mask_xor_pd (__m512d a, __mmask8 m, __m512d b);
VXORPD __m512d _mm512_maskz_xor_pd (__mmask8 m, __m512d a);
VXORPD __m256d _mm256_xor_pd (__m256d a, __m256d b);
VXORPD __m256d _mm256_mask_xor_pd (__m256d a, __mmask8 m, __m256d b);
VXORPD __m256d _mm256_maskz_xor_pd (__mmask8 m, __m256d a);
XORPD __m128d _mm_xor_pd (__m128d a, __m128d b);
VXORPD __m128d _mm_mask_xor_pd (__m128d a, __mmask8 m, __m128d b);
VXORPD __m128d _mm_maskz_xor_pd (__mmask8 m, __m128d a);

SIMD Floating-Point Exceptions


None.

XORPD—Bitwise Logical XOR of Packed Double Precision Floating-Point Values Vol. 2D 6-43
Other Exceptions
Non-EVEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E4 Class Exception Conditions.”

XORPD—Bitwise Logical XOR of Packed Double Precision Floating-Point Values Vol. 2D 6-44
XORPS—Bitwise Logical XOR of Packed Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 57 /r A V/V SSE Return the bitwise logical XOR of packed single
XORPS xmm1, xmm2/m128 precision floating-point values in xmm1 and
xmm2/mem.
VEX.128.0F.WIG 57 /r B V/V AVX Return the bitwise logical XOR of packed single
VXORPS xmm1,xmm2, xmm3/m128 precision floating-point values in xmm2 and
xmm3/mem.
VEX.256.0F.WIG 57 /r B V/V AVX Return the bitwise logical XOR of packed single
VXORPS ymm1, ymm2, ymm3/m256 precision floating-point values in ymm2 and
ymm3/mem.
EVEX.128.0F.W0 57 /r C V/V (AVX512VL AND Return the bitwise logical XOR of packed single-
VXORPS xmm1 {k1}{z}, xmm2, AVX512DQ) OR precision floating-point values in xmm2 and
xmm3/m128/m32bcst AVX10.11 xmm3/m128/m32bcst subject to writemask k1.
EVEX.256.0F.W0 57 /r C V/V (AVX512VL AND Return the bitwise logical XOR of packed single-
VXORPS ymm1 {k1}{z}, ymm2, AVX512DQ) OR precision floating-point values in ymm2 and
ymm3/m256/m32bcst AVX10.11 ymm3/m256/m32bcst subject to writemask k1.
EVEX.512.0F.W0 57 /r C V/V AVX512DQ Return the bitwise logical XOR of packed single-
VXORPS zmm1 {k1}{z}, zmm2, OR AVX10.11 precision floating-point values in zmm2 and
zmm3/m512/m32bcst zmm3/m512/m32bcst subject to writemask k1.

NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.

Instruction Operand Encoding


Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A N/A ModRM:reg (r, w) ModRM:r/m (r) N/A N/A
B N/A ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) N/A
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) N/A

Description
Performs a bitwise logical XOR of the four, eight or sixteen packed single precision floating-point values from the
first source operand and the second source operand, and stores the result in the destination operand
EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand can be a ZMM
register or a vector memory location. The destination operand is a ZMM register conditionally updated with write-
mask k1.
VEX.256 and EVEX.256 encoded versions: The first source operand is a YMM register. The second source operand
is a YMM register or a 256-bit memory location. The destination operand is a YMM register (conditionally updated
with writemask k1 in case of EVEX). The upper bits (MAXVL-1:256) of the corresponding ZMM register destination
are zeroed.
VEX.128 and EVEX.128 encoded versions: The first source operand is an XMM register. The second source operand
is an XMM register or 128-bit memory location. The destination operand is an XMM register (conditionally updated
with writemask k1 in case of EVEX). The upper bits (MAXVL-1:128) of the corresponding ZMM register destination
are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
register destination are unmodified.

XORPS—Bitwise Logical XOR of Packed Single Precision Floating-Point Values Vol. 2D 6-45
Operation
VXORPS (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b == 1) AND (SRC2 *is memory*)
THEN DEST[i+31:i] := SRC1[i+31:i] BITWISE XOR SRC2[31:0];
ELSE DEST[i+31:i] := SRC1[i+31:i] BITWISE XOR SRC2[i+31:i];
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] = 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0

VXORPS (VEX.256 Encoded Version)


DEST[31:0] := SRC1[31:0] BITWISE XOR SRC2[31:0]
DEST[63:32] := SRC1[63:32] BITWISE XOR SRC2[63:32]
DEST[95:64] := SRC1[95:64] BITWISE XOR SRC2[95:64]
DEST[127:96] := SRC1[127:96] BITWISE XOR SRC2[127:96]
DEST[159:128] := SRC1[159:128] BITWISE XOR SRC2[159:128]
DEST[191:160] := SRC1[191:160] BITWISE XOR SRC2[191:160]
DEST[223:192] := SRC1[223:192] BITWISE XOR SRC2[223:192]
DEST[255:224] := SRC1[255:224] BITWISE XOR SRC2[255:224].
DEST[MAXVL-1:256] := 0

VXORPS (VEX.128 Encoded Version)


DEST[31:0] := SRC1[31:0] BITWISE XOR SRC2[31:0]
DEST[63:32] := SRC1[63:32] BITWISE XOR SRC2[63:32]
DEST[95:64] := SRC1[95:64] BITWISE XOR SRC2[95:64]
DEST[127:96] := SRC1[127:96] BITWISE XOR SRC2[127:96]
DEST[MAXVL-1:128] := 0

XORPS (128-bit Legacy SSE Version)


DEST[31:0] := SRC1[31:0] BITWISE XOR SRC2[31:0]
DEST[63:32] := SRC1[63:32] BITWISE XOR SRC2[63:32]
DEST[95:64] := SRC1[95:64] BITWISE XOR SRC2[95:64]
DEST[127:96] := SRC1[127:96] BITWISE XOR SRC2[127:96]
DEST[MAXVL-1:128] (Unmodified)

XORPS—Bitwise Logical XOR of Packed Single Precision Floating-Point Values Vol. 2D 6-46
Intel C/C++ Compiler Intrinsic Equivalent
VXORPS __m512 _mm512_xor_ps (__m512 a, __m512 b);
VXORPS __m512 _mm512_mask_xor_ps (__m512 a, __mmask16 m, __m512 b);
VXORPS __m512 _mm512_maskz_xor_ps (__mmask16 m, __m512 a);
VXORPS __m256 _mm256_xor_ps (__m256 a, __m256 b);
VXORPS __m256 _mm256_mask_xor_ps (__m256 a, __mmask8 m, __m256 b);
VXORPS __m256 _mm256_maskz_xor_ps (__mmask8 m, __m256 a);
XORPS __m128 _mm_xor_ps (__m128 a, __m128 b);
VXORPS __m128 _mm_mask_xor_ps (__m128 a, __mmask8 m, __m128 b);
VXORPS __m128 _mm_maskz_xor_ps (__mmask8 m, __m128 a);

SIMD Floating-Point Exceptions


None.

Other Exceptions
Non-EVEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E4 Class Exception Conditions.”

XORPS—Bitwise Logical XOR of Packed Single Precision Floating-Point Values Vol. 2D 6-47
10.Updates to Chapter 1, Volume 3A
Change bars and violet text show changes to Chapter 1 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3A: System Programming Guide, Part 1.

------------------------------------------------------------------------------------------
Changes to this chapter:
• Updated Section 1.1, “Overview of the System Programming Guide,” with the newly added Chapter 4, “Linear-
Address Pre-Processing.”

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 13


CHAPTER 1
ABOUT THIS MANUAL

The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A: System Programming Guide, Part
1 (order number 253668), the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B: System
Programming Guide, Part 2 (order number 253669), the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 3C: System Programming Guide, Part 3 (order number 326019), and the Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 3D:System Programming Guide, Part 4 (order number
332831) are part of a set that describes the architecture and programming environment of Intel 64 and IA-32
Architecture processors. The other volumes in this set are:
• Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture (order number
253665).
• Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 2A, 2B, 2C, & 2D: Instruction Set
Reference (order numbers 253666, 253667, 326018, and 334569).
• The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 4: Model-Specific Registers
(order number 335592).
The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, describes the basic architecture
and programming environment of Intel 64 and IA-32 processors. The Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volumes 2A, 2B, 2C, & 2D, describe the instruction set of the processor and the opcode struc-
ture. These volumes apply to application programmers and to programmers who write operating systems or exec-
utives. The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 3A, 3B, 3C, & 3D, describe
the operating-system support environment of Intel 64 and IA-32 processors. These volumes target operating-
system and BIOS designers. In addition, Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume
3B, and Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3C, address the programming
environment for classes of software that host operating systems. The Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 4, describes the model-specific registers of Intel 64 and IA-32 processors.

1.1 OVERVIEW OF THE SYSTEM PROGRAMMING GUIDE


A description of this manual’s content follows:
Chapter 1 — About This Manual. Gives an overview of all volumes of the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, with chapter-specific details for the current volume.
Chapter 2 — System Architecture Overview. Describes the modes of operation used by Intel 64 and IA-32
processors and the mechanisms provided by the architectures to support operating systems and executives,
including the system-oriented registers and data structures and the system-oriented instructions. The steps neces-
sary for switching between real-address and protected modes are also identified.
Chapter 3 — Protected-Mode Memory Management. Describes the data structures, registers, and instructions
that support segmentation and paging. The chapter explains how they can be used to implement a “flat” (unseg-
mented) memory model or a segmented memory model.
Chapter 4 — Linear-Address Pre-Processing. Describes the processes to which linear addresses are subject
prior to translation changing. These include fault checking for linear-address-space separation (LASS) and canoni-
cality, as well as linear-address masking (LAM).
Chapter 5 — Paging. Describes the paging modes supported by Intel 64 and IA-32 processors.
Chapter 6 — Protection. Describes the support for page and segment protection provided in the Intel 64 and IA-
32 architectures. This chapter also explains the implementation of privilege rules, stack switching, pointer valida-
tion, user mode, and supervisor mode.
Chapter 7 — Interrupt and Exception Handling. Describes the basic interrupt mechanisms defined in the Intel
64 and IA-32 architectures, shows how interrupts and exceptions relate to protection, and describes how the archi-
tecture handles each exception type. Reference information for each exception is given in this chapter. Includes

Vol. 3A 1-1
ABOUT THIS MANUAL

programming the LINT0 and LINT1 inputs and gives an example of how to program the LINT0 and LINT1 pins for
specific interrupt vectors.
Chapter 8 — User Interrupts. Describes user interrupts supported by Intel 64 and IA-32 processors.
Chapter 9 — Task Management. Describes mechanisms the Intel 64 and IA-32 architectures provide to support
multitasking and inter-task protection.
Chapter 10 — Multiple-Processor Management. Describes the instructions and flags that support multiple
processors with shared memory, memory ordering, and Intel® Hyper-Threading Technology. Includes MP initializa-
tion for P6 family processors and gives an example of how to use the MP protocol to boot P6 family processors in an
MP system.
Chapter 11 — Processor Management and Initialization. Defines the state of an Intel 64 or IA-32 processor
after reset initialization. This chapter also explains how to set up an Intel 64 or IA-32 processor for real-address
mode operation and protected- mode operation, and how to switch between modes.
Chapter 12 — Advanced Programmable Interrupt Controller (APIC). Describes the programming interface
to the local APIC and gives an overview of the interface between the local APIC and the I/O APIC. Includes APIC bus
message formats and describes the message formats for messages transmitted on the APIC bus for P6 family and
Pentium processors.
Chapter 13 — Memory Cache Control. Describes the general concept of caching and the caching mechanisms
supported by the Intel 64 or IA-32 architectures. This chapter also describes the memory type range registers
(MTRRs) and how they can be used to map memory types of physical memory. Information on using the new cache
control and memory streaming instructions introduced with the Pentium III, Pentium 4, and Intel Xeon processors is
also given.
Chapter 14 — Intel® MMX™ Technology System Programming. Describes those aspects of the Intel® MMX™
technology that must be handled and considered at the system programming level, including: task switching,
exception handling, and compatibility with existing system environments.
Chapter 15 — System Programming For Instruction Set Extensions And Processor Extended States.
Describes the operating system requirements to support SSE/SSE2/SSE3/SSSE3/SSE4 extensions, including task
switching, exception handling, and compatibility with existing system environments. The latter part of this chapter
describes the extensible framework of operating system requirements to support processor extended states.
Processor extended state may be required by instruction set extensions beyond those of
SSE/SSE2/SSE3/SSSE3/SSE4 extensions.
Chapter 16 — Power and Thermal Management. Describes facilities of Intel 64 and IA-32 architecture used for
power management and thermal monitoring.
Chapter 17 — Machine-Check Architecture. Describes the machine-check architecture and machine-check
exception mechanism found in the Pentium 4, Intel Xeon, and P6 family processors. Additionally, a signaling mech-
anism for software to respond to hardware corrected machine check error is covered.
Chapter 18 — Interpreting Machine-Check Error Codes. Gives an example of how to interpret the error codes
for a machine-check error that occurred on a P6 family processor.
Chapter 19 — Debug, Branch Profile, TSC, and Resource Monitoring Features. Describes the debugging
registers and other debug mechanism provided in Intel 64 or IA-32 processors. This chapter also describes the
time-stamp counter.
Chapter 20 — Last Branch Records. Describes the Last Branch Records (architectural feature).
Chapter 21 — Performance Monitoring. Describes the Intel 64 and IA-32 architectures’ facilities for monitoring
performance.
Chapter 22 — 8086 Emulation. Describes the real-address and virtual-8086 modes of the IA-32 architecture.
Chapter 23 — Mixing 16-Bit and 32-Bit Code. Describes how to mix 16-bit and 32-bit code modules within the
same program or task.
Chapter 24 — IA-32 Architecture Compatibility. Describes architectural compatibility among IA-32 proces-
sors.
Chapter 25 — Introduction to Virtual Machine Extensions. Describes the basic elements of virtual machine
architecture and the virtual machine extensions for Intel 64 and IA-32 Architectures.

1-2 Vol. 3A
ABOUT THIS MANUAL

Chapter 26 — Virtual Machine Control Structures. Describes components that manage VMX operation. These
include the working-VMCS pointer and the controlling-VMCS pointer.
Chapter 27 — VMX Non-Root Operation. Describes the operation of a VMX non-root operation. Processor oper-
ation in VMX non-root mode can be restricted programmatically such that certain operations, events or conditions
can cause the processor to transfer control from the guest (running in VMX non-root mode) to the monitor software
(running in VMX root mode).
Chapter 28 — VM Entries. Describes VM entries. VM entry transitions the processor from the VMM running in
VMX root-mode to a VM running in VMX non-root mode. VM-Entry is performed by the execution of VMLAUNCH or
VMRESUME instructions.
Chapter 29 — VM Exits. Describes VM exits. Certain events, operations or situations while the processor is in VMX
non-root operation may cause VM-exit transitions. In addition, VM exits can also occur on failed VM entries.
Chapter 30 — VMX Support for Address Translation. Describes virtual-machine extensions that support
address translation and the virtualization of physical memory.
Chapter 31 — APIC Virtualization and Virtual Interrupts. Describes the VMCS including controls that enable
the virtualization of interrupts and the Advanced Programmable Interrupt Controller (APIC).
Chapter 32 — VMX Instruction Reference. Describes the virtual-machine extensions (VMX). VMX is intended
for a system executive to support virtualization of processor hardware and a system software layer acting as a host
to multiple guest software environments.
Chapter 33 — System Management Mode. Describes Intel 64 and IA-32 architectures’ system management
mode (SMM) facilities.
Chapter 34 — Intel® Processor Trace. Describes details of Intel® Processor Trace.
Chapter 35 — Introduction to Intel® Software Guard Extensions. Provides an overview of the Intel® Soft-
ware Guard Extensions (Intel® SGX) set of instructions.
Chapter 36 — Enclave Access Control and Data Structures. Describes Enclave Access Control procedures and
defines various Intel SGX data structures.
Chapter 37 — Enclave Operation. Describes enclave creation and initialization, adding pages and measuring an
enclave, and enclave entry and exit.
Chapter 38 — Enclave Exiting Events. Describes enclave-exiting events (EEE) and asynchronous enclave exit
(AEX).
Chapter 39 — SGX Instruction References. Describes the supervisor and user level instructions provided by
Intel SGX.
Chapter 40 — Intel® SGX Interactions with IA32 and Intel® 64 Architecture. Describes the Intel SGX
collection of enclave instructions for creating protected execution environments on processors supporting IA32 and
Intel 64 architectures.
Chapter 41 — Enclave Code Debug and Profiling. Describes enclave code debug processes and options.
Appendix A — VMX Capability Reporting Facility. Describes the VMX capability MSRs. Support for specific VMX
features is determined by reading capability MSRs.
Appendix B — Field Encoding in VMCS. Enumerates all fields in the VMCS and their encodings. Fields are
grouped by width (16-bit, 32-bit, etc.) and type (guest-state, host-state, etc.).
Appendix C — VM Basic Exit Reasons. Describes the 32-bit fields that encode reasons for a VM exit. Examples
of exit reasons include, but are not limited to: software interrupts, processor exceptions, software traps, NMIs,
external interrupts, and triple faults.

Vol. 3A 1-3
ABOUT THIS MANUAL

1-4 Vol. 3A
11.Updates to Chapter 2, Volume 3A
Change bars and violet text show changes to Chapter 2 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3A: System Programming Guide, Part 1.

------------------------------------------------------------------------------------------
Changes to this chapter:
• Added new LAM bits for CR3 and CR4. Bit 61 of CR3 is “LAM_U57.” Bit 62 of CR3 is “LAM_U48.” Bit 28 of CR4
is “LAM_SUP.” Figure 2-7 was updated to include these bits as well.

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 13


CHAPTER 2
SYSTEM ARCHITECTURE OVERVIEW

IA-32 architecture (beginning with the Intel386 processor family) provides extensive support for operating-system
and system-development software. This support offers multiple modes of operation, which include:
• Real mode, protected mode, virtual 8086 mode, and system management mode. These are sometimes
referred to as legacy modes.
Intel 64 architecture supports almost all the system programming facilities available in IA-32 architecture and
extends them to a new operating mode (IA-32e mode) that supports a 64-bit programming environment. IA-32e
mode allows software to operate in one of two sub-modes:
• 64-bit mode supports 64-bit OS and 64-bit applications
• Compatibility mode allows most legacy software to run; it co-exists with 64-bit applications under a 64-bit OS.
The IA-32 system-level architecture includes features to assist in the following operations:
• Memory management.
• Protection of software modules.
• Multitasking.
• Exception and interrupt handling.
• Multiprocessing.
• Cache management.
• Hardware resource and power management.
• Debugging and performance monitoring.
This chapter provides a description of each part of this architecture. It also describes the system registers that are
used to set up and control the processor at the system level and gives a brief overview of the processor’s system-
level (operating system) instructions.
Many features of the system-level architecture are used only by system programmers. However, application
programmers may need to read this chapter and the following chapters in order to create a reliable and secure
environment for application programs.
This overview and most subsequent chapters of this book focus on protected-mode operation of the IA-32 architec-
ture. IA-32e mode operation of the Intel 64 architecture, as it differs from protected mode operation, is also
described.
All Intel 64 and IA-32 processors enter real-address mode following a power-up or reset (see Chapter 11,
“Processor Management and Initialization”). Software then initiates the switch from real-address mode to
protected mode. If IA-32e mode operation is desired, software also initiates a switch from protected mode to IA-
32e mode.

2.1 OVERVIEW OF THE SYSTEM-LEVEL ARCHITECTURE


System-level architecture consists of a set of registers, data structures, and instructions designed to support basic
system-level operations such as memory management, interrupt and exception handling, task management, and
control of multiple processors.
Figure 2-1 provides a summary of system registers and data structures that applies to 32-bit modes. System regis-
ters and data structures that apply to IA-32e mode are shown in Figure 2-2.

Vol. 3A 2-1
SYSTEM ARCHITECTURE OVERVIEW

EFLAGS Register Physical Address


Code, Data or
Linear Address Stack Segment
Control Registers
Task-State
CR4 Segment Selector Segment (TSS) Task
CR3
Code
CR2 Register
CR1 Data
CR0 Stack
Global Descriptor
Task Register Table (GDT)

Segment Sel. Interrupt Handler


Seg. Desc.
Code
Current
Interrupt TSS Seg. Sel. TSS Desc. TSS Stack
Vector
Seg. Desc.
Interrupt Descriptor Task-State
Table (IDT) Segment (TSS) Task
TSS Desc.
Code
Interrupt Gate LDT Desc. Data
Stack
Task Gate
GDTR
Trap Gate
Local Descriptor Exception Handler
Table (LDT) Code
Current
TSS Stack
IDTR Call-Gate Seg. Desc.
Segment Selector
Call Gate
Protected Procedure
Code
XCR0 LDTR Current
TSS Stack

Linear Address Space Linear Address


Dir Table Offset

Linear Addr.
Page Directory Page Table Page

Physical Addr.
Pg. Dir. Entry Pg. Tbl. Entry

0 This page mapping example is for 4-KByte pages


CR3* and 32-bit paging.
*Physical Address

Figure 2-1. IA-32 System-Level Registers and Data Structures

2-2 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW

RFLAGS
Physical Address
Code, Data or Stack
Control Register Linear Address Segment (Base =0)
CR8 Task-State
CR4 Segment Selector Segment (TSS)
CR3
CR2 Register
CR1
CR0 Global Descriptor
Task Register Table (GDT)

Segment Sel. Interrupt Handler


Seg. Desc.
Code
NULL
Interrupt TR TSS Desc. Stack
Vector
Seg. Desc.
Interrupt Descriptor
Table (IDT) Seg. Desc. Interr. Handler
Code
Interrupt Gate LDT Desc. Current TSS
Stack
Interrupt Gate
GDTR
IST
Trap Gate
Local Descriptor Exception Handler
Table (LDT) Code
NULL
Stack
IDTR Call-Gate Seg. Desc.
Segment Selector
Call Gate
Protected Procedure
XCR0 Code
LDTR NULL
Stack

PKRU

Linear Address Space Linear Address


PML4 Dir. Pointer Directory Table Offset

Linear Addr.
PML4 Pg. Dir. Ptr. Page Dir. Page Table Page

Physical
PML4. Pg. Dir. Page Tbl Addr.
Entry Entry Entry

0 This page mapping example is for 4-KByte pages


CR3* and 4-level paging.
*Physical Address

Figure 2-2. System-Level Registers and Data Structures in IA-32e Mode and 4-Level Paging

2.1.1 Global and Local Descriptor Tables


When operating in protected mode, all memory accesses pass through either the global descriptor table (GDT) or
an optional local descriptor table (LDT) as shown in Figure 2-1. These tables contain entries called segment
descriptors. Segment descriptors provide the base address of segments well as access rights, type, and usage
information.

Vol. 3A 2-3
SYSTEM ARCHITECTURE OVERVIEW

Each segment descriptor has an associated segment selector. A segment selector provides the software that uses
it with an index into the GDT or LDT (the offset of its associated segment descriptor), a global/local flag (deter-
mines whether the selector points to the GDT or the LDT), and access rights information.
To access a byte in a segment, a segment selector and an offset must be supplied. The segment selector provides
access to the segment descriptor for the segment (in the GDT or LDT). From the segment descriptor, the processor
obtains the base address of the segment in the linear address space. The offset then provides the location of the
byte relative to the base address. This mechanism can be used to access any valid code, data, or stack segment,
provided the segment is accessible from the current privilege level (CPL) at which the processor is operating. The
CPL is defined as the protection level of the currently executing code segment.
See Figure 2-1. The solid arrows in the figure indicate a linear address, dashed lines indicate a segment selector,
and the dotted arrows indicate a physical address. For simplicity, many of the segment selectors are shown as
direct pointers to a segment. However, the actual path from a segment selector to its associated segment is always
through a GDT or LDT.
The linear address of the base of the GDT is contained in the GDT register (GDTR); the linear address of the LDT is
contained in the LDT register (LDTR).

2.1.1.1 Global and Local Descriptor Tables in IA-32e Mode


GDTR and LDTR registers are expanded to 64-bits wide in both IA-32e sub-modes (64-bit mode and compatibility
mode). For more information: see Section 3.5.2, “Segment Descriptor Tables in IA-32e Mode.”
Global and local descriptor tables are expanded in 64-bit mode to support 64-bit base addresses, (16-byte LDT
descriptors hold a 64-bit base address and various attributes). In compatibility mode, descriptors are not
expanded.

2.1.2 System Segments, Segment Descriptors, and Gates


Besides code, data, and stack segments that make up the execution environment of a program or procedure, the
architecture defines two system segments: the task-state segment (TSS) and the LDT. The GDT is not considered
a segment because it is not accessed by means of a segment selector and segment descriptor. TSSs and LDTs have
segment descriptors defined for them.
The architecture also defines a set of special descriptors called gates (call gates, interrupt gates, trap gates, and
task gates). These provide protected gateways to system procedures and handlers that may operate at a different
privilege level than application programs and most procedures. For example, a CALL to a call gate can provide
access to a procedure in a code segment that is at the same or a numerically lower privilege level (more privileged)
than the current code segment. To access a procedure through a call gate, the calling procedure1 supplies the
selector for the call gate. The processor then performs an access rights check on the call gate, comparing the CPL
with the privilege level of the call gate and the destination code segment pointed to by the call gate.
If access to the destination code segment is allowed, the processor gets the segment selector for the destination
code segment and an offset into that code segment from the call gate. If the call requires a change in privilege
level, the processor also switches to the stack for the targeted privilege level. The segment selector for the new
stack is obtained from the TSS for the currently running task. Gates also facilitate transitions between 16-bit and
32-bit code segments, and vice versa.

2.1.2.1 Gates in IA-32e Mode


In IA-32e mode, the following descriptors are 16-byte descriptors (expanded to allow a 64-bit base): LDT descrip-
tors, 64-bit TSSs, call gates, interrupt gates, and trap gates.
Call gates facilitate transitions between 64-bit mode and compatibility mode. Task gates are not supported in IA-
32e mode. On privilege level changes, stack segment selectors are not read from the TSS. Instead, they are set to
NULL.

1. The word “procedure” is commonly used in this document as a general term for a logical unit or block of code (such as a program, pro-
cedure, function, or routine).

2-4 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW

2.1.3 Task-State Segments and Task Gates


The TSS (see Figure 2-1) defines the state of the execution environment for a task. It includes the state of general-
purpose registers, segment registers, the EFLAGS register, the EIP register, and segment selectors with stack
pointers for three stack segments (one stack for each privilege level). The TSS also includes the segment selector
for the LDT associated with the task and the base address of the paging-structure hierarchy.
All program execution in protected mode happens within the context of a task (called the current task). The
segment selector for the TSS for the current task is stored in the task register. The simplest method for switching
to a task is to make a call or jump to the new task. Here, the segment selector for the TSS of the new task is given
in the CALL or JMP instruction. In switching tasks, the processor performs the following actions:
1. Stores the state of the current task in the current TSS.
2. Loads the task register with the segment selector for the new task.
3. Accesses the new TSS through a segment descriptor in the GDT.
4. Loads the state of the new task from the new TSS into the general-purpose registers, the segment registers,
the LDTR, control register CR3 (base address of the paging-structure hierarchy), the EFLAGS register, and the
EIP register.
5. Begins execution of the new task.
A task can also be accessed through a task gate. A task gate is similar to a call gate, except that it provides access
(through a segment selector) to a TSS rather than a code segment.

2.1.3.1 Task-State Segments in IA-32e Mode


Hardware task switches are not supported in IA-32e mode. However, TSSs continue to exist. The base address of
a TSS is specified by its descriptor.
A 64-bit TSS holds the following information that is important to 64-bit operation:
• Stack pointer addresses for each privilege level.
• Pointer addresses for the interrupt stack table.
• Offset address of the IO-permission bitmap (from the TSS base).
The task register is expanded to hold 64-bit base addresses in IA-32e mode. See also: Section 9.7, “Task Manage-
ment in 64-bit Mode.”

2.1.4 Interrupt and Exception Handling


External interrupts, software interrupts and exceptions are handled through the interrupt descriptor table (IDT).
The IDT stores a collection of gate descriptors that provide access to interrupt and exception handlers. Like the
GDT, the IDT is not a segment. The linear address for the base of the IDT is contained in the IDT register (IDTR).
Gate descriptors in the IDT can be interrupt, trap, or task gate descriptors. To access an interrupt or exception
handler, the processor first receives an interrupt vector from internal hardware, an external interrupt controller, or
from software by means of an INT n, INTO, INT3, INT1, or BOUND instruction. The interrupt vector provides an
index into the IDT. If the selected gate descriptor is an interrupt gate or a trap gate, the associated handler proce-
dure is accessed in a manner similar to calling a procedure through a call gate. If the descriptor is a task gate, the
handler is accessed through a task switch.

2.1.4.1 Interrupt and Exception Handling IA-32e Mode


In IA-32e mode, interrupt gate descriptors are expanded to 16 bytes to support 64-bit base addresses. This is true
for 64-bit mode and compatibility mode.
The IDTR register is expanded to hold a 64-bit base address. Task gates are not supported.

Vol. 3A 2-5
SYSTEM ARCHITECTURE OVERVIEW

2.1.5 Memory Management


System architecture supports either direct physical addressing of memory or virtual memory (through paging).
When physical addressing is used, a linear address is treated as a physical address. When paging is used: all code,
data, stack, and system segments (including the GDT and IDT) can be paged with only the most recently accessed
pages being held in physical memory.
The location of pages (sometimes called page frames) in physical memory is contained in the paging structures.
These structures reside in physical memory (see Figure 2-1 for the case of 32-bit paging).
The base physical address of the paging-structure hierarchy is contained in control register CR3. The entries in the
paging structures determine the physical address of the base of a page frame, access rights and memory manage-
ment information.
To use this paging mechanism, a linear address is broken into parts. The parts provide separate offsets into the
paging structures and the page frame. A system can have a single hierarchy of paging structures or several. For
example, each task can have its own hierarchy.

2.1.5.1 Memory Management in IA-32e Mode


In IA-32e mode, physical memory pages are managed by a set of system data structures. In both compatibility
mode and 64-bit mode, four or five levels of system data structures are used (see Chapter 5, “Paging”). These
include the following:
• The page map level 5 (PML5) — An entry in the PML5 table contains the physical address of the base of a
PML4 table, access rights, and memory management information. The base physical address of the PML5 table
is stored in CR3. The PML5 table is used only with 5-level paging.
• A page map level 4 (PML4) — An entry in a PML4 table contains the physical address of the base of a page
directory pointer table, access rights, and memory management information. With 4-level paging, there is only
one PML4 table and its base physical address is stored in CR3.
• A set of page directory pointer tables — An entry in a page directory pointer table contains the physical
address of the base of a page directory table, access rights, and memory management information.
• Sets of page directories — An entry in a page directory table contains the physical address of the base of a
page table, access rights, and memory management information.
• Sets of page tables — An entry in a page table contains the physical address of a page frame, access rights,
and memory management information.

2.1.6 System Registers


To assist in initializing the processor and controlling system operations, the system architecture provides system
flags in the EFLAGS register and several system registers:
• The system flags and IOPL field in the EFLAGS register control task and mode switching, interrupt handling,
instruction tracing, and access rights. See also: Section 2.3, “System Flags and Fields in the EFLAGS Register.”
• The control registers (CR0, CR2, CR3, and CR4) contain a variety of flags and data fields for controlling system-
level operations. Other flags in these registers are used to indicate support for specific processor capabilities
within the operating system or executive. See also: Chapter 2, “Control Registers,” and Section 2.6, “Extended
Control Registers (Including XCR0).”
• The debug registers (not shown in Figure 2-1) allow the setting of breakpoints for use in debugging programs
and systems software. See also: Chapter 19, “Debug, Branch Profile, TSC, and Intel® Resource Director
Technology (Intel® RDT) Features.”
• The GDTR, LDTR, and IDTR registers contain the linear addresses and sizes (limits) of their respective tables.
See also: Section 2.4, “Memory-Management Registers.”
• The task register contains the linear address and size of the TSS for the current task. See also: Section 2.4,
“Memory-Management Registers.”
• Model-specific registers (not shown in Figure 2-1).

2-6 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW

The model-specific registers (MSRs) are a group of registers available primarily to operating-system or executive
procedures (that is, code running at privilege level 0). These registers control items such as the debug extensions,
the performance-monitoring counters, the machine- check architecture, and the memory type ranges (MTRRs).
The number and function of these registers varies among different members of the Intel 64 and IA-32 processor
families. See also: Section 11.4, “Model-Specific Registers (MSRs),” and Chapter 2, “Model-Specific Registers
(MSRs),” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 4.
Most systems restrict access to system registers (other than the EFLAGS register) by application programs.
Systems can be designed, however, where all programs and procedures run at the most privileged level (privilege
level 0). In such a case, application programs would be allowed to modify the system registers.

2.1.6.1 System Registers in IA-32e Mode


In IA-32e mode, the four system-descriptor-table registers (GDTR, IDTR, LDTR, and TR) are expanded in hardware
to hold 64-bit base addresses. EFLAGS becomes the 64-bit RFLAGS register. CR0–CR4 are expanded to 64 bits.
CR8 becomes available. CR8 provides read-write access to the task priority register (TPR) so that the operating
system can control the priority classes of external interrupts.
In 64-bit mode, debug registers DR0–DR7 are 64 bits. In compatibility mode, address-matching in DR0–DR3 is
also done at 64-bit granularity.
On systems that support IA-32e mode, the extended feature enable register (IA32_EFER) is available. This model-
specific register controls activation of IA-32e mode and other IA-32e mode operations. In addition, there are
several model-specific registers that govern IA-32e mode instructions:
• IA32_KERNEL_GS_BASE — Used by SWAPGS instruction.
• IA32_LSTAR — Used by SYSCALL instruction.
• IA32_FMASK — Used by SYSCALL instruction.
• IA32_STAR — Used by SYSCALL and SYSRET instruction.

2.1.7 Other System Resources


Besides the system registers and data structures described in the previous sections, system architecture provides
the following additional resources:
• Operating system instructions (see also: Section 2.8, “System Instruction Summary”).
• Performance-monitoring counters (not shown in Figure 2-1).
• Internal caches and buffers (not shown in Figure 2-1).
Performance-monitoring counters are event counters that can be programmed to count processor events such as
the number of instructions decoded, the number of interrupts received, or the number of cache loads.
The processor provides several internal caches and buffers. The caches are used to store both data and instruc-
tions. The buffers are used to store things like decoded addresses to system and application segments and write
operations waiting to be performed. See also: Chapter 13, “Memory Cache Control.”

2.2 MODES OF OPERATION


The IA-32 architecture supports three operating modes and one quasi-operating mode:
• Protected mode — This is the native operating mode of the processor. It provides a rich set of architectural
features, flexibility, high performance and backward compatibility to existing software base.
• Real-address mode — This operating mode provides the programming environment of the Intel 8086
processor, with a few extensions (such as the ability to switch to protected or system management mode).
• System management mode (SMM) — SMM is a standard architectural feature in all IA-32 processors,
beginning with the Intel386 SL processor. This mode provides an operating system or executive with a
transparent mechanism for implementing power management and OEM differentiation features. SMM is
entered through activation of an external system interrupt pin (SMI#), which generates a system management

Vol. 3A 2-7
SYSTEM ARCHITECTURE OVERVIEW

interrupt (SMI). In SMM, the processor switches to a separate address space while saving the context of the
currently running program or task. SMM-specific code may then be executed transparently. Upon returning
from SMM, the processor is placed back into its state prior to the SMI.
• Virtual-8086 mode — In protected mode, the processor supports a quasi-operating mode known as virtual-
8086 mode. This mode allows the processor execute 8086 software in a protected, multitasking environment.
Intel 64 architecture supports all operating modes of IA-32 architecture and IA-32e modes:
• IA-32e mode — In IA-32e mode, the processor supports two sub-modes: compatibility mode and 64-bit
mode. 64-bit mode provides 64-bit linear addressing and support for physical address space larger than 64
GBytes. Compatibility mode allows most legacy protected-mode applications to run unchanged.
Figure 2-3 shows how the processor moves between operating modes.

SMI#
Real-Address
Mode
Reset
or
Reset or RSM
PE=1
PE=0

SMI#
Reset Protected Mode
RSM System
Management
LME=1, CR0.PG=1* SMI# Mode
See**
IA-32e RSM
Mode
VM=0 VM=1
* See Section 10.8.5
SMI# ** See Section 10.8.5.4
Virtual-8086
Mode
RSM

Figure 2-3. Transitions Among the Processor’s Operating Modes

The processor is placed in real-address mode following power-up or a reset. The PE flag in control register CR0 then
controls whether the processor is operating in real-address or protected mode. See also: Section 11.9, “Mode
Switching,” and Section 5.1.2, “Paging-Mode Enabling.”
The VM flag in the EFLAGS register determines whether the processor is operating in protected mode or virtual-
8086 mode. Transitions between protected mode and virtual-8086 mode are generally carried out as part of a task
switch or a return from an interrupt or exception handler. See also: Section 22.2.5, “Entering Virtual-8086 Mode.”
The LMA bit (IA32_EFER.LMA[bit 10]) determines whether the processor is operating in IA-32e mode. When
running in IA-32e mode, 64-bit or compatibility sub-mode operation is determined by CS.L bit of the code segment.
The processor enters into IA-32e mode from protected mode by enabling paging and setting the LME bit
(IA32_EFER.LME[bit 8]). See also: Chapter 11, “Processor Management and Initialization.”
The processor switches to SMM whenever it receives an SMI while the processor is in real-address, protected,
virtual-8086, or IA-32e modes. Upon execution of the RSM instruction, the processor always returns to the mode
it was in when the SMI occurred.

2-8 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW

2.2.1 Extended Feature Enable Register


The IA32_EFER MSR provides several fields related to IA-32e mode enabling and operation. It also provides one
field that relates to page-access right modification (see Section 5.6, “Access Rights”). The layout of the IA32_EFER
MSR is shown in Figure 2-4.

63 12 11 10 9 8 7 1 0

IA32_EFER

Execute Disable Bit Enable

IA-32e Mode Active

IA-32e Mode Enable

SYSCALL Enable

Reserved

Figure 2-4. IA32_EFER MSR Layout

Table 2-1. IA32_EFER MSR Information


Bit Description
0 SYSCALL Enable: IA32_EFER.SCE (R/W)
Enables SYSCALL/SYSRET instructions in 64-bit mode.
7:1 Reserved.
8 IA-32e Mode Enable: IA32_EFER.LME (R/W)
Enables IA-32e mode operation.
9 Reserved.
10 IA-32e Mode Active: IA32_EFER.LMA (R)
Indicates IA-32e mode is active when set.
11 Execute Disable Bit Enable: IA32_EFER.NXE (R/W)
Enables page access restriction by preventing instruction fetches from PAE pages with the XD bit set (See Section 5.6).
63:12 Reserved.

2.3 SYSTEM FLAGS AND FIELDS IN THE EFLAGS REGISTER


The system flags and IOPL field of the EFLAGS register control I/O, maskable hardware interrupts, debugging, task
switching, and the virtual-8086 mode (see Figure 2-5). Only privileged code (typically operating system or execu-
tive code) should be allowed to modify these bits.
The system flags and IOPL are:
TF Trap (bit 8) — Set to enable single-step mode for debugging; clear to disable single-step mode. In single-
step mode, the processor generates a debug exception after each instruction. This allows the execution
state of a program to be inspected after each instruction. If an application program sets the TF flag using a

Vol. 3A 2-9
SYSTEM ARCHITECTURE OVERVIEW

POPF, POPFD, or IRET instruction, a debug exception is generated after the instruction that follows the
POPF, POPFD, or IRET.

31 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
I
V V
I I I A V R 0 N O O D I T S Z A P C
Reserved (set to 0) D C M F T P F F F F F F 0 F 0 F 1 F
P F
L

ID — Identification Flag
VIP — Virtual Interrupt Pending
VIF — Virtual Interrupt Flag
AC — Alignment Check / Access Control
VM — Virtual-8086 Mode
RF — Resume Flag
NT — Nested Task Flag
IOPL— I/O Privilege Level
IF — Interrupt Enable Flag
TF — Trap Flag
Reserved

Figure 2-5. System Flags in the EFLAGS Register

IF Interrupt enable (bit 9) — Controls the response of the processor to maskable hardware interrupt
requests (see also: Section 7.3.2, “Maskable Hardware Interrupts”). The flag is set to respond to maskable
hardware interrupts; cleared to inhibit maskable hardware interrupts. The IF flag does not affect the gener-
ation of exceptions or nonmaskable interrupts (NMI interrupts). The CPL, IOPL, and the state of the VME
flag in control register CR4 determine whether the IF flag can be modified by the CLI, STI, POPF, POPFD,
and IRET.
IOPL I/O privilege level field (bits 12 and 13) — Indicates the I/O privilege level (IOPL) of the currently
running program or task. The CPL of the currently running program or task must be less than or equal to
the IOPL to access the I/O address space. The POPF and IRET instructions can modify this field only when
operating at a CPL of 0.
The IOPL is also one of the mechanisms that controls the modification of the IF flag and the handling of
interrupts in virtual-8086 mode when virtual mode extensions are in effect (when CR4.VME = 1). See also:
Chapter 20, “Input/Output,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume
1.
NT Nested task (bit 14) — Controls the chaining of interrupted and called tasks. The processor sets this flag
on calls to a task initiated with a CALL instruction, an interrupt, or an exception. It examines and modifies
this flag on returns from a task initiated with the IRET instruction. The flag can be explicitly set or cleared
with the POPF/POPFD instructions; however, changing to the state of this flag can generate unexpected
exceptions in application programs.
See also: Section 9.4, “Task Linking.”
RF Resume (bit 16) — Controls the processor’s response to instruction-breakpoint conditions. When set, this
flag temporarily disables debug exceptions (#DB) from being generated for instruction breakpoints
(although other exception conditions can cause an exception to be generated). When clear, instruction
breakpoints will generate debug exceptions.
The primary function of the RF flag is to allow the restarting of an instruction following a debug exception
that was caused by an instruction breakpoint condition. Here, debug software must set this flag in the
EFLAGS image on the stack just prior to returning to the interrupted program with IRETD (to prevent the
instruction breakpoint from causing another debug exception). The processor then automatically clears
this flag after the instruction returned to has been successfully executed, enabling instruction breakpoint
faults again.
See also: Section 19.3.1.1, “Instruction-Breakpoint Exception Condition.”
VM Virtual-8086 mode (bit 17) — Set to enable virtual-8086 mode; clear to return to protected mode.

2-10 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW

See also: Section 22.2.1, “Enabling Virtual-8086 Mode.”


AC Alignment check or access control (bit 18) — If the AM bit is set in the CR0 register, alignment
checking of user-mode data accesses is enabled if and only if this flag is 1. An alignment-check exception
is generated when reference is made to an unaligned operand, such as a word at an odd byte address or a
doubleword at an address which is not an integral multiple of four. Alignment-check exceptions are gener-
ated only in user mode (privilege level 3). Memory references that default to privilege level 0, such as
segment descriptor loads, do not generate this exception even when caused by instructions executed in
user-mode.
The alignment-check exception can be used to check alignment of data. This is useful when exchanging
data with processors which require all data to be aligned. The alignment-check exception can also be used
by interpreters to flag some pointers as special by misaligning the pointer. This eliminates overhead of
checking each pointer and only handles the special pointer when used.
If the SMAP bit is set in the CR4 register, explicit supervisor-mode data accesses to user-mode pages are
allowed if and only if this bit is 1. See Section 5.6, “Access Rights.”
VIF Virtual Interrupt (bit 19) — Contains a virtual image of the IF flag. This flag is used in conjunction with
the VIP flag. The processor only recognizes the VIF flag when either the VME flag or the PVI flag in control
register CR4 is set and the IOPL is less than 3. (The VME flag enables the virtual-8086 mode extensions;
the PVI flag enables the protected-mode virtual interrupts.)
See also: Section 22.3.3.5, “Method 6: Software Interrupt Handling,” and Section 22.4, “Protected-Mode
Virtual Interrupts.”
VIP Virtual interrupt pending (bit 20) — Set by software to indicate that an interrupt is pending; cleared to
indicate that no interrupt is pending. This flag is used in conjunction with the VIF flag. The processor reads
this flag but never modifies it. The processor only recognizes the VIP flag when either the VME flag or the
PVI flag in control register CR4 is set and the IOPL is less than 3. The VME flag enables the virtual-8086
mode extensions; the PVI flag enables the protected-mode virtual interrupts.
See Section 22.3.3.5, “Method 6: Software Interrupt Handling,” and Section 22.4, “Protected-Mode Virtual
Interrupts.”
ID Identification (bit 21) — The ability of a program or procedure to set or clear this flag indicates support
for the CPUID instruction.

2.3.1 System Flags and Fields in IA-32e Mode


In 64-bit mode, the RFLAGS register expands to 64 bits with the upper 32 bits reserved. System flags in RFLAGS
(64-bit mode) or EFLAGS (compatibility mode) are shown in Figure 2-5.
In IA-32e mode, the processor does not allow the VM bit to be set because virtual-8086 mode is not supported
(attempts to set the bit are ignored). Also, the processor will not set the NT bit. The processor does, however, allow
software to set the NT bit (note that an IRET causes a general protection fault in IA-32e mode if the NT bit is set).
In IA-32e mode, the SYSCALL/SYSRET instructions have a programmable method of specifying which bits are
cleared in RFLAGS/EFLAGS. These instructions save/restore EFLAGS/RFLAGS.

2.4 MEMORY-MANAGEMENT REGISTERS


The processor provides four memory-management registers (GDTR, LDTR, IDTR, and TR) that specify the locations
of the data structures which control segmented memory management (see Figure 2-6). Special instructions are
provided for loading and storing these registers.

Vol. 3A 2-11
SYSTEM ARCHITECTURE OVERVIEW

System Table Registers


47(79) 16 15 0
GDTR 32(64)-bit Linear Base Address 16-Bit Table Limit
IDTR 32(64)-bit Linear Base Address 16-Bit Table Limit

System Segment Segment Descriptor Registers (Automatically Loaded)


Registers
15 0 Attributes
Task
Register Seg. Sel. 32(64)-bit Linear Base Address Segment Limit
LDTR Seg. Sel. 32(64)-bit Linear Base Address Segment Limit

Figure 2-6. Memory Management Registers

2.4.1 Global Descriptor Table Register (GDTR)


The GDTR register holds the base address (32 bits in protected mode; 64 bits in IA-32e mode) and the 16-bit table
limit for the GDT. The base address specifies the linear address of byte 0 of the GDT; the table limit specifies the
number of bytes in the table.
The LGDT and SGDT instructions load and store the GDTR register, respectively. On power up or reset of the
processor, the base address is set to the default value of 0 and the limit is set to 0FFFFH. A new base address must
be loaded into the GDTR as part of the processor initialization process for protected-mode operation.
See also: Section 3.5.1, “Segment Descriptor Tables.”

2.4.2 Local Descriptor Table Register (LDTR)


The LDTR register holds the 16-bit segment selector, base address (32 bits in protected mode; 64 bits in IA-32e
mode), segment limit, and descriptor attributes for the LDT. The base address specifies the linear address of byte
0 of the LDT segment; the segment limit specifies the number of bytes in the segment. See also: Section 3.5.1,
“Segment Descriptor Tables.”
The LLDT and SLDT instructions load and store the segment selector part of the LDTR register, respectively. The
segment that contains the LDT must have a segment descriptor in the GDT. When the LLDT instruction loads a
segment selector in the LDTR: the base address, limit, and descriptor attributes from the LDT descriptor are auto-
matically loaded in the LDTR.
When a task switch occurs, the LDTR is automatically loaded with the segment selector and descriptor for the LDT
for the new task. The contents of the LDTR are not automatically saved prior to writing the new LDT information
into the register.
On power up or reset of the processor, the segment selector and base address are set to the default value of 0 and
the limit is set to 0FFFFH.

2.4.3 IDTR Interrupt Descriptor Table Register


The IDTR register holds the base address (32 bits in protected mode; 64 bits in IA-32e mode) and 16-bit table limit
for the IDT. The base address specifies the linear address of byte 0 of the IDT; the table limit specifies the number
of bytes in the table. The LIDT and SIDT instructions load and store the IDTR register, respectively. On power up or
reset of the processor, the base address is set to the default value of 0 and the limit is set to 0FFFFH. The base
address and limit in the register can then be changed as part of the processor initialization process.
See also: Section 7.10, “Interrupt Descriptor Table (IDT).”

2-12 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW

2.4.4 Task Register (TR)


The task register holds the 16-bit segment selector, base address (32 bits in protected mode; 64 bits in IA-32e
mode), segment limit, and descriptor attributes for the TSS of the current task. The selector references the TSS
descriptor in the GDT. The base address specifies the linear address of byte 0 of the TSS; the segment limit speci-
fies the number of bytes in the TSS. See also: Section 9.2.4, “Task Register.”
The LTR and STR instructions load and store the segment selector part of the task register, respectively. When the
LTR instruction loads a segment selector in the task register, the base address, limit, and descriptor attributes from
the TSS descriptor are automatically loaded into the task register. On power up or reset of the processor, the base
address is set to the default value of 0 and the limit is set to 0FFFFH.
When a task switch occurs, the task register is automatically loaded with the segment selector and descriptor for
the TSS for the new task. The contents of the task register are not automatically saved prior to writing the new TSS
information into the register.

2.5 CONTROL REGISTERS


Control registers (CR0, CR1, CR2, CR3, and CR4; see Figure 2-7) determine operating mode of the processor and
the characteristics of the currently executing task. These registers are 32 bits in all 32-bit modes and compatibility
mode.
In 64-bit mode, control registers are expanded to 64 bits. The MOV CRn instructions are used to manipulate the
register bits. Operand-size prefixes for these instructions are ignored. The following is also true:
• The control registers can be read and loaded (or modified) using the move-to-or-from-control-registers forms
of the MOV instruction. In protected mode, the MOV instructions allow the control registers to be read or loaded
(at privilege level 0 only). This restriction means that application programs or operating-system procedures
(running at privilege levels 1, 2, or 3) are prevented from reading or loading the control registers.
• Some of the bits in CR0 and CR4 are reserved and must be written with zeros. Attempting to set any reserved
bits in CR0[31:0] is ignored. Attempting to set any reserved bits in CR0[63:32] results in a general-protection
exception, #GP(0). Attempting to set any reserved bits in CR4 results in a general-protection exception,
#GP(0).
• All 64 bits of CR2 are writable by software.
• Bits in CR3 in the range 63:MAXPHYADDR that are reserved (see Figure 2-7) must be zero. Attempting to set
any of them results in #GP(0).
• The MOV CR2 instruction does not check that address written to CR2 is canonical.
• A 64-bit capable processor will retain the upper 32 bits of each control register when transitioning out of IA-32e
mode.
• On a 64-bit capable processor, an execution of MOV to CR outside of 64-bit mode zeros the upper 32 bits of the
control register.
• Register CR8 is available in 64-bit mode only.
The control registers are summarized below, and each architecturally defined control field in these control registers
is described individually. In Figure 2-7, the width of the register in 64-bit mode is indicated in parenthesis (except
for CR0).
• CR0 — Contains system control flags that control operating mode and states of the processor.
• CR1 — Reserved.
• CR2 — Contains the page-fault linear address (the linear address that caused a page fault).
• CR3 — Contains the physical address of the base of the paging-structure hierarchy and two flags (PCD and
PWT). Only the most-significant bits (less the lower 12 bits) of the base address are specified; the lower 12 bits
of the address are assumed to be 0. The first paging structure must thus be aligned to a page (4-KByte)
boundary. The PCD and PWT flags control caching of that paging structure in the processor’s internal data
caches (they do not control TLB caching of page-directory information).
When using the physical address extension, the CR3 register contains the base address of the page-directory-
pointer table. With 4-level paging and 5-level paging, the CR3 register contains the base address of the PML4

Vol. 3A 2-13
SYSTEM ARCHITECTURE OVERVIEW

table and PML5 table, respectively. If PCIDs are enabled, CR3 has a format different from that illustrated in
Figure 2-7. See Section 5.5, “4-Level Paging and 5-Level Paging.”
When linear-address masking is supported, CR3 includes two bits that control the masking of user pointers
(see Section 4.4, “Linear-Address Masking.”
See also: Chapter 5, “Paging.”
• CR4 — Contains a group of flags that enable several architectural extensions, and indicate operating system or
executive support for specific processor capabilities. Bits CR4[63:32] can only be used for IA-32e mode only
features that are enabled after entering 64-bit mode. Bits CR4[63:32] do not have any effect outside of IA-32e
mode.
• CR8 — Provides read and write access to the Task Priority Register (TPR). It specifies the priority threshold
value that operating systems use to control the priority class of external interrupts allowed to interrupt the
processor. This register is available only in 64-bit mode. However, interrupt filtering continues to apply in
compatibility mode.

31 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
U P
S S S V L U
I P C P C P P M P P T P V
M M K M M A M D
Reserved N K E K I C G C A S S V M CR4
A E L X X 5 I E
T S T E D E E E E E D I E
P P E E 7 P
R E

Reserved Reserved OSFXSR


LAM_SUP FSGSBASE OSXMMEXCPT
OSXSAVE

63 62 61 12 11 5 4 3 2 0

P P
Page-Directory Base Reserved C W Reserved CR3
D T

LAM_U57
LAM_U48

63 0

Page-Fault Linear Address CR2

31 30 29 28 19 18 17 16 15 6 5 4 3 2 1 0

P C N A W N E T E M P
Reserved Reserved CR0
G D W M P E T S M P E

Reserved

Figure 2-7. Control Registers

2-14 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW

The flags in control registers are:


CR0.PG
Paging (bit 31 of CR0) — Enables paging when set; disables paging when clear. When paging is
disabled, all linear addresses are treated as physical addresses. The PG flag has no effect if the PE flag (bit
0 of register CR0) is not also set; setting the PG flag when the PE flag is clear causes a general-protection
exception (#GP). See also: Chapter 5, “Paging.”
On Intel 64 processors, enabling and disabling IA-32e mode operation also requires modifying CR0.PG.
CR0.CD
Cache Disable (bit 30 of CR0) — When the CD and NW flags are clear, caching of memory locations for
the whole of physical memory in the processor’s internal (and external) caches is enabled. When the CD
flag is set, caching is restricted as described in Table 13-5. To prevent the processor from accessing and
updating its caches, the CD flag must be set and the caches must be invalidated so that no cache hits can
occur.
See also: Section 13.5.3, “Preventing Caching,” and Section 13.5, “Cache Control.”
CR0.NW
Not Write-through (bit 29 of CR0) — When the NW and CD flags are clear, write-back (for Pentium 4,
Intel Xeon, P6 family, and Pentium processors) or write-through (for Intel486 processors) is enabled for
writes that hit the cache and invalidation cycles are enabled. See Table 13-5 for detailed information about
the effect of the NW flag on caching for other settings of the CD and NW flags.
CR0.AM
Alignment Mask (bit 18 of CR0) — Enables automatic alignment checking when set; disables alignment
checking when clear. Alignment checking is performed only when the AM flag is set, the AC flag in the
EFLAGS register is set, CPL is 3, and the processor is operating in either protected or virtual-8086 mode.
CR0.WP
Write Protect (bit 16 of CR0) — When set, inhibits supervisor-level procedures from writing into read-
only pages; when clear, allows supervisor-level procedures to write into read-only pages (regardless of the
U/S bit setting; see Section 5.1.3 and Section 5.6). This flag facilitates implementation of the copy-on-
write method of creating a new process (forking) used by operating systems such as UNIX. This flag must
be set before software can set CR4.CET, and it cannot be cleared as long as CR4.CET = 1 (see below).
CR0.NE
Numeric Error (bit 5 of CR0) — Enables the native (internal) mechanism for reporting x87 FPU errors
when set; enables the PC-style x87 FPU error reporting mechanism when clear. When the NE flag is clear
and the IGNNE# input is asserted, x87 FPU errors are ignored. When the NE flag is clear and the IGNNE#
input is deasserted, an unmasked x87 FPU error causes the processor to assert the FERR# pin to generate
an external interrupt and to stop instruction execution immediately before executing the next waiting
floating-point instruction or WAIT/FWAIT instruction.
The FERR# pin is intended to drive an input to an external interrupt controller (the FERR# pin emulates the
ERROR# pin of the Intel 287 and Intel 387 DX math coprocessors). The NE flag, IGNNE# pin, and FERR#
pin are used with external logic to implement PC-style error reporting. Using FERR# and IGNNE# to handle
floating-point exceptions is deprecated by modern operating systems; this non-native approach also limits
newer processors to operate with one logical processor active.
See also: Section 8.7, “Handling x87 FPU Exceptions in Software,” in Chapter 8, “Programming with the
x87 FPU,” and Appendix A, “EFLAGS Cross-Reference,” in the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 1.
CR0.ET
Extension Type (bit 4 of CR0) — Reserved in the Pentium 4, Intel Xeon, P6 family, and Pentium proces-
sors. In the Pentium 4, Intel Xeon, and P6 family processors, this flag is hardcoded to 1. In the Intel386
and Intel486 processors, this flag indicates support of Intel 387 DX math coprocessor instructions when
set.
CR0.TS
Task Switched (bit 3 of CR0) — Allows the saving of the x87 FPU/MMX/SSE/SSE2/SSE3/SSSE3/SSE4
context on a task switch to be delayed until an x87 FPU/MMX/SSE/SSE2/SSE3/SSSE3/SSE4 instruction is

Vol. 3A 2-15
SYSTEM ARCHITECTURE OVERVIEW

actually executed by the new task. The processor sets this flag on every task switch and tests it when
executing x87 FPU/MMX/SSE/SSE2/SSE3/SSSE3/SSE4 instructions.
• If the TS flag is set and the EM flag (bit 2 of CR0) is clear, a device-not-available exception (#NM) is
raised prior to the execution of any x87 FPU/MMX/SSE/SSE2/SSE3/SSSE3/SSE4 instruction; with the
exception of PAUSE, PREFETCHh, SFENCE, LFENCE, MFENCE, MOVNTI, CLFLUSH, CRC32, and POPCNT.
See the paragraph below for the special case of the WAIT/FWAIT instructions.
• If the TS flag is set and the MP flag (bit 1 of CR0) and EM flag are clear, an #NM exception is not raised
prior to the execution of an x87 FPU WAIT/FWAIT instruction.
• If the EM flag is set, the setting of the TS flag has no effect on the execution of x87
FPU/MMX/SSE/SSE2/SSE3/SSSE3/SSE4 instructions.
Table 2-2 shows the actions taken when the processor encounters an x87 FPU instruction based on the
settings of the TS, EM, and MP flags. Table 14-1 and 15-1 show the actions taken when the processor
encounters an MMX/SSE/SSE2/SSE3/SSSE3/SSE4 instruction.

The processor does not automatically save the context of the x87 FPU, XMM, and MXCSR registers on a
task switch. Instead, it sets the TS flag, which causes the processor to raise an #NM exception whenever it
encounters an x87 FPU/MMX/SSE/SSE2/SSE3/SSSE3/SSE4 instruction in the instruction stream for the
new task (with the exception of the instructions listed above).
The fault handler for the #NM exception can then be used to clear the TS flag (with the CLTS instruction)
and save the context of the x87 FPU, XMM, and MXCSR registers. If the task never encounters an x87
FPU/MMX/SSE/SSE2/SSE3/SSSE3/SSE4 instruction, the x87 FPU/MMX/SSE/SSE2/SSE3/SSSE3/SSE4
context is never saved.

Table 2-2. Action Taken By x87 FPU Instructions for Different Combinations of EM, MP, and TS
CR0 Flags x87 FPU Instruction Type
EM MP TS Floating-Point WAIT/FWAIT
0 0 0 Execute Execute.
0 0 1 #NM Exception Execute.
0 1 0 Execute Execute.
0 1 1 #NM Exception #NM exception.
1 0 0 #NM Exception Execute.
1 0 1 #NM Exception Execute.
1 1 0 #NM Exception Execute.
1 1 1 #NM Exception #NM exception.

CR0.EM
Emulation (bit 2 of CR0) — Indicates that the processor does not have an internal or external x87 FPU when set;
indicates an x87 FPU is present when clear. This flag also affects the execution of
MMX/SSE/SSE2/SSE3/SSSE3/SSE4 instructions.
When the EM flag is set, execution of an x87 FPU instruction generates a device-not-available exception
(#NM). This flag must be set when the processor does not have an internal x87 FPU or is not connected to
an external math coprocessor. Setting this flag forces all floating-point instructions to be handled by soft-
ware emulation. Table 11-3 shows the recommended setting of this flag, depending on the IA-32 processor
and x87 FPU or math coprocessor present in the system. Table 2-2 shows the interaction of the EM, MP, and
TS flags.
Also, when the EM flag is set, execution of an MMX instruction causes an invalid-opcode exception (#UD)
to be generated (see Table 14-1). Thus, if an IA-32 or Intel 64 processor incorporates MMX technology, the
EM flag must be set to 0 to enable execution of MMX instructions.
Similarly for SSE/SSE2/SSE3/SSSE3/SSE4 extensions, when the EM flag is set, execution of most
SSE/SSE2/SSE3/SSSE3/SSE4 instructions causes an invalid opcode exception (#UD) to be generated (see

2-16 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW

Table 15-1). If an IA-32 or Intel 64 processor incorporates the SSE/SSE2/SSE3/SSSE3/SSE4 extensions,


the EM flag must be set to 0 to enable execution of these extensions. SSE/SSE2/SSE3/SSSE3/SSE4
instructions not affected by the EM flag include: PAUSE, PREFETCHh, SFENCE, LFENCE, MFENCE, MOVNTI,
CLFLUSH, CRC32, and POPCNT.
CR0.MP
Monitor Coprocessor (bit 1 of CR0) — Controls the interaction of the WAIT (or FWAIT) instruction with
the TS flag (bit 3 of CR0). If the MP flag is set, a WAIT instruction generates a device-not-available exception
(#NM) if the TS flag is also set. If the MP flag is clear, the WAIT instruction ignores the setting of the TS flag.
Table 11-3 shows the recommended setting of this flag, depending on the IA-32 processor and x87 FPU or
math coprocessor present in the system. Table 2-2 shows the interaction of the MP, EM, and TS flags.
CR0.PE
Protection Enable (bit 0 of CR0) — Enables protected mode when set; enables real-address mode when
clear. This flag does not enable paging directly. It only enables segment-level protection. To enable paging,
both the PE and PG flags must be set.
See also: Section 11.9, “Mode Switching.”
CR3.PCD
Page-level Cache Disable (bit 4 of CR3) — Controls the memory type used to access the first paging
structure of the current paging-structure hierarchy. See Section 5.9, “Paging and Memory Typing.” This bit
is not used if paging is disabled, with PAE paging, or with 4-level paging1 or 5-level paging if CR4.PCIDE=1.
CR3.PWT
Page-level Write-Through (bit 3 of CR3) — Controls the memory type used to access the first paging
structure of the current paging-structure hierarchy. See Section 5.9, “Paging and Memory Typing.” This bit
is not used if paging is disabled, with PAE paging, or with 4-level paging or 5-level paging if CR4.PCIDE=1.
CR3.LAM_U57
User LAM57 enable (bit 61 of CR3) — When set, enables LAM57 (masking of linear-address bits 62:57)
for user pointers and overrides CR3.LAM_U48. See Section 4.4, “Linear-Address Masking.”
CR3.LAM_U48
User LAM48 enable (bit 62 of CR3) — When set and CR3.LAM_U57 is clear, enables LAM48 (masking of
linear-address bits 62:48) for user pointers. See Section 4.4, “Linear-Address Masking.”
CR4.VME
Virtual-8086 Mode Extensions (bit 0 of CR4) — Enables interrupt- and exception-handling extensions
in virtual-8086 mode when set; disables the extensions when clear. Use of the virtual mode extensions can
improve the performance of virtual-8086 applications by eliminating the overhead of calling the virtual-
8086 monitor to handle interrupts and exceptions that occur while executing an 8086 program and,
instead, redirecting the interrupts and exceptions back to the 8086 program’s handlers. It also provides
hardware support for a virtual interrupt flag (VIF) to improve reliability of running 8086 programs in multi-
tasking and multiple-processor environments.
See also: Section 22.3, “Interrupt and Exception Handling in Virtual-8086 Mode.”
CR4.PVI
Protected-Mode Virtual Interrupts (bit 1 of CR4) — Enables hardware support for a virtual interrupt
flag (VIF) in protected mode when set; disables the VIF flag in protected mode when clear.
See also: Section 22.4, “Protected-Mode Virtual Interrupts.”
CR4.TSD
Time Stamp Disable (bit 2 of CR4) — Restricts the execution of the RDTSC instruction to procedures
running at privilege level 0 when set; allows RDTSC instruction to be executed at any privilege level when
clear. This bit also applies to the RDTSCP instruction if supported (if CPUID.80000001H:EDX[27] = 1).
CR4.DE
Debugging Extensions (bit 3 of CR4) — References to debug registers DR4 and DR5 cause an unde-
fined opcode (#UD) exception to be generated when set; when clear, processor aliases references to regis-
ters DR4 and DR5 for compatibility with software written to run on earlier IA-32 processors.

1. Earlier versions of this manual used the term “IA-32e paging” to identify 4-level paging.

Vol. 3A 2-17
SYSTEM ARCHITECTURE OVERVIEW

See also: Section 19.2.2, “Debug Registers DR4 and DR5.”


CR4.PSE
Page Size Extensions (bit 4 of CR4) — Enables 4-MByte pages with 32-bit paging when set; restricts
32-bit paging to pages of 4 KBytes when clear.
See also: Section 5.3, “32-Bit Paging.”
CR4.PAE
Physical Address Extension (bit 5 of CR4) — When set, enables paging to produce physical addresses
with more than 32 bits. When clear, restricts physical addresses to 32 bits. PAE must be set before entering
IA-32e mode.
See also: Chapter 5, “Paging.”
CR4.MCE
Machine-Check Enable (bit 6 of CR4) — Enables the machine-check exception when set; disables the
machine-check exception when clear.
See also: Chapter 17, “Machine-Check Architecture.”
CR4.PGE
Page Global Enable (bit 7 of CR4) — (Introduced in the P6 family processors.) Enables the global page
feature when set; disables the global page feature when clear. The global page feature allows frequently
used or shared pages to be marked as global to all users (done with the global flag, bit 8, in a page-direc-
tory-pointer-table entry, a page-directory entry, or a page-table entry). Global pages are not flushed from
the translation-lookaside buffer (TLB) on a task switch or a write to register CR3.
When enabling the global page feature, paging must be enabled (by setting the PG flag in control register
CR0) before the PGE flag is set. Reversing this sequence may affect program correctness, and processor
performance will be impacted.
See also: Section 5.10, “Caching Translation Information.”
CR4.PCE
Performance-Monitoring Counter Enable (bit 8 of CR4) — Enables execution of the RDPMC instruc-
tion for programs or procedures running at any protection level when set; RDPMC instruction can be
executed only at protection level 0 when clear.
CR4.OSFXSR
Operating System Support for FXSAVE and FXRSTOR instructions (bit 9 of CR4) — When set, this
flag: (1) indicates to software that the operating system supports the use of the FXSAVE and FXRSTOR
instructions, (2) enables the FXSAVE and FXRSTOR instructions to save and restore the contents of the
XMM and MXCSR registers along with the contents of the x87 FPU and MMX registers, and (3) enables the
processor to execute SSE/SSE2/SSE3/SSSE3/SSE4 instructions, with the exception of the PAUSE,
PREFETCHh, SFENCE, LFENCE, MFENCE, MOVNTI, CLFLUSH, CRC32, and POPCNT.
If this flag is clear, the FXSAVE and FXRSTOR instructions will save and restore the contents of the x87 FPU
and MMX registers, but they may not save and restore the contents of the XMM and MXCSR registers. Also,
the processor will generate an invalid opcode exception (#UD) if it attempts to execute any
SSE/SSE2/SSE3 instruction, with the exception of PAUSE, PREFETCHh, SFENCE, LFENCE, MFENCE,
MOVNTI, CLFLUSH, CRC32, and POPCNT. The operating system or executive must explicitly set this flag.

NOTE
CPUID feature flag FXSR indicates availability of the FXSAVE/FXRSTOR instructions. The OSFXSR
bit provides operating system software with a means of enabling FXSAVE/FXRSTOR to save/restore
the contents of the X87 FPU, XMM, and MXCSR registers. Consequently OSFXSR bit indicates that
the operating system provides context switch support for SSE/SSE2/SSE3/SSSE3/SSE4.

CR4.OSXMMEXCPT
Operating System Support for Unmasked SIMD Floating-Point Exceptions (bit 10 of CR4) —
When set, indicates that the operating system supports the handling of unmasked SIMD floating-point
exceptions through an exception handler that is invoked when a SIMD floating-point exception (#XM) is

2-18 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW

generated. SIMD floating-point exceptions are only generated by SSE/SSE2/SSE3/SSE4.1 SIMD floating-
point instructions.
The operating system or executive must explicitly set this flag. If this flag is not set, the processor will
generate an invalid opcode exception (#UD) whenever it detects an unmasked SIMD floating-point excep-
tion.
CR4.UMIP
User-Mode Instruction Prevention (bit 11 of CR4) — When set, the following instructions cannot be
executed if CPL > 0: SGDT, SIDT, SLDT, SMSW, and STR. An attempt at such execution causes a general-
protection exception (#GP).
CR4.LA57
57-bit linear addresses (bit 12 of CR4) — When set in IA-32e mode, the processor uses 5-level paging
to translate 57-bit linear addresses. When clear in IA-32e mode, the processor uses 4-level paging to
translate 48-bit linear addresses. This bit cannot be modified in IA-32e mode.
See also: Chapter 5, “Paging.”
CR4.VMXE
VMX-Enable Bit (bit 13 of CR4) — Enables VMX operation when set. See Chapter 25, “Introduction to
Virtual Machine Extensions.”
CR4.SMXE
SMX-Enable Bit (bit 14 of CR4) — Enables SMX operation when set. See Chapter 7, “Safer Mode Exten-
sions Reference,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2D.
CR4.FSGSBASE
FSGSBASE-Enable Bit (bit 16 of CR4) — Enables the instructions RDFSBASE, RDGSBASE, WRFSBASE,
and WRGSBASE.
CR4.PCIDE
PCID-Enable Bit (bit 17 of CR4) — Enables process-context identifiers (PCIDs) when set. See Section
5.10.1, “Process-Context Identifiers (PCIDs).” Applies only in IA-32e mode (if IA32_EFER.LMA = 1).
CR4.OSXSAVE
XSAVE and Processor Extended States-Enable Bit (bit 18 of CR4) — When set, this flag: (1) indi-
cates (via CPUID.01H:ECX.OSXSAVE[bit 27]) that the operating system supports the use of the XGETBV,
XSAVE, and XRSTOR instructions by general software; (2) enables the XSAVE and XRSTOR instructions to
save and restore the x87 FPU state (including MMX registers), the SSE state (XMM registers and MXCSR),
along with other processor extended states enabled in XCR0; (3) enables the processor to execute XGETBV
and XSETBV instructions in order to read and write XCR0. See Section 2.6 and Chapter 15, “System
Programming for Instruction Set Extensions and Processor Extended States.”
CR4.KL
Key-Locker-Enable Bit (bit 19 of CR4) — When set, the LOADIWKEY instruction is enabled; in addition,
if support for the AES Key Locker instructions has been activated by system firmware,
CPUID.19H:EBX.AESKLE[bit 0] is enumerated as 1 and the AES Key Locker instructions are enabled.1
When clear, CPUID.19H:EBX.AESKLE[bit 0] is enumerated as 0 and execution of any Key Locker instruction
causes an invalid-opcode exception (#UD).
CR4.SMEP
SMEP-Enable Bit (bit 20 of CR4) — Enables supervisor-mode execution prevention (SMEP) when set.
See Section 5.6, “Access Rights.”
CR4.SMAP
SMAP-Enable Bit (bit 21 of CR4) — Enables supervisor-mode access prevention (SMAP) when set. See
Section 5.6, “Access Rights.”
CR4.PKE
Enable protection keys for user-mode pages (bit 22 of CR4) — 4-level paging and 5-level paging

1. Software can check CPUID.19H:EBX.AESKLE[bit 0] after setting CR4.KL to determine whether the AES Key Locker instructions have
been enabled. Note that some processors may allow enabling of those instructions without activation by system firmware. Some
processors may not support use of the AES Key Locker instructions in system-management mode (SMM). Those processors enumer-
ate CPUID.19H:EBX.AESKLE[bit 0] as 0 in SMM regardless of the setting of CR4.KL.

Vol. 3A 2-19
SYSTEM ARCHITECTURE OVERVIEW

associate each user-mode linear address with a protection key. When set, this flag indicates (via
CPUID.(EAX=07H,ECX=0H):ECX.OSPKE [bit 4]) that the operating system supports use of the PKRU
register to specify, for each protection key, whether user-mode linear addresses with that protection key
can be read or written. This bit also enables access to the PKRU register using the RDPKRU and WRPKRU
instructions.
CR4.CET
Control-flow Enforcement Technology (bit 23 of CR4) — Enables control-flow enforcement tech-
nology when set. See Chapter 18, “Control-flow Enforcement Technology (CET)‚” of the IA-32 Intel® Archi-
tecture Software Developer’s Manual, Volume 1. This flag can be set only if CR0.WP is set, and it must be
clear before CR0.WP can be cleared (see below).
CR4.PKS
Enable protection keys for supervisor-mode pages (bit 24 of CR4) — 4-level paging and 5-level
paging associate each supervisor-mode linear address with a protection key. When set, this flag allows use
of the IA32_PKRS MSR to specify, for each protection key, whether supervisor-mode linear addresses with
that protection key can be read or written.
CR4.UINTR
User Interrupts Enable Bit (bit 25 of CR4) — Enables user interrupts when set, including user-interrupt
delivery, user-interrupt notification identification, and the user-interrupt instructions.
CR4.LAM_SUP
Supervisor LAM enable (bit 28 of CR4) — When set, enables LAM (linear-address masking) for super-
visor pointers. See Section 4.4, “Linear-Address Masking.”
CR8.TPL
Task Priority Level (bit 3:0 of CR8) — This sets the threshold value corresponding to the highest-
priority interrupt to be blocked. A value of 0 means all interrupts are enabled. This field is available in 64-
bit mode. A value of 15 means all interrupts will be disabled.

2.5.1 CPUID Qualification of Control Register Flags


Not all flags in control register CR4 are implemented on all processors. With the exception of the PCE flag, they can
be qualified with the CPUID instruction to determine if they are implemented on the processor before they are
used.
The CR8 register is available on processors that support Intel 64 architecture.

2.6 EXTENDED CONTROL REGISTERS (INCLUDING XCR0)


If CPUID.01H:ECX.XSAVE[bit 26] is 1, the processor supports one or more extended control registers (XCRs).
Currently, the only such register defined is XCR0. This register specifies the set of processor state components for
which the operating system provides context management, e.g., x87 FPU state, SSE state, AVX state. The OS
programs XCR0 to reflect the features for which it provides context management.

2-20 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW

63 18 17 10 9 8 7 6 5 4 3 2 1 0

Reserved for XCR0 bit vector expansion


Reserved / Future processor extended states
TILEDATA state
TILECONFIG state
PKRU state
Hi16_ZMM state
ZMM_Hi256 state
Opmask state
BNDCSR state
BNDREG state
AVX state
SSE state
X87 FPU/MMX state (must be 1)

Reserved (must be 0)

Figure 2-8. XCR0


Software can access XCR0 only if CR4.OSXSAVE[bit 18] = 1. (This bit is also readable as
CPUID.01H:ECX.OSXSAVE[bit 27].) Software can use CPUID leaf function 0DH to enumerate the bits in XCR0 that
the processor supports (see CPUID instruction in Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 2A). Each supported state component is represented by a bit in XCR0. System software enables state
components by loading an appropriate bit mask value into XCR0 using the XSETBV instruction.
As each bit in XCR0 (except bit 63) corresponds to a processor state component, XCR0 thus provides support for
up to 63 sets of processor state components. Bit 63 of XCR0 is reserved for future expansion and will not represent
a processor state component.
Currently, XCR0 defines support for the following state components:
• XCR0.X87 (bit 0): This bit 0 must be 1. An attempt to write 0 to this bit causes a #GP exception.
• XCR0.SSE (bit 1): If 1, the XSAVE feature set can be used to manage MXCSR and the XMM registers (XMM0-
XMM15 in 64-bit mode; otherwise XMM0-XMM7).
• XCR0.AVX (bit 2): If 1, Intel AVX instructions can be executed and the XSAVE feature set can be used to
manage the upper halves of the YMM registers (YMM0-YMM15 in 64-bit mode; otherwise YMM0-YMM7).
• XCR0.BNDREG (bit 3): If 1, Intel MPX instructions can be executed and the XSAVE feature set can be used to
manage the bounds registers BND0–BND3.
• XCR0.BNDCSR (bit 4): If 1, Intel MPX instructions can be executed and the XSAVE feature set can be used to
manage the BNDCFGU and BNDSTATUS registers.
• XCR0.opmask (bit 5): If 1, Intel AVX-512 instructions can be executed and the XSAVE feature set can be used
to manage the opmask registers k0–k7.
• XCR0.ZMM_Hi256 (bit 6): If 1, Intel AVX-512 instructions can be executed and the XSAVE feature set can be
used to manage the upper halves of the lower ZMM registers (ZMM0-ZMM15 in 64-bit mode; otherwise ZMM0-
ZMM7).
• XCR0.Hi16_ZMM (bit 7): If 1, Intel AVX-512 instructions can be executed and the XSAVE feature set can be
used to manage the upper ZMM registers (ZMM16-ZMM31, only in 64-bit mode).
• XCR0.PKRU (bit 9): If 1, the XSAVE feature set can be used to manage the PKRU register (see Section 2.7).
• XCR0.TILECFG (bit 17): If 1, and if XCR0.TILEDATA is also 1, Intel AMX instructions can be executed and the
XSAVE feature set can be used to manage TILECFG.

Vol. 3A 2-21
SYSTEM ARCHITECTURE OVERVIEW

• XCR0.TILEDATA (bit 18): If 1, and if XCR0.TILECFG is also 1, Intel AMX instructions can be executed and the
XSAVE feature set can be used to manage TILEDATA.
An attempt to use XSETBV to write to XCR0 results in general-protection exceptions (#GP) if it would do any of the
following:
• Set a bit reserved in XCR0 for a given processor (as determined by the contents of EAX and EDX after executing
CPUID with EAX=0DH, ECX= 0H).
• Clear XCR0.x87.
• Clear XCR0.SSE and set XCR0.AVX.
• Clear XCR0.AVX and set any of XCR0.opmask, XCR0.ZMM_Hi256, or XCR0.Hi16_ZMM.
• Set either XCR0.BNDREG or XCR0.BNDCSR while not setting the other.
• Set any of XCR0.opmask, XCR0.ZMM_Hi256, and XCR0.Hi16_ZMM while not setting all of them.
• Set either XCR0.TILECFG or XCR0.TILEDATA while not setting the other.
After reset, all bits (except bit 0) in XCR0 are cleared to zero; XCR0[0] is set to 1.

2.7 PROTECTION-KEY RIGHTS REGISTERS (PKRU AND IA32_PKRS)


Processors may support either or both of two protection-key rights registers: PKRU for user-mode pages and the
IA32_PKRS MSR (MSR index 6E1H) for supervisor-mode pages. 4-level paging and 5-level paging associate a 4-bit
protection key with each page. The protection-key rights registers determine accessibility based on a page’s
protection key.
If CPUID.(EAX=07H,ECX=0H):ECX.PKU [bit 3] = 1, the processor supports the protection-key feature for user-
mode pages. When CR4.PKE = 1, software can use the protection-key rights register for user pages (PKRU)
to specify the access rights for user-mode pages for each protection key.
If CPUID.(EAX=07H,ECX=0H):ECX.PKS [bit 31] = 1, the processor supports the protection-key feature for super-
visor-mode pages. When CR4.PKS = 1, software can use the protection-key rights register for supervisor
pages (the IA32_PKRS MSR) to specify the access rights for supervisor-mode pages for each protection key.

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Bit Position
W A W A W A W A W A W A W A W A W A W A W A W A W A W A W A W A
D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D
15 15 14 14 13 13 12 12 11 11 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0

Figure 2-9. Format of Protection-Key Rights Registers


The format of each protection-key rights register is given in Figure 2-9. Each contains 16 pairs of disable controls
to prevent data accesses to linear addresses (user-mode or supervisor-mode, depending on the register) based on
their protection keys. Each protection key i (0 ≤ i ≤ 15) is associated with two bits in each protection-key rights
register:
• Bit 2i, shown as “ADi” (access disable): if set, the processor prevents any data accesses to linear addresses
(user-mode or supervisor-mode, depending on the register) with protection key i.
• Bit 2i+1, shown as “WDi” (write disable): if set, the processor prevents write accesses to linear addresses
(user-mode or supervisor-mode, depending on the register) with protection key i.
(Bits 63:32 of the IA32_PKRS MSR are reserved and must be zero.)
See Section 5.6.2, “Protection Keys,” for details of how the processor uses the protection-key rights registers to
control accesses to linear addresses.
Software can read and write PKRU using the RDPKRU and WRPKRU instructions. The IA32_PKRS MSR can be read
and written with the RDMSR and WRMSR instructions. Writes to the IA32_PKRS MSR using WRMSR are not serial-
izing.

2-22 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW

2.8 SYSTEM INSTRUCTION SUMMARY


System instructions handle system-level functions such as loading system registers, managing the cache,
managing interrupts, or setting up the debug registers. Many of these instructions can be executed only by oper-
ating-system or executive procedures (that is, procedures running at privilege level 0). Others can be executed at
any privilege level and are thus available to application programs.
Table 2-3 lists the system instructions and indicates whether they are available and useful for application
programs. These instructions are described in the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volumes 2A, 2B, 2C, & 2D.

Table 2-3. Summary of System Instructions


Useful to Protected from
Instruction Description Application? Application?
LLDT Load LDT Register No Yes
SLDT Store LDT Register No If CR4.UMIP = 1
LGDT Load GDT Register No Yes
SGDT Store GDT Register No If CR4.UMIP = 1
LTR Load Task Register No Yes
STR Store Task Register No If CR4.UMIP = 1
LIDT Load IDT Register No Yes
SIDT Store IDT Register No If CR4.UMIP = 1
MOV CRn Load and store control registers No Yes
SMSW Store MSW Yes If CR4.UMIP = 1
LMSW Load MSW No Yes
CLTS Clear TS flag in CR0 No Yes
ARPL Adjust RPL Yes1, 5 No
LAR Load Access Rights Yes No
LSL Load Segment Limit Yes No
VERR Verify for Reading Yes No
VERW Verify for Writing Yes No
MOV DRn Load and store debug registers No Yes
INVD Invalidate cache, no writeback No Yes
WBINVD Invalidate cache, with writeback No Yes
INVLPG Invalidate TLB entry No Yes
HLT Halt Processor No Yes
LOCK (Prefix) Bus Lock Yes No
RSM Return from system management mode No Yes
RDMSR3 Read Model-Specific Registers No Yes
WRMSR3 Write Model-Specific Registers No Yes
RDPMC4 Read Performance-Monitoring Counter Yes Yes2
RDTSC3 Read Time-Stamp Counter Yes Yes2
RDTSCP7 Read Serialized Time-Stamp Counter Yes Yes2
XGETBV Return the state of XCR0 Yes No
XSETBV Enable one or more processor extended states No 6
Yes

Vol. 3A 2-23
SYSTEM ARCHITECTURE OVERVIEW

Table 2-3. Summary of System Instructions (Contd.)


Useful to Protected from
Instruction Description Application? Application?
NOTES:
1. Useful to application programs running at a CPL of 1 or 2.
2. The TSD and PCE flags in control register CR4 control access to these instructions by application programs running at a CPL of 3.
3. These instructions were introduced into the IA-32 Architecture with the Pentium processor.
4. This instruction was introduced into the IA-32 Architecture with the Pentium Pro processor and the Pentium processor with MMX technol-
ogy.
5. This instruction is not supported in 64-bit mode.
6. Application uses XGETBV to query which set of processor extended states are enabled.
7. RDTSCP is introduced in Intel Core i7 processor.

2.8.1 Loading and Storing System Registers


The GDTR, LDTR, IDTR, and TR registers each have a load and store instruction for loading data into and storing
data from the register:
• LGDT (Load GDTR Register) — Loads the GDT base address and limit from memory into the GDTR register.
• SGDT (Store GDTR Register) — Stores the GDT base address and limit from the GDTR register into memory.
• LIDT (Load IDTR Register) — Loads the IDT base address and limit from memory into the IDTR register.
• SIDT (Store IDTR Register) — Stores the IDT base address and limit from the IDTR register into memory.
• LLDT (Load LDTR Register) — Loads the LDT segment selector and segment descriptor from memory into
the LDTR. (The segment selector operand can also be located in a general-purpose register.)
• SLDT (Store LDTR Register) — Stores the LDT segment selector from the LDTR register into memory or a
general-purpose register.
• LTR (Load Task Register) — Loads segment selector and segment descriptor for a TSS from memory into the
task register. (The segment selector operand can also be located in a general-purpose register.)
• STR (Store Task Register) — Stores the segment selector for the current task TSS from the task register into
memory or a general-purpose register.
The LMSW (load machine status word) and SMSW (store machine status word) instructions operate on bits 0
through 15 of control register CR0. These instructions are provided for compatibility with the 16-bit Intel 286
processor. Programs written to run on 32-bit IA-32 processors should not use these instructions. Instead, they
should access the control register CR0 using the MOV CR instruction.
The CLTS (clear TS flag in CR0) instruction is provided for use in handling a device-not-available exception (#NM)
that occurs when the processor attempts to execute a floating-point instruction when the TS flag is set. This
instruction allows the TS flag to be cleared after the x87 FPU context has been saved, preventing further #NM
exceptions. See Section 2.5, “Control Registers,” for more information on the TS flag.
The control registers (CR0, CR1, CR2, CR3, CR4, and CR8) are loaded using the MOV instruction. The instruction
loads a control register from a general-purpose register or stores the content of a control register in a general-
purpose register.

2.8.2 Verifying of Access Privileges


The processor provides several instructions for examining segment selectors and segment descriptors to determine
if access to their associated segments is allowed. These instructions duplicate some of the automatic access rights
and type checking done by the processor, thus allowing operating-system or executive software to prevent excep-
tions from being generated.
The ARPL (adjust RPL) instruction adjusts the RPL (requestor privilege level) of a segment selector to match that of
the program or procedure that supplied the segment selector. See Section 6.10.4, “Checking Caller Access Privi-
leges (ARPL Instruction),” for a detailed explanation of the function and use of this instruction. Note that ARPL is
not supported in 64-bit mode.

2-24 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW

The LAR (load access rights) instruction verifies the accessibility of a specified segment and loads access rights
information from the segment’s segment descriptor into a general-purpose register. Software can then examine
the access rights to determine if the segment type is compatible with its intended use. See Section 6.10.1,
“Checking Access Rights (LAR Instruction),” for a detailed explanation of the function and use of this instruction.
The LSL (load segment limit) instruction verifies the accessibility of a specified segment and loads the segment
limit from the segment’s segment descriptor into a general-purpose register. Software can then compare the
segment limit with an offset into the segment to determine whether the offset lies within the segment. See Section
6.10.3, “Checking That the Pointer Offset Is Within Limits (LSL Instruction),” for a detailed explanation of the func-
tion and use of this instruction.
The VERR (verify for reading) and VERW (verify for writing) instructions verify if a selected segment is readable or
writable, respectively, at a given CPL. See Section 6.10.2, “Checking Read/Write Rights (VERR and VERW Instruc-
tions),” for a detailed explanation of the function and use of these instructions.

2.8.3 Loading and Storing Debug Registers


Internal debugging facilities in the processor are controlled by a set of 8 debug registers (DR0-DR7). The MOV
instruction allows setup data to be loaded to and stored from these registers.
On processors that support Intel 64 architecture, debug registers DR0-DR7 are 64 bits. In 32-bit modes and
compatibility mode, writes to a debug register fill the upper 32 bits with zeros. Reads return the lower 32 bits. In
64-bit mode, the upper 32 bits of DR6-DR7 are reserved and must be written with zeros. Writing one to any of the
upper 32 bits causes an exception, #GP(0).
In 64-bit mode, MOV DRn instructions read or write all 64 bits of a debug register (operand-size prefixes are
ignored). All 64 bits of DR0-DR3 are writable by software. However, MOV DRn instructions do not check that
addresses written to DR0-DR3 are in the limits of the implementation. Address matching is supported only on valid
addresses generated by the processor implementation.

2.8.4 Invalidating Caches and TLBs


The processor provides several instructions for use in explicitly invalidating its caches and TLB entries. The INVD
(invalidate cache with no writeback) instruction invalidates all data and instruction entries in the internal caches
and sends a signal to the external caches indicating that they should also be invalidated.
The WBINVD (invalidate cache with writeback) instruction performs the same function as the INVD instruction,
except that it writes back modified lines in its internal caches to memory before it invalidates the caches. After
invalidating the caches local to the executing logical processor or processor core, WBINVD signals caches higher in
the cache hierarchy (caches shared with the invalidating logical processor or core) to write back any data they have
in modified state at the time of instruction execution and to invalidate their contents.
Note, non-shared caches may not be written back nor invalidated. In Figure 2-10 below, if code executing on either
LP0 or LP1 were to execute a WBINVD, the shared L1 and L2 for LP0/LP1 will be written back and invalidated as will
the shared L3. However, the L1 and L2 caches not shared with LP0 and LP1 will not be written back nor invalidated.

Vol. 3A 2-25
SYSTEM ARCHITECTURE OVERVIEW

Not Written back and


not Invalidated
Logical Processors LP0 LP1 LP2 LP3 LP4 LP5 LP6 LP7
L1 & L2 Cache
Written back
& Invalidated
Execution Engine
L3 Cache Written back and Invalidated
Uncore

QPI
DDR3

Figure 2-10. WBINVD Invalidation of Shared and Non-Shared Cache Hierarchy

The INVLPG (invalidate TLB entry) instruction invalidates (flushes) the TLB entry for a specified page.

2.8.5 Controlling the Processor

The HLT (halt processor) instruction stops the processor until an enabled interrupt (such as NMI or SMI, which are
normally enabled), a debug exception, the BINIT# signal, the INIT# signal, or the RESET# signal is received. The
processor generates a special bus cycle to indicate that the halt mode has been entered.
Hardware may respond to this signal in a number of ways. An indicator light on the front panel may be turned on.
An NMI interrupt for recording diagnostic information may be generated. Reset initialization may be invoked (note
that the BINIT# pin was introduced with the Pentium Pro processor). If any non-wake events are pending during
shutdown, they will be handled after the wake event from shutdown is processed (for example, A20M# interrupts).
The LOCK prefix invokes a locked (atomic) read-modify-write operation when modifying a memory operand. This
mechanism is used to allow reliable communications between processors in multiprocessor systems, as described
below:
• In the Pentium processor and earlier IA-32 processors, the LOCK prefix causes the processor to assert the
LOCK# signal during the instruction. This always causes an explicit bus lock to occur.
• In the Pentium 4, Intel Xeon, and P6 family processors, the locking operation is handled with either a cache lock
or bus lock. If a memory access is cacheable and affects only a single cache line, a cache lock is invoked and
the system bus and the actual memory location in system memory are not locked during the operation. Here,
other Pentium 4, Intel Xeon, or P6 family processors on the bus write-back any modified data and invalidate
their caches as necessary to maintain system memory coherency. If the memory access is not cacheable
and/or it crosses a cache line boundary, the processor’s LOCK# signal is asserted and the processor does not
respond to requests for bus control during the locked operation.
The RSM (return from SMM) instruction restores the processor (from a context dump) to the state it was in prior to
a system management mode (SMM) interrupt.

2.8.6 Reading Performance-Monitoring and Time-Stamp Counters


The RDPMC (read performance-monitoring counter) and RDTSC (read time-stamp counter) instructions allow
application programs to read the processor’s performance-monitoring and time-stamp counters, respectively.
Processors based on Intel NetBurst® microarchitecture have eighteen 40-bit performance-monitoring counters; P6
family processors have two 40-bit counters. Intel Atom® processors and most of the processors based on the Intel
Core microarchitecture support two types of performance monitoring counters: programmable performance coun-
ters similar to those available in the P6 family, and three fixed-function performance monitoring counters. Details

2-26 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW

of programmable and fixed-function performance monitoring counters for each processor generation are described
in Chapter 20, “Last Branch Records.”
The programmable performance counters can support counting either the occurrence or duration of events. Events
that can be monitored on programmable counters generally are model specific (except for architectural perfor-
mance events enumerated by CPUID leaf 0AH); they may include the number of instructions decoded, interrupts
received, or the number of cache loads. Individual counters can be set up to monitor different events. Use the
system instruction WRMSR to set up values in one of the IA32_PERFEVTSELx MSR, in one of the 45 ESCRs and one
of the 18 CCCR MSRs (for Pentium 4 and Intel Xeon processors); or in the PerfEvtSel0 or the PerfEvtSel1 MSR (for
the P6 family processors). The RDPMC instruction loads the current count from the selected counter into the
EDX:EAX registers.
Fixed-function performance counters record only specific events that are defined at: https://perfmon-
events.intel.com/, and the width/number of fixed-function counters are enumerated by CPUID leaf 0AH.
The time-stamp counter is a model-specific 64-bit counter that is reset to zero each time the processor is reset. If
not reset, the counter will increment ~9.5 x 1016 times per year when the processor is operating at a clock rate
of 3GHz. At this clock frequency, it would take over 190 years for the counter to wrap around. The RDTSC
instruction loads the current count of the time-stamp counter into the EDX:EAX registers.
See Section 21.1, “Performance Monitoring Overview,” and Section 19.17, “Time-Stamp Counter,” for more infor-
mation about the performance monitoring and time-stamp counters.
The RDTSC instruction was introduced into the IA-32 architecture with the Pentium processor. The RDPMC instruc-
tion was introduced into the IA-32 architecture with the Pentium Pro processor and the Pentium processor with
MMX technology. Earlier Pentium processors have two performance-monitoring counters, but they can be read only
with the RDMSR instruction, and only at privilege level 0.

2.8.6.1 Reading Counters in 64-Bit Mode


In 64-bit mode, RDTSC operates the same as in protected mode. The count in the time-stamp counter is stored in
EDX:EAX (or RDX[31:0]:RAX[31:0] with RDX[63:32]:RAX[63:32] cleared).
RDPMC requires an index to specify the offset of the performance-monitoring counter. In 64-bit mode for Pentium
4 or Intel Xeon processor families, the index is specified in ECX[30:0]. The current count of the performance-moni-
toring counter is stored in EDX:EAX (or RDX[31:0]:RAX[31:0] with RDX[63:32]:RAX[63:32] cleared).

2.8.7 Reading and Writing Model-Specific Registers


The RDMSR (read model-specific register) and WRMSR (write model-specific register) instructions allow a
processor’s 64-bit model-specific registers (MSRs) to be read and written, respectively. The MSR to be read or
written is specified by the value in the ECX register.

RDMSR reads the value from the specified MSR to the EDX:EAX registers; WRMSR writes the value in the EDX:EAX
registers to the specified MSR. RDMSR and WRMSR were introduced into the IA-32 architecture with the Pentium
processor.
See Section 11.4, “Model-Specific Registers (MSRs),” for more information.

2.8.7.1 Reading and Writing Model-Specific Registers in 64-Bit Mode


RDMSR and WRMSR require an index to specify the address of an MSR. In 64-bit mode, the index is 32 bits; it is
specified using ECX.

2.8.8 Enabling Processor Extended States


The XSETBV instruction is required to enable OS support of individual processor extended states in XCR0 (see
Section 2.6).

Vol. 3A 2-27
SYSTEM ARCHITECTURE OVERVIEW

2-28 Vol. 3A
12.Updates to Chapter 3, Volume 3A
Change bars and violet text show changes to Chapter 3 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3A: System Programming Guide, Part 1.

------------------------------------------------------------------------------------------
Changes to this chapter:
• Updated the chapter with information on the newly added linear-address pre-processing and references to the
new Chapter 4 where needed.

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 13


CHAPTER 3
PROTECTED-MODE MEMORY MANAGEMENT

This chapter describes the Intel 64 and IA-32 architecture’s protected-mode memory management facilities,
including the physical memory requirements, segmentation mechanism, and paging mechanism.
See also: Chapter 6, “Protection‚” (for a description of the processor’s protection mechanism) and Chapter 22,
“8086 Emulation‚” (for a description of memory addressing protection in real-address and virtual-8086 modes).

3.1 MEMORY MANAGEMENT OVERVIEW


The memory management facilities of the IA-32 architecture are divided into two parts: segmentation and paging.
Segmentation provides a mechanism of isolating individual code, data, and stack modules so that multiple
programs (or tasks) can run on the same processor without interfering with one another. Paging provides a mech-
anism for implementing a conventional demand-paged, virtual-memory system where sections of a program’s
execution environment are mapped into physical memory as needed. Paging can also be used to provide isolation
between multiple tasks. When operating in protected mode, some form of segmentation must be used. There is
no mode bit to disable segmentation. The use of paging, however, is optional.
These two mechanisms (segmentation and paging) can be configured to support simple single-program (or single-
task) systems, multitasking systems, or multiple-processor systems that used shared memory.
As shown in Figure 3-1, segmentation provides a mechanism for dividing the processor’s addressable memory
space (called the linear address space) into smaller protected address spaces called segments. Segments can
be used to hold the code, data, and stack for a program or to hold system data structures (such as a TSS or LDT).
If more than one program (or task) is running on a processor, each program can be assigned its own set of
segments. The processor then enforces the boundaries between these segments and ensures that one program
does not interfere with the execution of another program by writing into the other program’s segments. The
segmentation mechanism also allows typing of segments so that the operations that may be performed on a partic-
ular type of segment can be restricted.
All the segments in a system are contained in the processor’s linear address space. To locate a byte in a particular
segment, a logical address (also called a far pointer) must be provided. A logical address consists of a segment
selector and an offset. The segment selector is a unique identifier for a segment. Among other things it provides an
offset into a descriptor table (such as the global descriptor table, GDT) to a data structure called a segment
descriptor. Each segment has a segment descriptor, which specifies the size of the segment, the access rights and
privilege level for the segment, the segment type, and the location of the first byte of the segment in the linear
address space (called the base address of the segment). The offset part of the logical address is added to the base
address for the segment to locate a byte within the segment. The base address plus the offset thus forms a linear
address in the processor’s linear address space.

Vol. 3A 3-1
PROTECTED-MODE MEMORY MANAGEMENT

Logical Address
(or Far Pointer)

Segment
Selector Offset Linear Address
Space

Linear Address
Global Descriptor
Dir Table Offset Physical
Table (GDT)
Address
Space
Segment
Page Table Page
Segment
Descriptor
Page Directory Phy. Addr.
Lin. Addr.
Entry
Entry
Segment
Base Address

Page

Segmentation Paging

Figure 3-1. Segmentation and Paging

If paging is not used, the linear address space of the processor is mapped directly into the physical address space
of processor. The physical address space is defined as the range of addresses that the processor can generate on
its address bus.
Because multitasking computing systems commonly define a linear address space much larger than it is economi-
cally feasible to contain all at once in physical memory, some method of “virtualizing” the linear address space is
needed. This virtualization of the linear address space is handled through the processor’s paging mechanism.
Paging supports a “virtual memory” environment where a large linear address space is simulated with a small
amount of physical memory (RAM and ROM) and some disk storage. When using paging, each segment is divided
into pages (typically 4 KBytes each in size), which are stored either in physical memory or on the disk. The oper-
ating system or executive maintains a page directory and a set of page tables to keep track of the pages. When a
program (or task) attempts to access an address location in the linear address space, the processor uses the page
directory and page tables to translate the linear address into a physical address and then performs the requested
operation (read or write) on the memory location.
If the page being accessed is not currently in physical memory, the processor interrupts execution of the program
(by generating a page-fault exception). The operating system or executive then reads the page into physical
memory from the disk and continues executing the program.
When paging is implemented properly in the operating-system or executive, the swapping of pages between phys-
ical memory and the disk is transparent to the correct execution of a program. Even programs written for 16-bit IA-
32 processors can be paged (transparently) when they are run in virtual-8086 mode.

3.2 USING SEGMENTS


The segmentation mechanism supported by the IA-32 architecture can be used to implement a wide variety of
system designs. These designs range from flat models that make only minimal use of segmentation to protect

3-2 Vol. 3A
PROTECTED-MODE MEMORY MANAGEMENT

programs to multi-segmented models that employ segmentation to create a robust operating environment in
which multiple programs and tasks can be executed reliably.
The following sections give several examples of how segmentation can be employed in a system to improve
memory management performance and reliability.

3.2.1 Basic Flat Model


The simplest memory model for a system is the basic “flat model,” in which the operating system and application
programs have access to a continuous, unsegmented address space. To the greatest extent possible, this basic flat
model hides the segmentation mechanism of the architecture from both the system designer and the application
programmer.
To implement a basic flat memory model with the IA-32 architecture, at least two segment descriptors must be
created, one for referencing a code segment and one for referencing a data segment (see Figure 3-2). Both of
these segments, however, are mapped to the entire linear address space: that is, both segment descriptors have
the same base address value of 0 and the same segment limit of 4 GBytes. By setting the segment limit to 4
GBytes, the segmentation mechanism is kept from generating exceptions for out of limit memory references, even
if no physical memory resides at a particular address. ROM (EPROM) is generally located at the top of the physical
address space, because the processor begins execution at FFFF_FFF0H. RAM (DRAM) is placed at the bottom of the
address space because the initial base address for the DS data segment after reset initialization is 0.

3.2.2 Protected Flat Model


The protected flat model is similar to the basic flat model, except the segment limits are set to include only the
range of addresses for which physical memory actually exists (see Figure 3-3). A general-protection exception
(#GP) is then generated on any attempt to access nonexistent memory. This model provides a minimum level of
hardware protection against some kinds of program bugs.

Linear Address Space


(or Physical Memory)
Segment FFFFFFFFH
Registers Code
CS Code- and Data-Segment
Descriptors Not Present
SS

DS Access Limit Data and


ES Base Address Stack 0

FS

GS

Figure 3-2. Flat Model

Vol. 3A 3-3
PROTECTED-MODE MEMORY MANAGEMENT

Segment Linear Address Space


Descriptors (or Physical Memory)
Segment Access Limit
Registers Code FFFFFFFFH
Base Address
CS
Not Present
ES
Memory I/O
SS Access Limit
DS Base Address
Data and
FS Stack

GS 0

Figure 3-3. Protected Flat Model

More complexity can be added to this protected flat model to provide more protection. For example, for the paging
mechanism to provide isolation between user and supervisor code and data, four segments need to be defined:
code and data segments at privilege level 3 for the user, and code and data segments at privilege level 0 for the
supervisor. Usually these segments all overlay each other and start at address 0 in the linear address space. This
flat segmentation model along with a simple paging structure can protect the operating system from applications,
and by adding a separate paging structure for each task or process, it can also protect applications from each other.
Similar designs are used by several popular multitasking operating systems.

3.2.3 Multi-Segment Model


A multi-segment model (such as the one shown in Figure 3-4) uses the full capabilities of the segmentation mech-
anism to provide hardware enforced protection of code, data structures, and programs and tasks. Here, each
program (or task) is given its own table of segment descriptors and its own segments. The segments can be
completely private to their assigned programs or shared among programs. Access to all segments and to the
execution environments of individual programs running on the system is controlled by hardware.

3-4 Vol. 3A
PROTECTED-MODE MEMORY MANAGEMENT

Segment Segment Linear Address Space


Registers Descriptors (or Physical Memory)
Access Limit
CS
Base Address Stack
Access Limit
SS
Base Address

Access Limit
DS
Base Address Code
Access Limit
ES
Base Address
Data
Access Limit
FS
Base Address
Data
Access Limit
GS
Base Address
Data
Access Limit
Base Address

Access Limit
Base Address
Data
Access Limit
Base Address

Access Limit
Base Address

Figure 3-4. Multi-Segment Model

Access checks can be used to protect not only against referencing an address outside the limit of a segment, but
also against performing disallowed operations in certain segments. For example, since code segments are desig-
nated as read-only segments, hardware can be used to prevent writes into code segments. The access rights infor-
mation created for segments can also be used to set up protection rings or levels. Protection levels can be used to
protect operating-system procedures from unauthorized access by application programs.

3.2.4 Segmentation in IA-32e Mode


In IA-32e mode of Intel 64 architecture, the effects of segmentation depend on whether the processor is running
in compatibility mode or 64-bit mode. In compatibility mode, segmentation functions just as it does using legacy
16-bit or 32-bit protected mode semantics.
In 64-bit mode, segmentation is generally (but not completely) disabled, creating a flat 64-bit linear-address
space. The processor treats the segment base of CS, DS, ES, SS as zero, creating a linear address that is equal to
the effective address. The FS and GS segments are exceptions. These segment registers (which hold the segment
base) can be used as additional base registers in linear address calculations. They facilitate addressing local data
and certain operating system data structures.
Note that the processor does not perform segment limit checks at runtime in 64-bit mode.

3.2.5 Paging and Segmentation


Paging can be used with any of the segmentation models described in Figures 3-2, 3-3, and 3-4. The processor’s
paging mechanism divides the linear address space (into which segments are mapped) into pages (as shown in
Figure 3-1). These linear-address-space pages are then mapped to pages in the physical address space. The
paging mechanism offers several page-level protection facilities that can be used with or instead of the segment-

Vol. 3A 3-5
PROTECTED-MODE MEMORY MANAGEMENT

protection facilities. For example, it lets read-write protection be enforced on a page-by-page basis. The paging
mechanism also provides two-level user-supervisor protection that can also be specified on a page-by-page basis.

3.3 PHYSICAL ADDRESS SPACE


In protected mode, the IA-32 architecture provides a normal physical address space of 4 GBytes (232 bytes). This
is the address space that the processor can address on its address bus. This address space is flat (unsegmented),
with addresses ranging continuously from 0 to FFFFFFFFH. This physical address space can be mapped to read-
write memory, read-only memory, and memory mapped I/O. The memory mapping facilities described in this
chapter can be used to divide this physical memory up into segments and/or pages.
Starting with the Pentium Pro processor, the IA-32 architecture also supports an extension of the physical address
space to 236 bytes (64 GBytes); with a maximum physical address of FFFFFFFFFH. This extension is invoked in
either of two ways:
• Using the physical address extension (PAE) flag, located in bit 5 of control register CR4.
• Using the 36-bit page size extension (PSE-36) feature (introduced in the Pentium III processors).
Physical address support has since been extended beyond 36 bits. See Chapter 5, “Paging‚” for more information
about 36-bit physical addressing.

3.3.1 Intel® 64 Processors and Physical Address Space


On processors that support Intel 64 architecture (CPUID.80000001H:EDX[29] = 1), the size of the physical
address range is implementation-specific and indicated by CPUID.80000008H:EAX[bits 7-0].
For the format of information returned in EAX, see “CPUID—CPU Identification” in Chapter 3 of the Intel® 64 and
IA-32 Architectures Software Developer’s Manual, Volume 2A. See also: Chapter 5, “Paging.”

3.4 LOGICAL AND LINEAR ADDRESSES


At the system-architecture level in protected mode, the processor uses two stages of address translation to arrive
at a physical address: logical-address translation and linear address space paging.
Even with the minimum use of segments, every byte in the processor’s address space is accessed with a logical
address. A logical address consists of a 16-bit segment selector and a 32-bit offset (see Figure 3-5). The segment
selector identifies the segment the byte is located in and the offset specifies the location of the byte in the segment
relative to the base address of the segment.
The processor translates every logical address into a linear address. A linear address is a 32-bit address in the
processor’s linear address space. Like the physical address space, the linear address space is a flat (unsegmented),
232-byte address space, with addresses ranging from 0 to FFFFFFFFH. The linear address space contains all the
segments and system tables defined for a system.
To translate a logical address into a linear address, the processor does the following:
1. Uses the offset in the segment selector to locate the segment descriptor for the segment in the GDT or LDT and
reads it into the processor. (This step is needed only when a new segment selector is loaded into a segment
register.)
2. Examines the segment descriptor to check the access rights and range of the segment to ensure that the
segment is accessible and that the offset is within the limits of the segment.
3. Adds the base address of the segment from the segment descriptor to the offset to form a linear address.

3-6 Vol. 3A
PROTECTED-MODE MEMORY MANAGEMENT

15 0 31(63) 0
Logical Seg. Selector Offset (Effective Address)
Address

Descriptor Table

Segment Base Address


Descriptor
+

31(63) 0
Linear Address

Figure 3-5. Logical Address to Linear Address Translation

If paging is not used, the processor maps the linear address directly to a physical address (that is, the linear
address goes out on the processor’s address bus). If the linear address space is paged, a second level of address
translation is used to translate the linear address into a physical address.
See also: Chapter 5, “Paging.”

3.4.1 Logical Address Translation in IA-32e Mode


In IA-32e mode, an Intel 64 processor uses the steps described above to translate a logical address to a linear
address. In 64-bit mode, the offset and base address of the segment are 64-bits instead of 32 bits. The linear
address format is also 64 bits wide and is subject to linear-address pre-processing (see Chapter 4).
Each code segment descriptor provides an L bit. This bit allows a code segment to execute 64-bit code or legacy
32-bit code by code segment.

3.4.2 Segment Selectors


A segment selector is a 16-bit identifier for a segment (see Figure 3-6). It does not point directly to the segment,
but instead points to the segment descriptor that defines the segment. A segment selector contains the following
items:
Index (Bits 3 through 15) — Selects one of 8192 descriptors in the GDT or LDT. The processor multiplies
the index value by 8 (the number of bytes in a segment descriptor) and adds the result to the base
address of the GDT or LDT (from the GDTR or LDTR register, respectively).
TI (table indicator) flag
(Bit 2) — Specifies the descriptor table to use: clearing this flag selects the GDT; setting this flag
selects the current LDT.

15 3 2 1 0

Index T RPL
I

Table Indicator
0 = GDT
1 = LDT
Requested Privilege Level (RPL)

Figure 3-6. Segment Selector

Vol. 3A 3-7
PROTECTED-MODE MEMORY MANAGEMENT

Requested Privilege Level (RPL)


(Bits 0 and 1) — Specifies the privilege level of the selector. The privilege level can range from 0 to
3, with 0 being the most privileged level. See Section 6.5, “Privilege Levels,” for a description of the
relationship of the RPL to the CPL of the executing program (or task) and the descriptor privilege
level (DPL) of the descriptor the segment selector points to.
The first entry of the GDT is not used by the processor. A segment selector that points to this entry of the GDT (that
is, a segment selector with an index of 0 and the TI flag set to 0) is used as a “null segment selector.” The processor
does not generate an exception when a segment register (other than the CS or SS registers) is loaded with a null
selector. It does, however, generate an exception when a segment register holding a null selector is used to access
memory. A null selector can be used to initialize unused segment registers. Loading the CS or SS register with a null
segment selector causes a general-protection exception (#GP) to be generated.
Segment selectors are visible to application programs as part of a pointer variable, but the values of selectors are
usually assigned or modified by link editors or linking loaders, not application programs.

3.4.3 Segment Registers


To reduce address translation time and coding complexity, the processor provides registers for holding up to 6
segment selectors (see Figure 3-7). Each of these segment registers support a specific kind of memory reference
(code, stack, or data). For virtually any kind of program execution to take place, at least the code-segment (CS),
data-segment (DS), and stack-segment (SS) registers must be loaded with valid segment selectors. The processor
also provides three additional data-segment registers (ES, FS, and GS), which can be used to make additional data
segments available to the currently executing program (or task).
For a program to access a segment, the segment selector for the segment must have been loaded in one of the
segment registers. So, although a system can define thousands of segments, only 6 can be available for immediate
use. Other segments can be made available by loading their segment selectors into these registers during program
execution.

Visible Part Hidden Part


Segment Selector Base Address, Limit, Access Information CS
SS
DS
ES
FS
GS

Figure 3-7. Segment Registers

Every segment register has a “visible” part and a “hidden” part. (The hidden part is sometimes referred to as a
“descriptor cache” or a “shadow register.”) When a segment selector is loaded into the visible part of a segment
register, the processor also loads the hidden part of the segment register with the base address, segment limit, and
access control information from the segment descriptor pointed to by the segment selector. The information cached
in the segment register (visible and hidden) allows the processor to translate addresses without taking extra bus
cycles to read the base address and limit from the segment descriptor. In systems in which multiple processors
have access to the same descriptor tables, it is the responsibility of software to reload the segment registers when
the descriptor tables are modified. If this is not done, an old segment descriptor cached in a segment register might
be used after its memory-resident version has been modified.
Two kinds of load instructions are provided for loading the segment registers:
1. Direct load instructions such as the MOV, POP, LDS, LES, LSS, LGS, and LFS instructions. These instructions
explicitly reference the segment registers.
2. Implied load instructions such as the far pointer versions of the CALL, JMP, and RET instructions, the SYSENTER
and SYSEXIT instructions, and the IRET, INT n, INTO, INT3, and INT1 instructions. These instructions change

3-8 Vol. 3A
PROTECTED-MODE MEMORY MANAGEMENT

the contents of the CS register (and sometimes other segment registers) as an incidental part of their
operation.
The MOV instruction can also be used to store the visible part of a segment register in a general-purpose register.

3.4.4 Segment Loading Instructions in IA-32e Mode


Because ES, DS, and SS segment registers are not used in 64-bit mode, their fields (base, limit, and attribute) in
segment descriptor registers are ignored. Some forms of segment load instructions are also invalid (for example,
LDS, POP ES). Address calculations that reference the ES, DS, or SS segments are treated as if the segment base
is zero.
The processor performs linear-address pre-processing (Chapter 4) instead of performing limit checks. Mode
switching does not change the contents of the segment registers or the associated descriptor registers. These
registers are also not changed during 64-bit mode execution, unless explicit segment loads are performed.
In order to set up compatibility mode for an application, segment-load instructions (MOV to Sreg, POP Sreg) work
normally in 64-bit mode. An entry is read from the system descriptor table (GDT or LDT) and is loaded in the hidden
portion of the segment register. The descriptor-register base, limit, and attribute fields are all loaded. However, the
contents of the data and stack segment selector and the descriptor registers are ignored.
When FS and GS segment overrides are used in 64-bit mode, their respective base addresses are used in the linear
address calculation: (FS or GS).base + index + displacement. FS.base and GS.base are then expanded to the full
linear-address size supported by the implementation. The resulting effective address calculation can wrap across
positive and negative addresses; the resulting address is subject to linear-address pre-processing.
In 64-bit mode, memory accesses using FS-segment and GS-segment overrides are not checked for a runtime limit
nor subjected to attribute-checking. Normal segment loads (MOV to Sreg and POP Sreg) into FS and GS load a
standard 32-bit base value in the hidden portion of the segment register. The base address bits above the standard
32 bits are cleared to 0 to allow consistency for implementations that use less than 64 bits.
The hidden descriptor register fields for FS.base and GS.base are physically mapped to MSRs in order to load all
address bits supported by a 64-bit implementation. Software with CPL = 0 (privileged software) can load all
supported linear-address bits into FS.base or GS.base using WRMSR. Addresses written into the 64-bit FS.base
and GS.base registers must be in canonical form. A WRMSR instruction that attempts to write a non-canonical
address to those registers causes a #GP fault. See Section 4.5.2.
When in compatibility mode, FS and GS overrides operate as defined by 32-bit mode behavior regardless of the
value loaded into the upper 32 linear-address bits of the hidden descriptor register base field. Compatibility mode
ignores the upper 32 bits when calculating an effective address.
A new 64-bit mode instruction, SWAPGS, can be used to load GS base. SWAPGS exchanges the kernel data struc-
ture pointer from the IA32_KERNEL_GS_BASE MSR with the GS base register. The kernel can then use the GS
prefix on normal memory references to access the kernel data structures. An attempt to write a non-canonical
value (using WRMSR) to the IA32_KERNEL_GS_BASE MSR causes a #GP fault; see Section 4.5.2.

3.4.5 Segment Descriptors


A segment descriptor is a data structure in a GDT or LDT that provides the processor with the size and location of
a segment, as well as access control and status information. Segment descriptors are typically created by
compilers, linkers, loaders, or the operating system or executive, but not application programs. Figure 3-8 illus-
trates the general descriptor format for all types of segment descriptors.

Vol. 3A 3-9
PROTECTED-MODE MEMORY MANAGEMENT

31 24 23 22 21 20 19 16 15 14 13 12 11 8 7 0

D A Seg. D
Base 31:24 G / L V Limit P P S Type Base 23:16 4
B L 19:16 L

31 16 15 0

Base Address 15:00 Segment Limit 15:00 0

L — 64-bit code segment (IA-32e mode only)


AVL — Available for use by system software
BASE — Segment base address
D/B — Default operation size (0 = 16-bit segment; 1 = 32-bit segment)
DPL — Descriptor privilege level
G — Granularity
LIMIT — Segment Limit
P — Segment present
S — Descriptor type (0 = system; 1 = code or data)
TYPE — Segment type

Figure 3-8. Segment Descriptor

The flags and fields in a segment descriptor are as follows:


Segment limit field
Specifies the size of the segment. The processor puts together the two segment limit fields to form
a 20-bit value. The processor interprets the segment limit in one of two ways, depending on the
setting of the G (granularity) flag:
• If the granularity flag is clear, the segment size can range from 1 byte to 1 MByte, in byte incre-
ments.
• If the granularity flag is set, the segment size can range from 4 KBytes to 4 GBytes, in 4-KByte
increments.
The processor uses the segment limit in two different ways, depending on whether the segment is
an expand-up or an expand-down segment. See Section 3.4.5.1, “Code- and Data-Segment
Descriptor Types,” for more information about segment types. For expand-up segments, the offset
in a logical address can range from 0 to the segment limit. Offsets greater than the segment limit
generate general-protection exceptions (#GP, for all segments other than SS) or stack-fault excep-
tions (#SS for the SS segment). For expand-down segments, the segment limit has the reverse
function; the offset can range from the segment limit plus 1 to FFFFFFFFH or FFFFH, depending on
the setting of the B flag. Offsets less than or equal to the segment limit generate general-protection
exceptions or stack-fault exceptions. Decreasing the value in the segment limit field for an expand-
down segment allocates new memory at the bottom of the segment's address space, rather than at
the top. IA-32 architecture stacks always grow downwards, making this mechanism convenient for
expandable stacks.
Base address fields
Defines the location of byte 0 of the segment within the 4-GByte linear address space. The
processor puts together the three base address fields to form a single 32-bit value. Segment base
addresses should be aligned to 16-byte boundaries. Although 16-byte alignment is not required,
this alignment allows programs to maximize performance by aligning code and data on 16-byte
boundaries.
Type field Indicates the segment or gate type and specifies the kinds of access that can be made to the
segment and the direction of growth. The interpretation of this field depends on whether the
descriptor type flag specifies an application (code or data) descriptor or a system descriptor. The
encoding of the type field is different for code, data, and system descriptors (see Figure 6-1). See
Section 3.4.5.1, “Code- and Data-Segment Descriptor Types,” for a description of how this field is
used to specify code and data-segment types.

3-10 Vol. 3A
PROTECTED-MODE MEMORY MANAGEMENT

S (descriptor type) flag


Specifies whether the segment descriptor is for a system segment (S flag is clear) or a code or data
segment (S flag is set).
DPL (descriptor privilege level) field
Specifies the privilege level of the segment. The privilege level can range from 0 to 3, with 0 being
the most privileged level. The DPL is used to control access to the segment. See Section 6.5, “Priv-
ilege Levels,” for a description of the relationship of the DPL to the CPL of the executing code
segment and the RPL of a segment selector.
P (segment-present) flag
Indicates whether the segment is present in memory (set) or not present (clear). If this flag is clear,
the processor generates a segment-not-present exception (#NP) when a segment selector that
points to the segment descriptor is loaded into a segment register. Memory management software
can use this flag to control which segments are actually loaded into physical memory at a given
time. It offers a control in addition to paging for managing virtual memory.
Figure 3-9 shows the format of a segment descriptor when the segment-present flag is clear. When
this flag is clear, the operating system or executive is free to use the locations marked “Available” to
store its own data, such as information regarding the whereabouts of the missing segment.
D/B (default operation size/default stack pointer size and/or upper bound) flag
Performs different functions depending on whether the segment descriptor is an executable code
segment, an expand-down data segment, or a stack segment. (This flag should always be set to 1
for 32-bit code and data segments and to 0 for 16-bit code and data segments.)
• Executable code segment. The flag is called the D flag and it indicates the default length for
effective addresses and operands referenced by instructions in the segment. If the flag is set,
32-bit addresses and 32-bit or 8-bit operands are assumed; if it is clear, 16-bit addresses and
16-bit or 8-bit operands are assumed.
The instruction prefix 66H can be used to select an operand size other than the default, and the
prefix 67H can be used select an address size other than the default.
• Stack segment (data segment pointed to by the SS register). The flag is called the B (big)
flag and it specifies the size of the stack pointer used for implicit stack operations (such as
pushes, pops, and calls). If the flag is set, a 32-bit stack pointer is used, which is stored in the
32-bit ESP register; if the flag is clear, a 16-bit stack pointer is used, which is stored in the 16-
bit SP register. If the stack segment is set up to be an expand-down data segment (described in
the next paragraph), the B flag also specifies the upper bound of the stack segment.
• Expand-down data segment. The flag is called the B flag and it specifies the upper bound of
the segment. If the flag is set, the upper bound is FFFFFFFFH (4 GBytes); if the flag is clear, the
upper bound is FFFFH (64 KBytes).

31 16 15 14 13 12 11 8 7 0

D
Available 0 P S Type Available 4
L

31 0

Available 0

Figure 3-9. Segment Descriptor When Segment-Present Flag Is Clear

G (granularity) flag
Determines the scaling of the segment limit field. When the granularity flag is clear, the segment
limit is interpreted in byte units; when flag is set, the segment limit is interpreted in 4-KByte units.
(This flag does not affect the granularity of the base address; it is always byte granular.) When the
granularity flag is set, the twelve least significant bits of an offset are not tested when checking the

Vol. 3A 3-11
PROTECTED-MODE MEMORY MANAGEMENT

offset against the segment limit. For example, when the granularity flag is set, a limit of 0 results in
valid offsets from 0 to 4095.
L (64-bit code segment) flag
In IA-32e mode, bit 21 of the second doubleword of the segment descriptor indicates whether a
code segment contains native 64-bit code. A value of 1 indicates instructions in this code segment
are executed in 64-bit mode. A value of 0 indicates the instructions in this code segment are
executed in compatibility mode. If the L-bit is set, then the D-bit must be cleared. Bit 21 is not used
outside IA-32e mode (or for data segments). Because an attempt to activate IA-32e mode will fault
if the current CS has the L-bit set (see Section 11.8.5), software operating outside IA-32e mode
should avoid loading CS from a descriptor that sets the L-bit.
Available and reserved bits
Bit 20 of the second doubleword of the segment descriptor is available for use by system software.

3.4.5.1 Code- and Data-Segment Descriptor Types


When the S (descriptor type) flag in a segment descriptor is set, the descriptor is for either a code or a data
segment. The highest order bit of the type field (bit 11 of the second double word of the segment descriptor) then
determines whether the descriptor is for a data segment (clear) or a code segment (set).
For data segments, the three low-order bits of the type field (bits 8, 9, and 10) are interpreted as accessed (A),
write-enable (W), and expansion-direction (E). See Table 3-1 for a description of the encoding of the bits in the
type field for code and data segments. Data segments can be read-only or read/write segments, depending on the
setting of the write-enable bit.

Table 3-1. Code- and Data-Segment Types


Type Field Descriptor Description
Type
Decimal 11 10 9 8
E W A
0 0 0 0 0 Data Read-Only
1 0 0 0 1 Data Read-Only, accessed
2 0 0 1 0 Data Read/Write
3 0 0 1 1 Data Read/Write, accessed
4 0 1 0 0 Data Read-Only, expand-down
5 0 1 0 1 Data Read-Only, expand-down, accessed
6 0 1 1 0 Data Read/Write, expand-down
7 0 1 1 1 Data Read/Write, expand-down, accessed
C R A
8 1 0 0 0 Code Execute-Only
9 1 0 0 1 Code Execute-Only, accessed
10 1 0 1 0 Code Execute/Read
11 1 0 1 1 Code Execute/Read, accessed
12 1 1 0 0 Code Execute-Only, conforming
13 1 1 0 1 Code Execute-Only, conforming, accessed
14 1 1 1 0 Code Execute/Read, conforming
15 1 1 1 1 Code Execute/Read, conforming, accessed

Stack segments are data segments which must be read/write segments. Loading the SS register with a segment
selector for a nonwritable data segment generates a general-protection exception (#GP). If the size of a stack
segment needs to be changed dynamically, the stack segment can be an expand-down data segment (expansion-

3-12 Vol. 3A
PROTECTED-MODE MEMORY MANAGEMENT

direction flag set). Here, dynamically changing the segment limit causes stack space to be added to the bottom of
the stack. If the size of a stack segment is intended to remain static, the stack segment may be either an expand-
up or expand-down type.
The accessed bit indicates whether the segment has been accessed since the last time the operating-system or
executive cleared the bit. The processor sets this bit whenever it loads a segment selector for the segment into a
segment register, assuming that the type of memory that contains the segment descriptor supports processor
writes. The bit remains set until explicitly cleared. This bit can be used both for virtual memory management and
for debugging.
For code segments, the three low-order bits of the type field are interpreted as accessed (A), read enable (R), and
conforming (C). Code segments can be execute-only or execute/read, depending on the setting of the read-enable
bit. An execute/read segment might be used when constants or other static data have been placed with instruction
code in a ROM. Here, data can be read from the code segment either by using an instruction with a CS override
prefix or by loading a segment selector for the code segment in a data-segment register (the DS, ES, FS, or GS
registers). In protected mode, code segments are not writable.
Code segments can be either conforming or nonconforming. A transfer of execution into a more-privileged
conforming segment allows execution to continue at the current privilege level. A transfer into a nonconforming
segment at a different privilege level results in a general-protection exception (#GP), unless a call gate or task gate
is used (see Section 6.8.1, “Direct Calls or Jumps to Code Segments,” for more information on conforming and
nonconforming code segments). System utilities that do not access protected facilities and handlers for some types
of exceptions (such as, divide error or overflow) may be loaded in conforming code segments. Utilities that need to
be protected from less privileged programs and procedures should be placed in nonconforming code segments.

NOTE
Execution cannot be transferred by a call or a jump to a less-privileged (numerically higher
privilege level) code segment, regardless of whether the target segment is a conforming or
nonconforming code segment. Attempting such an execution transfer will result in a general-
protection exception.

All data segments are nonconforming, meaning that they cannot be accessed by less privileged programs or proce-
dures (code executing at numerically higher privilege levels). Unlike code segments, however, data segments can
be accessed by more privileged programs or procedures (code executing at numerically lower privilege levels)
without using a special access gate.
If the segment descriptors in the GDT or an LDT are placed in ROM, the processor can enter an indefinite loop if
software or the processor attempts to update (write to) the ROM-based segment descriptors. To prevent this
problem, set the accessed bits for all segment descriptors placed in a ROM. Also, remove operating-system or
executive code that attempts to modify segment descriptors located in ROM.

3.5 SYSTEM DESCRIPTOR TYPES


When the S (descriptor type) flag in a segment descriptor is clear, the descriptor type is a system descriptor. The
processor recognizes the following types of system descriptors:
• Local descriptor-table (LDT) segment descriptor.
• Task-state segment (TSS) descriptor.
• Call-gate descriptor.
• Interrupt-gate descriptor.
• Trap-gate descriptor.
• Task-gate descriptor.
These descriptor types fall into two categories: system-segment descriptors and gate descriptors. System-
segment descriptors point to system segments (LDT and TSS segments). Gate descriptors are in themselves
“gates,” which hold pointers to procedure entry points in code segments (call, interrupt, and trap gates) or which
hold segment selectors for TSS’s (task gates).

Vol. 3A 3-13
PROTECTED-MODE MEMORY MANAGEMENT

Table 3-2 shows the encoding of the type field for system-segment descriptors and gate descriptors. Note that
system descriptors in IA-32e mode are 16 bytes instead of 8 bytes.

Table 3-2. System-Segment and Gate-Descriptor Types


Type Field Description
Decimal 11 10 9 8 32-Bit Mode IA-32e Mode
0 0 0 0 0 Reserved Reserved
1 0 0 0 1 16-bit TSS (Available) Reserved
2 0 0 1 0 LDT LDT
3 0 0 1 1 16-bit TSS (Busy) Reserved
4 0 1 0 0 16-bit Call Gate Reserved
5 0 1 0 1 Task Gate Reserved
6 0 1 1 0 16-bit Interrupt Gate Reserved
7 0 1 1 1 16-bit Trap Gate Reserved
8 1 0 0 0 Reserved Reserved
9 1 0 0 1 32-bit TSS (Available) 64-bit TSS (Available)
10 1 0 1 0 Reserved Reserved
11 1 0 1 1 32-bit TSS (Busy) 64-bit TSS (Busy)
12 1 1 0 0 32-bit Call Gate 64-bit Call Gate
13 1 1 0 1 Reserved Reserved
14 1 1 1 0 32-bit Interrupt Gate 64-bit Interrupt Gate
15 1 1 1 1 32-bit Trap Gate 64-bit Trap Gate

See also: Section 3.5.1, “Segment Descriptor Tables,” and Section 9.2.2, “TSS Descriptor,” (for more information
on the system-segment descriptors); see Section 6.8.3, “Call Gates,” Section 7.11, “IDT Descriptors,” and Section
9.2.5, “Task-Gate Descriptor,” (for more information on the gate descriptors).

3.5.1 Segment Descriptor Tables


A segment descriptor table is an array of segment descriptors (see Figure 3-10). A descriptor table is variable in
length and can contain up to 8192 (213) 8-byte descriptors. There are two kinds of descriptor tables:
• The global descriptor table (GDT).
• The local descriptor tables (LDT).

3-14 Vol. 3A
PROTECTED-MODE MEMORY MANAGEMENT

Global Local
Descriptor Descriptor
Table (GDT) Table (LDT)

T
I TI = 0 TI = 1
Segment
Selector
56 56

48 48

40 40

32 32

24 24

16 16

8 8
First Descriptor in
GDT is Not Used 0 0

GDTR Register LDTR Register


Limit Limit
Base Address Base Address
Seg. Sel.

Figure 3-10. Global and Local Descriptor Tables

Each system must have one GDT defined, which may be used for all programs and tasks in the system. Optionally,
one or more LDTs can be defined. For example, an LDT can be defined for each separate task being run, or some or
all tasks can share the same LDT.
The GDT is not a segment itself; instead, it is a data structure in linear address space. The base linear address and
limit of the GDT must be loaded into the GDTR register (see Section 2.4, “Memory-Management Registers”). The
base address of the GDT should be aligned on an eight-byte boundary to yield the best processor performance. The
limit value for the GDT is expressed in bytes. As with segments, the limit value is added to the base address to get
the address of the last valid byte. A limit value of 0 results in exactly one valid byte. Because segment descriptors
are always 8 bytes long, the GDT limit should always be one less than an integral multiple of eight (that is, 8N – 1).
The first descriptor in the GDT is not used by the processor. A segment selector to this “null descriptor” does not
generate an exception when loaded into a data-segment register (DS, ES, FS, or GS), but it always generates a
general-protection exception (#GP) when an attempt is made to access memory using the descriptor. By initializing
the segment registers with this segment selector, accidental reference to unused segment registers can be guar-
anteed to generate an exception.
The LDT is located in a system segment of the LDT type. The GDT must contain a segment descriptor for the LDT
segment. If the system supports multiple LDTs, each must have a separate segment selector and segment
descriptor in the GDT. The segment descriptor for an LDT can be located anywhere in the GDT. See Section 3.5,
“System Descriptor Types,” for information on the LDT segment-descriptor type.
An LDT is accessed with its segment selector. To eliminate address translations when accessing the LDT, the
segment selector, base linear address, limit, and access rights of the LDT are stored in the LDTR register (see
Section 2.4, “Memory-Management Registers”).
When the GDTR register is stored (using the SGDT instruction), a 48-bit “pseudo-descriptor” is stored in memory
(see top diagram in Figure 3-11). To avoid alignment check faults in user mode (privilege level 3), the pseudo-
descriptor should be located at an odd word address (that is, address MOD 4 is equal to 2). This causes the

Vol. 3A 3-15
PROTECTED-MODE MEMORY MANAGEMENT

processor to store an aligned word, followed by an aligned doubleword. User-mode programs normally do not store
pseudo-descriptors, but the possibility of generating an alignment check fault can be avoided by aligning pseudo-
descriptors in this way. The same alignment should be used when storing the IDTR register using the SIDT instruc-
tion. When storing the LDTR or task register (using the SLDT or STR instruction, respectively), the pseudo-
descriptor should be located at a doubleword address (that is, address MOD 4 is equal to 0).

47 16 15 0
32-bit Base Address Limit

79 16 15 0
64-bit Base Address Limit

Figure 3-11. Pseudo-Descriptor Formats

3.5.2 Segment Descriptor Tables in IA-32e Mode


In IA-32e mode, a segment descriptor table can contain up to 8192 (213) 8-byte descriptors. An entry in the
segment descriptor table can be 8 bytes. System descriptors are expanded to 16 bytes (occupying the space of two
entries).
GDTR and LDTR registers are expanded to hold 64-bit base address. The corresponding pseudo-descriptor is 80
bits. (see the bottom diagram in Figure 3-11).
The following system descriptors expand to 16 bytes:
— Call gate descriptors (see Section 6.8.3.1, “IA-32e Mode Call Gates”).
— IDT gate descriptors (see Section 7.14.1, “64-Bit Mode IDT”).
— LDT and TSS descriptors (see Section 9.2.3, “TSS Descriptor in 64-bit mode”).

3-16 Vol. 3A
13.New Chapter 4, Volume 3A
Change bars and violet text show changes to a new Chapter 4 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3A: System Programming Guide, Part 1.

------------------------------------------------------------------------------------------
This is a new chapter on linear-address pre-processing. It includes LASS and LAM and collects much of the
previous material on canonicality checking. Regarding the last point, it captures numerous fine points regarding
5-level paging.

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 13


CHAPTER 4
LINEAR-ADDRESS PRE-PROCESSING

As described in Chapter 3, “Protected-Mode Memory Management‚” software accesses to memory typically use
logical addresses. The processor uses segmentation, as detailed in Section 3.4, to generate linear addresses from
logical addresses. Linear addresses are then translated to physical addresses using paging, as described in Chapter
5, “Paging.”
In IA-32e mode (if IA32_EFER.LMA = 1), linear addresses may undergo some pre-processing before being trans-
lated through paging.1 Some of this pre-processing is done only if enabled by software, but some occurs uncondi-
tionally. Specifically, linear addresses are subject to pre-processing in IA-32e mode as follows:
1. Linear-address-space separation (LASS). This is a feature that, when enabled by software, may limit the
linear addresses that are accessible by software, generating faults for accesses out of range.
2. Linear-address masking (LAM). This is a feature that, when enabled by software, masks certain linear-
address bits.
3. Canonicality checking. As will be detailed in Chapter 5, paging does not translate all 64 bits of a linear
address. Each linear address must be canonical, meaning that the untranslated bits have a fixed value.
Memory accesses using a non-canonical address generate faults.
Both LASS and canonicality checking can generate faults. For any specific memory access, the two features
generate the same fault. For that reason, the relative order of that checking is not defined and cannot be deter-
mined by software.

4.1 ENABLING AND ENUMERATION


Software enables LASS by setting CR4.LASS[bit 27]. Enabling of LAM is based on three different bits:
CR3.LAM_U48[bit 62], CR3.LAM_U57[bit 61], and CR4.LAM_SUP[bit 28]. The use of these bits is explained in
Section 4.4. Canonicality checking is not enabled by software and is always performed in 64-bit mode.
The processor enumerates support for LASS with CPUID.(EAX=07H,ECX=1):EAX.LASS[bit 6]. If this bit is enumer-
ated as 1, software can set CR4.LASS.
The processor enumerates support for LAM with CPUID.(EAX=07H,ECX=1):EAX.LAM[bit 26]. If this bit is enumer-
ated as 1, software can set CR3.LAM_U48, CR3.LAM_U57, and CR4.LAM_SUP.

4.2 MODE-BASED ACCESSES AND LINEAR-ADDRESS-SPACE PARTITIONING


Every access to a linear address is either a supervisor-mode access or a user-mode access. For all instruction
fetches and most data accesses, this distinction is determined by the current privilege level (CPL): accesses made
while CPL < 3 are supervisor-mode accesses, while accesses made while CPL = 3 are user-mode accesses.
Some operations implicitly access system data structures with linear addresses; the resulting accesses to those
data structures are supervisor-mode accesses regardless of CPL. Such accesses include the following: accesses to
the global descriptor table (GDT) or local descriptor table (LDT) to load a segment descriptor; accesses to the inter-
rupt descriptor table (IDT) when delivering an interrupt or exception; accesses to the task-state segment (TSS) as
part of a task switch or change of CPL; and accesses to a user posted-interrupt descriptor (UPID) during user-inter-
rupt notification processing. Such accesses are called implicit supervisor-mode accesses regardless of CPL.
Other accesses made while CPL < 3 are called explicit supervisor-mode accesses.2

1. The presentation in this chapter focuses on 64-bit addresses. 32-bit and 16-bit addresses can also be used in IA-32e mode. For the
purposes of this chapter, the upper bits of such addresses (32 bits and 48 bits, respectively) are treated as if they were all zero.
2. The WRUSS instruction is an exception; although it can be executed only if CPL = 0, the processor treats its shadow-stack accesses
as user-mode accesses.

Vol. 3A 4-1
LINEAR-ADDRESS PRE-PROCESSING

Some 64-bit operating systems partition the 64-bit linear-address space into a supervisor portion and a user
portion. Specifically, the upper half of the linear-address space (comprising addresses in which bit 63 is 1) is used
for supervisor instructions and data, while the lower half (addresses in which bit 63 is 0) is for user instructions and
data.
The LASS and LAM features are designed for operating systems that establish such linear-address-space parti-
tioning. However, the features are defined and may be used even if such partitioning is not in effect.

4.3 LINEAR-ADDRESS-SPACE SEPARATION (LASS)


The access rights determined by paging (see Section 5.6) are based on whether a linear address is a supervisor-
mode address or a user-mode address. Paging provides protection by preventing user-mode accesses to super-
visor-mode addresses; in addition, there are paging features that can prevent supervisor-mode accesses to user-
mode addresses.
These paging-based protections prevent malicious software from directly reading or writing memory inappropri-
ately. However, they require the processor to traverse a hierarchy of paging structures in memory. Unprivileged
software may be able to use the timing information resulting from this traversal to determine details about the
paging structures, the layout of supervisor memory, or its use by supervisor software.
Linear-address-space separation (LASS) is an independent mechanism that can enforce mode-based protection
without traversing the paging structures. Because LASS provides this protection as part of linear-address pre-
processing, unprivileged software is denied paging-based timing information.
An operating system can use LASS to provide protections corresponding to the mode-based paging protections if it
has established the linear-address-space partitioning outlined in Section 4.2.

4.3.1 Enumeration and Enabling


The processor enumerates support for LASS with CPUID.(EAX=07H,ECX=1):EAX.LASS[bit 6].
Software enables LASS by setting CR4.LASS[bit 27]. CR4.LASS can be set to 1 if
CPUID.(EAX=07H,ECX=1):EAX.LASS[bit 6] is enumerated as 1.
The operation of LASS is also affected by the paging-mode bit CR4.SMAP[bit 21], which enables supervisor-access
prevention. LASS enforces the equivalent of supervisor-mode execution prevention regardless of the setting of
CR4.SMEP[bit 17]. See Section 4.3.2 for details.

4.3.2 Operation of Linear-Address-Space Separation (LASS)


This section describes the operation of linear-address-space separation (LASS). This operation applies only in
IA-32e mode (if IA32_EFER.LMA = 1) and only if CR4.LASS = 1.
As indicated earlier, LASS enforces protections similar to those enforced by paging. Violations of these protections
are called LASS violations.
LASS violations typically result in faults. In most cases, an access causing a LASS violation results in a general
protection exception (#GP); for stack accesses (those due to stack-oriented instructions, as well as accesses that
implicitly or explicitly use the SS segment register), a stack fault (#SS) is generated. In either case, a null error
code is produced.
Some accesses are not subject to faults due to LASS violations. These include prefetches (e.g., those resulting from
execution of one of the PREFETCHh instructions), executions of the CLDEMOTE instruction, and accesses resulting
from the speculative fetch or execution of an instruction. Such an access may cause a LASS violation; if it does, the
access is not performed but no fault occurs.
The remainder of this section describes how LASS applies to different types of accesses to linear addresses. The
items below discuss specific LASS violations based on bit 63 of a linear address. For a linear address with only 32
bits (or 16 bits), the processor treats bit 63 as if it were 0; this includes accesses in compatibility mode.
• A user-mode data access causes a LASS violation if it would access a linear address of which bit 63 is 1.

4-2 Vol. 3A
LINEAR-ADDRESS PRE-PROCESSING

• A supervisor-mode data access causes a LASS violation if it would access a linear address of which bit 63 is 0,
supervisor-mode access protection is enabled (by setting CR4.SMAP), and either RFLAGS.AC = 0 or the access
is an implicit supervisor-mode access.
• A user-mode instruction fetch causes a LASS violation if it would fetch an instruction using a linear address of
which bit 63 is 1.
• A supervisor-mode instruction fetch causes a LASS violation if it would accesses a linear address of which bit 63
is 0. (Unlike paging, this behavior of LASS applies regardless of the setting of CR4.SMEP.)
LASS for instruction fetches applies when the linear address in RIP is used to load an instruction from memory.
Unlike canonicality checking (see Section 4.5.2), LASS does not apply to branch instructions that load RIP. A
branch instruction can load RIP with an address that would violate LASS. Only when the address is used to fetch an
instruction will a LASS violation occur, generating a #GP. (The return instruction pointer of the #GP handler is the
address that incurred the LASS violation.)

4.4 LINEAR-ADDRESS MASKING


This section describes linear-address masking (LAM). LAM modifies linear addresses before they are subject to
canonicality checking as described in Section 4.5. Doing so allows untranslated linear-address bits to contain arbi-
trary values.
In IA-32e mode, linear address have 64 bits and are translated either with 4-level paging, which translates the low
48 bits of each linear address, or with 5-level paging, which translates 57 bits. The upper linear-address bits are
reserved by canonicality checking (see Section 4.5).
Software usages that associate metadata with a pointer might benefit from being able to place metadata in the
upper (untranslated) bits of the pointer itself. However, the canonicality enforcement mentioned earlier implies
that software would have to mask the metadata bits in a pointer (making it canonical) before using it as a linear
address to access memory. LAM allows software to use pointers with metadata without having to mask the meta-
data bits. With LAM enabled, the processor masks the metadata bits in a pointer before using it as a linear address
to access memory.
LAM applies only in 64-bit mode and only to addresses used for data accesses. It does not apply to addresses used
for instruction fetches or to those being loaded into the RIP register (e.g., as targets of jump and call instructions).

4.4.1 Enumeration, Enabling, and Configuration


The processor enumerates support for LAM with CPUID.(EAX=07H,ECX=1):EAX.LAM[bit 26].
Enabling and configuration of LAM is controlled by the following control-register bits: CR3.LAM_U48[bit 62],
CR3.LAM_U57[bit 61], and CR4.LAM_SUP[bit 28]. The use of these control bits is explained below.
LAM supports configurations that differ regarding which linear-address bits are masked and can be used for meta-
data. With LAM48, linear-address bits 62:48 are masked (resulting in a LAM width of 15); with LAM57, linear-
address bits 62:57 are masked (a LAM width of 6).
Like LASS, LAM was designed for operating systems that establish the linear-address-space partitioning outlined in
Section 4.2: linear addresses that clear bit 63 are used for user memory, while those that set bit 63 are for super-
visor memory. For LAM, the identification of an address as user or supervisor is based solely on the value of bit 63
and does not depend on the CPL.
LAM and the LAM width can be configured independently for user and supervisor addresses (as identified in the
previous paragraph, using bit 63). CR3.LAM_U48 and CR3.LAM_U57 enable and configure LAM for user addresses:
• If CR3.LAM_U48 = CR3.LAM_U57 = 0, LAM is not enabled for user addresses.
• If CR3.LAM_U48 = 1 and CR3.LAM_U57 = 0, LAM48 is enabled for user addresses (a LAM width of 15).
• If CR3.LAM_U57 = 1, LAM57 is enabled for user addresses (a LAM width of 6; CR3.LAM_U48 is ignored in this
case).
CR4.LAM_SUP enables and configures LAM for supervisor addresses:
• If CR4.LAM_SUP = 0, LAM is not enabled for supervisor addresses.

Vol. 3A 4-3
LINEAR-ADDRESS PRE-PROCESSING

• If CR4.LAM_SUP = 1, LAM is enabled for supervisor addresses with a width determined by the paging mode
(see Section 5.1.1):
— If 4-level paging is enabled, LAM48 is enabled for supervisor addresses (a LAM width of 15).
— If 5-level paging is enabled, LAM57 is enabled for supervisor addresses (a LAM width of 6).

4.4.2 Treatment of Data Accesses with LAM Active


When LAM is active, linear addresses used to access data are masked before they are subject to the canonicality
checking identified in Section 4.5. Specifically, LAM modifies a linear address by extending the value of one address
bit (depending on the LAM width) over others:
• When LAM48 is enabled (see Section 4.4.1), the processor modifies each linear address to replace each of
bits 62:48 with the value of bit 47.
• When LAM57 is enabled, each of bits 62:57 is replaced by the value of bit 56 (bits 56:48 are not modified).

4.4.3 Paging Interactions


As explained in Section 4.4.2, LAM masks certain bits in a linear address before that address is translated by
paging.
In most cases, the address bits in the masked positions are not used by address translation. However, if 5-level
paging is active and LAM48 is enabled for user pointers, bit 47 of a user pointer is extended over bits 62:48 to form
a linear address, and bits 56:48 are used by 5-level paging.
Page faults report the faulting linear address in CR2. Because LAM masking applies before paging, the faulting
linear address recorded in CR2 reflects the result of that masking and does not contain any masked metadata.
The INVLPG, INVPCID, and INVVPID instructions can be used to invalidate any translation lookaside buffer (TLB)
entries for specified linear addresses. LAM does not apply to those addresses, although those addresses are subject
to canonicality checking (see Section 4.5.4).

4.5 CANONICALITY CHECKING


Memory accesses in IA-32e mode can use 64-bit linear addresses. As detailed in Section 5.5.4, 4-level paging
translates the low 48 bits of each linear address, while 5-level paging translates the low 57 bits. The remaining
upper bits (bits 63:48 with 4-level paging; bits 63:57 with 5-level paging) are not translated.
IA-32e mode accounts for the fact that address bits are not translated (and thus should be reserved) with the
concept of canonicality. In general, a linear address is canonical if the untranslated bits are a sign-extension of
the most significant translated bit. More specifically, there are two types of canonicality:
• A linear address is paging canonical if it is canonical for the current paging mode: a linear address is canonical
for 4-level paging (48-bit canonical) if bits 63:47 of the address are identical; it is canonical for 5-level paging
(57-bit canonical) if bits 63:56 are identical.
• A linear address is CPU canonical if it is canonical relative to the widest linear address supported by the
processor: 48-bit canonical if the processor supports only 4-level paging and 57-bit canonical if the processor
supports 5-level paging.
Unlike LASS and LAM, there is no control to enable canonicality checking. It always applies (as described in this
section) when 64-bit linear addresses are used.
Section 4.5.1 and Section 4.5.2 explain canonicality checking for accesses to memory using linear addresses and
for loads of the instruction pointer, respectively. Section 4.5.3 details how canonicality checking applies to certain
system registers that contain linear addresses. Section 4.5.4 explains the role of canonicality checking by instruc-
tions that invalidate TLB entries.

4-4 Vol. 3A
LINEAR-ADDRESS PRE-PROCESSING

NOTE
Section 4.5.2 and Section 4.5.3 discuss the canonicality checking performed by the WRMSR
instruction. The WRMSRLIST and WRMSRNS instructions perform the same canonicality checking in
corresponding situations. (Similarly, the characterization of RDMSR in Section 4.5.3 applies also to
RDMSRLIST.)

4.5.1 Memory Accesses


An access to memory using a linear address is allowed only if the address is paging canonical; if it is not, a canon-
icality violation occurs. In most cases, an access causing a canonicality violation results in a general protection
exception (#GP); for stack accesses (those due to stack-oriented instructions, as well as accesses that implicitly or
explicitly use the SS segment register), a stack fault (#SS) is generated. In either case, a null error code is
produced.
When LAM is enabled, canonicality checking is performed after masking of the linear address. This implies that the
requirements of canonicality on an original (unmasked) linear address used to access data are effectively relaxed
when LAM is enabled:
• With LAM48, bit 63 and bit 47 of the original linear address must be identical.
• With LAM57 and 4-level paging, bit 63 and bits 56:47 of the original linear address must be identical.
• With LAM57 and 5-level paging, bit 63 and bit 56 of the original linear address must be identical.
While LAM applies only to data accesses, canonicality checking applies both data accesses and instruction fetches.

4.5.2 Loads of the Instruction Pointer (RIP)


In 64-bit mode, the RIP register contains the linear address of the instruction pointer. Operations that load RIP
(including both instructions such as JMP as well as control transfers through the IDT) check first whether the value
to be loaded is paging canonical. If it is not, the operation does not modify RIP and instead causes a #GP. (This #GP
is delivered as a fault, meaning that the return instruction pointer of the fault handler is the address of the faulting
instruction and not the non-canonical address whose load was attempted.)
This treatment applies to the SYSRET and SYSEXIT instructions, which load RIP from RCX and RDX, respectively.
The SYSCALL and SYSENTER instructions load RIP from the IA32_LSTAR and IA32_SYSENTER_EIP MSRs, respec-
tively. On processors that support only 4-level paging, these instructions do not check explicitly that the values
being loaded are paging canonical. This is because the WRMSR instruction ensures that these MSRs necessarily
contain values that are CPU canonical, which is the same as being paging canonical on processors that support only
4-level paging. On processors that support 5-level paging, the checking by WRMSR is relaxed to ensure only 57-bit
canonicality. (See Section 4.5.3 for the treatment of WRMSR.) On such processors, an execution of SYSCALL or
SYSENTER with 4-level paging checks that the value being loaded into RIP is 48-bit canonical (paging canonical).

4.5.3 System Registers Containing Linear Addresses


In addition to RIP, the CPU maintains numerous other registers that hold linear addresses:
• GDTR and IDTR (in their base-address portions).
• LDTR, TR, FS, and GS (in the base-address portions of their hidden descriptor caches).
• Control register CR2, which holds the linear address causing a page fault.
• The debug-address registers (DR0 through DR3), which hold the linear addresses of breakpoints.
• The following MSRs: IA32_BNDCFGS, IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
IA32_INTERRUPT_SSP_TABLE_ADDR, IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_PL0_SSP, IA32_PL1_SSP,
IA32_PL2_SSP, IA32_PL3_SSP, IA32_RTIT_ADDR0_A, IA32_RTIT_ADDR0_B, IA32_RTIT_ADDR1_A,
IA32_RTIT_ADDR1_B, IA32_RTIT_ADDR2_A, IA32_RTIT_ADDR2_B, IA32_RTIT_ADDR3_A,
IA32_RTIT_ADDR3_B, IA32_S_CET, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP, IA32_UINTR_HANDLER,
IA32_UINTR_PD, IA32_UINTR_STACKADJUST, IA32_U_CET, and IA32_UINTR_TT.
• The x87 FPU instruction pointer (FIP).

Vol. 3A 4-5
LINEAR-ADDRESS PRE-PROCESSING

• The user-mode configuration register BNDCFGU, used by Intel® MPX.

With a few exceptions, the processor ensures that the addresses in these registers are always canonical in the
following ways:
• Some instructions fault on attempts to load a linear-address register with a non-canonical address:
— An execution of the LGDT or LIDT instruction causes a general-protection exception (#GP) if the base
address specified in the instruction’s memory operand is not canonical.
— An execution of the LLDT or LTR instruction causes a #GP if the base address to be loaded from the GDT is
not canonical.
— An execution of WRFSBASE or WRGSBASE causes a #GP if it would load the base address of either FS or GS
with a non-canonical address.
— An execution of WRMSR causes a #GP if it would load any of the following MSRs with a non-canonical
address: IA32_BNDCFGS, IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
IA32_INTERRUPT_SSP_TABLE_ADDR, IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_PL0_SSP,
IA32_PL1_SSP, IA32_PL2_SSP, IA32_PL3_SSP, IA32_RTIT_ADDR0_A, IA32_RTIT_ADDR0_B,
IA32_RTIT_ADDR1_A, IA32_RTIT_ADDR1_B, IA32_RTIT_ADDR2_A, IA32_RTIT_ADDR2_B,
IA32_RTIT_ADDR3_A, IA32_RTIT_ADDR3_B, IA32_S_CET, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP,
IA32_UINTR_HANDLER, IA32_UINTR_PD, IA32_UINTR_STACKADJUST, IA32_U_CET, and
IA32_UINTR_TT.1
— An execution of XRSTORS causes a #GP if it would load any of the following MSRs with a non-canonical
address: IA32_PL0_SSP, IA32_PL1_SSP, IA32_PL2_SSP, IA32_PL3_SSP, IA32_RTIT_ADDR0_A,
IA32_RTIT_ADDR0_B, IA32_RTIT_ADDR1_A, IA32_RTIT_ADDR1_B, IA32_RTIT_ADDR2_A,
IA32_RTIT_ADDR2_B, IA32_RTIT_ADDR3_A, IA32_RTIT_ADDR3_B, IA32_U_CET,
IA32_UINTR_HANDLER, IA32_UINTR_PD, IA32_UINTR_STACKADJUST, or IA32_UINTR_TT.
With a small number of exceptions, this enforcement checks for CPU canonicality and is thus independent of the
current paging mode. Thus, a processor that supports 5-level paging will allow the instructions mentioned
above to load these registers with addresses that are 57-bit canonical but not 48-bit canonical, even if 4-level
paging is active. (As a result, instructions that store these values — SGDT, SIDT, SLDT, STR, RDFSBASE,
RDGSBASE, RDMSR, XSAVE, XSAVEC, XSAVEOPT, and XSAVES — may save addresses that are 57-bit canonical
but not 48-bit canonical, even if 4-level paging is active.)
The WRFSBASE and WRGSBASE instructions, which load the base address of FS and GS, respectively, operate
differently. An execution of either of these instructions causes a #GP if it would load a base address with an
address that is not paging canonical. Thus, if 4-level paging is active, these instructions do not allow loading of
addresses that are 57-bit canonical but not 48-bit canonical.
• The FXRSTOR, XRSTOR, and XRSTORS instructions ignore attempts to load some of these registers with non-
canonical addresses:
— Loads of FIP ignore any bits in the memory image beyond the enumerated maximum linear-address width.
The processor sign-extends the most significant bit (e.g., bit 56 on processors that support 5-level paging)
to ensure that FIP is always CPU canonical.
— Loads of BNDCFGU (by XRSTOR or XRSTORS) ignore any bits in the memory image beyond the enumerated
maximum linear-address width. The processor sign-extends the most significant bit to ensure that
BNDCFGU is always CPU canonical.
• Every non-control x87 instruction loads FIP. The value loaded is always paging canonical.
• CR2 can be loaded with the MOV to CR instruction. The instruction allows that register to be loaded with a non-
canonical address. The MOV from CR instruction will return for CR2 the value last loaded into that register by a
page fault or with the MOV to CR instruction, even if (for the latter case) the address is not canonical. Page
faults load CR2 only with linear addresses that are paging canonical.
• DR0 through DR3 can be loaded with the MOV to DR instruction. The instruction allows those registers to be
loaded with non-canonical addresses. The MOV from DR instruction will return for a debug register the value

1. Such canonicality checking may apply also when the WRMSR instruction is used to load some non-architec-
tural MSRs (not listed here) that hold a linear address.

4-6 Vol. 3A
LINEAR-ADDRESS PRE-PROCESSING

last loaded into that register with the MOV to DR instruction, even if the address is not canonical. Breakpoint
address matching is supported only for linear addresses that are paging canonical.

4.5.4 TLB-Invalidation Instructions


The Intel 64 architecture includes three instructions that may invalidate TLB entries for the linear address of an
instruction operand: INVLPG, INVPCID, and INVVPID. The following items describe how they are affected by
canonicality.
• The INVLPG instruction takes a memory operand. It invalidates any TLB entries that the logical processor is
caching for the linear address of that operand for the current linear address space. The instruction does not
fault if that address is not paging canonical. However, no invalidation is performed because the processor does
not cache TLB entries for addresses that are not paging canonical.
• The INVPCID instruction takes a register operand (INVPCID type) and a memory operand (INVPCID
descriptor). If the INVPCID type is 0, the instruction invalidates any TLB entries that the logical processor is
caching for the linear address and PCID specified in the INVPCID descriptor. If the linear address is not CPU
canonical, the instruction causes a #GP. If the processor supports 5-level paging, the instruction will not cause
such a #GP for an address that is 57-bit canonical, regardless of paging mode, even if 4-level paging is active
and the address is not 48-bit canonical.
• The INVVPID instruction takes a register operand (INVVPID type) and a memory operand (INVVPID
descriptor). If the INVPCID type is 0, the instruction invalidates any TLB entries that the logical processor is
caching for the linear address and VPID specified in the INVVPID descriptor. If the linear address is not CPU
canonical, the instruction fails.1 If the processor supports 5-level paging, the instruction will not fail for an
address that is 57-bit canonical, regardless of paging mode, even if 4-level paging is active and the address is
not 48-bit canonical.
LAM does not apply to the linear addresses that these instructions use to invalidate TLB entries.

1. INVVPID is a VMX instruction. In response to certain conditions, execution of a VMX instruction may fail, meaning that it does not
complete its normal operation. When a VMX instruction fails, control passes to the next instruction (rather than to a fault handler)
and a flag is set to report the failure.

Vol. 3A 4-7
LINEAR-ADDRESS PRE-PROCESSING

4-8 Vol. 3A
14.Updates to Chapter 5, Volume 3A
Change bars and violet text show changes to Chapter 5 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3A: System Programming Guide, Part 1.

------------------------------------------------------------------------------------------
Changes to this chapter:
• This is the paging chapter. It contains a few references to the new Chapter 4 (especially about LAM).

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 13


CHAPTER 5
PAGING

Chapter 3 explains how segmentation converts logical addresses to linear addresses. Paging (or linear-address
translation) is the process of translating linear addresses so that they can be used to access memory or I/O
devices. Paging translates each linear address to a physical address and determines, for each translation, what
accesses to the linear address are allowed (the address’s access rights) and the type of caching used for such
accesses (the address’s memory type).
Intel-64 processors support four different paging modes. These modes are identified and defined in Section 5.1.
Section 5.2 gives an overview of the translation mechanism that is used in all modes. Section 5.3, Section 5.4, and
Section 5.5 discuss the four paging modes in detail.
Section 5.6 details how paging determines and uses access rights. Section 5.7 discusses exceptions that may be
generated by paging (page-fault exceptions). Section 5.8 considers data which the processor writes in response to
linear-address accesses (accessed and dirty flags).
Section 5.9 describes how paging determines the memory types used for accesses to linear addresses. Section
5.10 provides details of how a processor may cache information about linear-address translation. Section 5.11
outlines interactions between paging and certain VMX features. Section 5.12 gives an overview of how paging can
be used to implement virtual memory.

5.1 PAGING MODES AND CONTROL BITS


Paging behavior is controlled by the following control bits:
• The WP and PG flags in control register CR0 (bit 16 and bit 31, respectively).
• The PSE, PAE, PGE, LA57, PCIDE, SMEP, SMAP, PKE, CET, and PKS flags in control register CR4 (bit 4, bit 5,
bit 7, bit 12, bit 17, bit 20, bit 21, bit 22, bit 23, and bit 24, respectively).
• The LME and NXE flags in the IA32_EFER MSR (bit 8 and bit 11, respectively).
• The AC flag in the EFLAGS register (bit 18).
• The “enable HLAT” VM-execution control (tertiary processor-based VM-execution control bit 1; see Section
26.6.2, “Processor-Based VM-Execution Controls,” in the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3C).
Software enables paging by using the MOV to CR0 instruction to set CR0.PG. Before doing so, software should
ensure that control register CR3 contains the physical address of the first paging structure that the processor will
use for linear-address translation (see Section 5.2) and that that structure is initialized as desired. See Table 5-3,
Table 5-7, and Table 5-12 for the use of CR3 in the different paging modes.
Section 5.1.1 describes how the values of CR0.PG, CR4.PAE, CR4.LA57, and IA32_EFER.LME determine whether
paging is enabled and, if so, which of four paging modes is in use. Section 5.1.2 explains how to manage these bits
to establish or make changes in paging modes. Section 5.1.3 discusses how CR0.WP, CR4.PSE, CR4.PGE,
CR4.PCIDE, CR4.SMEP, CR4.SMAP, CR4.PKE, CR4.CET, CR4.PKS, and IA32_EFER.NXE modify the operation of the
different paging modes.

5.1.1 Four Paging Modes


If CR0.PG = 0, paging is not used. The logical processor treats all linear addresses as if they were physical
addresses. CR4.PAE, CR4.LA57, and IA32_EFER.LME are ignored by the processor, as are CR0.WP, CR4.PSE,
CR4.PGE, CR4.SMEP, CR4.SMAP, and IA32_EFER.NXE. (CR4.CET is also ignored insofar as it affects linear-address
access rights.)
Paging is enabled if CR0.PG = 1. Paging can be enabled only if protection is enabled (CR0.PE = 1). If paging is
enabled, one of four paging modes is used. The values of CR4.PAE, CR4.LA57, and IA32_EFER.LME determine
which paging mode is used:

Vol. 3A 5-1
PAGING

• If CR4.PAE = 0, 32-bit paging is used. 32-bit paging is detailed in Section 5.3. 32-bit paging uses CR0.WP,
CR4.PSE, CR4.PGE, CR4.SMEP, CR4.SMAP, and CR4.CET as described in Section 5.1.3 and Section 5.6.
• If CR4.PAE = 1 and IA32_EFER.LME = 0, PAE paging is used. PAE paging is detailed in Section 5.4. PAE paging
uses CR0.WP, CR4.PGE, CR4.SMEP, CR4.SMAP, CR4.CET, and IA32_EFER.NXE as described in Section 5.1.3 and
Section 5.6.
• If CR4.PAE = 1, IA32_EFER.LME = 1, and CR4.LA57 = 0, 4-level paging1 is used.2 4-level paging is detailed
in Section 5.5 (along with 5-level paging). 4-level paging uses CR0.WP, CR4.PGE, CR4.PCIDE, CR4.SMEP,
CR4.SMAP, CR4.PKE, CR4.CET, CR4.PKS, and IA32_EFER.NXE as described in Section 5.1.3 and Section 5.6.
• If CR4.PAE = 1, IA32_EFER.LME = 1, and CR4.LA57 = 1, 5-level paging is used. 5-level paging is detailed in
Section 5.5 (along with 4-level paging). 5-level paging uses CR0.WP, CR4.PGE, CR4.PCIDE, CR4.SMEP,
CR4.SMAP, CR4.PKE, CR4.CET, CR4.PKS, and IA32_EFER.NXE as described in Section 5.1.3 and Section 5.6.

NOTE
32-bit paging and PAE paging can be used only in legacy protected mode (IA32_EFER.LME = 0). In
contrast, 4-level paging and 5-level paging can be used only IA-32e mode (IA32_EFER.LME = 1).
The four paging modes differ with regard to the following details:
• Linear-address width. The size of the linear addresses that can be translated.
• Physical-address width. The size of the physical addresses produced by paging.
• Page size. The granularity at which linear addresses are translated. Linear addresses on the same page are
translated to corresponding physical addresses on the same page.
• Support for execute-disable access rights. In some paging modes, software can be prevented from fetching
instructions from pages that are otherwise readable.
• Support for PCIDs. With 4-level paging and 5-level paging, software can enable a facility by which a logical
processor caches information for multiple linear-address spaces. The processor may retain cached information
when software switches between different linear-address spaces.
• Support for protection keys. With 4-level paging and 5-level paging, each linear address is associated with a
protection key. Software can use the protection-key rights registers to disable, for each protection key, how
certain accesses to linear addresses associated with that protection key.
Table 5-1 illustrates the principal differences between the four paging modes.

Table 5-1. Properties of Different Paging Modes

Supports
Lin.- Phys.- Supports
Paging PG in PAE in LME in LA57 in Page PCIDs and
Addr. Addr. Execute-
Mode CR0 CR4 IA32_EFER CR4 Sizes protection
Width Width1 Disable?
keys?

None 0 N/A N/A N/A 32 32 N/A No No

4 KB
32-bit 1 0 02 N/A 32 Up to 403 No No
4 MB4

4 KB
PAE 1 1 0 N/A 32 Up to 52 Yes5 No
2 MB

4 KB
4-level 1 1 1 0 48 Up to 52 2 MB Yes5 Yes7
1 GB6

1. Earlier versions of this manual used the term “IA-32e paging” to identify 4-level paging.
2. The LMA flag in the IA32_EFER MSR (bit 10) is a status bit that indicates whether the logical processor is in IA-32e mode (and thus
uses either 4-level paging or 5-level paging). The processor always sets IA32_EFER.LMA to CR0.PG & IA32_EFER.LME. Software can-
not directly modify IA32_EFER.LMA; an execution of WRMSR to the IA32_EFER MSR ignores bit 10 of its source operand.

5-2 Vol. 3A
PAGING

Table 5-1. Properties of Different Paging Modes (Contd.)

Supports
Lin.- Phys.- Supports
Paging PG in PAE in LME in LA57 in Page PCIDs and
Addr. Addr. Execute-
Mode CR0 CR4 IA32_EFER CR4 Sizes protection
Width Width1 Disable?
keys?

4 KB
5-level 1 1 1 1 57 Up to 52 2 MB Yes5 Yes7
1 GB6

NOTES:
1. The physical-address width is always bounded by MAXPHYADDR; see Section 5.1.4.
2. The processor ensures that IA32_EFER.LME must be 0 if CR0.PG = 1 and CR4.PAE = 0.
3. 32-bit paging supports physical-address widths of more than 32 bits only for 4-MByte pages and only if the PSE-36 mechanism is
supported; see Section 5.1.4 and Section 5.3.
4. 32-bit paging uses 4-MByte pages only if CR4.PSE = 1; see Section 5.3.
5. Execute-disable access rights are applied only if IA32_EFER.NXE = 1; see Section 5.6.
6. Processors that support 4-level paging or 5-level paging do not necessarily support 1-GByte pages; see Section 5.1.4.
7. PCIDs are used only if CR4.PCIDE = 1; see Section 5.10.1. Protection keys are used only if certain conditions hold; see Section 5.6.2.

Because 32-bit paging and PAE paging are used only in legacy protected mode and because legacy protected mode
cannot produce linear addresses larger than 32 bits, 32-bit paging and PAE paging translate 32-bit linear
addresses.
4-level paging and 5-level paging are used only in IA-32e mode. IA-32e mode has two sub-modes:
• Compatibility mode. This sub-mode uses only 32-bit linear addresses. In this sub-mode, 4-level paging and 5-
level paging treat bits 63:32 of such an address as all 0. These addresses are subject to linear-address pre-
processing, specifically linear-address-space separation (Section 4.3).
• 64-bit mode. This sub-mode produces 64-bit linear addresses. These addresses are then subject to linear-
address pre-processing (Chapter 4). As part of this, the processor enforces canonicality (Section 4.5),
ensuring that the upper bits of such an address are identical: bits 63:47 for 4-level paging and bits 63:56 for
5-level paging. 4-level paging (respectively, 5-level paging) does not use bits 63:48 (respectively, bits 63:57)
of such addresses.

5.1.2 Paging-Mode Enabling


If CR0.PG = 1, a logical processor is in one of four paging modes, depending on the values of CR4.PAE,
IA32_EFER.LME, and CR4.LA57. Figure 5-1 illustrates how software can enable these modes and make transitions
between them. The following items identify certain limitations and other details:
• IA32_EFER.LME cannot be modified while paging is enabled (CR0.PG = 1). Attempts to do so using WRMSR
cause a general-protection exception (#GP(0)).
• Paging cannot be enabled (by setting CR0.PG to 1) while CR4.PAE = 0 and IA32_EFER.LME = 1. Attempts to do
so using MOV to CR0 cause a general-protection exception (#GP(0)).
• One node in Figure 5-1 is labeled “IA-32e mode.” This node represents either 4-level paging (if CR4.LA57 = 0)
or 5-level paging (if CR4.LA57 = 1). As noted in the following items, software cannot modify CR4.LA57
(effecting transition between 4-level paging and 5-level paging) without first disabling paging.
• CR4.PAE and CR4.LA57 cannot be modified while either 4-level paging or 5-level paging is in use (when
CR0.PG = 1 and IA32_EFER.LME = 1). Attempts to do so using MOV to CR4 cause a general-protection
exception (#GP(0)).
• Regardless of the current paging mode, software can disable paging by clearing CR0.PG with MOV to CR0.1

1. If the logical processor is in 64-bit mode or if CR4.PCIDE = 1, an attempt to clear CR0.PG causes a general-protection exception
(#GP). Software should transition to compatibility mode and clear CR4.PCIDE before attempting to disable paging.

Vol. 3A 5-3
PAGING

#GP #GP

Set LME
Set LME

No Paging Set PG 32-bit Paging Set PAE


PAE Paging
PG = 0 PG = 1
PAE = 0 PG = 1
PAE = 0
LME = 0 PAE = 1
Clear PG LME = 0 Clear PAE LME = 0
Setr LME

#GP
Clear LME

Clear PAE Set PG


Set PAE Clear PAE
Clear PG

No Paging No Paging 4-level Paging


PG = 0 PG = 0 PG = 1
PAE = 0 PAE = 1 PAE = 1
LME = 1 LME = 0 LME = 1
Set PG
Setr LME

Clear LME
Clear LME

Set PG

Clear PAE Clear PG


Set PAE
#GP
#GP

No Paging
PG = 0
PAE = 1
LME = 1

Figure 5-1. Enabling and Changing Paging Modes

• Software can transition between 32-bit paging and PAE paging by changing the value of CR4.PAE with MOV to
CR4.
• Software cannot transition directly between 4-level paging (or 5-level paging) and any of other paging mode.
It must first disable paging (by clearing CR0.PG with MOV to CR0), then set CR4.PAE, IA32_EFER.LME, and
CR4.LA57 to the desired values (with MOV to CR4 and WRMSR), and then re-enable paging (by setting CR0.PG
with MOV to CR0). As noted earlier, an attempt to modify CR4.PAE, IA32_EFER.LME, or CR.LA57 while 4-level
paging or 5-level paging is enabled causes a general-protection exception (#GP(0)).
• VMX transitions allow transitions between paging modes that are not possible using MOV to CR or WRMSR. This
is because VMX transitions can load CR0, CR4, and IA32_EFER in one operation. See Section 5.11.1.

5.1.3 Paging-Mode Modifiers


Details of how each paging mode operates are determined by the following control bits:
• The WP flag in CR0 (bit 16).
• The PSE, PGE, PCIDE, SMEP, SMAP, PKE, CET, and PKS flags in CR4 (bit 4, bit 7, bit 17, bit 20, bit 21, bit 22,
bit 23, and bit 24, respectively).
• The NXE flag in the IA32_EFER MSR (bit 11).
• The “enable HLAT” VM-execution control (tertiary processor-based VM-execution control bit 1).

5-4 Vol. 3A
PAGING

CR0.WP allows pages to be protected from supervisor-mode writes. If CR0.WP = 0, supervisor-mode write
accesses are allowed to linear addresses with read-only access rights; if CR0.WP = 1, they are not. (User-mode
write accesses are never allowed to linear addresses with read-only access rights, regardless of the value of
CR0.WP.) Section 5.6 explains how access rights are determined, including the definition of supervisor-mode and
user-mode accesses.
CR4.PSE enables 4-MByte pages for 32-bit paging. If CR4.PSE = 0, 32-bit paging can use only 4-KByte pages; if
CR4.PSE = 1, 32-bit paging can use both 4-KByte pages and 4-MByte pages. See Section 5.3 for more information.
(PAE paging, 4-level paging, and 5-level paging can use multiple page sizes regardless of the value of CR4.PSE.)
CR4.PGE enables global pages. If CR4.PGE = 0, no translations are shared across address spaces; if CR4.PGE = 1,
specified translations may be shared across address spaces. See Section 5.10.2.4 for more information.
CR4.PCIDE enables process-context identifiers (PCIDs) for 4-level paging and 5-level paging. PCIDs allow a logical
processor to cache information for multiple linear-address spaces. See Section 5.10.1 for more information.
CR4.SMEP allows pages to be protected from supervisor-mode instruction fetches. If CR4.SMEP = 1, software
operating in supervisor mode cannot fetch instructions from linear addresses that are accessible in user mode.
Section 5.6 explains how access rights are determined, including the definition of supervisor-mode accesses and
user-mode accessibility.
CR4.SMAP allows pages to be protected from supervisor-mode data accesses. If CR4.SMAP = 1, software oper-
ating in supervisor mode cannot access data at linear addresses that are accessible in user mode. Software can
override this protection by setting EFLAGS.AC. Section 5.6 explains how access rights are determined, including
the definition of supervisor-mode accesses and user-mode accessibility.
CR4.PKE and CR4.PKS enable specification of access rights based on protection keys. 4-level paging and 5-level
paging associate each linear address with a protection key. When CR4.PKE = 1, the PKRU register specifies, for
each protection key, whether user-mode linear addresses with that protection key can be read or written. When
CR4.PKS = 1, the IA32_PKRS MSR does the same for supervisor-mode linear addresses. See Section 5.6 for more
information.
CR4.CET enables control-flow enforcement technology, including the shadow-stack feature. If CR4.CET = 1,
certain memory accesses are identified as shadow-stack accesses and certain linear addresses translate to
shadow-stack pages. Section 5.6 explains how access rights are determined for these accesses and pages. (The
processor allows CR4.CET to be set only if CR0.WP is also set.)
IA32_EFER.NXE enables execute-disable access rights for PAE paging, 4-level paging, and 5-level paging. If
IA32_EFER.NXE = 1, instruction fetches can be prevented from specified linear addresses (even if data reads from
the addresses are allowed). Section 5.6 explains how access rights are determined. (IA32_EFER.NXE has no effect
with 32-bit paging. Software that wants to use this feature to limit instruction fetches from readable pages must
use PAE paging, 4-level paging, or 5-level paging.)
The “enable HLAT” VM-execution control enables HLAT paging for 4-level paging and 5-level paging. HLAT paging
does not use control register CR3 to identify the address of the first paging structure used for linear-address trans-
lation; instead, that structure is located using a field in the virtual-machine control structure (VMCS). In addition,
HLAT paging interprets certain bits in paging-structure entries differently than ordinary paging. See Section 5.5 for
details.

5.1.4 Enumeration of Paging Features by CPUID


Software can discover support for different paging features using the CPUID instruction:
• PSE: page-size extensions for 32-bit paging.
If CPUID.01H:EDX.PSE [bit 3] = 1, CR4.PSE may be set to 1, enabling support for 4-MByte pages with 32-bit
paging (see Section 5.3).
• PAE: physical-address extension.
If CPUID.01H:EDX.PAE [bit 6] = 1, CR4.PAE may be set to 1, enabling PAE paging (this setting is also required
for 4-level paging and 5-level paging).
• PGE: global-page support.
If CPUID.01H:EDX.PGE [bit 13] = 1, CR4.PGE may be set to 1, enabling the global-page feature (see Section
5.10.2.4).

Vol. 3A 5-5
PAGING

• PAT: page-attribute table.


If CPUID.01H:EDX.PAT [bit 16] = 1, the 8-entry page-attribute table (PAT) is supported. When the PAT is
supported, three bits in certain paging-structure entries select a memory type (used to determine type of
caching used) from the PAT (see Section 5.9.2).
• PSE-36: page-size extensions with 40-bit physical-address extension.
If CPUID.01H:EDX.PSE-36 [bit 17] = 1, the PSE-36 mechanism is supported, indicating that translations using
4-MByte pages with 32-bit paging may produce physical addresses with up to 40 bits (see Section 5.3).
• PCID: process-context identifiers.
If CPUID.01H:ECX.PCID [bit 17] = 1, CR4.PCIDE may be set to 1, enabling process-context identifiers (see
Section 5.10.1).
• SMEP: supervisor-mode execution prevention.
If CPUID.(EAX=07H,ECX=0H):EBX.SMEP [bit 7] = 1, CR4.SMEP may be set to 1, enabling supervisor-mode
execution prevention (see Section 5.6).
• SMAP: supervisor-mode access prevention.
If CPUID.(EAX=07H,ECX=0H):EBX.SMAP [bit 20] = 1, CR4.SMAP may be set to 1, enabling supervisor-mode
access prevention (see Section 5.6).
• PKU: protection keys for user-mode pages.
If CPUID.(EAX=07H,ECX=0H):ECX.PKU [bit 3] = 1, CR4.PKE may be set to 1, enabling protection keys for
user-mode pages (see Section 5.6).
• OSPKE: enabling of protection keys for user-mode pages.
CPUID.(EAX=07H,ECX=0H):ECX.OSPKE [bit 4] returns the value of CR4.PKE. Thus, protection keys for user-
mode pages are enabled if this flag is 1 (see Section 5.6).
• CET: control-flow enforcement technology.
If CPUID.(EAX=07H,ECX=0H):ECX.CET_SS [bit 7] = 1, CR4.CET may be set to 1, enabling shadow-stack
pages (see Section 5.6).
• LA57: 57-bit linear addresses and 5-level paging.
If CPUID.(EAX=07H,ECX=0):ECX.LA57 [bit 16] = 1, CR4.LA57 may be set to 1, enabling 5-level paging.
• PKS: protection keys for supervisor-mode pages.
If CPUID.(EAX=07H,ECX=0H):ECX.PKS [bit 31] = 1, CR4.PKS may be set to 1, enabling protection keys for
supervisor-mode pages (see Section 5.6).
• NX: execute disable.
If CPUID.80000001H:EDX.NX [bit 20] = 1, IA32_EFER.NXE may be set to 1, allowing software to disable
execute access to selected pages (see Section 5.6). (Processors that do not support CPUID function
80000001H do not allow IA32_EFER.NXE to be set to 1.)
• Page1GB: 1-GByte pages.
If CPUID.80000001H:EDX.Page1GB [bit 26] = 1, 1-GByte pages may be supported with 4-level paging and 5-
level paging (see Section 5.5).
• LM: IA-32e mode support.
If CPUID.80000001H:EDX.LM [bit 29] = 1, IA32_EFER.LME may be set to 1, enabling IA-32e mode (with either
4-level paging or 5-level paging). (Processors that do not support CPUID function 80000001H do not allow
IA32_EFER.LME to be set to 1.)
• CPUID.80000008H:EAX[7:0] reports the physical-address width supported by the processor. (For processors
that do not support CPUID function 80000008H, the width is generally 36 if CPUID.01H:EDX.PAE [bit 6] = 1
and 32 otherwise.) This width is referred to as MAXPHYADDR. MAXPHYADDR is at most 52.
• CPUID.80000008H:EAX[15:8] reports the linear-address width supported by the processor. Generally, this
value is reported as follows:
— If CPUID.80000001H:EDX.LM [bit 29] = 0, the value is reported as 32.
— If CPUID.80000001H:EDX.LM [bit 29] = 1 and CPUID.(EAX=07H,ECX=0):ECX.LA57 [bit 16] = 0, the
value is reported as 48.
— If CPUID.(EAX=07H,ECX=0):ECX.LA57 [bit 16] = 1, the value is reported as 57.
(Processors that do not support CPUID function 80000008H, support a linear-address width of 32.)

5-6 Vol. 3A
PAGING

5.2 HIERARCHICAL PAGING STRUCTURES: AN OVERVIEW


All four paging modes translate linear addresses using hierarchical paging structures. This section provides an
overview of their operation. Section 5.3, Section 5.4, Section 5.5, and Section 5.6 provide details for the four
paging modes.
Every paging structure is 4096 Bytes in size and comprises a number of individual entries. With 32-bit paging,
each entry is 32 bits (4 bytes); there are thus 1024 entries in each structure. With the other paging modes, each
entry is 64 bits (8 bytes); there are thus 512 entries in each structure. (PAE paging includes one exception, a
paging structure that is 32 bytes in size, containing 4 64-bit entries.)
The processor uses the upper portion of a linear address to identify a series of paging-structure entries. The last of
these entries identifies the physical address of the region to which the linear address translates (called the page
frame). The lower portion of the linear address (called the page offset) identifies the specific address within that
region to which the linear address translates.
Each paging-structure entry contains a physical address, which is either the address of another paging structure or
the address of a page frame. In the first case, the entry is said to reference the other paging structure; in the
latter, the entry is said to map a page.
The first paging structure used for any translation is located at the physical address in CR3.1 A linear address is
translated using the following iterative procedure. A portion of the linear address (initially the uppermost bits)
selects an entry in a paging structure (initially the one located using CR3). If that entry references another paging
structure, the process continues with that paging structure and with the portion of the linear address immediately
below that just used. If instead the entry maps a page, the process completes: the physical address in the entry is
that of the page frame and the remaining lower portion of the linear address is the page offset.
The following items give an example for each of the four paging modes (each example locates a 4-KByte page
frame):
• With 32-bit paging, each paging structure comprises 1024 = 210 entries. For this reason, the translation
process uses 10 bits at a time from a 32-bit linear address. Bits 31:22 identify the first paging-structure entry
and bits 21:12 identify a second. The latter identifies the page frame. Bits 11:0 of the linear address are the
page offset within the 4-KByte page frame. (See Figure 5-2 for an illustration.)
• With PAE paging, the first paging structure comprises only 4 = 22 entries. Translation thus begins by using
bits 31:30 from a 32-bit linear address to identify the first paging-structure entry. Other paging structures
comprise 512 =29 entries, so the process continues by using 9 bits at a time. Bits 29:21 identify a second
paging-structure entry and bits 20:12 identify a third. This last identifies the page frame. (See Figure 5-5 for
an illustration.)
• With 4-level paging, each paging structure comprises 512 = 29 entries and translation uses 9 bits at a time
from a 48-bit linear address. Bits 47:39 identify the first paging-structure entry, bits 38:30 identify a second,
bits 29:21 a third, and bits 20:12 identify a fourth. Again, the last identifies the page frame. (See Figure 5-8
for an illustration.)
• 5-level paging is similar to 4-level paging except that 5-level paging translates 57-bit linear addresses.
Bits 56:48 identify the first paging-structure entry, while the remaining bits are used as with 4-level paging.
The translation process in each of the examples above completes by identifying a page frame; the page frame is
part of the translation of the original linear address. In some cases, however, the paging structures may be
configured so that the translation process terminates before identifying a page frame. This occurs if the process
encounters a paging-structure entry that is marked “not present” (because its P flag — bit 0 — is clear) or in which
a reserved bit is set. In this case, there is no translation for the linear address; an access to that address causes a
page-fault exception (see Section 5.7).
In the examples above, a paging-structure entry maps a page with a 4-KByte page frame when only 12 bits remain
in the linear address; entries identified earlier always reference other paging structures. That may not apply in
other cases. The following items identify when an entry maps a page and when it references another paging struc-
ture:
• If more than 12 bits remain in the linear address, bit 7 (PS — page size) of the current paging-structure entry
is consulted. If the bit is 0, the entry references another paging structure; if the bit is 1, the entry maps a page.

1. If HLAT paging is in use, a different mechanism is used to identify the first paging structure. See Section 5.5 for more information.

Vol. 3A 5-7
PAGING

• If only 12 bits remain in the linear address, the current paging-structure entry always maps a page (bit 7 is
used for other purposes).
If a paging-structure entry maps a page when more than 12 bits remain in the linear address, the entry identifies
a page frame larger than 4 KBytes. For example, 32-bit paging uses the upper 10 bits of a linear address to locate
the first paging-structure entry; 22 bits remain. If that entry maps a page, the page frame is 222 Bytes = 4 MBytes.
32-bit paging can use 4-MByte pages if CR4.PSE = 1. The other paging modes can use 2-MByte pages (regardless
of the value of CR4.PSE). 4-level paging and 5-level paging can use 1-GByte pages if the processor supports them
(see Section 5.1.4).
Paging structures are given different names based on their uses in the translation process. Table 5-2 gives the
names of the different paging structures. It also provides, for each structure, the source of the physical address
used to locate it (CR3 or a different paging-structure entry); the bits in the linear address used to select an entry
from the structure; and details of whether and how such an entry can map a page.

Table 5-2. Paging Structures in the Different Paging Modes

Physical
Entry Bits Selecting
Paging Structure Paging Mode Address of Page Mapping
Name Entry
Structure

32-bit, PAE, 4-level N/A


PML5 table PML5E
5-level CR31 56:48 N/A (PS must be 0)

32-bit, PAE N/A

PML4 table PML4E 4-level CR31


47:39 N/A (PS must be 0)
5-level PML5E

32-bit N/A
Page-directory-
PDPTE PAE CR3 31:30 N/A (PS must be 0)
pointer table
4-level, 5-level PML4E 38:30 1-GByte page if PS=12

32-bit CR3 31:22 4-MByte page if PS=13


Page directory PDE
PAE, 4-level, 5-level PDPTE 29:21 2-MByte page if PS=1

32-bit 21:12
Page table PTE PDE 4-KByte page
PAE, 4-level, 5-level 20:12

NOTES:
1. If HLAT paging is in use, a different mechanism is used to identify the first paging structure. See Section 5.5 for more information.
2. Not all processors support 1-GByte pages; see Section 5.1.4.
3. 32-bit paging ignores the PS flag in a PDE (and uses the entry to reference a page table) unless CR4.PSE = 1. Not all processors sup-
port 4-MByte pages with 32-bit paging; see Section 5.1.4.

5-8 Vol. 3A
PAGING

5.3 32-BIT PAGING


A logical processor uses 32-bit paging if CR0.PG = 1 and CR4.PAE = 0. 32-bit paging translates 32-bit linear
addresses to 40-bit physical addresses.1 Although 40 bits corresponds to 1 TByte, linear addresses are limited to
32 bits; at most 4 GBytes of linear-address space may be accessed at any given time.
32-bit paging uses a hierarchy of paging structures to produce a translation for a linear address. CR3 is used to
locate the first paging-structure, the page directory. Table 5-3 illustrates how CR3 is used with 32-bit paging.
32-bit paging may map linear addresses to either 4-KByte pages or 4-MByte pages. Figure 5-2 illustrates the
translation process when it uses a 4-KByte page; Figure 5-3 covers the case of a 4-MByte page. The following
items describe the 32-bit paging process in more detail as well has how the page size is determined:
• A 4-KByte naturally aligned page directory is located at the physical address specified in bits 31:12 of CR3 (see
Table 5-3). A page directory comprises 1024 32-bit entries (PDEs). A PDE is selected using the physical address
defined as follows:
— Bits 39:32 are all 0.
— Bits 31:12 are from CR3.
— Bits 11:2 are bits 31:22 of the linear address.
— Bits 1:0 are 0.
Because a PDE is identified using bits 31:22 of the linear address, it controls access to a 4-Mbyte region of the
linear-address space. Use of the PDE depends on CR4.PSE and the PDE’s PS flag (bit 7):
• If CR4.PSE = 1 and the PDE’s PS flag is 1, the PDE maps a 4-MByte page (see Table 5-4). The final physical
address is computed as follows:
— Bits 39:32 are bits 20:13 of the PDE.
— Bits 31:22 are bits 31:22 of the PDE.2
— Bits 21:0 are from the original linear address.
• If CR4.PSE = 0 or the PDE’s PS flag is 0, a 4-KByte naturally aligned page table is located at the physical
address specified in bits 31:12 of the PDE (see Table 5-5). A page table comprises 1024 32-bit entries (PTEs).
A PTE is selected using the physical address defined as follows:
— Bits 39:32 are all 0.
— Bits 31:12 are from the PDE.
— Bits 11:2 are bits 21:12 of the linear address.
— Bits 1:0 are 0.
• Because a PTE is identified using bits 31:12 of the linear address, every PTE maps a 4-KByte page (see
Table 5-6). The final physical address is computed as follows:
— Bits 39:32 are all 0.
— Bits 31:12 are from the PTE.
— Bits 11:0 are from the original linear address.
If a paging-structure entry’s P flag (bit 0) is 0 or if the entry sets any reserved bit, the entry is used neither to refer-
ence another paging-structure entry nor to map a page. There is no translation for a linear address whose transla-
tion would use such a paging-structure entry; a reference to such a linear address causes a page-fault exception
(see Section 5.7).

1. Bits in the range 39:32 are 0 in any physical address used by 32-bit paging except those used to map 4-MByte pages. If the proces-
sor does not support the PSE-36 mechanism, this is true also for physical addresses used to map 4-MByte pages. If the processor
does support the PSE-36 mechanism and MAXPHYADDR < 40, bits in the range 39:MAXPHYADDR are 0 in any physical address used
to map a 4-MByte page. (The corresponding bits are reserved in PDEs.) See Section 5.1.4 for how to determine MAXPHYADDR and
whether the PSE-36 mechanism is supported.
2. The upper bits in the final physical address do not all come from corresponding positions in the PDE; the physical-address bits in the
PDE are not all contiguous.

Vol. 3A 5-9
PAGING

With 32-bit paging, there are reserved bits only if CR4.PSE = 1:


• If the P flag and the PS flag (bit 7) of a PDE are both 1, the bits reserved depend on MAXPHYADDR, and whether
the PSE-36 mechanism is supported:1
— If the PSE-36 mechanism is not supported, bits 21:13 are reserved.
— If the PSE-36 mechanism is supported, bits 21:(M–19) are reserved, where M is the minimum of 40 and
MAXPHYADDR.
• If the PAT is not supported:2
— If the P flag of a PTE is 1, bit 7 is reserved.
— If the P flag and the PS flag of a PDE are both 1, bit 12 is reserved.
(If CR4.PSE = 0, no bits are reserved with 32-bit paging.)
A reference using a linear address that is successfully translated to a physical address is performed only if allowed
by the access rights of the translation; see Section 5.6.

Linear Address
31 22 21 12 11 0
Directory Table Offset

12 4-KByte Page

10 10 Page Table Physical Address


Page Directory

PTE
20
PDE with PS=0
20
32
CR3

Figure 5-2. Linear-Address Translation to a 4-KByte Page using 32-Bit Paging

1. See Section 5.1.4 for how to determine MAXPHYADDR and whether the PSE-36 mechanism is supported.
2. See Section 5.1.4 for how to determine whether the PAT is supported.

5-10 Vol. 3A
PAGING

Linear Address
31 22 21 0
Directory Offset

22 4-MByte Page

10 Page Directory
Physical Address

PDE with PS=1


18

32
CR3

Figure 5-3. Linear-Address Translation to a 4-MByte Page using 32-Bit Paging

Figure 5-4 gives a summary of the formats of CR3 and the paging-structure entries with 32-bit paging. For the
paging structure entries, it identifies separately the format of entries that map pages, those that reference other
paging structures, and those that do neither because they are “not present”; bit 0 (P) and bit 7 (PS) are high-
lighted because they determine how such an entry is used.

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
P P
Address of page directory1 Ignored C W Ignored CR3
D T

Bits 31:22 of address Reserved Bits 39:32 of P P P U R PDE:


of 4MB page frame (must be 0) address2 A Ignored G 1 D A C W / / 1 4MB
T D T S W page

I P P U R PDE:
Address of page table Ignored 0 g A C W / / 1 page
n D T S W table

PDE:
Ignored 0 not
present

P P P U R PTE:
Address of 4KB page frame Ignored G A D A C W / / 1 4KB
T D T S W page

PTE:
Ignored 0 not
present

Figure 5-4. Formats of CR3 and Paging-Structure Entries with 32-Bit Paging
NOTES:
1. CR3 has 64 bits on processors supporting the Intel-64 architecture. These bits are ignored with 32-bit paging.
2. This example illustrates a processor in which MAXPHYADDR is 36. If this value is larger or smaller, the number of bits reserved in
positions 20:13 of a PDE mapping a 4-MByte page will change.

Vol. 3A 5-11
PAGING

Table 5-3. Use of CR3 with 32-Bit Paging

Bit Contents
Position(s)

2:0 Ignored

3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page directory during linear-
address translation (see Section 5.9)

4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page directory during linear-
address translation (see Section 5.9)

11:5 Ignored

31:12 Physical address of the 4-KByte aligned page directory used for linear-address translation

63:32 Ignored (these bits exist only on processors supporting the Intel-64 architecture)

Table 5-4. Format of a 32-Bit Page-Directory Entry that Maps a 4-MByte Page

Bit Contents
Position(s)

0 (P) Present; must be 1 to map a 4-MByte page

1 (R/W) Read/write; if 0, writes may not be allowed to the 4-MByte page referenced by this entry (see Section 5.6)

2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 4-MByte page referenced by this entry (see Section
5.6)

3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 4-MByte page referenced by
this entry (see Section 5.9)

4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 4-MByte page referenced by
this entry (see Section 5.9)

5 (A) Accessed; indicates whether software has accessed the 4-MByte page referenced by this entry (see Section 5.8)

6 (D) Dirty; indicates whether software has written to the 4-MByte page referenced by this entry (see Section 5.8)

7 (PS) Page size; must be 1 (otherwise, this entry references a page table; see Table 5-5)

8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 5.10); ignored otherwise

11:9 Ignored

12 (PAT) If the PAT is supported, indirectly determines the memory type used to access the 4-MByte page referenced by this
entry (see Section 5.9.2); otherwise, reserved (must be 0)1

(M–20):13 Bits (M–1):32 of physical address of the 4-MByte page referenced by this entry2

21:(M–19) Reserved (must be 0)

31:22 Bits 31:22 of physical address of the 4-MByte page referenced by this entry

NOTES:
1. See Section 5.1.4 for how to determine whether the PAT is supported.
2. If the PSE-36 mechanism is not supported, M is 32, and this row does not apply. If the PSE-36 mechanism is supported, M is the min-
imum of 40 and MAXPHYADDR (this row does not apply if MAXPHYADDR = 32). See Section 5.1.4 for how to determine MAXPHY-
ADDR and whether the PSE-36 mechanism is supported.

5-12 Vol. 3A
PAGING

Table 5-5. Format of a 32-Bit Page-Directory Entry that References a Page Table

Bit Contents
Position(s)

0 (P) Present; must be 1 to reference a page table

1 (R/W) Read/write; if 0, writes may not be allowed to the 4-MByte region controlled by this entry (see Section 5.6)

2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 4-MByte region controlled by this entry (see Section
5.6)

3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page table referenced by this
entry (see Section 5.9)

4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page table referenced by this
entry (see Section 5.9)

5 (A) Accessed; indicates whether this entry has been used for linear-address translation (see Section 5.8)

6 Ignored

7 (PS) If CR4.PSE = 1, must be 0 (otherwise, this entry maps a 4-MByte page; see Table 5-4); otherwise, ignored

11:8 Ignored

31:12 Physical address of 4-KByte aligned page table referenced by this entry

Table 5-6. Format of a 32-Bit Page-Table Entry that Maps a 4-KByte Page

Bit Contents
Position(s)

0 (P) Present; must be 1 to map a 4-KByte page

1 (R/W) Read/write; if 0, writes may not be allowed to the 4-KByte page referenced by this entry (see Section 5.6)

2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 4-KByte page referenced by this entry (see Section
5.6)

3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 4-KByte page referenced by this
entry (see Section 5.9)

4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 4-KByte page referenced by this
entry (see Section 5.9)

5 (A) Accessed; indicates whether software has accessed the 4-KByte page referenced by this entry (see Section 5.8)

6 (D) Dirty; indicates whether software has written to the 4-KByte page referenced by this entry (see Section 5.8)

7 (PAT) If the PAT is supported, indirectly determines the memory type used to access the 4-KByte page referenced by this
entry (see Section 5.9.2); otherwise, reserved (must be 0)1

8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 5.10); ignored otherwise

11:9 Ignored

31:12 Physical address of the 4-KByte page referenced by this entry

NOTES:
1. See Section 5.1.4 for how to determine whether the PAT is supported.

Vol. 3A 5-13
PAGING

5.4 PAE PAGING


A logical processor uses PAE paging if CR0.PG = 1, CR4.PAE = 1, and IA32_EFER.LME = 0. PAE paging translates
32-bit linear addresses to 52-bit physical addresses.1 Although 52 bits corresponds to 4 PBytes, linear addresses
are limited to 32 bits; at most 4 GBytes of linear-address space may be accessed at any given time.
With PAE paging, a logical processor maintains a set of four (4) PDPTE registers, which are loaded from an address
in CR3. Linear address are translated using 4 hierarchies of in-memory paging structures, each located using one
of the PDPTE registers. (This is different from the other paging modes, in which there is one hierarchy referenced
by CR3.)
Section 5.4.1 discusses the PDPTE registers. Section 5.4.2 describes linear-address translation with PAE paging.

5.4.1 PDPTE Registers


When PAE paging is used, CR3 references the base of a 32-Byte page-directory-pointer table. Table 5-7 illus-
trates how CR3 is used with PAE paging.

Table 5-7. Use of CR3 with PAE Paging

Bit Contents
Position(s)

4:0 Ignored

31:5 Physical address of the 32-Byte aligned page-directory-pointer table used for linear-address translation

63:32 Ignored (these bits exist only on processors supporting the Intel-64 architecture)

The page-directory-pointer-table comprises four (4) 64-bit entries called PDPTEs. Each PDPTE controls access to a
1-GByte region of the linear-address space. Corresponding to the PDPTEs, the logical processor maintains a set of
four (4) internal, non-architectural PDPTE registers, called PDPTE0, PDPTE1, PDPTE2, and PDPTE3. The logical
processor loads these registers from the PDPTEs in memory as part of certain operations:
• If PAE paging would be in use following an execution of MOV to CR0 or MOV to CR4 (see Section 5.1.1) and the
instruction is modifying any of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then the
PDPTEs are loaded from the address in CR3.
• If MOV to CR3 is executed while the logical processor is using PAE paging, the PDPTEs are loaded from the
address being loaded into CR3.
• If PAE paging is in use and a task switch changes the value of CR3, the PDPTEs are loaded from the address in
the new CR3 value.
• Certain VMX transitions load the PDPTE registers. See Section 5.11.1.
Table 5-8 gives the format of a PDPTE. If any of the PDPTEs sets both the P flag (bit 0) and any reserved bit, the
MOV to CR instruction causes a general-protection exception (#GP(0)) and the PDPTEs are not loaded.2 As shown
in Table 5-8, bits 2:1, 8:5, and 63:MAXPHYADDR are reserved in the PDPTEs.

1. If MAXPHYADDR < 52, bits in the range 51:MAXPHYADDR will be 0 in any physical address used by PAE paging. (The corresponding
bits are reserved in the paging-structure entries.) See Section 5.1.4 for how to determine MAXPHYADDR.
2. On some processors, reserved bits are checked even in PDPTEs in which the P flag (bit 0) is 0.

5-14 Vol. 3A
PAGING

Table 5-8. Format of a PAE Page-Directory-Pointer-Table Entry (PDPTE)

Bit Contents
Position(s)

0 (P) Present; must be 1 to reference a page directory

2:1 Reserved (must be 0)

3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page directory referenced by
this entry (see Section 5.9)

4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page directory referenced by
this entry (see Section 5.9)

8:5 Reserved (must be 0)

11:9 Ignored

(M–1):12 Physical address of 4-KByte aligned page directory referenced by this entry1

63:M Reserved (must be 0)

NOTES:
1. M is an abbreviation for MAXPHYADDR, which is at most 52; see Section 5.1.4.

5.4.2 Linear-Address Translation with PAE Paging


PAE paging may map linear addresses to either 4-KByte pages or 2-MByte pages. Figure 5-5 illustrates the trans-
lation process when it produces a 4-KByte page; Figure 5-6 covers the case of a 2-MByte page. The following items
describe the PAE paging process in more detail as well has how the page size is determined:
• Bits 31:30 of the linear address select a PDPTE register (see Section 5.4.1); this is PDPTEi, where i is the value
of bits 31:30.1 Because a PDPTE register is identified using bits 31:30 of the linear address, it controls access
to a 1-GByte region of the linear-address space. If the P flag (bit 0) of PDPTEi is 0, the processor ignores bits
63:1, and there is no mapping for the 1-GByte region controlled by PDPTEi. A reference using a linear address
in this region causes a page-fault exception (see Section 5.7).
• If the P flag of PDPTEi is 1, 4-KByte naturally aligned page directory is located at the physical address specified
in bits 51:12 of PDPTEi (see Table 5-8 in Section 5.4.1). A page directory comprises 512 64-bit entries (PDEs).
A PDE is selected using the physical address defined as follows:
— Bits 51:12 are from PDPTEi.
— Bits 11:3 are bits 29:21 of the linear address.
— Bits 2:0 are 0.
Because a PDE is identified using bits 31:21 of the linear address, it controls access to a 2-Mbyte region of the
linear-address space. Use of the PDE depends on its PS flag (bit 7):
• If the PDE’s PS flag is 1, the PDE maps a 2-MByte page (see Table 5-9). The final physical address is computed
as follows:
— Bits 51:21 are from the PDE.
— Bits 20:0 are from the original linear address.
• If the PDE’s PS flag is 0, a 4-KByte naturally aligned page table is located at the physical address specified in
bits 51:12 of the PDE (see Table 5-10). A page table comprises 512 64-bit entries (PTEs). A PTE is selected
using the physical address defined as follows:
— Bits 51:12 are from the PDE.

1. With PAE paging, the processor does not use CR3 when translating a linear address (as it does in the other paging modes). It does
not access the PDPTEs in the page-directory-pointer table during linear-address translation.

Vol. 3A 5-15
PAGING

— Bits 11:3 are bits 20:12 of the linear address.


— Bits 2:0 are 0.
• Because a PTE is identified using bits 31:12 of the linear address, every PTE maps a 4-KByte page (see
Table 5-11). The final physical address is computed as follows:
— Bits 51:12 are from the PTE.
— Bits 11:0 are from the original linear address.
If the P flag (bit 0) of a PDE or a PTE is 0 or if a PDE or a PTE sets any reserved bit, the entry is used neither to
reference another paging-structure entry nor to map a page. There is no translation for a linear address whose
translation would use such a paging-structure entry; a reference to such a linear address causes a page-fault
exception (see Section 5.7).
The following bits are reserved with PAE paging:
• If the P flag (bit 0) of a PDE or a PTE is 1, bits 62:MAXPHYADDR are reserved.
• If the P flag and the PS flag (bit 7) of a PDE are both 1, bits 20:13 are reserved.
• If IA32_EFER.NXE = 0 and the P flag of a PDE or a PTE is 1, the XD flag (bit 63) is reserved.
• If the PAT is not supported:1
— If the P flag of a PTE is 1, bit 7 is reserved.
— If the P flag and the PS flag of a PDE are both 1, bit 12 is reserved.
A reference using a linear address that is successfully translated to a physical address is performed only if allowed
by the access rights of the translation; see Section 5.6.

Linear Address
31 30 29 21 20 12 11 0
Directory Pointer Directory Table Offset

12 4-KByte Page

Page Table Physical Address


Page Directory 9
9 PTE
40
PDE with PS=0
2 40

PDPTE Registers
40

PDPTE value

Figure 5-5. Linear-Address Translation to a 4-KByte Page using PAE Paging

1. See Section 5.1.4 for how to determine whether the PAT is supported.

5-16 Vol. 3A
PAGING

Linear Address
31 30 29 21 20 0
Directory Offset
Pointer Directory

21 2-MByte Page

9
Page Directory Physical Address

PDPTE Registers
2
PDE with PS=1
31
PDPTE value
40

Figure 5-6. Linear-Address Translation to a 2-MByte Page using PAE Paging

Table 5-9. Format of a PAE Page-Directory Entry that Maps a 2-MByte Page

Bit Contents
Position(s)

0 (P) Present; must be 1 to map a 2-MByte page

1 (R/W) Read/write; if 0, writes may not be allowed to the 2-MByte page referenced by this entry (see Section 5.6)

2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 2-MByte page referenced by this entry (see Section
5.6)

3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 2-MByte page referenced by
this entry (see Section 5.9)

4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 2-MByte page referenced by this
entry (see Section 5.9)

5 (A) Accessed; indicates whether software has accessed the 2-MByte page referenced by this entry (see Section 5.8)

6 (D) Dirty; indicates whether software has written to the 2-MByte page referenced by this entry (see Section 5.8)

7 (PS) Page size; must be 1 (otherwise, this entry references a page table; see Table 5-10)

8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 5.10); ignored otherwise

11:9 Ignored

12 (PAT) If the PAT is supported, indirectly determines the memory type used to access the 2-MByte page referenced by this
entry (see Section 5.9.2); otherwise, reserved (must be 0)1

20:13 Reserved (must be 0)

(M–1):21 Physical address of the 2-MByte page referenced by this entry

62:M Reserved (must be 0)

63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 2-MByte page controlled by
this entry; see Section 5.6); otherwise, reserved (must be 0)

NOTES:
1. See Section 5.1.4 for how to determine whether the PAT is supported.

Vol. 3A 5-17
PAGING

Table 5-10. Format of a PAE Page-Directory Entry that References a Page Table

Bit Contents
Position(s)

0 (P) Present; must be 1 to reference a page table

1 (R/W) Read/write; if 0, writes may not be allowed to the 2-MByte region controlled by this entry (see Section 5.6)

2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 2-MByte region controlled by this entry (see
Section 5.6)

3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page table referenced by this
entry (see Section 5.9)

4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page table referenced by this
entry (see Section 5.9)

5 (A) Accessed; indicates whether this entry has been used for linear-address translation (see Section 5.8)

6 Ignored

7 (PS) Page size; must be 0 (otherwise, this entry maps a 2-MByte page; see Table 5-9)

11:8 Ignored

(M–1):12 Physical address of 4-KByte aligned page table referenced by this entry

62:M Reserved (must be 0)

63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 2-MByte region controlled
by this entry; see Section 5.6); otherwise, reserved (must be 0)

Table 5-11. Format of a PAE Page-Table Entry that Maps a 4-KByte Page

Bit Contents
Position(s)

0 (P) Present; must be 1 to map a 4-KByte page

1 (R/W) Read/write; if 0, writes may not be allowed to the 4-KByte page referenced by this entry (see Section 5.6)

2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 4-KByte page referenced by this entry (see Section
5.6)

3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 4-KByte page referenced by
this entry (see Section 5.9)

4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 4-KByte page referenced by this
entry (see Section 5.9)

5 (A) Accessed; indicates whether software has accessed the 4-KByte page referenced by this entry (see Section 5.8)

6 (D) Dirty; indicates whether software has written to the 4-KByte page referenced by this entry (see Section 5.8)

7 (PAT) If the PAT is supported, indirectly determines the memory type used to access the 4-KByte page referenced by this
entry (see Section 5.9.2); otherwise, reserved (must be 0)1

8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 5.10); ignored otherwise

5-18 Vol. 3A
PAGING

Table 5-11. Format of a PAE Page-Table Entry that Maps a 4-KByte Page (Contd.)

Bit Contents
Position(s)

11:9 Ignored

(M–1):12 Physical address of the 4-KByte page referenced by this entry

62:M Reserved (must be 0)

63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 4-KByte page controlled by
this entry; see Section 5.6); otherwise, reserved (must be 0)

NOTES:
1. See Section 5.1.4 for how to determine whether the PAT is supported.

Figure 5-7 gives a summary of the formats of CR3 and the paging-structure entries with PAE paging. For the paging
structure entries, it identifies separately the format of entries that map pages, those that reference other paging
structures, and those that do neither because they are “not present”; bit 0 (P) and bit 7 (PS) are highlighted
because they determine how a paging-structure entry is used.

6666555555555 33322222222221111111111
M1 M-1
3210987654321 210987654321098765432109876543210

Ignored2 Address of page-directory-pointer table Ignored CR3

P P Rs
PDPTE:
Reserved3 Address of page directory Ign. Rsvd. CW 1
D T vd present

PDTPE:
Ignored 0 not
present
X P PPUR PDE:
Address of
D Reserved Reserved A Ign. G 1 D A C W /S / 1 2MB
2MB page frame
4 T DT W page

X I PPUR PDE:
Reserved Address of page table Ign. 0 g A C W /S / 1 page
D n DT W table
PDE:
Ignored 0 not
present

X P PPUR PTE:
Reserved Address of 4KB page frame Ign. G A D A C W /S / 1 4KB
D T DT W page

PTE:
Ignored 0 not
present

Figure 5-7. Formats of CR3 and Paging-Structure Entries with PAE Paging
NOTES:
1. M is an abbreviation for MAXPHYADDR.
2. CR3 has 64 bits only on processors supporting the Intel-64 architecture. These bits are ignored with PAE paging.
3. Reserved fields must be 0.
4. If IA32_EFER.NXE = 0 and the P flag of a PDE or a PTE is 1, the XD flag (bit 63) is reserved.

Vol. 3A 5-19
PAGING

5.5 4-LEVEL PAGING AND 5-LEVEL PAGING


Because the operation of 4-level paging and 5-level paging is very similar, they are described together in this
section. The following items highlight the distinctions between the two paging modes:
• A logical processor uses 4-level paging if CR0.PG = 1, CR4.PAE = 1, IA32_EFER.LME = 1, and CR4.LA57 = 0.
4-level paging translates 48-bit linear addresses to 52-bit physical addresses.1 Although 52 bits corresponds to
4 PBytes, linear addresses are limited to 48 bits; at most 256 TBytes of linear-address space may be accessed
at any given time.
• A logical processor uses 5-level paging if CR0.PG = 1, CR4.PAE = 1, IA32_EFER.LME = 1, and CR4.LA57 = 1.
5-level paging translates 57-bit linear addresses to 52-bit physical addresses. Thus, 5-level paging supports a
linear-address space sufficient to access the entire physical-address space.

5.5.1 Ordinary Paging and HLAT Paging


There are two forms of 4-level paging and 5-level paging that differ principally with regard to how linear-address
translation identifies the first paging structure.
The normal form is called ordinary paging, and it uses CR3 to locate the first paging structure, similar to what is
done for 32-bit paging. Section 5.5.2 provides details of this use of CR3.
An alternative form of paging may be used with the VMX feature called hypervisor-managed linear-address trans-
lation (HLAT). Called HLAT paging, this form is used only in VMX non-root operation and only if the “enable HLAT”
VM-execution control is 1.2 HLAT paging locates the first paging structure using a VM-execution control field in the
VMCS called the HLAT pointer (HLATP). Section 5.5.3 provides details.
Whether HLAT paging is used to translate a specific linear address depends on the address and on the value of a
VM-execution control field in the VMCS called the HLAT prefix size:
• If the HLAT prefix size is zero, every linear address is translated using HLAT paging.
• If the HLAT prefix size is not zero, a linear address is translated using HLAT paging if bit 63 of the address is 1.3
The address is translated using ordinary paging if bit 63 of the address is 0.
In some cases, HLAT paging may specify that a translation of a linear address must be restarted. When this occurs,
the linear address is then translated using ordinary paging (starting with a paging structure identified using CR3).
The situations leading to this restart are detailed in Section 5.5.4, and additional details of the restart process are
given in Section 5.5.5.

5.5.2 Use of CR3 with Ordinary 4-Level Paging and 5-Level Paging
Ordinary 4-level paging and 5-level paging each translate linear addresses using a hierarchy of in-memory paging
structures located using the contents of CR3, which is used to locate the first paging structure. For 4-level paging,
this is the PML4 table, and for 5-level paging it is the PML5 table. Use of CR3 with 4-level paging and 5-level paging
depends on whether process-context identifiers (PCIDs) have been enabled by setting CR4.PCIDE:
• Table 5-12 illustrates how CR3 is used with 4-level paging and 5-level paging if CR4.PCIDE = 0.

Table 5-12. Use of CR3 with 4-Level Paging and 5-level Paging and CR4.PCIDE = 0

Bit Contents
Position(s)

2:0 Ignored

1. If MAXPHYADDR < 52, bits in the range 51:MAXPHYADDR will be 0 in any physical address used by 4-level paging. (The correspond-
ing bits are reserved in the paging-structure entries.) See Section 5.1.4 for how to determine MAXPHYADDR.
2. HLAT paging is used only with 4-level paging and 5-level paging. It is never used with 32-bit paging or PAE paging, regardless of the
value of the “enable HLAT” VM-execution control.
3. This behavior applies if the CPU enumerates a maximum HLAT prefix size of 1 in IA32_VMX_EPT_VPID_CAP[53:48] (see Appendix
A.10). Behavior when a different value is enumerated is not currently defined.

5-20 Vol. 3A
PAGING

Table 5-12. Use of CR3 with 4-Level Paging and 5-level Paging and CR4.PCIDE = 0 (Contd.)

Bit Contents
Position(s)

3 (PWT) Page-level write-through; indirectly determines the memory type used to access the PML4 table or PML5 table
during linear-address translation (see Section 5.9.2)

4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the PML4 table or PML5 table during
linear-address translation (see Section 5.9.2)

11:5 Ignored

M–1:12 Physical address of the 4-KByte aligned PML4 table or PML5 table used for linear-address translation1

60:M Reserved (must be 0)

61 Enables LAM57 for user pointers; Section 4.4.2

62 Enables LAM48 for user pointers; ignored if bit 61 is set.2

63 Reserved (must be 0)3

NOTES:
1. M is an abbreviation for MAXPHYADDR, which is at most 52; see Section 5.1.4.
2. LAM is not a paging feature.
3. See Section 5.10.4.1 for use of bit 63 of the source operand of the MOV to CR3 instruction.

• Table 5-13 illustrates how CR3 is used with 4-level paging and 5-level paging if CR4.PCIDE = 1.

Table 5-13. Use of CR3 with 4-Level Paging and 5-Level Paging and CR4.PCIDE = 1

Bit Contents
Position(s)

11:0 PCID (see Section 5.10.1)1

M–1:12 Physical address of the 4-KByte aligned PML4 table used for linear-address translation2

60:M Reserved (must be 0)

61 Enables LAM57 for user pointers; Section 4.4.3

62 Enables LAM48 for user pointers; ignored if bit 61 is set.3

63 Reserved (must be 0)4

NOTES:
1. Section 5.9.2 explains how the processor determines the memory type used to access the PML4 table during linear-address transla-
tion with CR4.PCIDE = 1.
2. M is an abbreviation for MAXPHYADDR, which is at most 52; see Section 5.1.4.
3. LAM is not a paging feature.
4. See Section 5.10.4.1 for use of bit 63 of the source operand of the MOV to CR3 instruction.

After software modifies the value of CR4.PCIDE, the logical processor immediately begins using CR3 as specified
for the new value. For example, if software changes CR4.PCIDE from 1 to 0, the current PCID immediately changes
from CR3[11:0] to 000H (see also Section 5.10.4.1). In addition, the logical processor subsequently determines
the memory type used to access the PML4 table using CR3.PWT and CR3.PCD, which had been bits 4:3 of the PCID.

Vol. 3A 5-21
PAGING

5.5.3 Use of HLATP with HLAT 4-Level Paging and 5-Level Paging
With HLAT paging, 4-level paging and 5-level paging each translate linear addresses using a hierarchy of in-
memory paging structures located using the value of HLATP (a VM-execution control field in the VMCS), which is
used to locate the first paging structure. For 4-level paging, this is the PML4 table, and for 5-level paging it is the
PML5 table.
HLATP has the same format as that given for CR3 in Table 5-12, with the exception that bits 2:0 and bits 11:5 are
reserved and must be zero (these bits are checked by VM entry). HLATP does not contain a PCID value. HLAT
paging with CR4.PCIDE = 1 uses the PCID value in CR3[11:0].

5.5.4 Linear-Address Translation with 4-Level Paging and 5-Level Paging


4-level paging and 5-level paging may map linear addresses to 4-KByte pages, 2-MByte pages, or 1-GByte pages.1
Figure 5-8 illustrates the translation process for 4-level paging when it produces a 4-KByte page; Figure 5-9 covers
the case of a 2-MByte page, and Figure 5-10 the case of a 1-GByte page. (The process for 5-level paging is similar.)

Linear Address
47 39 38 30 29 21 20 12 11 0
PML4 Directory Ptr Directory Table Offset

9 9
9 12 4-KByte Page
Physical Addr

PTE
Page-Directory- PDE with PS=0
40
Pointer Table 40 Page Table
Page-Directory
PDPTE 40

40
PML4E

40
CR3 or HLATP

Figure 5-8. Linear-Address Translation to a 4-KByte Page Using 4-Level Paging

1. Not all processors support 1-GByte pages; see Section 5.1.4.

5-22 Vol. 3A
PAGING

Linear Address
47 39 38 30 29 21 20 0
PML4 Directory Ptr Directory Offset

9 21
9
2-MByte Page

Physical Addr
Page-Directory- PDE with PS=1
Pointer Table 31
Page-Directory
PDPTE
40
9

40
PML4E

40

CR3 or HLATP

Figure 5-9. Linear-Address Translation to a 2-MByte Page using 4-Level Paging

Linear Address
47 39 38 30 29 0
PML4 Directory Ptr Offset

30
9

1-GByte Page
Page-Directory-
Pointer Table
Physical Addr
PDPTE with PS=1
22
9

40
PML4E

40

CR3 or HLATP

Figure 5-10. Linear-Address Translation to a 1-GByte Page using 4-Level Paging

Vol. 3A 5-23
PAGING

4-level paging and 5-level paging associate with each linear address a protection key. Section 5.6 explains how
the processor uses the protection key in its determination of the access rights of each linear address.
The remainder of this section describes the translation process used by 4-level paging and 5-level paging in more
detail, as well has how the page size and protection key are determined. Because the process used by the two
paging modes is similar, they are described together, with any differences identified, in the following items:
• With 5-level paging, a 4-KByte naturally aligned PML5 table is located at the physical address specified in
bits 51:12 of CR3 (see Table 5-12). (4-level paging does not use a PML5 table and omits this step.) A PML5
table comprises 512 64-bit entries (PML5Es). A PML5E is selected using the physical address defined as follows:
— Bits 51:12 are from CR3 or HLATP.
— Bits 11:3 are bits 56:48 of the linear address.
— Bits 2:0 are all 0.
Because a PML5E is identified using bits 56:48 of the linear address, it controls access to a 256-TByte region of
the linear-address space.
With HLAT paging, if bit 11 of the PML5E is 1, translation is restarted with ordinary paging with a maximum
page size of 256-TBytes (see Section 5.5.5). Otherwise, the translation process continues as described in the
next item.
• A 4-KByte naturally aligned PML4 table is located at the physical address specified in bits 51:12 of CR3 (for 4-
level paging; see Table 5-12) or in bits 51:12 of the PML5E (for 5-level paging; see Table 5-14). A PML4 table
comprises 512 64-bit entries (PML4Es). A PML4E is selected using the physical address defined as follows:
— Bits 51:12 are from CR3 or the HLATP (for 4-level paging) or in bits 51:12 of the PML5E (for 5-level
paging).
— Bits 11:3 are bits 47:39 of the linear address.
— Bits 2:0 are all 0.
Because a PML4E is identified using bits 47:39 of the linear address, it controls access to a 512-GByte region of
the linear-address space.
With HLAT paging, if bit 11 of the PML4E is 1, translation is restarted with ordinary paging with a maximum
page size of 512-GBytes (see Section 5.5.5). Otherwise, the translation process continues as described in the
next item.
• A 4-KByte naturally aligned page-directory-pointer table is located at the physical address specified in
bits 51:12 of the PML4E (see Table 5-15). A page-directory-pointer table comprises 512 64-bit entries
(PDPTEs). A PDPTE is selected using the physical address defined as follows:
— Bits 51:12 are from the PML4E.
— Bits 11:3 are bits 38:30 of the linear address.
— Bits 2:0 are all 0.
Because a PDPTE is identified using bits 47:30 of the linear address, it controls access to a 1-GByte region of the
linear-address space.
With HLAT paging, if bit 11 of the PDPTE is 1, translation is restarted with ordinary paging with a maximum page
size of 1-GByte (see Section 5.5.5). Otherwise, the translation process continues as described below.
Use of the PDPTE depends on its PS flag (bit 7):1
• If the PDPTE’s PS flag is 1, the PDPTE maps a 1-GByte page (see Table 5-16). The final physical address is
computed as follows:
— Bits 51:30 are from the PDPTE.
— Bits 29:0 are from the original linear address.
The linear address’s protection key is the value of bits 62:59 of the PDPTE (see Section 5.6.2).

1. The PS flag of a PDPTE is reserved and must be 0 (if the P flag is 1) if 1-GByte pages are not supported. See Section 5.1.4 for how
to determine whether 1-GByte pages are supported.

5-24 Vol. 3A
PAGING

• If the PDPTE’s PS flag is 0, a 4-KByte naturally aligned page directory is located at the physical address
specified in bits 51:12 of the PDPTE (see Table 5-17). A page directory comprises 512 64-bit entries (PDEs). A
PDE is selected using the physical address defined as follows:
— Bits 51:12 are from the PDPTE.
— Bits 11:3 are bits 29:21 of the linear address.
— Bits 2:0 are all 0.
Because a PDE is identified using bits 47:21 of the linear address, it controls access to a 2-MByte region of the
linear-address space.
With HLAT paging, if bit 11 of the PDE is 1, translation is restarted with ordinary paging with a maximum page size
of 2-MBytes (see Section 5.5.5). Otherwise, the translation process continues as described below.
Use of the PDE depends on its PS flag:
• If the PDE's PS flag is 1, the PDE maps a 2-MByte page (see Table 5-18). The final physical address is computed
as follows:
— Bits 51:21 are from the PDE.
— Bits 20:0 are from the original linear address.
The linear address’s protection key is the value of bits 62:59 of the PDE (see Section 5.6.2).
• If the PDE’s PS flag is 0, a 4-KByte naturally aligned page table is located at the physical address specified in
bits 51:12 of the PDE (see Table 5-19). A page table comprises 512 64-bit entries (PTEs). A PTE is selected
using the physical address defined as follows:
— Bits 51:12 are from the PDE.
— Bits 11:3 are bits 20:12 of the linear address.
— Bits 2:0 are all 0.
• Because a PTE is identified using bits 47:12 of the linear address, every PTE maps a 4-KByte page (see
Table 5-20).
With HLAT paging, if bit 11 of the PTE is 1, translation is restarted with ordinary paging with a maximum page
size of 4-KBytes (see Section 5.5.5). Otherwise, the final physical address is computed as follows:
— Bits 51:12 are from the PTE.
— Bits 11:0 are from the original linear address.
The linear address’s protection key is the value of bits 62:59 of the PTE (see Section 5.6.2).
If a paging-structure entry’s P flag (bit 0) is 0 or if the entry sets any reserved bit, the entry is used neither to refer-
ence another paging-structure entry nor to map a page. There is no translation for a linear address whose transla-
tion would use such a paging-structure entry; a reference to such a linear address causes a page-fault exception
(see Section 5.7).
The following bits in a paging-structure entry are reserved with 4-level paging and 5-level paging (assuming that
the entry’s P flag is 1):
• Bits 51:MAXPHYADDR are reserved in every paging-structure entry.
• The PS flag is reserved in a PML5E or a PML4E.
• If 1-GByte pages are not supported, the PS flag is reserved in a PDPTE.1
• If the PS flag in a PDPTE is 1, bits 29:13 of the entry are reserved.
• If the PS flag in a PDE is 1, bits 20:13 of the entry are reserved.
• If IA32_EFER.NXE = 0, the XD flag (bit 63) is reserved in every paging-structure entry.
A reference using a linear address that is successfully translated to a physical address is performed only if allowed
by the access rights of the translation; see Section 5.6.

1. See Section 5.1.4 for how to determine whether 1-GByte pages are supported.

Vol. 3A 5-25
PAGING

Figure 5-11 gives a summary of the formats of CR3 and the 4-level and 5-level paging-structure entries. For the
paging structure entries, it identifies separately the format of entries that map pages, those that reference other
paging structures, and those that do neither because they are “not present”; bit 0 (P) and bit 7 (PS) are highlighted
because they determine how a paging-structure entry is used.

Table 5-14. Format of a PML5 Entry (PML5E) that References a PML4 Table

Bit Contents
Position(s)

0 (P) Present; must be 1 to reference a PML4 table

1 (R/W) Read/write; if 0, writes may not be allowed to the 256-TByte region controlled by this entry (see Section 5.6)

2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 256-TByte region controlled by this entry (see
Section 5.6)

3 (PWT) Page-level write-through; indirectly determines the memory type used to access the PML4 table referenced by this
entry (see Section 5.9.2)

4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the PML4 table referenced by this
entry (see Section 5.9.2)

5 (A) Accessed; indicates whether this entry has been used for linear-address translation (see Section 5.8)

6 Ignored

7 (PS) Reserved (must be 0)

10:8 Ignored

11 (R) For ordinary paging, ignored; for HLAT paging, restart (if 1, linear-address translation is restarted with ordinary
paging)

M–1:12 Physical address of 4-KByte aligned PML4 table referenced by this entry

51:M Reserved (must be 0)

62:52 Ignored

63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 256-TByte region
controlled by this entry; see Section 5.6); otherwise, reserved (must be 0)

Table 5-15. Format of a PML4 Entry (PML4E) that References a Page-Directory-Pointer Table

Bit Contents
Position(s)

0 (P) Present; must be 1 to reference a page-directory-pointer table

1 (R/W) Read/write; if 0, writes may not be allowed to the 512-GByte region controlled by this entry (see Section 5.6)

2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 512-GByte region controlled by this entry (see
Section 5.6)

3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page-directory-pointer table
referenced by this entry (see Section 5.9.2)

4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page-directory-pointer table
referenced by this entry (see Section 5.9.2)

5 (A) Accessed; indicates whether this entry has been used for linear-address translation (see Section 5.8)

6 Ignored

5-26 Vol. 3A
PAGING

Table 5-15. Format of a PML4 Entry (PML4E) that References a Page-Directory-Pointer Table (Contd.)

Bit Contents
Position(s)

7 (PS) Reserved (must be 0)

10:8 Ignored

11 (R) For ordinary paging, ignored; for HLAT paging, restart (if 1, linear-address translation is restarted with ordinary
paging)

M–1:12 Physical address of 4-KByte aligned page-directory-pointer table referenced by this entry

51:M Reserved (must be 0)

62:52 Ignored

63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 512-GByte region
controlled by this entry; see Section 5.6); otherwise, reserved (must be 0)

Table 5-16. Format of a Page-Directory-Pointer-Table Entry (PDPTE) that Maps a 1-GByte Page

Bit Contents
Position(s)

0 (P) Present; must be 1 to map a 1-GByte page

1 (R/W) Read/write; if 0, writes may not be allowed to the 1-GByte page referenced by this entry (see Section 5.6)

2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 1-GByte page referenced by this entry (see Section
5.6)

3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 1-GByte page referenced by this
entry (see Section 5.9.2)

4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 1-GByte page referenced by this
entry (see Section 5.9.2)

5 (A) Accessed; indicates whether software has accessed the 1-GByte page referenced by this entry (see Section 5.8)

6 (D) Dirty; indicates whether software has written to the 1-GByte page referenced by this entry (see Section 5.8)

7 (PS) Page size; must be 1 (otherwise, this entry references a page directory; see Table 5-17)

8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 5.10); ignored otherwise

10:9 Ignored

11 (R) For ordinary paging, ignored; for HLAT paging, restart (if 1, linear-address translation is restarted with ordinary
paging)

12 (PAT) Indirectly determines the memory type used to access the 1-GByte page referenced by this entry (see Section
5.9.2)1

29:13 Reserved (must be 0)

(M–1):30 Physical address of the 1-GByte page referenced by this entry

Vol. 3A 5-27
PAGING

Table 5-16. Format of a Page-Directory-Pointer-Table Entry (PDPTE) that Maps a 1-GByte Page (Contd.)

Bit Contents
Position(s)

51:M Reserved (must be 0)

58:52 Ignored

62:59 Protection key; if CR4.PKE = 1 or CR4.PKS = 1, this may control the page’s access rights (see Section 5.6.2); otherwise,
it is ignored and not used to control access rights.

63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 1-GByte page controlled by
this entry; see Section 5.6); otherwise, reserved (must be 0)

NOTES:
1. The PAT is supported on all processors that support 4-level paging.

Table 5-17. Format of a Page-Directory-Pointer-Table Entry (PDPTE) that References a Page Directory

Bit Contents
Position(s)

0 (P) Present; must be 1 to reference a page directory

1 (R/W) Read/write; if 0, writes may not be allowed to the 1-GByte region controlled by this entry (see Section 5.6)

2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 1-GByte region controlled by this entry (see Section
5.6)

3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page directory referenced by
this entry (see Section 5.9.2)

4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page directory referenced by
this entry (see Section 5.9.2)

5 (A) Accessed; indicates whether this entry has been used for linear-address translation (see Section 5.8)

6 Ignored

7 (PS) Page size; must be 0 (otherwise, this entry maps a 1-GByte page; see Table 5-16)

10:8 Ignored

11 (R) For ordinary paging, ignored; for HLAT paging, restart (if 1, linear-address translation is restarted with ordinary
paging)

(M–1):12 Physical address of 4-KByte aligned page directory referenced by this entry

51:M Reserved (must be 0)

62:52 Ignored

63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 1-GByte region controlled
by this entry; see Section 5.6); otherwise, reserved (must be 0)

5-28 Vol. 3A
PAGING

Table 5-18. Format of a Page-Directory Entry that Maps a 2-MByte Page

Bit Contents
Position(s)

0 (P) Present; must be 1 to map a 2-MByte page

1 (R/W) Read/write; if 0, writes may not be allowed to the 2-MByte page referenced by this entry (see Section 5.6)

2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 2-MByte page referenced by this entry (see Section
5.6)

3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 2-MByte page referenced by
this entry (see Section 5.9.2)

4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 2-MByte page referenced by
this entry (see Section 5.9.2)

5 (A) Accessed; indicates whether software has accessed the 2-MByte page referenced by this entry (see Section 5.8)

6 (D) Dirty; indicates whether software has written to the 2-MByte page referenced by this entry (see Section 5.8)

7 (PS) Page size; must be 1 (otherwise, this entry references a page table; see Table 5-19)

8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 5.10); ignored otherwise

10:9 Ignored

11 (R) For ordinary paging, ignored; for HLAT paging, restart (if 1, linear-address translation is restarted with ordinary
paging)

12 (PAT) Indirectly determines the memory type used to access the 2-MByte page referenced by this entry (see Section
5.9.2)

20:13 Reserved (must be 0)

(M–1):21 Physical address of the 2-MByte page referenced by this entry

51:M Reserved (must be 0)

58:52 Ignored

62:59 Protection key; if CR4.PKE = 1 or CR4.PKS = 1, this may control the page’s access rights (see Section 5.6.2);
otherwise, it is ignored and not used to control access rights.

63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 2-MByte page controlled by
this entry; see Section 5.6); otherwise, reserved (must be 0)

Vol. 3A 5-29
PAGING

Table 5-19. Format of a Page-Directory Entry that References a Page Table

Bit Contents
Position(s)

0 (P) Present; must be 1 to reference a page table

1 (R/W) Read/write; if 0, writes may not be allowed to the 2-MByte region controlled by this entry (see Section 5.6)

2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 2-MByte region controlled by this entry (see Section
5.6)

3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page table referenced by this
entry (see Section 5.9.2)

4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page table referenced by this
entry (see Section 5.9.2)

5 (A) Accessed; indicates whether this entry has been used for linear-address translation (see Section 5.8)

6 Ignored

7 (PS) Page size; must be 0 (otherwise, this entry maps a 2-MByte page; see Table 5-18)

10:8 Ignored

11 (R) For ordinary paging, ignored; for HLAT paging, restart (if 1, linear-address translation is restarted with ordinary
paging)

(M–1):12 Physical address of 4-KByte aligned page table referenced by this entry

51:M Reserved (must be 0)

62:52 Ignored

63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 2-MByte region controlled
by this entry; see Section 5.6); otherwise, reserved (must be 0)

5-30 Vol. 3A
PAGING

Table 5-20. Format of a Page-Table Entry that Maps a 4-KByte Page

Bit Contents
Position(s)

0 (P) Present; must be 1 to map a 4-KByte page

1 (R/W) Read/write; if 0, writes may not be allowed to the 4-KByte page referenced by this entry (see Section 5.6)

2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 4-KByte page referenced by this entry (see Section
5.6)

3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 4-KByte page referenced by
this entry (see Section 5.9.2)

4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 4-KByte page referenced by this
entry (see Section 5.9.2)

5 (A) Accessed; indicates whether software has accessed the 4-KByte page referenced by this entry (see Section 5.8)

6 (D) Dirty; indicates whether software has written to the 4-KByte page referenced by this entry (see Section 5.8)

7 (PAT) Indirectly determines the memory type used to access the 4-KByte page referenced by this entry (see Section 5.9.2)

8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 5.10); ignored otherwise

10:9 Ignored

11 (R) For ordinary paging, ignored; for HLAT paging, restart (if 1, linear-address translation is restarted with ordinary
paging)

(M–1):12 Physical address of the 4-KByte page referenced by this entry

51:M Reserved (must be 0)

58:52 Ignored

62:59 Protection key; if CR4.PKE = 1 or CR4.PKS = 1, this may control the page’s access rights (see Section 5.6.2);
otherwise, it is ignored and not used to control access rights.

63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 4-KByte page controlled by
this entry; see Section 5.6); otherwise, reserved (must be 0)

Vol. 3A 5-31
PAGING

6666555555555 33322222222221111111111
M1 M-1
3210987654321 210987654321098765432109876543210
PP
Reserved2 Address of PML4 table (4-level paging) Ignored C W Ign. CR3
or PML5 table (5-level paging) DT
X I PP R
R Ign. Rs g A C W U / 1 PML5E:
D Ignored Rsvd. Address of PML4 table 4 vd n D T /S W present
3

PML5E:
Ignored 0 not
present

X I PPUR
R Ign. Rs PML4E:
Ignored Rsvd. Address of page-directory-pointer table vd n A D
g C W /S / 1 present
D T W

PML4E:
Ignored 0 not
present

X Prot. P PPUR PDPTE:


Address of
Ignored Rsvd. Reserved A R Ign. G 1 D A C W /S / 1 1GB
D Key5 1GB page frame T DT W page

X I PPUR PDPTE:
Ignored Rsvd. Address of page directory R Ign. 0 g A C W /S / 1 page
D n DT W directory

PDTPE:
Ignored 0 not
present

X Prot. P PPUR PDE:


Address of
Ignored Rsvd. Reserved A R Ign. G 1 D A C W /S / 1 2MB
D Key 2MB page frame T DT W page

X I PPUR PDE:
Ignored Rsvd. Address of page table R Ign. 0 g A C W /S / 1 page
D n DT W table

PDE:
Ignored 0 not
present

X Prot. P PPUR PTE:


Ignored Rsvd. Address of 4KB page frame R Ign. G A D A C W /S / 1 4KB
D Key T DT W page

PTE:
Ignored 0 not
present

Figure 5-11. Formats of CR3 and Paging-Structure Entries with 4-Level Paging and 5-Level Paging
NOTES:
1. M is an abbreviation for MAXPHYADDR.
2. Reserved fields must be 0. On processors that support linear-address masking (see Section 4.4), bits 62:61 configure that feature and
may be set to 1. Because linear-address masking is not a paging feature, those bits are not illustrated here.
3. If IA32_EFER.NXE = 0 and the P flag of a paging-structure entry is 1, the XD flag (bit 63) is reserved.
4. Bit 11 is R (restart) only for HLAT paging; it is ignored for ordinary paging.
5. The protection key is used only if software has enabled the appropriate feature; see Section 5.6.2. It is ignored otherwise.

5-32 Vol. 3A
PAGING

5.5.5 Restart of HLAT Paging


As noted in Section 5.5.1, HLAT paging may specify that a translation of a linear address must be restarted. Specif-
ically, this occurs when HLAT paging encounters a paging-structure entry that sets bit 11 (see Section 5.5.4).
When this occurs, translation of the linear address is restarted using ordinary paging (starting with a paging struc-
ture identified using CR3). The restarted translation proceeds just as if the HLAT feature were not enabled. The
entire linear address is translated again, including those portions that had been used by HLAT paging prior to the
restart.
The process of restarting HLAT paging (using ordinary paging) always specifies a maximum page size to be used
when a resulting translation is cached in the TLBs. This maximum page size depends on the level of the paging-
structure entry that restarts the translation by setting bit 11; details are given in Section 5.5.4. The page size of
the translation produced by the restarted process is never greater than this maximum page size. See Section
5.10.2.2 for more discussion.

5.6 ACCESS RIGHTS


There is a translation for a linear address if the processes described in Section 5.3, Section 5.4.2, and Section 5.5
(depending upon the paging mode) completes and produces a physical address. Whether an access is permitted by
a translation is determined by the access rights specified by the paging-structure entries controlling the transla-
tion;1 paging-mode modifiers in CR0, CR4, and the IA32_EFER MSR; EFLAGS.AC; and the mode of the access.
Section 5.6.1 describes how the processor determines the access rights for each linear address. Section 5.6.2
provides additional information about how protection keys contribute to access-rights determination. (They do so
only with 4-level paging and 5-level paging, and only if CR4.PKE = 1 or CR4.PKS = 1.)

NOTE
If HLAT paging is restarted, permissions are determined only by the access rights specified by the
paging-structure entries that the subsequent ordinary paging used to translate the linear address.
The access rights specified by the entries used earlier by HLAT paging do not apply.

5.6.1 Determination of Access Rights


Every access to a linear address is either a supervisor-mode access or a user-mode access. For all instruction
fetches and most data accesses, this distinction is determined by the current privilege level (CPL): accesses made
while CPL < 3 are supervisor-mode accesses, while accesses made while CPL = 3 are user-mode accesses.
Some operations implicitly access system data structures with linear addresses; the resulting accesses to those
data structures are supervisor-mode accesses regardless of CPL. Examples of such accesses include the following:
accesses to the global descriptor table (GDT) or local descriptor table (LDT) to load a segment descriptor; accesses
to the interrupt descriptor table (IDT) when delivering an interrupt or exception; and accesses to the task-state
segment (TSS) as part of a task switch or change of CPL. All these accesses are called implicit supervisor-mode
accesses regardless of CPL. Other accesses made while CPL < 3 are called explicit supervisor-mode accesses.
Access rights are also controlled by the mode of a linear address as specified by the paging-structure entries
controlling the translation of the linear address. If the U/S flag (bit 2) is 0 in at least one of the paging-structure
entries, the address is a supervisor-mode address. Otherwise, the address is a user-mode address.
When the shadow-stack feature of control-flow enforcement technology (CET) is enabled, certain accesses to
linear addresses are considered shadow-stack accesses (see Section 18.2, “Shadow Stacks,” in the Intel® 64
and IA-32 Architectures Software Developer’s Manual, Volume 1). Like ordinary data accesses, each shadow-stack
access is defined as being either a user access or a supervisor access. In general, a shadow-stack access is a user
access if CPL = 3 and a supervisor access if CPL < 3. The WRUSS instruction is an exception; although it can be
executed only if CPL = 0, the processor treats its shadow-stack accesses as user accesses.

1. With PAE paging, the PDPTEs do not determine access rights.

Vol. 3A 5-33
PAGING

Shadow-stack accesses are allowed only to shadow-stack addresses. A linear address is a shadow-stack
address if the following are true of the translation of the linear address: (1) the R/W flag (bit 1) is 0 and the dirty
flag (bit 6) is 1 in the paging-structure entry that maps the page containing the linear address; and (2) the R/W
flag is 1 in every other paging-structure entry controlling the translation of the linear address.
The following items detail how paging determines access rights (only the items noted explicitly apply to shadow-
stack accesses):

NOTE
Many of the items below refer to an address with a protection key for which read (or write) access
is permitted. Section 5.6.2 provides details on when a protection key will permit (or not permit) a
data access (read or write) to a linear address using that protection key.

• For supervisor-mode accesses:


— Data may be read (implicitly or explicitly) from any supervisor-mode address with a protection key for
which read access is permitted.
— Data reads from user-mode pages.
Access rights depend on the value of CR4.SMAP:
• If CR4.SMAP = 0, data may be read from any user-mode address with a protection key for which read
access is permitted.
• If CR4.SMAP = 1, access rights depend on the value of EFLAGS.AC and whether the access is implicit or
explicit:
— If EFLAGS.AC = 1 and the access is explicit, data may be read from any user-mode address with a
protection key for which read access is permitted.
— If EFLAGS.AC = 0 or the access is implicit, data may not be read from any user-mode address.
— Data writes to supervisor-mode addresses.
Access rights depend on the value of CR0.WP:
• If CR0.WP = 0, data may be written to any supervisor-mode address with a protection key for which
write access is permitted.
• If CR0.WP = 1, data may be written to any supervisor-mode address with a translation for which the
R/W flag (bit 1) is 1 in every paging-structure entry controlling the translation and with a protection key
for which write access is permitted; data may not be written to any supervisor-mode address with a
translation for which the R/W flag is 0 in any paging-structure entry controlling the translation.
— Data writes to user-mode addresses.
Access rights depend on the value of CR0.WP:
• If CR0.WP = 0, access rights depend on the value of CR4.SMAP:
— If CR4.SMAP = 0, data may be written to any user-mode address with a protection key for which
write access is permitted.
— If CR4.SMAP = 1, access rights depend on the value of EFLAGS.AC and whether the access is
implicit or explicit:
• If EFLAGS.AC = 1 and the access is explicit, data may be written to any user-mode address
with a protection key for which write access is permitted.
• If EFLAGS.AC = 0 or the access is implicit, data may not be written to any user-mode address.
• If CR0.WP = 1, access rights depend on the value of CR4.SMAP:
— If CR4.SMAP = 0, data may be written to any user-mode address with a translation for which the
R/W flag is 1 in every paging-structure entry controlling the translation and with a protection key
for which write access is permitted; data may not be written to any user-mode address with a
translation for which the R/W flag is 0 in any paging-structure entry controlling the translation.
— If CR4.SMAP = 1, access rights depend on the value of EFLAGS.AC and whether the access is
implicit or explicit:

5-34 Vol. 3A
PAGING

• If EFLAGS.AC = 1 and the access is explicit, data may be written to any user-mode address
with a translation for which the R/W flag is 1 in every paging-structure entry controlling the
translation and with a protection key for which write access is permitted; data may not be
written to any user-mode address with a translation for which the R/W flag is 0 in any paging-
structure entry controlling the translation.
• If EFLAGS.AC = 0 or the access is implicit, data may not be written to any user-mode address.
— Instruction fetches from supervisor-mode addresses.
• For 32-bit paging or if IA32_EFER.NXE = 0, instructions may be fetched from any supervisor-mode
address.
• For other paging modes with IA32_EFER.NXE = 1, instructions may be fetched from any supervisor-
mode address with a translation for which the XD flag (bit 63) is 0 in every paging-structure entry
controlling the translation; instructions may not be fetched from any supervisor-mode address with a
translation for which the XD flag is 1 in any paging-structure entry controlling the translation.
— Instruction fetches from user-mode addresses.
Access rights depend on the values of CR4.SMEP:
• If CR4.SMEP = 0, access rights depend on the paging mode and the value of IA32_EFER.NXE:
— For 32-bit paging or if IA32_EFER.NXE = 0, instructions may be fetched from any user-mode
address.
— For other paging modes with IA32_EFER.NXE = 1, instructions may be fetched from any user-
mode address with a translation for which the XD flag is 0 in every paging-structure entry
controlling the translation; instructions may not be fetched from any user-mode address with a
translation for which the XD flag is 1 in any paging-structure entry controlling the translation.
• If CR4.SMEP = 1, instructions may not be fetched from any user-mode address.
— Supervisor-mode shadow-stack accesses are allowed only to supervisor-mode shadow-stack addresses
(see above).
• For user-mode accesses:
— Data reads.
Access rights depend on the mode of the linear address:
• Data may be read from any user-mode address with a protection key for which read access is
permitted.
• Data may not be read from any supervisor-mode address.
— Data writes.
Access rights depend on the mode of the linear address:
• Data may be written to any user-mode address with a translation for which the R/W flag is 1 in every
paging-structure entry controlling the translation and with a protection key for which write access is
permitted.
• Data may not be written to any supervisor-mode address.
— Instruction fetches.
Access rights depend on the mode of the linear address, the paging mode, and the value of
IA32_EFER.NXE:
• For 32-bit paging or if IA32_EFER.NXE = 0, instructions may be fetched from any user-mode address.
• For other paging modes with IA32_EFER.NXE = 1, instructions may be fetched from any user-mode
address with a translation for which the XD flag is 0 in every paging-structure entry controlling the
translation.
• Instructions may not be fetched from any supervisor-mode address.
— User-mode shadow-stack accesses made outside enclave mode are allowed only to user-mode shadow-
stack addresses (see above). User-mode shadow-stack accesses made in enclave mode are treated like
ordinary data accesses (see above).

Vol. 3A 5-35
PAGING

A processor may cache information from the paging-structure entries in TLBs and paging-structure caches (see
Section 5.10). These structures may include information about access rights. The processor may enforce access
rights based on the TLBs and paging-structure caches instead of on the paging structures in memory.
This fact implies that, if software modifies a paging-structure entry to change access rights, the processor might
not use that change for a subsequent access to an affected linear address (see Section 5.10.4.3). See Section
5.10.4.2 for how software can ensure that the processor uses the modified access rights.

5.6.2 Protection Keys


4-level paging and 5-level paging associate a 4-bit protection key with each linear address (the protection key
located in bits 62:59 of the paging-structure entry that mapped the page containing the linear address; see Section
5.5). Two protection key features control accesses to linear addresses based on their protection keys:
• If CR4.PKE = 1, the PKRU register determines, for each protection key, whether user-mode addresses with that
protection key may be read or written.
• If CR4.PKS = 1, the IA32_PKRS MSR (MSR index 6E1H) determines, for each protection key, whether
supervisor-mode addresses with that protection key may be read or written.
32-bit paging and PAE paging do not associate linear addresses with protection keys. For the purposes of Section
5.6.1, reads and writes are implicitly permitted for all protection keys with either of those paging modes.
The PKRU register (protection-key rights for user pages) is a 32-bit register with the following format: for each i
(0 ? i ? 15), PKRU[2i] is the access-disable bit for protection key i (ADi); PKRU[2i+1] is the write-disable bit for
protection key i (WDi). The IA32_PKRS MSR has the same format (bits 63:32 of the MSR are reserved and must be
zero).
Software can use the RDPKRU and WRPKRU instructions with ECX = 0 to read and write PKRU. In addition, the
PKRU register is XSAVE-managed state and can thus be read and written by instructions in the XSAVE feature set.
See Chapter 13, “Managing State Using the XSAVE Feature Set,” of Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 1, for more information about the XSAVE feature set.
Software can use the RDMSR and WRMSR instructions to read and write the IA32_PKRS MSR. Writes to the
IA32_PKRS MSR using WRMSR are not serializing. The IA32_PKRS MSR is not XSAVE-managed.
How a linear address’s protection key controls access to the address depends on the mode of the linear address:
• A linear address’s protection key controls only data accesses to the address. It does not in any way affect
instructions fetches from the address.
• If CR4.PKE = 0, the protection key of a user-mode address does not control data accesses to the address (for
the purposes of Section 5.6.1, reads and writes of user-mode addresses are implicitly permitted for all
protection keys).
If CR4.PKE = 1, use of the protection key i of a user-mode address depends on the value of the PKRU register:
— If ADi = 1, no data accesses are permitted.
— If WDi = 1, permission may be denied to certain data write accesses:
• User-mode write accesses are not permitted.
• Supervisor-mode write accesses are not permitted if CR0.WP = 1. (If CR0.WP = 0, WDi does not affect
supervisor-mode write accesses to user-mode addresses with protection key i.)
• If CR4.PKS = 0, the protection key of a supervisor-mode address does not control data accesses to the address
(for the purposes of Section 5.6.1, reads and writes of supervisor-mode addresses are implicitly permitted for
all protection keys).
If CR4.PKS = 1, use of the protection key i of a supervisor-mode address depends on the value of the
IA32_PKRS MSR:
— If ADi = 1, no data accesses are permitted.
— If WDi = 1, write accesses are not permitted if CR0.WP = 1. (If CR0.WP = 0, IA32_PKRS.WDi does not
affect write accesses to supervisor-mode addresses with protection key i.)
Protection keys apply to shadow-stack accesses just as they do to ordinary data accesses.

5-36 Vol. 3A
PAGING

5.7 PAGE-FAULT EXCEPTIONS


Accesses using linear addresses may cause page-fault exceptions (#PF; exception 14). An access to a linear
address may cause a page-fault exception for either of two reasons: (1) there is no translation for the linear
address; or (2) there is a translation for the linear address, but its access rights do not permit the access.
As noted in Section 5.3, Section 5.4.2, and Section 5.5, there is no translation for a linear address if the translation
process for that address would use a paging-structure entry in which the P flag (bit 0) is 0 or one that sets a
reserved bit.1 If there is a translation for a linear address, its access rights are determined as specified in Section
5.6.
When Intel® Software Guard Extensions (Intel® SGX) are enabled, the processor may deliver exception 14 for
reasons unrelated to paging. See Section 36.3, “Access-control Requirements,” and Section 36.20, “Enclave Page
Cache Map (EPCM),” in Chapter 36, “Enclave Access Control and Data Structures.” Such an exception is called an
SGX-induced page fault. The processor uses the error code to distinguish SGX-induced page faults from ordinary
page faults.
When a page fault occurs, the processor loads the CR2 register with the linear address that generated the excep-
tion. If linear-address masking had been in effect (Section 4.4), the address recorded reflects the result of that
masking and does not contain any masked metadata. If the page-fault exception occurred during execution of an
instruction in enclave mode (and not during delivery of an event incident to enclave mode), bits 11:0 of the address
are cleared.
Figure 5-12 illustrates the error code that the processor provides on delivery of a page-fault exception. The
following items explain how the bits in the error code describe the nature of the page-fault exception:

31 15 7 6 5 4 3 2 1 0
SGX

SS
PK
I/D
RSVD
U/S
W/R
P
HLAT
Reserved Reserved

P 0 The fault was caused by a non-present page.


1 The fault was caused by a page-level protection violation.

W/R 0 The access causing the fault was a read.


1 The access causing the fault was a write.

U/S 0 A supervisor-mode access caused the fault.


1 A user-mode access caused the fault.

RSVD 0 The fault was not caused by reserved bit violation.


1 The fault was caused by a reserved bit set to 1 in some
paging-structure entry.

I/D 0 The fault was not caused by an instruction fetch.


1 The fault was caused by an instruction fetch.

PK 0 The fault was not caused by protection keys.


1 There was a protection-key violation.

SS 0 The fault was not caused by a shadow-stack access.


1 The fault was caused by a shadow-stack access.

HLAT 0 The fault occurred during ordinary paging or due to access rights.
1 The fault occurred during HLAT paging.

SGX 0 The fault is not related to SGX.


1 The fault resulted from violation of SGX-specific access-control
requirements.

Figure 5-12. Page-Fault Error Code

1. If HLAT paging encounters a paging-structure entry that sets a reserved bit, there is no translation even if the bit 11 of the entry
indicates a restart. In this case, there is a page fault and the translation is not restarted.

Vol. 3A 5-37
PAGING

• P flag (bit 0).


This flag is 0 if there is no translation for the linear address because the P flag was 0 in one of the paging-
structure entries used to translate that address.
• W/R (bit 1).
If the access causing the page-fault exception was a write, this flag is 1; otherwise, it is 0. This flag describes
the access causing the page-fault exception, not the access rights specified by paging.
• U/S (bit 2).
If a user-mode access caused the page-fault exception, this flag is 1; it is 0 if a supervisor-mode access did so.
This flag describes the access causing the page-fault exception, not the access rights specified by paging. User-
mode and supervisor-mode accesses are defined in Section 5.6.
• RSVD flag (bit 3).
This flag is 1 if there is no translation for the linear address because a reserved bit was set in one of the paging-
structure entries used to translate that address. (Because reserved bits are not checked in a paging-structure
entry whose P flag is 0, bit 3 of the error code can be set only if bit 0 is also set.1)
Bits reserved in the paging-structure entries are reserved for future functionality. Software developers should
be aware that such bits may be used in the future and that a paging-structure entry that causes a page-fault
exception on one processor might not do so in the future.
• I/D flag (bit 4).
This flag is 1 if (1) the access causing the page-fault exception was an instruction fetch; and (2) either
(a) CR4.SMEP = 1; or (b) both (i) CR4.PAE = 1 (either PAE paging, 4-level paging, or 5-level paging is in use);
and (ii) IA32_EFER.NXE = 1. Otherwise, the flag is 0. This flag describes the access causing the page-fault
exception, not the access rights specified by paging.
• PK flag (bit 5).
This flag is 1 only for data accesses and only with 4-level paging and 5-level paging. In these cases, the setting
depends on the mode of the address being accessed:
— For accesses to supervisor-mode addresses, the flag is set if (1) CR4.PKS = 1; (2) the linear address has
protection key i; and (3) the IA32_PKRS MSR (see Section 5.6.2) is such that either (a) ADi = 1; or (b) the
following all hold: (i) WDi = 1; (ii) the access is a write access; and (iii) either CR0.WP = 1 or the access
causing the page-fault exception was a user-mode access. (Note that this flag may be set on page faults
due to user-mode accesses to supervisor-mode addresses.)
— For accesses to user-mode addresses, the flag is set if (1) CR4.PKE = 1; (2) the linear address has
protection key i; and (3) the PKRU register (see Section 5.6.2) is such that either (a) ADi = 1; or (b) the
following all hold: (i) WDi = 1; (ii) the access is a write access; and (iii) either CR0.WP = 1 or the access
causing the page-fault exception was a user-mode access.
• SS (bit 6).
If the access causing the page-fault exception was a shadow-stack access (including shadow-stack accesses in
enclave mode), this flag is 1; otherwise, it is 0. This flag describes the access causing the page-fault exception,
not the access rights specified by paging.
• HLAT (bit 7).
This flag is 1 if there is no translation for the linear address using HLAT paging because, in one of the paging-
structure entries used to translate that address, either the P flag was 0 or a reserved bit was set. An error code
will set this flag only if it clears bit 0 or sets bit 3. This flag will not be set by a page fault resulting from a
violation of access rights, nor for one encountered during ordinary paging, including the case in which there has
been a restart of HLAT paging.
• SGX flag (bit 15).
This flag is 1 if the exception is unrelated to paging and resulted from violation of SGX-specific access-control
requirements. Because such a violation can occur only if there is no ordinary page fault, this flag is set only if
the P flag (bit 0) is 1 and the RSVD flag (bit 3) and the PK flag (bit 5) are both 0.
Page-fault exceptions occur only due to an attempt to use a linear address. Failures to load the PDPTE registers
with PAE paging (see Section 5.4.1) cause general-protection exceptions (#GP(0)) and not page-fault exceptions.

1. Some past processors had errata for some page faults that occur when there is no translation for the linear address because the P
flag was 0 in one of the paging-structure entries used to translate that address. Due to these errata, some such page faults pro-
duced error codes that cleared bit 0 (P flag) and set bit 3 (RSVD flag).

5-38 Vol. 3A
PAGING

5.8 ACCESSED AND DIRTY FLAGS


For any paging-structure entry that is used during linear-address translation, bit 5 is the accessed flag.1 For
paging-structure entries that map a page (as opposed to referencing another paging structure), bit 6 is the dirty
flag. These flags are provided for use by memory-management software to manage the transfer of pages and
paging structures into and out of physical memory.
Whenever the processor uses a paging-structure entry as part of linear-address translation, it sets the accessed
flag in that entry (if it is not already set).
Whenever there is a write to a linear address, the processor sets the dirty flag (if it is not already set) in the paging-
structure entry that identifies the final physical address for the linear address (either a PTE or a paging-structure
entry in which the PS flag is 1).
The previous two paragraphs apply also to HLAT paging. If HLAT paging encounters a paging-structure entry that
sets bit 11, indicating a restart, the processor will set the accessed flag in that entry; it will not set the dirty flag
because, if an entry indicates a restart, it does identify the final physical address for the linear address being trans-
lated.

NOTE
If software on one logical processor writes to a page while software on another logical processor
concurrently clears the R/W flag in the paging-structure entry that maps the page, execution on
some processors may result in the entry’s dirty flag being set (due to the write on the first logical
processor) and the entry’s R/W flag being clear (due to the update to the entry on the second
logical processor). This will never occur on a processor that supports control-flow enforcement
technology (CET). Specifically, a processor that supports CET will never set the dirty flag in a
paging-structure entry in which the R/W flag is clear.

Memory-management software may clear these flags when a page or a paging structure is initially loaded into
physical memory. These flags are “sticky,” meaning that, once set, the processor does not clear them; only soft-
ware can clear them.
A processor may cache information from the paging-structure entries in TLBs and paging-structure caches (see
Section 5.10). This fact implies that, if software changes an accessed flag or a dirty flag from 1 to 0, the processor
might not set the corresponding bit in memory on a subsequent access using an affected linear address (see
Section 5.10.4.3). See Section 5.10.4.2 for how software can ensure that these bits are updated as desired.

NOTE
The accesses used by the processor to set these flags may or may not be exposed to the
processor’s self-modifying code detection logic. If the processor is executing code from the same
memory area that is being used for the paging structures, the setting of these flags may or may not
result in an immediate change to the executing code stream.

5.9 PAGING AND MEMORY TYPING


The memory type of a memory access refers to the type of caching used for that access. Chapter 13, “Memory
Cache Control‚” provides many details regarding memory typing in the Intel-64 and IA-32 architectures. This
section describes how paging contributes to the determination of memory typing.
The way in which paging contributes to memory typing depends on whether the processor supports the Page
Attribute Table (PAT; see Section 13.12).2 Section 5.9.1 and Section 5.9.2 explain how paging contributes to
memory typing depending on whether the PAT is supported.

1. With PAE paging, the PDPTEs are not used during linear-address translation but only to load the PDPTE registers for some execu-
tions of the MOV CR instruction (see Section 5.4.1). For this reason, the PDPTEs do not contain accessed flags with PAE paging.
2. The PAT is supported on Pentium III and more recent processor families. See Section 5.1.4 for how to determine whether the PAT is
supported.

Vol. 3A 5-39
PAGING

5.9.1 Paging and Memory Typing When the PAT is Not Supported (Pentium Pro and Pentium
II Processors)

NOTE
The PAT is supported on all processors that support 4-level paging or 5-level paging. Thus, this
section applies only to 32-bit paging and PAE paging.

If the PAT is not supported, paging contributes to memory typing in conjunction with the memory-type range regis-
ters (MTRRs) as specified in Table 13-6 in Section 13.5.2.1.
For any access to a physical address, the table combines the memory type specified for that physical address by the
MTRRs with a PCD value and a PWT value. The latter two values are determined as follows:
• For an access to a PDE with 32-bit paging, the PCD and PWT values come from CR3.
• For an access to a PDE with PAE paging, the PCD and PWT values come from the relevant PDPTE register.
• For an access to a PTE, the PCD and PWT values come from the relevant PDE.
• For an access to the physical address that is the translation of a linear address, the PCD and PWT values come
from the relevant PTE (if the translation uses a 4-KByte page) or the relevant PDE (otherwise).
• With PAE paging, the UC memory type is used when loading the PDPTEs (see Section 5.4.1).

5.9.2 Paging and Memory Typing When the PAT is Supported (Pentium III and More Recent
Processor Families)
If the PAT is supported, paging contributes to memory typing in conjunction with the PAT and the memory-type
range registers (MTRRs) as specified in Table 13-7 in Section 13.5.2.2.
The PAT is a 64-bit MSR (IA32_PAT; MSR index 277H) comprising eight (8) 8-bit entries (entry i comprises
bits 8i+7:8i of the MSR).
For any access to a physical address, the table combines the memory type specified for that physical address by the
MTRRs with a memory type selected from the PAT. Table 13-11 in Section 13.12.3 specifies how a memory type is
selected from the PAT. Specifically, it comes from entry i of the PAT, where i is defined as follows:
• For an access to an entry in a paging structure whose address is in CR3 (e.g., the PML4 table with 4-level
paging):
— For 4-level paging or 5-level paging with CR4.PCIDE = 1, i = 0.
— Otherwise, i = 2*PCD+PWT, where the PCD and PWT values come from CR3.
• For an access to a PDE with PAE paging, i = 2*PCD+PWT, where the PCD and PWT values come from the
relevant PDPTE register.
• For an access to a paging-structure entry X whose address is in another paging-structure entry Y, i =
2*PCD+PWT, where the PCD and PWT values come from Y.
• For an access to the physical address that is the translation of a linear address, i = 4*PAT+2*PCD+PWT, where
the PAT, PCD, and PWT values come from the relevant PTE (if the translation uses a 4-KByte page), the relevant
PDE (if the translation uses a 2-MByte page or a 4-MByte page), or the relevant PDPTE (if the translation uses
a 1-GByte page).
• With PAE paging, the WB memory type is used when loading the PDPTEs (see Section 5.4.1).1

1. Some older IA-32 processors used the UC memory type when loading the PDPTEs. Some processors may use the UC memory type if
CR0.CD = 1 or if the MTRRs are disabled. These behaviors are model-specific and not architectural.

5-40 Vol. 3A
PAGING

5.9.3 Caching Paging-Related Information about Memory Typing


A processor may cache information from the paging-structure entries in TLBs and paging-structure caches (see
Section 5.10). These structures may include information about memory typing. The processor may use memory-
typing information from the TLBs and paging-structure caches instead of from the paging structures in memory.
This fact implies that, if software modifies a paging-structure entry to change the memory-typing bits, the
processor might not use that change for a subsequent translation using that entry or for access to an affected
linear address. See Section 5.10.4.2 for how software can ensure that the processor uses the modified memory
typing.

5.10 CACHING TRANSLATION INFORMATION


The Intel-64 and IA-32 architectures may accelerate the address-translation process by caching data from the
paging structures on the processor. Because the processor does not ensure that the data that it caches are always
consistent with the structures in memory, it is important for software developers to understand how and when the
processor may cache such data. They should also understand what actions software can take to remove cached
data that may be inconsistent and when it should do so. This section provides software developers information
about the relevant processor operation.
Section 5.10.1 introduces process-context identifiers (PCIDs), which a logical processor may use to distinguish
information cached for different linear-address spaces. Section 5.10.2 and Section 5.10.3 describe how the
processor may cache information in translation lookaside buffers (TLBs) and paging-structure caches, respectively.
Section 5.10.4 explains how software can remove inconsistent cached information by invalidating portions of the
TLBs and paging-structure caches. Section 5.10.5 describes special considerations for multiprocessor systems.

5.10.1 Process-Context Identifiers (PCIDs)


Process-context identifiers (PCIDs) are a facility by which a logical processor may cache information for multiple
linear-address spaces. The processor may retain cached information when software switches to a different linear-
address space with a different PCID (e.g., by loading CR3; see Section 5.10.4.1 for details).
A PCID is a 12-bit identifier. Non-zero PCIDs are enabled by setting the PCIDE flag (bit 17) of CR4. If CR4.PCIDE =
0, the current PCID is always 000H; otherwise, the current PCID is the value of bits 11:0 of CR3.1 Not all proces-
sors allow CR4.PCIDE to be set to 1; see Section 5.1.4 for how to determine whether this is allowed.
The processor ensures that CR4.PCIDE can be 1 only in IA-32e mode (thus, 32-bit paging and PAE paging use only
PCID 000H). In addition, software can change CR4.PCIDE from 0 to 1 only if CR3[11:0] = 000H. These require-
ments are enforced by the following limitations on the MOV CR instruction:
• MOV to CR4 causes a general-protection exception (#GP) if it would change CR4.PCIDE from 0 to 1 and either
IA32_EFER.LMA = 0 or CR3[11:0] ≠ 000H.
• MOV to CR0 causes a general-protection exception if it would clear CR0.PG to 0 while CR4.PCIDE = 1.
When a logical processor creates entries in the TLBs (Section 5.10.2) and paging-structure caches (Section
5.10.3), it associates those entries with the current PCID. When using entries in the TLBs and paging-structure
caches to translate a linear address, a logical processor uses only those entries associated with the current PCID
(see Section 5.10.2.4 for an exception).
If CR4.PCIDE = 0, a logical processor does not cache information for any PCID other than 000H. This is because
(1) if CR4.PCIDE = 0, the logical processor will associate any newly cached information with the current PCID,
000H; and (2) if MOV to CR4 clears CR4.PCIDE, all cached information is invalidated (see Section 5.10.4.1).

1. Note that, while HLAT paging (Section 5.5.3) does not use CR3 to locate the first paging structure, it does use the PCID in CR3[11:0]
when CR4.PCIDE = 1.

Vol. 3A 5-41
PAGING

NOTE
In revisions of this manual that were produced when no processors allowed CR4.PCIDE to be set to
1, Section 5.10, “Caching Translation Information,” discussed the caching of translation information
without any reference to PCIDs. While the section now refers to PCIDs in its specification of this
caching, this documentation change is not intended to imply any change to the behavior of
processors that do not allow CR4.PCIDE to be set to 1.

5.10.2 Translation Lookaside Buffers (TLBs)


A processor may cache information about the translation of linear addresses in translation lookaside buffers (TLBs).
In general, TLBs contain entries that map page numbers to page frames; these terms are defined in Section
5.10.2.1. Section 5.10.2.2 describes how information may be cached in TLBs, and Section 5.10.2.3 gives details of
TLB usage. Section 5.10.2.4 explains the global-page feature, which allows software to indicate that certain trans-
lations should receive special treatment when cached in the TLBs.

5.10.2.1 Page Numbers, Page Frames, and Page Offsets


Section 5.3, Section 5.4.2, and Section 5.5 give details of how the different paging modes translate linear
addresses to physical addresses. Specifically, the upper bits of a linear address (called the page number) deter-
mine the upper bits of the physical address (called the page frame); the lower bits of the linear address (called the
page offset) determine the lower bits of the physical address. The boundary between the page number and the
page offset is determined by the page size. Specifically:
• 32-bit paging:
— If the translation does not use a PTE (because CR4.PSE = 1 and the PS flag is 1 in the PDE used), the page
size is 4 MBytes and the page number comprises bits 31:22 of the linear address.
— If the translation does use a PTE, the page size is 4 KBytes and the page number comprises bits 31:12 of
the linear address.
• PAE paging:
— If the translation does not use a PTE (because the PS flag is 1 in the PDE used), the page size is 2 MBytes
and the page number comprises bits 31:21 of the linear address.
— If the translation does use a PTE, the page size is 4 KBytes and the page number comprises bits 31:12 of
the linear address.
• 4-level paging and 5-level paging:
— If the translation does not use a PDE (because the PS flag is 1 in the PDPTE used), the page size is 1 GByte
and the page number comprises bits 47:30 of the linear address.
— If the translation does use a PDE but does not uses a PTE (because the PS flag is 1 in the PDE used), the
page size is 2 MBytes and the page number comprises bits 47:21 of the linear address.
— If the translation does use a PTE, the page size is 4 KBytes and the page number comprises bits 47:12 of
the linear address.
— The page size identified by the preceding items may be reduced if there has been a restart of HLAT paging
(see Section 5.5.5). Restart of HLAT paging always specifies a maximum page size; this page size is
determined by the level of the paging-structure entry that caused the restart. The page size used by the
translation is the minimum of the maximum page size specified by the restart and the page size determined
by the restarted translation (as specified by the previous items).
For example, suppose that HLAT paging encounters a PDE that sets bit 11, indicating a restart. As a result,
the restart uses a maximum page size of 2 MBytes. Suppose that the restarted translation encounters a
PDPTE that sets bit 7, indicating a 1-GByte page. In this case, the translation produced will have a page size
of 2 MBytes (the smaller of the two sizes).

5-42 Vol. 3A
PAGING

5.10.2.2 Caching Translations in TLBs


The processor may accelerate the paging process by caching individual translations in translation lookaside
buffers (TLBs). Each entry in a TLB is an individual translation. Each translation is referenced by a page number.
It contains the following information from the paging-structure entries used to translate linear addresses with the
page number:
• The physical address corresponding to the page number (the page frame).
• The access rights from the paging-structure entries used to translate linear addresses with the page number
(see Section 5.6):
— The logical-AND of the R/W flags.
— The logical-AND of the U/S flags.
— The logical-OR of the XD flags (necessary only if IA32_EFER.NXE = 1).
— The protection key (only with 4-level paging and 5-level paging).
• Attributes from a paging-structure entry that identifies the final page frame for the page number (either a PTE
or a paging-structure entry in which the PS flag is 1):
— The dirty flag (see Section 5.8).
— The memory type (see Section 5.9).
(TLB entries may contain other information as well. A processor may implement multiple TLBs, and some of these
may be for special purposes, e.g., only for instruction fetches. Such special-purpose TLBs may not contain some of
this information if it is not necessary. For example, a TLB used only for instruction fetches need not contain infor-
mation about the R/W and dirty flags.)
As noted in Section 5.10.1, any TLB entries created by a logical processor are associated with the current PCID.
Processors need not implement any TLBs. Processors that do implement TLBs may invalidate any TLB entry at any
time. Software should not rely on the existence of TLBs or on the retention of TLB entries.

5.10.2.3 Details of TLB Use


Because the TLBs cache entries only for linear addresses with translations, there can be a TLB entry for a page
number only if the P flag is 1 and the reserved bits are 0 in each of the paging-structure entries used to translate
that page number. In addition, the processor does not cache a translation for a page number unless the accessed
flag is 1 in each of the paging-structure entries used during translation; before caching a translation, the processor
sets any of these accessed flags that is not already 1.
Subject to the limitations given in the previous paragraph, the processor may cache a translation for any linear
address, even if that address is not used to access memory. For example, the processor may cache translations
required for prefetches and for accesses that result from speculative execution that would never actually occur in
the executed code path.
If the page number of a linear address corresponds to a TLB entry associated with the current PCID, the processor
may use that TLB entry to determine the page frame, access rights, and other attributes for accesses to that linear
address. In this case, the processor may not actually consult the paging structures in memory. The processor may
retain a TLB entry unmodified even if software subsequently modifies the relevant paging-structure entries in
memory. See Section 5.10.4.2 for how software can ensure that the processor uses the modified paging-structure
entries.
If the paging structures specify a translation using a page larger than 4 KBytes, some processors may cache
multiple smaller-page TLB entries for that translation. Each such TLB entry would be associated with a page
number corresponding to the smaller page size (e.g., bits 47:12 of a linear address with 4-level paging), even
though part of that page number (e.g., bits 20:12) is part of the offset with respect to the page specified by the
paging structures. The upper bits of the physical address in such a TLB entry are derived from the physical address
in the PDE used to create the translation, while the lower bits come from the linear address of the access for which
the translation is created. There is no way for software to be aware that multiple translations for smaller pages
have been used for a large page. For example, an execution of INVLPG for a linear address on such a page invali-
dates any and all smaller-page TLB entries for the translation of any linear address on that page.

Vol. 3A 5-43
PAGING

If software modifies the paging structures so that the page size used for a 4-KByte range of linear addresses
changes, the TLBs may subsequently contain multiple translations for the address range (one for each page size).
A reference to a linear address in the address range may use any of these translations. Which translation is used
may vary from one execution to another, and the choice may be implementation-specific.

5.10.2.4 Global Pages


The Intel-64 and IA-32 architectures also allow for global pages when the PGE flag (bit 7) is 1 in CR4. If the G flag
(bit 8) is 1 in a paging-structure entry that maps a page (either a PTE or a paging-structure entry in which the PS
flag is 1), any TLB entry cached for a linear address using that paging-structure entry is considered to be global.
Because the G flag is used only in paging-structure entries that map a page, and because information from such
entries is not cached in the paging-structure caches, the global-page feature does not affect the behavior of the
paging-structure caches.
A logical processor may use a global TLB entry to translate a linear address, even if the TLB entry is associated with
a PCID different from the current PCID.

5.10.3 Paging-Structure Caches


In addition to the TLBs, a processor may cache other information about the paging structures in memory.

5.10.3.1 Caches for Paging Structures


A processor may support any or all of the following paging-structure caches:
• PML5E cache (5-level paging only). Each PML5E-cache entry is referenced by a 9-bit value and is used for
linear addresses for which bits 56:48 have that value. The entry contains information from the PML5E used to
translate such linear addresses:
— The physical address from the PML5E (the address of the PML4 table).
— The value of the R/W flag of the PML5E.
— The value of the U/S flag of the PML5E.
— The value of the XD flag of the PML5E.
— The values of the PCD and PWT flags of the PML5E.
The following items detail how a processor may use the PML5E cache:
— If the processor has a PML5E-cache entry for a linear address, it may use that entry when translating the
linear address (instead of the PML5E in memory).
— The processor does not create a PML5E-cache entry unless the P flag is 1 and all reserved bits are 0 in the
PML5E in memory.
— The processor does not create a PML5E-cache entry unless the accessed flag is 1 in the PML5E in memory;
before caching a translation, the processor sets the accessed flag if it is not already 1.
— The processor may create a PML5E-cache entry even if there are no translations for any linear address that
might use that entry (e.g., because the P flags are 0 in all entries in the referenced PML4 table).
— If the processor creates a PML5E-cache entry, the processor may retain it unmodified even if software
subsequently modifies the corresponding PML5E in memory.
• PML4E cache (4-level paging and 5-level paging only). The use of the PML4E cache depends on the paging
mode:
— For 4-level paging, each PML4E-cache entry is referenced by a 9-bit value and is used for linear addresses
for which bits 47:39 have that value.
— For 5-level paging, each PML4E-cache entry is referenced by an 18-bit value and is used for linear
addresses for which bits 56:39 have that value.
A PML4E-cache entry contains information from the PML5E and PML4E used to translate the relevant linear
addresses (for 4-level paging, the PML5E does not apply):

5-44 Vol. 3A
PAGING

— The physical address from the PML4E (the address of the page-directory-pointer table).
— The logical-AND of the R/W flags in the PML5E and the PML4E.
— The logical-AND of the U/S flags in the PML5E and the PML4E.
— The logical-OR of the XD flags in the PML5E and the PML4E.
— The values of the PCD and PWT flags of the PML4E.
The following items detail how a processor may use the PML4E cache:
— If the processor has a PML4E-cache entry for a linear address, it may use that entry when translating the
linear address (instead of the PML5E and PML4E in memory).
— The processor does not create a PML4E-cache entry unless the P flags are 1 and all reserved bits are 0 in
the PML5E and the PML4E in memory.
— The processor does not create a PML4E-cache entry unless the accessed flags are 1 in the PML5E and the
PML4E in memory; before caching a translation, the processor sets any accessed flags that are not already
1.
— The processor may create a PML4E-cache entry even if there are no translations for any linear address that
might use that entry (e.g., because the P flags are 0 in all entries in the referenced page-directory-pointer
table).
— If the processor creates a PML4E-cache entry, the processor may retain it unmodified even if software
subsequently modifies the corresponding PML4E in memory.
• PDPTE cache (4-level paging and 5-level paging only).1 The use of the PML4E cache depends on the paging
mode:
— For 4-level paging, each PDPTE-cache entry is referenced by an 18-bit value and is used for linear
addresses for which bits 47:30 have that value.
— For 5-level paging, each PDPTE-cache entry is referenced by a 27-bit value and is used for linear addresses
for which bits 56:30 have that value.
A PDPTE-cache entry contains information from the PML5E, PML4E, PDPTE used to translate the relevant linear
addresses (for 4-level paging, the PML5E does not apply):
— The physical address from the PDPTE (the address of the page directory). (No PDPTE-cache entry is created
for a PDPTE that maps a 1-GByte page.)
— The logical-AND of the R/W flags in the PML5E, PML4E, and PDPTE.
— The logical-AND of the U/S flags in the PML5E, PML4E, and PDPTE.
— The logical-OR of the XD flags in the PML5E, PML4E, and PDPTE.
— The values of the PCD and PWT flags of the PDPTE.
The following items detail how a processor may use the PDPTE cache:
— If the processor has a PDPTE-cache entry for a linear address, it may use that entry when translating the
linear address (instead of the PML5E, PML4E, and PDPTE in memory).
— The processor does not create a PDPTE-cache entry unless the P flags are 1, the PS flags are 0, and the
reserved bits are 0 in the PML5E, PML4E, and PDPTE in memory.
— The processor does not create a PDPTE-cache entry unless the accessed flags are 1 in the PML5E, PML4E,
and PDPTE in memory; before caching a translation, the processor sets any accessed flags that are not
already 1.
— The processor may create a PDPTE-cache entry even if there are no translations for any linear address that
might use that entry.
— If the processor creates a PDPTE-cache entry, the processor may retain it unmodified even if software
subsequently modifies the corresponding PML5E, PML4E, or PDPTE in memory.

1. With PAE paging, the PDPTEs are stored in internal, non-architectural registers. The operation of these registers is described in Sec-
tion 5.4.1 and differs from that described here.

Vol. 3A 5-45
PAGING

• PDE cache. The use of the PDE cache depends on the paging mode:
— For 32-bit paging, each PDE-cache entry is referenced by a 10-bit value and is used for linear addresses for
which bits 31:22 have that value.
— For PAE paging, each PDE-cache entry is referenced by an 11-bit value and is used for linear addresses for
which bits 31:21 have that value.
— For 4-level paging, each PDE-cache entry is referenced by a 27-bit value and is used for linear addresses for
which bits 47:21 have that value.
— For 5-level paging, each PDE-cache entry is referenced by a 36-bit value and is used for linear addresses for
which bits 56:21 have that value.
A PDE-cache entry contains information from the PML5E, PML4E, PDPTE, and PDE used to translate the relevant
linear addresses (for 32-bit paging and PAE paging, only the PDE applies; for 4-level paging, the PML5E does
not apply):
— The physical address from the PDE (the address of the page table). (No PDE-cache entry is created for a
PDE that maps a page.)
— The logical-AND of the R/W flags in the PML5E, PML4E, PDPTE, and PDE.
— The logical-AND of the U/S flags in the PML5E, PML4E, PDPTE, and PDE.
— The logical-OR of the XD flags in the PML5E, PML4E, PDPTE, and PDE.
— The values of the PCD and PWT flags of the PDE.
The following items detail how a processor may use the PDE cache (references below to PML5Es, PML4Es, and
PDPTEs apply only to 4-level paging and to 5-level paging, as appropriate):
— If the processor has a PDE-cache entry for a linear address, it may use that entry when translating the
linear address (instead of the PML5E, PML4E, PDPTE, and PDE in memory).
— The processor does not create a PDE-cache entry unless the P flags are 1, the PS flags are 0, and the
reserved bits are 0 in the PML5E, PML4E, PDPTE, and PDE in memory.
— The processor does not create a PDE-cache entry unless the accessed flag is 1 in the PML5E, PML4E, PDPTE,
and PDE in memory; before caching a translation, the processor sets any accessed flags that are not
already 1.
— The processor may create a PDE-cache entry even if there are no translations for any linear address that
might use that entry.
— If the processor creates a PDE-cache entry, the processor may retain it unmodified even if software subse-
quently modifies the corresponding PML5E, PML4E, PDPTE, or PDE in memory.
Information from a paging-structure entry can be included in entries in the paging-structure caches for other
paging-structure entries referenced by the original entry. For example, if the R/W flag is 0 in a PML4E, then the R/W
flag will be 0 in any PDPTE-cache entry for a PDPTE from the page-directory-pointer table referenced by that
PML4E. This is because the R/W flag of each such PDPTE-cache entry is the logical-AND of the R/W flags in the
appropriate PML4E and PDPTE.
On processors that support HLAT paging (see Section 5.5.1), each entry in a paging-structure cache indicates
whether the entry was cached during ordinary paging or HLAT paging. When the processor commences linear-
address translation using ordinary paging (respectively, HLAT paging), it will use only entries that indicate that they
were cached during ordinary paging (respectively, HLAT paging).
Entries that were cached during HLAT paging also include the restart flag (bit 11) of the original paging-structure
entry. When the processor commences HLAT paging using such an entry, it immediately restarts (using ordinary
paging) if this cached restart flag is 1.
The paging-structure caches contain information only from paging-structure entries that reference other paging
structures (and not those that map pages). Because the G flag is not used in such paging-structure entries, the
global-page feature does not affect the behavior of the paging-structure caches.
The processor may create entries in paging-structure caches for translations required for prefetches and for
accesses that are a result of speculative execution that would never actually occur in the executed code path.

5-46 Vol. 3A
PAGING

As noted in Section 5.10.1, any entries created in paging-structure caches by a logical processor are associated
with the current PCID.
A processor may or may not implement any of the paging-structure caches. Software should rely on neither their
presence nor their absence. The processor may invalidate entries in these caches at any time. Because the
processor may create the cache entries at the time of translation and not update them following subsequent modi-
fications to the paging structures in memory, software should take care to invalidate the cache entries appropri-
ately when causing such modifications. The invalidation of TLBs and the paging-structure caches is described in
Section 5.10.4.

5.10.3.2 Using the Paging-Structure Caches to Translate Linear Addresses


When a linear address is accessed, the processor uses a procedure such as the following to determine the physical
address to which it translates and whether the access should be allowed:
• If the processor finds a TLB entry that is for the page number of the linear address and that is associated with
the current PCID (or which is global), it may use the physical address, access rights, and other attributes from
that entry.
• If the processor does not find a relevant TLB entry, it may use the upper bits of the linear address to select an
entry from the PDE cache that is associated with the current PCID (Section 5.10.3.1 indicates which bits are
used in each paging mode). It can then use that entry to complete the translation process (locating a PTE, etc.)
as if it had traversed the PDE (and, for 4-level paging and 5-level paging, the PDPTE, PML4E, and PML5E, as
appropriate) corresponding to the PDE-cache entry.
• The following items apply when 4-level paging or 5-level paging is used:
— If the processor does not find a relevant TLB entry or PDE-cache entry, it may use the upper bits of the
linear address (for 4-level paging, bits 47:30; for 5-level paging, bits 56:30) to select an entry from the
PDPTE cache that is associated with the current PCID. It can then use that entry to complete the translation
process (locating a PDE, etc.) as if it had traversed the PDPTE, the PML4E, and (for 5-level paging) the
PML5E corresponding to the PDPTE-cache entry.
— If the processor does not find a relevant TLB entry, PDE-cache entry, or PDPTE-cache entry, it may use the
upper bits of the linear address (for 4-level paging, bits 47:39; for 5-level paging, bits 56:39) to select an
entry from the PML4E cache that is associated with the current PCID. It can then use that entry to complete
the translation process (locating a PDPTE, etc.) as if it had traversed the corresponding PML4E.
— With 5-level paging, if the processor does not find a relevant TLB entry, PDE-cache entry, PDPTE-cache
entry, or PML4E-cache entry, it may use bits 56:48 of the linear address to select an entry from the PML5E
cache that is associated with the current PCID. It can then use that entry to complete the translation
process (locating a PML4E, etc.) as if it had traversed the corresponding PML5E.
(Any of the above steps would be skipped if the processor does not support the cache in question.)
If the processor does not find a TLB or paging-structure-cache entry for the linear address, it uses the linear
address to traverse the entire paging-structure hierarchy, as described in Section 5.3, Section 5.4.2, and Section
5.5.

5.10.3.3 Multiple Cached Entries for a Single Paging-Structure Entry


The paging-structure caches and TLBs may contain multiple entries associated with a single PCID and with infor-
mation derived from a single paging-structure entry. The following items give some examples for 4-level paging:
• Suppose that two PML4Es contain the same physical address and thus reference the same page-directory-
pointer table. Any PDPTE in that table may result in two PDPTE-cache entries, each associated with a different
set of linear addresses. Specifically, suppose that the n1th and n2th entries in the PML4 table contain the same
physical address. This implies that the physical address in the mth PDPTE in the page-directory-pointer table
would appear in the PDPTE-cache entries associated with both p1 and p2, where (p1 » 9) = n1, (p2 » 9) = n2,
and (p1 & 1FFH) = (p2 & 1FFH) = m. This is because both PDPTE-cache entries use the same PDPTE, one
resulting from a reference from the n1th PML4E and one from the n2th PML4E.
• Suppose that the first PML4E (i.e., the one in position 0) contains the physical address X in CR3 (the physical
address of the PML4 table). This implies the following:

Vol. 3A 5-47
PAGING

— Any PML4-cache entry associated with linear addresses with 0 in bits 47:39 contains address X.
— Any PDPTE-cache entry associated with linear addresses with 0 in bits 47:30 contains address X. This is
because the translation for a linear address for which the value of bits 47:30 is 0 uses the value of
bits 47:39 (0) to locate a page-directory-pointer table at address X (the address of the PML4 table). It then
uses the value of bits 38:30 (also 0) to find address X again and to store that address in the PDPTE-cache
entry.
— Any PDE-cache entry associated with linear addresses with 0 in bits 47:21 contains address X for similar
reasons.
— Any TLB entry for page number 0 (associated with linear addresses with 0 in bits 47:12) translates to page
frame X » 12 for similar reasons.
The same PML4E contributes its address X to all these cache entries because the self-referencing nature of the
entry causes it to be used as a PML4E, a PDPTE, a PDE, and a PTE.

5.10.4 Invalidation of TLBs and Paging-Structure Caches


As noted in Section 5.10.2 and Section 5.10.3, the processor may create entries in the TLBs and the paging-struc-
ture caches when linear addresses are translated, and it may retain these entries even after the paging structures
used to create them have been modified. To ensure that linear-address translation uses the modified paging struc-
tures, software should take action to invalidate any cached entries that may contain information that has since
been modified.

5.10.4.1 Operations that Invalidate TLBs and Paging-Structure Caches


The following instructions invalidate entries in the TLBs and the paging-structure caches:
• INVLPG. This instruction takes a single operand, which is a linear address. The instruction invalidates any TLB
entries that are for a page number corresponding to the linear address and that are associated with the current
PCID. It also invalidates any global TLB entries with that page number, regardless of PCID (see Section
5.10.2.4).1 INVLPG also invalidates all entries in all paging-structure caches associated with the current PCID,
regardless of the linear addresses to which they correspond.
• INVPCID. The operation of this instruction is based on instruction operands, called the INVPCID type and the
INVPCID descriptor. Four INVPCID types are currently defined:
— Individual-address. If the INVPCID type is 0, the logical processor invalidates mappings—except global
translations—associated with the PCID specified in the INVPCID descriptor and that would be used to
translate the linear address specified in the INVPCID descriptor.2 (The instruction may also invalidate global
translations, as well as mappings associated with other PCIDs and for other linear addresses.)
— Single-context. If the INVPCID type is 1, the logical processor invalidates all mappings—except global
translations—associated with the PCID specified in the INVPCID descriptor. (The instruction may also
invalidate global translations, as well as mappings associated with other PCIDs.)
— All-context, including globals. If the INVPCID type is 2, the logical processor invalidates
mappings—including global translations—associated with all PCIDs.
— All-context. If the INVPCID type is 3, the logical processor invalidates mappings—except global transla-
tions—associated with all PCIDs. (The instruction may also invalidate global translations.)
See Chapter 3 of the Intel 64 and IA-32 Architecture Software Developer’s Manual, Volume 2A for details of the
INVPCID instruction.
• MOV to CR0. The instruction invalidates all TLB entries (including global entries) and all entries in all paging-
structure caches (for all PCIDs) if it changes the value of CR0.PG from 1 to 0.
• MOV to CR3. The behavior of the instruction depends on the value of CR4.PCIDE:

1. If the paging structures map the linear address using a page larger than 4 KBytes and there are multiple TLB entries for that page
(see Section 5.10.2.3), the instruction invalidates all of them.
2. If the paging structures map the linear address using a page larger than 4 KBytes and there are multiple TLB entries for that page
(see Section 5.10.2.3), the instruction invalidates all of them.

5-48 Vol. 3A
PAGING

— If CR4.PCIDE = 0, the instruction invalidates all TLB entries associated with PCID 000H except those for
global pages. It also invalidates all entries in all paging-structure caches associated with PCID 000H.
— If CR4.PCIDE = 1 and bit 63 of the instruction’s source operand is 0, the instruction invalidates all TLB
entries associated with the PCID specified in bits 11:0 of the instruction’s source operand except those for
global pages. It also invalidates all entries in all paging-structure caches associated with that PCID. It is not
required to invalidate entries in the TLBs and paging-structure caches that are associated with other PCIDs.
— If CR4.PCIDE = 1 and bit 63 of the instruction’s source operand is 1, the instruction is not required to
invalidate any TLB entries or entries in paging-structure caches.
• MOV to CR4. The behavior of the instruction depends on the bits being modified:
— The instruction invalidates all TLB entries (including global entries) and all entries in all paging-structure
caches (for all PCIDs) if (1) it changes the value of CR4.PGE;1 or (2) it changes the value of the CR4.PCIDE
from 1 to 0.
— The instruction invalidates all TLB entries and all entries in all paging-structure caches for the current PCID
if (1) it changes the value of CR4.PAE; or (2) it changes the value of CR4.SMEP from 0 to 1.
• Task switch. If a task switch changes the value of CR3, it invalidates all TLB entries associated with PCID 000H
except those for global pages. It also invalidates all entries in all paging-structure caches associated with PCID
000H.2
• VMX transitions. See Section 5.11.1.
The processor is always free to invalidate additional entries in the TLBs and paging-structure caches. The following
are some examples:
• INVLPG may invalidate TLB entries for pages other than the one corresponding to its linear-address operand. It
may invalidate TLB entries and paging-structure-cache entries associated with PCIDs other than the current
PCID.
• INVPCID may invalidate TLB entries for pages other than the one corresponding to the specified linear address.
It may invalidate TLB entries and paging-structure-cache entries associated with PCIDs other than the
specified PCID.
• MOV to CR0 may invalidate TLB entries even if CR0.PG is not changing. For example, this may occur if either
CR0.CD or CR0.NW is modified.
• MOV to CR3 may invalidate TLB entries for global pages. If CR4.PCIDE = 1 and bit 63 of the instruction’s source
operand is 0, it may invalidate TLB entries and entries in the paging-structure caches associated with PCIDs
other than the PCID it is establishing. It may invalidate entries if CR4.PCIDE = 1 and bit 63 of the instruction’s
source operand is 1.
• MOV to CR4 may invalidate TLB entries when changing CR4.PSE or when changing CR4.SMEP from 1 to 0.
• On a processor supporting Hyper-Threading Technology, invalidations performed on one logical processor may
invalidate entries in the TLBs and paging-structure caches used by other logical processors.
(Other instructions and operations may invalidate entries in the TLBs and the paging-structure caches, but the
instructions identified above are recommended.)
In addition to the instructions identified above, page faults invalidate entries in the TLBs and paging-structure
caches. In particular, a page-fault exception resulting from an attempt to use a linear address will invalidate any
TLB entries that are for a page number corresponding to that linear address and that are associated with the
current PCID. It also invalidates all entries in the paging-structure caches that would be used for that linear address
and that are associated with the current PCID.3 These invalidations ensure that the page-fault exception will not
recur (if the faulting instruction is re-executed) if it would not be caused by the contents of the paging structures

1. If CR4.PGE is changing from 0 to 1, there were no global TLB entries before the execution; if CR4.PGE is changing from 1 to 0, there
will be no global TLB entries after the execution.
2. Task switches do not occur in IA-32e mode and thus cannot occur with 4-level paging. Since CR4.PCIDE can be set only with 4-level
paging, task switches occur only with CR4.PCIDE = 0.
3. Unlike INVLPG, page faults need not invalidate all entries in the paging-structure caches, only those that would be used to translate
the faulting linear address.

Vol. 3A 5-49
PAGING

in memory (and if, therefore, it resulted from cached entries that were not invalidated after the paging structures
were modified in memory).
As noted in Section 5.10.2, some processors may choose to cache multiple smaller-page TLB entries for a transla-
tion specified by the paging structures to use a page larger than 4 KBytes. There is no way for software to be aware
that multiple translations for smaller pages have been used for a large page. The INVLPG instruction and page
faults provide the same assurances that they provide when a single TLB entry is used: they invalidate all TLB
entries corresponding to the translation specified by the paging structures.

5.10.4.2 Recommended Invalidation


The following items provide some recommendations regarding when software should perform invalidations:
• If software modifies a paging-structure entry that maps a page (rather than referencing another paging
structure), it should execute INVLPG for any linear address with a page number whose translation uses that
paging-structure entry.1
(If the paging-structure entry may be used in the translation of different page numbers — see Section 5.10.3.3
— software should execute INVLPG for linear addresses with each of those page numbers; alternatively, it could
use MOV to CR3 or MOV to CR4.)
• If software modifies a paging-structure entry that references another paging structure, it may use one of the
following approaches depending upon the types and number of translations controlled by the modified entry:
— Execute INVLPG for linear addresses with each of the page numbers with translations that would use the
entry. However, if no page numbers that would use the entry have translations (e.g., because the P flags are
0 in all entries in the paging structure referenced by the modified entry), it remains necessary to execute
INVLPG at least once.
— Execute MOV to CR3 if the modified entry controls no global pages.
— Execute MOV to CR4 to modify CR4.PGE.
• If CR4.PCIDE = 1 and software modifies a paging-structure entry that does not map a page or in which the G
flag (bit 8) is 0, additional steps are required if the entry may be used for PCIDs other than the current one. Any
one of the following suffices:
— Execute MOV to CR4 to modify CR4.PGE, either immediately or before again using any of the affected
PCIDs. For example, software could use different (previously unused) PCIDs for the processes that used the
affected PCIDs.
— For each affected PCID, execute MOV to CR3 to make that PCID current (and to load the address of the
appropriate PML4 table). If the modified entry controls no global pages and bit 63 of the source operand to
MOV to CR3 was 0, no further steps are required. Otherwise, execute INVLPG for linear addresses with each
of the page numbers with translations that would use the entry; if no page numbers that would use the
entry have translations, execute INVLPG at least once.
• If software using PAE paging modifies a PDPTE, it should reload CR3 with the register’s current value to ensure
that the modified PDPTE is loaded into the corresponding PDPTE register (see Section 5.4.1).
• If the nature of the paging structures is such that a single entry may be used for multiple purposes (see Section
5.10.3.3), software should perform invalidations for all of these purposes. For example, if a single entry might
serve as both a PDE and PTE, it may be necessary to execute INVLPG with two (or more) linear addresses, one
that uses the entry as a PDE and one that uses it as a PTE. (Alternatively, software could use MOV to CR3 or
MOV to CR4.)
• As noted in Section 5.10.2, the TLBs may subsequently contain multiple translations for the address range if
software modifies the paging structures so that the page size used for a 4-KByte range of linear addresses
changes. A reference to a linear address in the address range may use any of these translations.
Software wishing to prevent this uncertainty should not write to a paging-structure entry in a way that would
change, for any linear address, both the page size and either the page frame, access rights, or other attributes.
It can instead use the following algorithm: first clear the P flag in the relevant paging-structure entry (e.g.,

1. One execution of INVLPG is sufficient even for a page with size greater than 4 KBytes.

5-50 Vol. 3A
PAGING

PDE); then invalidate any translations for the affected linear addresses (see above); and then modify the
relevant paging-structure entry to set the P flag and establish modified translation(s) for the new page size.
• Software should clear bit 63 of the source operand to a MOV to CR3 instruction that establishes a PCID that had
been used earlier for a different linear-address space (e.g., with a different value in bits 51:12 of CR3). This
ensures invalidation of any information that may have been cached for the previous linear-address space.
This assumes that both linear-address spaces use the same global pages and that it is thus not necessary to
invalidate any global TLB entries. If that is not the case, software should invalidate those entries by executing
MOV to CR4 to modify CR4.PGE.

5.10.4.3 Optional Invalidation


The following items describe cases in which software may choose not to invalidate and the potential consequences
of that choice:
• If a paging-structure entry is modified to change the P flag from 0 to 1, no invalidation is necessary. This is
because no TLB entry or paging-structure cache entry is created with information from a paging-structure
entry in which the P flag is 0.1
• If a paging-structure entry is modified to change the accessed flag from 0 to 1, no invalidation is necessary
(assuming that an invalidation was performed the last time the accessed flag was changed from 1 to 0). This is
because no TLB entry or paging-structure cache entry is created with information from a paging-structure
entry in which the accessed flag is 0.
• If a paging-structure entry is modified to change the R/W flag from 0 to 1, failure to perform an invalidation
may result in a “spurious” page-fault exception (e.g., in response to an attempted write access) but no other
adverse behavior. Such an exception will occur at most once for each affected linear address (see Section
5.10.4.1).
• If CR4.SMEP = 0 and a paging-structure entry is modified to change the U/S flag from 0 to 1, failure to perform
an invalidation may result in a “spurious” page-fault exception (e.g., in response to an attempted user-mode
access) but no other adverse behavior. Such an exception will occur at most once for each affected linear
address (see Section 5.10.4.1).
• If a paging-structure entry is modified to change the XD flag from 1 to 0, failure to perform an invalidation may
result in a “spurious” page-fault exception (e.g., in response to an attempted instruction fetch) but no other
adverse behavior. Such an exception will occur at most once for each affected linear address (see Section
5.10.4.1).
• If a paging-structure entry is modified to change the accessed flag from 1 to 0, failure to perform an invali-
dation may result in the processor not setting that bit in response to a subsequent access to a linear address
whose translation uses the entry. Software cannot interpret the bit being clear as an indication that such an
access has not occurred.
• If software modifies a paging-structure entry that identifies the final physical address for a linear address
(either a PTE or a paging-structure entry in which the PS flag is 1) to change the dirty flag from 1 to 0, failure
to perform an invalidation may result in the processor not setting that bit in response to a subsequent write to
a linear address whose translation uses the entry. Software cannot interpret the bit being clear as an indication
that such a write has not occurred.
• The read of a paging-structure entry in translating an address being used to fetch an instruction may appear to
execute before an earlier write to that paging-structure entry if there is no serializing instruction between the
write and the instruction fetch. Note that the invalidating instructions identified in Section 5.10.4.1 are all
serializing instructions.
• Section 5.10.3.3 describes situations in which a single paging-structure entry may contain information cached
in multiple entries in the paging-structure caches. Because all entries in these caches are invalidated by any
execution of INVLPG, it is not necessary to follow the modification of such a paging-structure entry by
executing INVLPG multiple times solely for the purpose of invalidating these multiple cached entries. (It may be
necessary to do so to invalidate multiple TLB entries.)

1. If it is also the case that no invalidation was performed the last time the P flag was changed from 1 to 0, the processor may use a
TLB entry or paging-structure cache entry that was created when the P flag had earlier been 1.

Vol. 3A 5-51
PAGING

5.10.4.4 Delayed Invalidation


Required invalidations may be delayed under some circumstances. Software developers should understand that,
between the modification of a paging-structure entry and execution of the invalidation instruction recommended in
Section 5.10.4.2, the processor may use translations based on either the old value or the new value of the paging-
structure entry. The following items describe some of the potential consequences of delayed invalidation:
• If a paging-structure entry is modified to change the P flag from 1 to 0, an access to a linear address whose
translation is controlled by this entry may or may not cause a page-fault exception.
• If a paging-structure entry is modified to change the R/W flag from 0 to 1, write accesses to linear addresses
whose translation is controlled by this entry may or may not cause a page-fault exception.
• If a paging-structure entry is modified to change the U/S flag from 0 to 1, user-mode accesses to linear
addresses whose translation is controlled by this entry may or may not cause a page-fault exception.
• If a paging-structure entry is modified to change the XD flag from 1 to 0, instruction fetches from linear
addresses whose translation is controlled by this entry may or may not cause a page-fault exception.
As noted in Section 10.1.1, an x87 instruction or an SSE instruction that accesses data larger than a quadword may
be implemented using multiple memory accesses. If such an instruction stores to memory and invalidation has
been delayed, some of the accesses may complete (writing to memory) while another causes a page-fault excep-
tion.1 In this case, the effects of the completed accesses may be visible to software even though the overall instruc-
tion caused a fault.
In some cases, the consequences of delayed invalidation may not affect software adversely. For example, when
freeing a portion of the linear-address space (by marking paging-structure entries “not present”), invalidation
using INVLPG may be delayed if software does not re-allocate that portion of the linear-address space or the
memory that had been associated with it. However, because of speculative execution (or errant software), there
may be accesses to the freed portion of the linear-address space before the invalidations occur. In this case, the
following can happen:
• Reads can occur to the freed portion of the linear-address space. Therefore, invalidation should not be delayed
for an address range that has read side effects.
• The processor may retain entries in the TLBs and paging-structure caches for an extended period of time.
Software should not assume that the processor will not use entries associated with a linear address simply
because time has passed.
• As noted in Section 5.10.3.1, the processor may create an entry in a paging-structure cache even if there are
no translations for any linear address that might use that entry. Thus, if software has marked “not present” all
entries in a page table, the processor may subsequently create a PDE-cache entry for the PDE that references
that page table (assuming that the PDE itself is marked “present”).
• If software attempts to write to the freed portion of the linear-address space, the processor might not generate
a page fault. (Such an attempt would likely be the result of a software error.) For that reason, the page frames
previously associated with the freed portion of the linear-address space should not be reallocated for another
purpose until the appropriate invalidations have been performed.

5.10.5 Propagation of Paging-Structure Changes to Multiple Processors


As noted in Section 5.10.4, software that modifies a paging-structure entry may need to invalidate entries in the
TLBs and paging-structure caches that were derived from the modified entry before it was modified. In a system
containing more than one logical processor, software must account for the fact that there may be entries in the
TLBs and paging-structure caches of logical processors other than the one used to modify the paging-structure
entry. The process of propagating the changes to a paging-structure entry is commonly referred to as “TLB shoot-
down.”
TLB shootdown can be done using memory-based semaphores and/or interprocessor interrupts (IPI). The following
items describe a simple but inefficient example of a TLB shootdown algorithm for processors supporting the
Intel-64 and IA-32 architectures:

1. If the accesses are to different pages, this may occur even if invalidation has not been delayed.

5-52 Vol. 3A
PAGING

1. Begin barrier: Stop all but one logical processor; that is, cause all but one to execute the HLT instruction or to
enter a spin loop.
2. Allow the active logical processor to change the necessary paging-structure entries.
3. Allow all logical processors to perform invalidations appropriate to the modifications to the paging-structure
entries.
4. Allow all logical processors to resume normal operation.
Alternative, performance-optimized, TLB shootdown algorithms may be developed; however, software developers
must take care to ensure that the following conditions are met:
• All logical processors that are using the paging structures that are being modified must participate and perform
appropriate invalidations after the modifications are made.
• If the modifications to the paging-structure entries are made before the barrier or if there is no barrier, the
operating system must ensure one of the following: (1) that the affected linear-address range is not used
between the time of modification and the time of invalidation; or (2) that it is prepared to deal with the conse-
quences of the affected linear-address range being used during that period. For example, if the operating
system does not allow pages being freed to be reallocated for another purpose until after the required invalida-
tions, writes to those pages by errant software will not unexpectedly modify memory that is in use.
• Software must be prepared to deal with reads, instruction fetches, and prefetch requests to the affected linear-
address range that are a result of speculative execution that would never actually occur in the executed code
path.
When multiple logical processors are using the same linear-address space at the same time, they must coordinate
before any request to modify the paging-structure entries that control that linear-address space. In these cases,
the barrier in the TLB shootdown routine may not be required. For example, when freeing a range of linear
addresses, some other mechanism can assure no logical processor is using that range before the request to free it
is made. In this case, a logical processor freeing the range can clear the P flags in the PTEs associated with the
range, free the physical page frames associated with the range, and then signal the other logical processors using
that linear-address space to perform the necessary invalidations. All the affected logical processors must complete
their invalidations before the linear-address range and the physical page frames previously associated with that
range can be reallocated.

5.11 INTERACTIONS WITH VIRTUAL-MACHINE EXTENSIONS (VMX)


The architecture for virtual-machine extensions (VMX) includes features that interact with paging. Section 5.11.1
discusses ways in which VMX-specific control transfers, called VMX transitions specially affect paging. Section
5.11.2 gives an overview of VMX features specifically designed to support address translation.

5.11.1 VMX Transitions


The VMX architecture defines two control transfers called VM entries and VM exits; collectively, these are called
VMX transitions. VM entries and VM exits are described in detail in Chapter 27 and Chapter 28, respectively, in
the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3C. The following items identify
paging-related details:
• VMX transitions modify the CR0 and CR4 registers and the IA32_EFER MSR concurrently. For this reason, they
allow transitions between paging modes that would not otherwise be possible:
— VM entries allow transitions from 4-level paging directly to either 32-bit paging or PAE paging.
— VM exits allow transitions from either 32-bit paging or PAE paging directly to 4-level paging or 5-level
paging.
• VMX transitions that result in PAE paging load the PDPTE registers (see Section 5.4.1) as follows:
— VM entries load the PDPTE registers either from the physical address being loaded into CR3 or from the
virtual-machine control structure (VMCS); see Section 28.3.2.4.
— VM exits load the PDPTE registers from the physical address being loaded into CR3; see Section 29.5.4.

Vol. 3A 5-53
PAGING

• VMX transitions invalidate the TLBs and paging-structure caches based on certain control settings. See Section
28.3.2.5 and Section 29.5.5 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3C.

5.11.2 VMX Support for Address Translation


Chapter 30, “VMX Support for Address Translation,” in the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 3C, describes two features of the virtual-machine extensions (VMX) that interact directly with
paging. These are virtual-processor identifiers (VPIDs) and the extended page table mechanism (EPT).
VPIDs provide a way for software to identify to the processor the address spaces for different “virtual processors.”
The processor may use this identification to maintain concurrently information for multiple address spaces in its
TLBs and paging-structure caches, even when non-zero PCIDs are not being used. See Section 30.1 for details.
When EPT is in use, the addresses in the paging-structures are not used as physical addresses to access memory
and memory-mapped I/O. Instead, they are treated as guest-physical addresses and are translated through a set
of EPT paging structures to produce physical addresses. EPT can also specify its own access rights and memory
typing; these are used on conjunction with those specified in this chapter. See Section 30.3 for more information.
Both VPIDs and EPT may change the way that a processor maintains information in TLBs and paging structure
caches and the ways in which software can manage that information. Some of the behaviors documented in Section
5.10 may change. See Section 30.4 for details.

5.12 USING PAGING FOR VIRTUAL MEMORY


With paging, portions of the linear-address space need not be mapped to the physical-address space; data for the
unmapped addresses can be stored externally (e.g., on disk). This method of mapping the linear-address space is
referred to as virtual memory or demand-paged virtual memory.
Paging divides the linear address space into fixed-size pages that can be mapped into the physical-address space
and/or external storage. When a program (or task) references a linear address, the processor uses paging to trans-
late the linear address into a corresponding physical address if such an address is defined.
If the page containing the linear address is not currently mapped into the physical-address space, the processor
generates a page-fault exception as described in Section 5.7. The handler for page-fault exceptions typically directs
the operating system or executive to load data for the unmapped page from external storage into physical memory
(perhaps writing a different page from physical memory out to external storage in the process) and to map it using
paging (by updating the paging structures). When the page has been loaded into physical memory, a return from
the exception handler causes the instruction that generated the exception to be restarted.
Paging differs from segmentation through its use of fixed-size pages. Unlike segments, which usually are the same
size as the code or data structures they hold, pages have a fixed size. If segmentation is the only form of address
translation used, a data structure present in physical memory will have all of its parts in memory. If paging is used,
a data structure can be partly in memory and partly in disk storage.

5.13 MAPPING SEGMENTS TO PAGES


The segmentation and paging mechanisms provide support for a wide variety of approaches to memory manage-
ment. When segmentation and paging are combined, segments can be mapped to pages in several ways. To imple-
ment a flat (unsegmented) addressing environment, for example, all the code, data, and stack modules can be
mapped to one or more large segments (up to 4-GBytes) that share same range of linear addresses (see Figure 3-2
in Section 3.2.2). Here, segments are essentially invisible to applications and the operating-system or executive. If
paging is used, the paging mechanism can map a single linear-address space (contained in a single segment) into
virtual memory. Alternatively, each program (or task) can have its own large linear-address space (contained in its
own segment), which is mapped into virtual memory through its own paging structures.
Segments can be smaller than the size of a page. If one of these segments is placed in a page which is not shared
with another segment, the extra memory is wasted. For example, a small data structure, such as a 1-Byte sema-
phore, occupies 4 KBytes if it is placed in a page by itself. If many semaphores are used, it is more efficient to pack
them into a single page.

5-54 Vol. 3A
PAGING

The Intel-64 and IA-32 architectures do not enforce correspondence between the boundaries of pages and
segments. A page can contain the end of one segment and the beginning of another. Similarly, a segment can
contain the end of one page and the beginning of another.
Memory-management software may be simpler and more efficient if it enforces some alignment between page and
segment boundaries. For example, if a segment which can fit in one page is placed in two pages, there may be
twice as much paging overhead to support access to that segment.
One approach to combining paging and segmentation that simplifies memory-management software is to give
each segment its own page table, as shown in Figure 5-13. This convention gives the segment a single entry in the
page directory, and this entry provides the access control information for paging the entire segment.

Page Frames

LDT Page Directory Page Tables

PTE
PTE
PTE
Seg. Descript. PDE
Seg. Descript. PDE

PTE
PTE

Figure 5-13. Memory Management Convention That Assigns a Page Table to Each Segment

Vol. 3A 5-55
PAGING

5-56 Vol. 3A
15.Updates to Chapter 7, Volume 3A
Change bars and violet text show changes to Chapter 7 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3A: System Programming Guide, Part 1.

------------------------------------------------------------------------------------------
Changes to this chapter:
• Minor updates, mainly about saving CR2 by page faults.

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 14


CHAPTER 7
INTERRUPT AND EXCEPTION HANDLING

This chapter describes the interrupt and exception-handling mechanism when operating in protected mode on an
Intel 64 or IA-32 processor. Most of the information provided here also applies to interrupt and exception mecha-
nisms used in real-address, virtual-8086 mode, and 64-bit mode.
Chapter 22, “8086 Emulation,” describes information specific to interrupt and exception mechanisms in real-
address and virtual-8086 mode. Section 7.14, “Exception and Interrupt Handling in 64-bit Mode,” describes infor-
mation specific to interrupt and exception mechanisms in IA-32e mode and 64-bit sub-mode.

7.1 INTERRUPT AND EXCEPTION OVERVIEW


Interrupts and exceptions are events that indicate that a condition exists somewhere in the system, the processor,
or within the currently executing program or task that requires the attention of a processor. They typically result in
a forced transfer of execution from the currently running program or task to a special software routine or task
called an interrupt handler or an exception handler. The action taken by a processor in response to an interrupt or
exception is referred to as servicing or handling the interrupt or exception.
Interrupts occur at random times during the execution of a program, in response to signals from hardware. System
hardware uses interrupts to handle events external to the processor, such as requests to service peripheral devices.
Software can also generate interrupts by executing the INT n instruction.
Exceptions occur when the processor detects an error condition while executing an instruction, such as division by
zero. The processor detects a variety of error conditions including protection violations, page faults, and internal
machine faults. The machine-check architecture of the Pentium 4, Intel Xeon, P6 family, and Pentium processors
also permits a machine-check exception to be generated when internal hardware errors and bus errors are
detected.
When an interrupt is received or an exception is detected, the currently running procedure or task is suspended
while the processor executes an interrupt or exception handler. When execution of the handler is complete, the
processor resumes execution of the interrupted procedure or task. The resumption of the interrupted procedure or
task happens without loss of program continuity, unless recovery from an exception was not possible or an inter-
rupt caused the currently running program to be terminated.
This chapter describes the processor’s interrupt and exception-handling mechanism, when operating in protected
mode. A description of the exceptions and the conditions that cause them to be generated is given at the end of this
chapter.

7.2 EXCEPTION AND INTERRUPT VECTORS


To aid in handling exceptions and interrupts, each architecturally defined exception and each interrupt condition
requiring special handling by the processor is assigned a unique identification number, called a vector number. The
processor uses the vector number assigned to an exception or interrupt as an index into the interrupt descriptor
table (IDT). The table provides the entry point to an exception or interrupt handler (see Section 7.10, “Interrupt
Descriptor Table (IDT)”).
The allowable range for vector numbers is 0 to 255. Vector numbers in the range 0 through 31 are reserved by the
Intel 64 and IA-32 architectures for architecture-defined exceptions and interrupts. Not all of the vector numbers
in this range have a currently defined function. The unassigned vector numbers in this range are reserved. Do not
use the reserved vector numbers.
Vector numbers in the range 32 to 255 are designated as user-defined interrupts and are not reserved by the Intel
64 and IA-32 architecture. These interrupts are generally assigned to external I/O devices to enable those devices
to send interrupts to the processor through one of the external hardware interrupt mechanisms (see Section 7.3,
“Sources of Interrupts”).

Vol. 3A 7-1
INTERRUPT AND EXCEPTION HANDLING

Table 7-1 shows vector number assignments for architecturally defined exceptions and for the NMI interrupt. This
table gives the exception type (see Section 7.5, “Exception Classifications”) and indicates whether an error code is
saved on the stack for the exception. The source of each predefined exception and the NMI interrupt is also given.

7.3 SOURCES OF INTERRUPTS


The processor receives interrupts from two sources:
• External (hardware generated) interrupts.
• Software-generated interrupts.

7.3.1 External Interrupts


External interrupts are received through pins on the processor or through the local APIC. The primary interrupt pins
on Pentium 4, Intel Xeon, P6 family, and Pentium processors are the LINT[1:0] pins, which are connected to the
local APIC (see Chapter 12, “Advanced Programmable Interrupt Controller (APIC)”). When the local APIC is
enabled, the LINT[1:0] pins can be programmed through the APIC’s local vector table (LVT) to be associated with
any of the processor’s exception or interrupt vectors.
When the local APIC is global/hardware disabled, these pins are configured as INTR and NMI pins, respectively.
Asserting the INTR pin signals the processor that an external interrupt has occurred. The processor reads from the
system bus the interrupt vector number provided by an external interrupt controller, such as an 8259A (see Section
7.2, “Exception and Interrupt Vectors”). Asserting the NMI pin signals a non-maskable interrupt (NMI), which is
assigned to interrupt vector 2.

Table 7-1. Protected-Mode Exceptions and Interrupts


Vector Mnemonic Description Type Error Source
Code
0 #DE Divide Error Fault No DIV and IDIV instructions.
1 #DB Debug Exception Fault/ Trap No Instruction, data, and I/O breakpoints;
single-step; and others.
2 — NMI Interrupt Interrupt No Nonmaskable external interrupt.
3 #BP Breakpoint Trap No INT3 instruction.
4 #OF Overflow Trap No INTO instruction.
5 #BR BOUND Range Exceeded Fault No BOUND instruction.
6 #UD Invalid Opcode (Undefined Opcode) Fault No UD instruction or reserved opcode.
7 #NM Device Not Available (No Math Fault No Floating-point or WAIT/FWAIT instruction.
Coprocessor)
8 #DF Double Fault Abort Yes Any instruction that can generate an
(zero) exception, an NMI, or an INTR.
9 Coprocessor Segment Overrun Fault No Floating-point instruction.1
(reserved)
10 #TS Invalid TSS Fault Yes Task switch or TSS access.
11 #NP Segment Not Present Fault Yes Loading segment registers or accessing
system segments.
12 #SS Stack-Segment Fault Fault Yes Stack operations and SS register loads.
13 #GP General Protection Fault Yes Any memory reference and other
protection checks.
14 #PF Page Fault Fault Yes Any memory reference.

7-2 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

Table 7-1. Protected-Mode Exceptions and Interrupts (Contd.)


Vector Mnemonic Description Type Error Source
Code
15 — (Intel reserved. Do not use.) No
16 #MF x87 FPU Floating-Point Error (Math Fault No x87 FPU floating-point or WAIT/FWAIT
Fault) instruction.
17 #AC Alignment Check Fault Yes Any data reference in memory.2
(Zero)
18 #MC Machine Check Abort No Error codes (if any) and source are model
dependent.3
19 #XM SIMD Floating-Point Exception Fault No SSE/SSE2/SSE3 floating-point
instructions4
20 #VE Virtualization Exception Fault No EPT violations5
21 #CP Control Protection Exception Fault Yes RET, IRET, RSTORSSP, and SETSSBSY
instructions can generate this exception.
When CET indirect branch tracking is
enabled, this exception can be generated
due to a missing ENDBRANCH instruction
at target of an indirect call or jump.
22-31 — Intel reserved. Do not use.
32-255 — User Defined (Non-reserved) Interrupt External interrupt or INT n instruction.
Interrupts
NOTES:
1. Processors after the Intel386 processor do not generate this exception.
2. This exception was introduced in the Intel486 processor.
3. This exception was introduced in the Pentium processor and enhanced in the P6 family processors.
4. This exception was introduced in the Pentium III processor.
5. This exception can occur only on processors that support the 1-setting of the “EPT-violation #VE” VM-execution control.

The processor’s local APIC is normally connected to a system-based I/O APIC. Here, external interrupts received at
the I/O APIC’s pins can be directed to the local APIC through the system bus (Pentium 4, Intel Core Duo, Intel Core
2, Intel Atom, and Intel Xeon processors) or the APIC serial bus (P6 family and Pentium processors). The I/O APIC
determines the vector number of the interrupt and sends this number to the local APIC. When a system contains
multiple processors, processors can also send interrupts to one another by means of the system bus (Pentium 4,
Intel Core Duo, Intel Core 2, Intel Atom, and Intel Xeon processors) or the APIC serial bus (P6 family and Pentium
processors).
The LINT[1:0] pins are not available on the Intel486 processor and earlier Pentium processors that do not contain
an on-chip local APIC. These processors have dedicated NMI and INTR pins. With these processors, external inter-
rupts are typically generated by a system-based interrupt controller (8259A), with the interrupts being signaled
through the INTR pin.
Note that several other pins on the processor can cause a processor interrupt to occur. However, these interrupts
are not handled by the interrupt and exception mechanism described in this chapter. These pins include the
RESET#, FLUSH#, STPCLK#, SMI#, R/S#, and INIT# pins. Whether they are included on a particular processor is
implementation dependent. Pin functions are described in the data books for the individual processors. The SMI#
pin is described in Chapter 33, “System Management Mode.”

7.3.2 Maskable Hardware Interrupts


Any external interrupt that is delivered to the processor by means of the INTR pin or through the local APIC is called
a maskable hardware interrupt. Maskable hardware interrupts that can be delivered through the INTR pin include

Vol. 3A 7-3
INTERRUPT AND EXCEPTION HANDLING

all IA-32 architecture defined interrupt vectors from 0 through 255; those that can be delivered through the local
APIC include interrupt vectors 16 through 255.
The IF flag in the EFLAGS register permits all maskable hardware interrupts to be masked as a group (see Section
7.8.1, “Masking Maskable Hardware Interrupts”). Note that when interrupts 0 through 15 are delivered through the
local APIC, the APIC indicates the receipt of an illegal vector.

7.3.3 Software-Generated Interrupts


The INT n instruction permits interrupts to be generated from within software by supplying an interrupt vector
number as an operand. For example, the INT 35 instruction forces an implicit call to the interrupt handler for inter-
rupt 35.
Any of the interrupt vectors from 0 to 255 can be used as a parameter in this instruction. If the processor’s
predefined NMI vector is used, however, the response of the processor will not be the same as it would be from an
NMI interrupt generated in the normal manner. If vector number 2 (the NMI vector) is used in this instruction, the
NMI interrupt handler is called, but the processor’s NMI-handling hardware is not activated.
Interrupts generated in software with the INT n instruction cannot be masked by the IF flag in the EFLAGS register.

7.4 SOURCES OF EXCEPTIONS


The processor receives exceptions from three sources:
• Processor-detected program-error exceptions.
• Software-generated exceptions.
• Machine-check exceptions.

7.4.1 Program-Error Exceptions


The processor generates one or more exceptions when it detects program errors during the execution in an appli-
cation program or the operating system or executive. Intel 64 and IA-32 architectures define a vector number for
each processor-detectable exception. Exceptions are classified as faults, traps, and aborts (see Section 7.5,
“Exception Classifications”).

7.4.2 Software-Generated Exceptions


The INTO, INT1, INT3, and BOUND instructions permit exceptions to be generated in software. These instructions
allow checks for exception conditions to be performed at points in the instruction stream. For example, INT3 causes
a breakpoint exception to be generated.
The INT n instruction can be used to emulate exceptions in software; but there is a limitation.1 If INT n provides a
vector for one of the architecturally-defined exceptions, the processor generates an interrupt to the correct vector
(to access the exception handler) but does not push an error code on the stack. This is true even if the associated
hardware-generated exception normally produces an error code. The exception handler will still attempt to pop an
error code from the stack while handling the exception. Because no error code was pushed, the handler will pop off
and discard the EIP instead (in place of the missing error code). This sends the return to the wrong location.

7.4.3 Machine-Check Exceptions


The P6 family and Pentium processors provide both internal and external machine-check mechanisms for checking
the operation of the internal chip hardware and bus transactions. These mechanisms are implementation depen-

1. The INT n instruction has opcode CD following by an immediate byte encoding the value of n. In contrast, INT1 has opcode F1 and
INT3 has opcode CC.

7-4 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

dent. When a machine-check error is detected, the processor signals a machine-check exception (vector 18) and
returns an error code.
See Chapter 7, “Interrupt 18—Machine-Check Exception (#MC),” and Chapter 17, “Machine-Check Architecture,”
for more information about the machine-check mechanism.

7.5 EXCEPTION CLASSIFICATIONS


Exceptions are classified as faults, traps, or aborts depending on the way they are reported and whether the
instruction that caused the exception can be restarted without loss of program or task continuity.
• Faults — A fault is an exception that can generally be corrected and that, once corrected, allows the program
to be restarted with no loss of continuity. When a fault is reported, the processor restores the machine state to
the state prior to the beginning of execution of the faulting instruction. The return address (saved contents of
the CS and EIP registers) for the fault handler points to the faulting instruction, rather than to the instruction
following the faulting instruction.
• Traps — A trap is an exception that is reported immediately following the execution of the trapping instruction.
Traps allow execution of a program or task to be continued without loss of program continuity. The return
address for the trap handler points to the instruction to be executed after the trapping instruction.
• Aborts — An abort is an exception that does not always report the precise location of the instruction causing
the exception and does not allow a restart of the program or task that caused the exception. Aborts are used to
report severe errors, such as hardware errors and inconsistent or illegal values in system tables.

NOTE
One exception subset normally reported as a fault is not restartable. Such exceptions result in loss
of some processor state. For example, executing a POPAD instruction where the stack frame
crosses over the end of the stack segment causes a fault to be reported. In this situation, the
exception handler sees that the instruction pointer (CS:EIP) has been restored as if the POPAD
instruction had not been executed. However, internal processor state (the general-purpose
registers) will have been modified. Such cases are considered programming errors. An application
causing this class of exceptions should be terminated by the operating system.

7.6 PROGRAM OR TASK RESTART


To allow the restarting of program or task following the handling of an exception or an interrupt, all exceptions
(except aborts) are guaranteed to report exceptions on an instruction boundary. All interrupts are guaranteed to be
taken on an instruction boundary.
For fault-class exceptions, the return instruction pointer (saved when the processor generates an exception) points
to the faulting instruction. So, when a program or task is restarted following the handling of a fault, the faulting
instruction is restarted (re-executed). Restarting the faulting instruction is commonly used to handle exceptions
that are generated when access to an operand is blocked. The most common example of this type of fault is a page-
fault exception (#PF) that occurs when a program or task references an operand located on a page that is not in
memory. When a page-fault exception occurs, the exception handler can load the page into memory and resume
execution of the program or task by restarting the faulting instruction. To ensure that the restart is handled trans-
parently to the currently executing program or task, the processor saves the necessary registers and stack pointers
to allow a restart to the state prior to the execution of the faulting instruction.
For trap-class exceptions, the return instruction pointer points to the instruction following the trapping instruction.
If a trap is detected during an instruction which transfers execution, the return instruction pointer reflects the
transfer. For example, if a trap is detected while executing a JMP instruction, the return instruction pointer points
to the destination of the JMP instruction, not to the next address past the JMP instruction. All trap exceptions allow
program or task restart with no loss of continuity. For example, the overflow exception is a trap exception. Here,
the return instruction pointer points to the instruction following the INTO instruction that tested EFLAGS.OF (over-
flow) flag. The trap handler for this exception resolves the overflow condition. Upon return from the trap handler,
program or task execution continues at the instruction following the INTO instruction.

Vol. 3A 7-5
INTERRUPT AND EXCEPTION HANDLING

The abort-class exceptions do not support reliable restarting of the program or task. Abort handlers are designed
to collect diagnostic information about the state of the processor when the abort exception occurred and then shut
down the application and system as gracefully as possible.
Interrupts rigorously support restarting of interrupted programs and tasks without loss of continuity. The return
instruction pointer saved for an interrupt points to the next instruction to be executed at the instruction boundary
where the processor took the interrupt. If the instruction just executed has a repeat prefix, the interrupt is taken
at the end of the current iteration with the registers set to execute the next iteration.
The ability of a P6 family processor to speculatively execute instructions does not affect the taking of interrupts by
the processor. Interrupts are taken at instruction boundaries located during the retirement phase of instruction
execution; so they are always taken in the “in-order” instruction stream. See Chapter 2, “Intel® 64 and IA-32
Architectures,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for more informa-
tion about the P6 family processors’ microarchitecture and its support for out-of-order instruction execution.
Note that the Pentium processor and earlier IA-32 processors also perform varying amounts of prefetching and
preliminary decoding. With these processors as well, exceptions and interrupts are not signaled until actual “in-
order” execution of the instructions. For a given code sample, the signaling of exceptions occurs uniformly when
the code is executed on any family of IA-32 processors (except where new exceptions or new opcodes have been
defined).

7.7 NONMASKABLE INTERRUPT (NMI)


The nonmaskable interrupt (NMI) can be generated in either of two ways:
• External hardware asserts the NMI pin.
• The processor receives a message on the system bus (Pentium 4, Intel Core Duo, Intel Core 2, Intel Atom, and
Intel Xeon processors) or the APIC serial bus (P6 family and Pentium processors) with a delivery mode NMI.
When the processor receives a NMI from either of these sources, the processor handles it immediately by calling
the NMI handler pointed to by interrupt vector number 2. The processor also invokes certain hardware conditions
to ensure that no other interrupts, including NMI interrupts, are received until the NMI handler has completed
executing (see Section 7.7.1, “Handling Multiple NMIs”).
Also, when an NMI is received from either of the above sources, it cannot be masked by the IF flag in the EFLAGS
register.
It is possible to issue a maskable hardware interrupt (through the INTR pin) to vector 2 to invoke the NMI interrupt
handler; however, this interrupt will not truly be an NMI interrupt. A true NMI interrupt that activates the
processor’s NMI-handling hardware can only be delivered through one of the mechanisms listed above.

7.7.1 Handling Multiple NMIs


While an NMI interrupt handler is executing, the processor blocks delivery of subsequent NMIs until the next execu-
tion of the IRET instruction. This blocking of NMIs prevents nested execution of the NMI handler. It is recommended
that the NMI interrupt handler be accessed through an interrupt gate to disable maskable hardware interrupts (see
Section 7.8.1, “Masking Maskable Hardware Interrupts”).
An execution of the IRET instruction unblocks NMIs even if the instruction causes a fault. For example, if the IRET
instruction executes with EFLAGS.VM = 1 and IOPL of less than 3, a general-protection exception is generated (see
Section 22.2.7, “Sensitive Instructions”). In such a case, NMIs are unmasked before the exception handler is
invoked.

7.8 ENABLING AND DISABLING INTERRUPTS


The processor inhibits the generation of some interrupts, depending on the state of the processor and of the IF and
RF flags in the EFLAGS register, as described in the following sections.

7-6 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

7.8.1 Masking Maskable Hardware Interrupts


The IF flag can disable the servicing of maskable hardware interrupts received on the processor’s INTR pin or
through the local APIC (see Section 7.3.2, “Maskable Hardware Interrupts”). When the IF flag is clear, the
processor inhibits interrupts delivered to the INTR pin or through the local APIC from generating an internal inter-
rupt request; when the IF flag is set, interrupts delivered to the INTR or through the local APIC pin are processed
as normal external interrupts.
The IF flag does not affect non-maskable interrupts (NMIs) delivered to the NMI pin or delivery mode NMI
messages delivered through the local APIC, nor does it affect processor generated exceptions. As with the other
flags in the EFLAGS register, the processor clears the IF flag in response to a hardware reset.
The fact that the group of maskable hardware interrupts includes the reserved interrupt and exception vectors 0
through 32 can potentially cause confusion. Architecturally, when the IF flag is set, an interrupt for any of the
vectors from 0 through 32 can be delivered to the processor through the INTR pin and any of the vectors from 16
through 32 can be delivered through the local APIC. The processor will then generate an interrupt and call the
interrupt or exception handler pointed to by the vector number. So for example, it is possible to invoke the page-
fault handler through the INTR pin (by means of vector 14); however, this is not a true page-fault exception. It is
an interrupt. As with the INT n instruction (see Section 7.4.2, “Software-Generated Exceptions”), when an inter-
rupt is generated through the INTR pin to an exception vector, the processor does not push an error code on the
stack, so the exception handler may not operate correctly.
The IF flag can be set or cleared with the STI (set interrupt-enable flag) and CLI (clear interrupt-enable flag)
instructions, respectively. These instructions may be executed only if the CPL is equal to or less than the IOPL. A
general-protection exception (#GP) is generated if they are executed when the CPL is greater than the IOPL.1 If
IF = 0, maskable hardware interrupts remain inhibited on the instruction boundary following an execution of STI.2
The inhibition ends after delivery of another event (e.g., exception) or the execution of the next instruction.
The IF flag is also affected by the following operations:
• The PUSHF instruction stores all flags on the stack, where they can be examined and modified. The POPF
instruction can be used to load the modified flags back into the EFLAGS register.
• Task switches and the POPF and IRET instructions load the EFLAGS register; therefore, they can be used to
modify the setting of the IF flag.
• When an interrupt is handled through an interrupt gate, the IF flag is automatically cleared, which disables
maskable hardware interrupts. (If an interrupt is handled through a trap gate, the IF flag is not cleared.)
See the descriptions of the CLI, STI, PUSHF, POPF, and IRET instructions in Chapter 3, “Instruction Set Reference,
A-L,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A, and Chapter 4, “Instruc-
tion Set Reference, M-U,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B, for a
detailed description of the operations these instructions are allowed to perform on the IF flag.

7.8.2 Masking Instruction Breakpoints


The RF (resume) flag in the EFLAGS register controls the response of the processor to instruction-breakpoint condi-
tions (see the description of the RF flag in Section 2.3, “System Flags and Fields in the EFLAGS Register”).
When set, it prevents an instruction breakpoint from generating a debug exception (#DB); when clear, instruction
breakpoints will generate debug exceptions. The primary function of the RF flag is to prevent the processor from
going into a debug exception loop on an instruction-breakpoint. See Section 19.3.1.1, “Instruction-Breakpoint
Exception Condition,” for more information on the use of this flag.
As noted in Section 7.8.3, execution of the MOV or POP instruction to load the SS register suppresses any instruc-
tion breakpoint on the next instruction (just as if EFLAGS.RF were 1).

1. The effect of the IOPL on these instructions is modified slightly when the virtual mode extension is enabled by setting the VME flag
in control register CR4: see Section 22.3, “Interrupt and Exception Handling in Virtual-8086 Mode.” Behavior is also impacted by the
PVI flag: see Section 22.4, “Protected-Mode Virtual Interrupts.”
2. Nonmaskable interrupts and system-management interrupts may also be inhibited on the instruction boundary following such an
execution of STI.

Vol. 3A 7-7
INTERRUPT AND EXCEPTION HANDLING

7.8.3 Masking Exceptions and Interrupts When Switching Stacks


To switch to a different stack segment, software often uses a pair of instructions, for example:
MOV SS, AX
MOV ESP, StackTop
(Software might also use the POP instruction to load SS and ESP.)
If an interrupt or exception occurs after the new SS segment descriptor has been loaded but before the ESP register
has been loaded, these two parts of the logical address into the stack space are inconsistent for the duration of the
interrupt or exception handler (assuming that delivery of the interrupt or exception does not itself load a new stack
pointer).
To account for this situation, the processor prevents certain events from being delivered after execution of a MOV
to SS instruction or a POP to SS instruction. The following items provide details:
• Any instruction breakpoint on the next instruction is suppressed (as if EFLAGS.RF were 1).
• Any data breakpoint on the MOV to SS instruction or POP to SS instruction is inhibited until the instruction
boundary following the next instruction.
• Any single-step trap that would be delivered following the MOV to SS instruction or POP to SS instruction
(because EFLAGS.TF is 1) is suppressed.
• The suppression and inhibition ends after delivery of an exception or the execution of the next instruction.
• If a sequence of consecutive instructions each loads the SS register (using MOV or POP), only the first is
guaranteed to inhibit or suppress events in this way.
Intel recommends that software use the LSS instruction to load the SS register and ESP together. The problem
identified earlier does not apply to LSS, and the LSS instruction does not inhibit events as detailed above.

7.9 PRIORITIZATION OF CONCURRENT EVENTS


If more than one event is pending at an instruction boundary (between execution of instructions), the processor
services them in a predictable order. Table 7-2 shows the priority among classes of event sources.

Table 7-2. Priority Among Concurrent Events


Priority Description
1 (Highest) Hardware Reset and Machine Checks
- RESET
- Machine Check (#MC)
2 Trap on Task Switch
- T flag in TSS is set (#DB)
3 External Hardware Interventions
- FLUSH
- STOPCLK
- SMI
- INIT
4 Traps on the Previous Instruction
- Trap-class Debug Exceptions (#DB due to TF flag set or data/I-O breakpoint)
5 Nonmaskable Interrupts (NMI) 1
6 Maskable Hardware Interrupts 1
7 Fault-class Debug Exceptions (#DB due to instruction breakpoint)

7-8 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

Table 7-2. Priority Among Concurrent Events (Contd.)


Priority Description
8 Faults from Fetching Next Instruction
- Code-Segment Limit Violation (#GP)
- Code Page Fault (#PF)
9 (Lowest) Faults from Decoding the Next Instruction
- Control protection exception due to missing ENDBRANCH at target of an indirect call or jump (#CP)
- Instruction length > 15 bytes (#GP)
- Invalid Opcode (#UD)
- Coprocessor Not Available (#NM)
NOTE
1. The Intel® 486 processor and earlier processors group nonmaskable and maskable interrupts in the same priority class.

The processor first services a pending event from the class which has the highest priority, transferring execution to
the first instruction of the handler. Lower priority exceptions are discarded; lower priority interrupts are held
pending. Discarded exceptions may be re-generated when the event handler returns execution to the point in the
program or task where the original event occurred. While the priority among the classes listed in Table 7-2 is
consistent across processor implementations, the priority of events within a class is implementation-dependent
and may vary from processor to processor.
Table 7-2 specifies the prioritization of events that may be pending at an instruction boundary. It does not specify
the prioritization of faults that arise during instruction execution or event delivery (these include #BR, #TS, #NP,
#SS, #GP, #PF, #AC, #MF, #XM, #VE, or #CP). It also does not apply to the events generated by the “Call to Inter-
rupt Procedure” instructions (INT n, INTO, INT3, and INT1), as these events are integral to the execution of those
instructions and do not occur between instructions.

7.10 INTERRUPT DESCRIPTOR TABLE (IDT)


The interrupt descriptor table (IDT) associates each exception or interrupt vector with a gate descriptor for the
procedure or task used to service the associated exception or interrupt. Like the GDT and LDTs, the IDT is an array
of 8-byte descriptors (in protected mode). Unlike the GDT, the first entry of the IDT may contain a descriptor. To
form an index into the IDT, the processor scales the exception or interrupt vector by eight (the number of bytes in
a gate descriptor). Because there are only 256 interrupt or exception vectors, the IDT need not contain more than
256 descriptors. It can contain fewer than 256 descriptors, because descriptors are required only for the interrupt
and exception vectors that may occur. All empty descriptor slots in the IDT should have the present flag for the
descriptor set to 0.
The base addresses of the IDT should be aligned on an 8-byte boundary to maximize performance of cache line
fills. The limit value is expressed in bytes and is added to the base address to get the address of the last valid byte.
A limit value of 0 results in exactly 1 valid byte. Because IDT entries are always eight bytes long, the limit should
always be one less than an integral multiple of eight (that is, 8N – 1).
The IDT may reside anywhere in the linear address space. As shown in Figure 7-1, the processor locates the IDT
using the IDTR register. This register holds both a 32-bit base address and 16-bit limit for the IDT.
The LIDT (load IDT register) and SIDT (store IDT register) instructions load and store the contents of the IDTR
register, respectively. The LIDT instruction loads the IDTR register with the base address and limit held in a
memory operand. This instruction can be executed only when the CPL is 0. It normally is used by the initialization
code of an operating system when creating an IDT. An operating system also may use it to change from one IDT to
another. The SIDT instruction copies the base and limit value stored in IDTR to memory. This instruction can be
executed at any privilege level.
If a vector references a descriptor beyond the limit of the IDT, a general-protection exception (#GP) is generated.

Vol. 3A 7-9
INTERRUPT AND EXCEPTION HANDLING

NOTE
Because interrupts are delivered to the processor core only once, an incorrectly configured IDT
could result in incomplete interrupt handling and/or the blocking of interrupt delivery.
IA-32 architecture rules need to be followed for setting up IDTR base/limit/access fields and each
field in the gate descriptors. The same apply for the Intel 64 architecture. This includes implicit
referencing of the destination code segment through the GDT or LDT and accessing the stack.

IDTR Register
47 16 15 0
IDT Base Address IDT Limit

Interrupt

+ Descriptor Table (IDT)


Gate for
Interrupt #n (n−1)∗8

Gate for
Interrupt #3 16
Gate for
Interrupt #2 8
Gate for
Interrupt #1 0
31 0

Figure 7-1. Relationship of the IDTR and IDT

7.11 IDT DESCRIPTORS


The IDT may contain any of three kinds of gate descriptors:
• Task-gate descriptor
• Interrupt-gate descriptor
• Trap-gate descriptor
Figure 7-2 shows the formats for the task-gate, interrupt-gate, and trap-gate descriptors. The format of a task
gate used in an IDT is the same as that of a task gate used in the GDT or an LDT (see Section 9.2.5, “Task-Gate
Descriptor”). The task gate contains the segment selector for a TSS for an exception and/or interrupt handler task.
Interrupt and trap gates are very similar to call gates (see Section 6.8.3, “Call Gates”). They contain a far pointer
(segment selector and offset) that the processor uses to transfer program execution to a handler procedure in an
exception- or interrupt-handler code segment. These gates differ in the way the processor handles the IF flag in the
EFLAGS register (see Section 7.12.1.3, “Flag Usage By Exception- or Interrupt-Handler Procedure”).

7-10 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

Task Gate
31 16 15 14 13 12 8 7 0

D
P P 0 0 1 0 1 4
L

31 16 15 0

TSS Segment Selector 0

Interrupt Gate
31 16 15 14 13 12 8 7 5 4 0

D
Offset 31..16 P P 0 D 1 1 0 0 0 0 4
L

31 16 15 0

Segment Selector Offset 15..0 0

Trap Gate
31 16 15 14 13 12 8 7 5 4 0

D
Offset 31..16 P P 0 D 1 1 1 0 0 0 4
L

31 16 15 0

Segment Selector Offset 15..0 0

DPL Descriptor Privilege Level


Offset Offset to procedure entry point
P Segment Present flag
Selector Segment Selector for destination code segment
D Size of gate: 1 = 32 bits; 0 = 16 bits
Reserved

Figure 7-2. IDT Gate Descriptors

7.12 EXCEPTION AND INTERRUPT HANDLING


The processor handles calls to exception- and interrupt-handlers similar to the way it handles calls with a CALL
instruction to a procedure or a task. When responding to an exception or interrupt, the processor uses the excep-
tion or interrupt vector as an index to a descriptor in the IDT. If the index points to an interrupt gate or trap gate,
the processor calls the exception or interrupt handler in a manner similar to a CALL to a call gate (see Section
6.8.2, “Gate Descriptors,” through Section 6.8.6, “Returning from a Called Procedure”). If index points to a task
gate, the processor executes a task switch to the exception- or interrupt-handler task in a manner similar to a CALL
to a task gate (see Section 9.3, “Task Switching”).

Vol. 3A 7-11
INTERRUPT AND EXCEPTION HANDLING

7.12.1 Exception- or Interrupt-Handler Procedures


An interrupt gate or trap gate references an exception- or interrupt-handler procedure that runs in the context of
the currently executing task (see Figure 7-3). The segment selector for the gate points to a segment descriptor for
an executable code segment in either the GDT or the current LDT. The offset field of the gate descriptor points to
the beginning of the exception- or interrupt-handling procedure.

Destination
IDT Code Segment

Interrupt
Offset Procedure
Interrupt
Vector
Interrupt or
Trap Gate
+

Segment Selector

GDT or LDT
Base
Address

Segment
Descriptor

Figure 7-3. Interrupt Procedure Call

7-12 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

When the processor performs a call to the exception- or interrupt-handler procedure:


• If the handler procedure is going to be executed at a numerically lower privilege level, a stack switch occurs.
When the stack switch occurs:
a. The segment selector and stack pointer for the stack to be used by the handler are obtained from the TSS
for the currently executing task. On this new stack, the processor pushes the stack segment selector and
stack pointer of the interrupted procedure.
b. The processor then saves the current state of the EFLAGS, CS, and EIP registers on the new stack (see
Figure 7-4).
c. If an exception causes an error code to be saved, it is pushed on the new stack after the EIP value.
• If the handler procedure is going to be executed at the same privilege level as the interrupted procedure:
a. The processor saves the current state of the EFLAGS, CS, and EIP registers on the current stack (see Figure
7-4).
b. If an exception causes an error code to be saved, it is pushed on the current stack after the EIP value.

Stack Usage with No


Privilege-Level Change
Interrupted Procedure’s
and Handler’s Stack

ESP Before
EFLAGS Transfer to Handler
CS
EIP
Error Code ESP After
Transfer to Handler

Stack Usage with


Privilege-Level Change
Interrupted Procedure’s Handler’s Stack
Stack

ESP Before
Transfer to Handler SS
ESP
EFLAGS
CS
EIP
ESP After Error Code
Transfer to Handler

Figure 7-4. Stack Usage on Transfers to Interrupt and Exception-Handling Routines

To return from an exception- or interrupt-handler procedure, the handler must use the IRET (or IRETD) instruction.
The IRET instruction is similar to the RET instruction except that it restores the saved flags into the EFLAGS
register. The IOPL field of the EFLAGS register is restored only if the CPL is 0. The IF flag is changed only if the CPL
is less than or equal to the IOPL. See Chapter 3, “Instruction Set Reference, A-L,” of the Intel® 64 and IA-32 Archi-
tectures Software Developer’s Manual, Volume 2A, for a description of the complete operation performed by the
IRET instruction.
If a stack switch occurred when calling the handler procedure, the IRET instruction switches back to the interrupted
procedure’s stack on the return.

Vol. 3A 7-13
INTERRUPT AND EXCEPTION HANDLING

7.12.1.1 Shadow Stack Usage on Transfers to Interrupt and Exception Handling Routines
When the processor performs a call to the exception- or interrupt-handler procedure:
• If the handler procedure is going to be execute at a numerically lower privilege level, a shadow stack switch
occurs. When the shadow stack switch occurs:
a. On a transfer from privilege level 3, if shadow stacks are enabled at privilege level 3 then the SSP is saved
to the IA32_PL3_SSP MSR.
b. If shadow stacks are enabled at the privilege level where the handler will execute then the shadow stack for
the handler is obtained from one of the following MSRs based on the privilege level at which the handler
executes.
• IA32_PL2_SSP if handler executes at privilege level 2.
• IA32_PL1_SSP if handler executes at privilege level 1.
• IA32_PL0_SSP if handler executes at privilege level 0.
c. The SSP obtained is then verified to ensure it points to a valid supervisory shadow stack that is not currently
active by verifying a supervisor shadow stack token at the address pointed to by the SSP. The operations
performed to verify and acquire the supervisor shadow stack token by making it busy are as described in
Section 18.2.3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1.
d. On this new shadow stack, the processor pushes the CS, LIP (CS.base + EIP), and SSP of the interrupted
procedure if the interrupted procedure was executing at privilege level less than 3; see Figure 7-5.1
• If the handler procedure is going to be executed at the same privilege level as the interrupted procedure and
shadow stacks are enabled at current privilege level:
a. The processor saves the current state of the CS, LIP (CS.base + EIP), and SSP registers on the current
shadow stack; see Figure 7-5.

1. If any of these pushes leads to an exception or a VM exit, the supervisor shadow-stack token remains busy.

7-14 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

Shadow Stack Usage with


No Privilege-Level Change
Interrupted Procedure’s
and Handler’s Shadow Stack

SSP Before
Transfer to Handler
CS

LIP

SSP SSP After


Transfer to Handler

Shadow Stack Usage with Privilege-


Level Change from Level 3
Interrupted Procedure’s Handler’s Shadow Stack
Shadow Stack

SSP Before
Transfer to Handler
Supervisor
Shadow Stack
SSP After
Token
Transfer to Handler

Shadow Stack Usage with Privilege-


Level Change from Level 2 or 1
Interrupted Procedure’s Handler’s Shadow Stack
Shadow Stack

SSP Before
Transfer to Handler
Supervisor
Shadow Stack
Token

CS

LIP

SSP After SSP


Transfer to Handler

Figure 7-5. Shadow Stack Usage on Transfers to Interrupt and Exception-Handling Routines

To return from an exception- or interrupt-handler procedure, the handler must use the IRET (or IRETD) instruction.
When executing a return from an interrupt or exception handler from the same privilege level as the interrupted
procedure, the processor performs these actions to enforce return address protection:
• Restores the CS and EIP registers to their values prior to the interrupt or exception.

Vol. 3A 7-15
INTERRUPT AND EXCEPTION HANDLING

If shadow stack is enabled:


— Compares the values on shadow stack at address SSP+8 (the LIP) and SSP+16 (the CS) to the CS and
(CS.base + EIP) popped from the stack and causes a control protection exception (#CP(FAR-RET/IRET)) if
they do not match.
— Pops the top-of-stack value (the SSP prior to the interrupt or exception) from shadow stack into SSP
register.
When executing a return from an interrupt or exception handler from a different privilege level than the interrupted
procedure, the processor performs the actions below.
• If shadow stack is enabled at current privilege level:
— If SSP is not aligned to 8 bytes then causes a control protection exception (#CP(FAR-RET/IRET)).
— If privilege level of the procedure being returned to is less than 3 (returning to supervisor mode):
• Compares the values on shadow stack at address SSP+8 (the LIP) and SSP+16 (the CS) to the CS and
(CS.base + EIP) popped from the stack and causes a control protection exception (#CP(FAR-RET/IRET))
if they do not match.
• Temporarily saves the top-of-stack value (the SSP of the procedure being returned to) internally.
— If a busy supervisor shadow stack token is present at address SSP+24, then marks the token free using
operations described in section Section 18.2.3 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 1.
— If the privilege level of the procedure being returned to is less than 3 (returning to supervisor mode),
restores the SSP register from the internally saved value.
— If the privilege level of the procedure being returned to is 3 (returning to user mode) and shadow stack is
enabled at privilege level 3, then restores the SSP register with value of IA32_PL3_SSP MSR.

7.12.1.2 Protection of Exception- and Interrupt-Handler Procedures


The privilege-level protection for exception- and interrupt-handler procedures is similar to that used for ordinary
procedure calls when called through a call gate (see Section 6.8.4, “Accessing a Code Segment Through a Call
Gate”). The processor does not permit transfer of execution to an exception- or interrupt-handler procedure in a
less privileged code segment (numerically greater privilege level) than the CPL.
An attempt to violate this rule results in a general-protection exception (#GP). The protection mechanism for
exception- and interrupt-handler procedures is different in the following ways:
• Because interrupt and exception vectors have no RPL, the RPL is not checked on implicit calls to exception and
interrupt handlers.
• The processor checks the DPL of the interrupt or trap gate only if an exception or interrupt is generated with an
INT n, INT3, or INTO instruction.1 Here, the CPL must be less than or equal to the DPL of the gate. This
restriction prevents application programs or procedures running at privilege level 3 from using a software
interrupt to access critical exception handlers, such as the page-fault handler, providing that those handlers are
placed in more privileged code segments (numerically lower privilege level). For hardware-generated interrupts
and processor-detected exceptions, the processor ignores the DPL of interrupt and trap gates.
Because exceptions and interrupts generally do not occur at predictable times, these privilege rules effectively
impose restrictions on the privilege levels at which exception and interrupt- handling procedures can run. Either of
the following techniques can be used to avoid privilege-level violations.
• The exception or interrupt handler can be placed in a conforming code segment. This technique can be used for
handlers that only need to access data available on the stack (for example, divide error exceptions). If the
handler needs data from a data segment, the data segment needs to be accessible from privilege level 3, which
would make it unprotected.
• The handler can be placed in a nonconforming code segment with privilege level 0. This handler would always
run, regardless of the CPL that the interrupted program or task is running at.

1. This check is not performed by execution of the INT1 instruction (opcode F1); it would be performed by execution of INT 1 (opcode
CD 01).

7-16 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

7.12.1.3 Flag Usage By Exception- or Interrupt-Handler Procedure


When accessing an exception or interrupt handler through either an interrupt gate or a trap gate, the processor
clears the TF flag in the EFLAGS register after it saves the contents of the EFLAGS register on the stack. (On calls
to exception and interrupt handlers, the processor also clears the VM, RF, and NT flags in the EFLAGS register, after
they are saved on the stack.) Clearing the TF flag prevents instruction tracing from affecting interrupt response and
ensures that no single-step exception will be delivered after delivery to the handler. A subsequent IRET instruction
restores the TF (and VM, RF, and NT) flags to the values in the saved contents of the EFLAGS register on the stack.
The only difference between an interrupt gate and a trap gate is the way the processor handles the IF flag in the
EFLAGS register. When accessing an exception- or interrupt-handling procedure through an interrupt gate, the
processor clears the IF flag to prevent other interrupts from interfering with the current interrupt handler. A subse-
quent IRET instruction restores the IF flag to its value in the saved contents of the EFLAGS register on the stack.
Accessing a handler procedure through a trap gate does not affect the IF flag.

7.12.2 Interrupt Tasks


When an exception or interrupt handler is accessed through a task gate in the IDT, a task switch results. Handling
an exception or interrupt with a separate task offers several advantages:
• The entire context of the interrupted program or task is saved automatically.
• A new TSS permits the handler to use a new privilege level 0 stack when handling the exception or interrupt. If
an exception or interrupt occurs when the current privilege level 0 stack is corrupted, accessing the handler
through a task gate can prevent a system crash by providing the handler with a new privilege level 0 stack.
• The handler can be further isolated from other tasks by giving it a separate address space. This is done by
giving it a separate LDT.
The disadvantage of handling an interrupt with a separate task is that the amount of machine state that must be
saved on a task switch makes it slower than using an interrupt gate, resulting in increased interrupt latency.
A task gate in the IDT references a TSS descriptor in the GDT (see Figure 7-6). A switch to the handler task is
handled in the same manner as an ordinary task switch (see Section 9.3, “Task Switching”). The link back to the
interrupted task is stored in the previous task link field of the handler task’s TSS. If an exception caused an error
code to be generated, this error code is copied to the stack of the new task.
When exception- or interrupt-handler tasks are used in an operating system, there are actually two mechanisms
that can be used to dispatch tasks: the software scheduler (part of the operating system) and the hardware sched-
uler (part of the processor's interrupt mechanism). The software scheduler needs to accommodate interrupt tasks
that may be dispatched when interrupts are enabled.

Vol. 3A 7-17
INTERRUPT AND EXCEPTION HANDLING

NOTE
Because IA-32 architecture tasks are not re-entrant, an interrupt-handler task must disable
interrupts between the time it completes handling the interrupt and the time it executes the IRET
instruction. This action prevents another interrupt from occurring while the interrupt task’s TSS is
still marked busy, which would cause a general-protection (#GP) exception.

TSS for Interrupt-


IDT Handling Task

Interrupt
Vector Task Gate

TSS Selector TSS


Base
Address
GDT

TSS Descriptor

Figure 7-6. Interrupt Task Switch

7.13 ERROR CODE


When an exception condition is related to a specific segment selector or IDT vector, the processor pushes an error
code onto the stack of the exception handler (whether it is a procedure or task). The error code has the format
shown in Figure 7-7. The error code resembles a segment selector; however, instead of a TI flag and RPL field, the
error code contains 3 flags:
EXT External event (bit 0) — When set, indicates that the exception occurred during delivery of an
event external to the program, such as an interrupt or an earlier exception.1 The bit is cleared if the
exception occurred during delivery of a software interrupt (INT n, INT3, or INTO).
IDT Descriptor location (bit 1) — When set, indicates that the index portion of the error code refers
to a gate descriptor in the IDT; when clear, indicates that the index refers to a descriptor in the GDT
or the current LDT.
TI GDT/LDT (bit 2) — Only used when the IDT flag is clear. When set, the TI flag indicates that the
index portion of the error code refers to a segment or gate descriptor in the LDT; when clear, it indi-
cates that the index refers to a descriptor in the current GDT.

1. The bit is also set if the exception occurred during delivery of INT1.

7-18 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

31 3 2 1 0
T I E
Reserved Segment Selector Index I D X
T T

Figure 7-7. Error Code

The segment selector index field provides an index into the IDT, GDT, or current LDT to the segment or gate
selector being referenced by the error code. In some cases the error code is null (all bits are clear except possibly
EXT). A null error code indicates that the error was not caused by a reference to a specific segment or that a null
segment selector was referenced in an operation.
The format of the error code is different for page-fault exceptions (#PF). See the “Interrupt 14—Page-Fault Excep-
tion (#PF)” section in this chapter.
The format of the error code is different for control protection exceptions (#CP). See the “Interrupt 21—Control
Protection Exception (#CP)” section in this chapter.
The error code is pushed on the stack as a doubleword or word (depending on the default interrupt, trap, or task
gate size). To keep the stack aligned for doubleword pushes, the upper half of the error code is reserved. Note that
the error code is not popped when the IRET instruction is executed to return from an exception handler, so the
handler must remove the error code before executing a return.
Error codes are not pushed on the stack for exceptions that are generated externally (with the INTR or LINT[1:0]
pins) or the INT n instruction, even if an error code is normally produced for those exceptions.

7.14 EXCEPTION AND INTERRUPT HANDLING IN 64-BIT MODE


In 64-bit mode, interrupt and exception handling is similar to what has been described for non-64-bit modes. The
following are the exceptions:
• All interrupt handlers pointed by the IDT are in 64-bit code (this does not apply to the SMI handler).
• The size of interrupt-stack pushes is fixed at 64 bits; and the processor uses 8-byte, zero extended stores.
• The stack pointer (SS:RSP) is pushed unconditionally on interrupts. In legacy modes, this push is conditional
and based on a change in current privilege level (CPL).
• The new SS is set to NULL if there is a change in CPL.
• IRET behavior changes.
• There is a new interrupt stack-switch mechanism and a new interrupt shadow stack-switch mechanism.
• The alignment of interrupt stack frame is different.

7.14.1 64-Bit Mode IDT


Interrupt and trap gates are 16 bytes in length to provide a 64-bit offset for the instruction pointer (RIP). The 64-
bit RIP referenced by interrupt-gate descriptors allows an interrupt service routine to be located anywhere in the
linear-address space. See Figure 7-8.

Vol. 3A 7-19
INTERRUPT AND EXCEPTION HANDLING

Interrupt/Trap Gate
31 0

Reserved 12

31 0

Offset 63..32 8

31 16 15 14 13 12 11 8 7 5 4 2 0
D
Offset 31..16 P P 0 TYPE 0 0 0 0 0 IST 4
L

31 16 15 0

Segment Selector Offset 15..0 0

DPL Descriptor Privilege Level


Offset Offset to procedure entry point
P Segment Present flag
Selector Segment Selector for destination code segment
IST Interrupt Stack Table

Figure 7-8. 64-Bit IDT Gate Descriptors

In 64-bit mode, the IDT index is formed by scaling the interrupt vector by 16. The first eight bytes (bytes 7:0) of a
64-bit mode interrupt gate are similar but not identical to legacy 32-bit interrupt gates. The type field (bits 11:8 in
bytes 7:4) is described in Table 3-2. The Interrupt Stack Table (IST) field (bits 4:0 in bytes 7:4) is used by the stack
switching mechanisms described in Section 7.14.5, “Interrupt Stack Table.” Bytes 11:8 hold the upper 32 bits of
the target RIP (interrupt segment offset) in canonical form. A general-protection exception (#GP) is generated if
software attempts to reference an interrupt gate with a target RIP that is not in canonical form.
The target code segment referenced by the interrupt gate must be a 64-bit code segment (CS.L = 1, CS.D = 0). If
the target is not a 64-bit code segment, a general-protection exception (#GP) is generated with the IDT vector
number reported as the error code.
Only 64-bit interrupt and trap gates can be referenced in IA-32e mode (64-bit mode and compatibility mode).
Legacy 32-bit interrupt or trap gate types (0EH or 0FH) are redefined in IA-32e mode as 64-bit interrupt and trap
gate types. No 32-bit interrupt or trap gate type exists in IA-32e mode. If a reference is made to a 16-bit interrupt
or trap gate (06H or 07H), a general-protection exception (#GP(0)) is generated.

7.14.2 64-Bit Mode Stack Frame


In legacy mode, the size of an IDT entry (16 bits or 32 bits) determines the size of interrupt-stack-frame pushes.
SS:ESP is pushed only on a CPL change. In 64-bit mode, the size of interrupt stack-frame pushes is fixed at eight
bytes. This is because only 64-bit mode gates can be referenced. 64-bit mode also pushes SS:RSP unconditionally,
rather than only on a CPL change.
When shadow stacks are enabled at the interrupt handler’s privilege level and the interrupted procedure was not
executing at a privilege level 3, then the processor pushes the CS:LIP:SSP of the interrupted procedure on the
shadow stack of the interrupt handler (where LIP is the linear address of the return address).
Aside from error codes, pushing SS:RSP unconditionally presents operating systems with a consistent interrupt-
stackframe size across all interrupts. Interrupt service-routine entry points that handle interrupts generated by the
INTn instruction or external INTR# signal can push an additional error code place-holder to maintain consistency.
In legacy mode, the stack pointer may be at any alignment when an interrupt or exception causes a stack frame to
be pushed. This causes the stack frame and succeeding pushes done by an interrupt handler to be at arbitrary
alignments. In IA-32e mode, the RSP is aligned to a 16-byte boundary before pushing the stack frame. The stack
frame itself is aligned on a 16-byte boundary when the interrupt handler is called. The processor can arbitrarily
realign the new RSP on interrupts because the previous (possibly unaligned) RSP is unconditionally saved on the
newly aligned stack. The previous RSP will be automatically restored by a subsequent IRET.

7-20 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

Aligning the stack permits exception and interrupt frames to be aligned on a 16-byte boundary before interrupts
are re-enabled. This allows the stack to be formatted for optimal storage of 16-byte XMM registers, which enables
the interrupt handler to use faster 16-byte aligned loads and stores (MOVAPS rather than MOVUPS) to save and
restore XMM registers.
Although the RSP alignment is always performed when LMA = 1, it is only of consequence for the kernel-mode case
where there is no stack switch or IST used. For a stack switch or IST, the OS would have presumably put suitably
aligned RSP values in the TSS.

7.14.3 IRET in IA-32e Mode


In IA-32e mode, IRET executes with an 8-byte operand size. There is nothing that forces this requirement. The
stack is formatted in such a way that for actions where IRET is required, the 8-byte IRET operand size works
correctly.
Because interrupt stack-frame pushes are always eight bytes in IA-32e mode, an IRET must pop eight byte items
off the stack. This is accomplished by preceding the IRET with a 64-bit operand-size prefix. The size of the pop is
determined by the address size of the instruction. The SS/ESP/RSP size adjustment is determined by the stack
size.
IRET pops SS:RSP unconditionally off the interrupt stack frame only when it is executed in 64-bit mode. In compat-
ibility mode, IRET pops SS:RSP off the stack only if there is a CPL change. This allows legacy applications to
execute properly in compatibility mode when using the IRET instruction. 64-bit interrupt service routines that exit
with an IRET unconditionally pop SS:RSP off of the interrupt stack frame, even if the target code segment is
running in 64-bit mode or at CPL = 0. This is because the original interrupt always pushes SS:RSP.
When shadow stacks are enabled and the target privilege level is not 3, the CS:LIP from the shadow stack frame is
compared to the return linear address formed by CS:EIP from the stack. If they do not match then the processor
caused a control protection exception (#CP(FAR-RET/IRET)), else the processor pops the SSP of the interrupted
procedure from the shadow stack. If the target privilege level is 3 and shadow stacks are enabled at privilege level
3, then the SSP for the interrupted procedure is restored from the IA32_PL3_SSP MSR.
In IA-32e mode, IRET is allowed to load a NULL SS under certain conditions. If the target mode is 64-bit mode and
the target CPL ? 3, IRET allows SS to be loaded with a NULL selector. As part of the stack switch mechanism, an
interrupt or exception sets the new SS to NULL, instead of fetching a new SS selector from the TSS and loading the
corresponding descriptor from the GDT or LDT. The new SS selector is set to NULL in order to properly handle
returns from subsequent nested far transfers. If the called procedure itself is interrupted, the NULL SS is pushed
on the stack frame. On the subsequent IRET, the NULL SS on the stack acts as a flag to tell the processor not to
load a new SS descriptor.

7.14.4 Stack Switching in IA-32e Mode


The IA-32 architecture provides a mechanism to automatically switch stack frames in response to an interrupt. The
64-bit extensions of Intel 64 architecture implement a modified version of the legacy stack-switching mechanism
and an alternative stack-switching mechanism called the interrupt stack table (IST).
In IA-32 modes, the legacy IA-32 stack-switch mechanism is unchanged. In IA-32e mode, the legacy stack-switch
mechanism is modified. When stacks are switched as part of a 64-bit mode privilege-level change (resulting from
an interrupt), a new SS descriptor is not loaded. IA-32e mode loads only an inner-level RSP from the TSS. The new
SS selector is forced to NULL and the SS selector’s RPL field is set to the new CPL. The new SS is set to NULL in
order to handle nested far transfers (far CALL, INT, interrupts, and exceptions). The old SS and RSP are saved on
the new stack (Figure 7-9). On the subsequent IRET, the old SS is popped from the stack and loaded into the SS
register.
In summary, a stack switch in IA-32e mode works like the legacy stack switch, except that a new SS selector is not
loaded from the TSS. Instead, the new SS is forced to NULL.

Vol. 3A 7-21
INTERRUPT AND EXCEPTION HANDLING

Legacy Mode Stack Usage with IA-32e Mode


Privilege-Level Change

Handler’s Stack Handler’s Stack

+20 SS SS +40
+16 ESP RSP +32
+12 EFLAGS RFLAGS +24
+8 CS CS +16
+4 EIP RIP +8
0 Error Code Stack Pointer After Error Code 0
Transfer to Handler

Figure 7-9. IA-32e Mode Stack Usage After Privilege Level Change

7.14.5 Interrupt Stack Table


In IA-32e mode, a new interrupt stack table (IST) mechanism is available as an alternative to the modified legacy
stack-switching mechanism described above. This mechanism unconditionally switches stacks when it is enabled.
It can be enabled on an individual interrupt-vector basis using a field in the IDT entry. This means that some inter-
rupt vectors can use the modified legacy mechanism and others can use the IST mechanism.
The IST mechanism is only available in IA-32e mode. It is part of the 64-bit mode TSS. The motivation for the IST
mechanism is to provide a method for specific interrupts (such as NMI, double-fault, and machine-check) to always
execute on a known good stack. In legacy mode, interrupts can use the task-switch mechanism to set up a known-
good stack by accessing the interrupt service routine through a task gate located in the IDT. However, the legacy
task-switch mechanism is not supported in IA-32e mode.
The IST mechanism provides up to seven IST pointers in the TSS. The pointers are referenced by an interrupt-gate
descriptor in the interrupt-descriptor table (IDT); see Figure 7-8. The gate descriptor contains a 3-bit IST index
field that provides an offset into the IST section of the TSS. Using the IST mechanism, the processor loads the value
pointed by an IST pointer into the RSP.
When an interrupt occurs, the new SS selector is forced to NULL and the SS selector’s RPL field is set to the new
CPL. The old SS, RSP, RFLAGS, CS, and RIP are pushed onto the new stack. Interrupt processing then proceeds as
normal. If the IST index is zero, the modified legacy stack-switching mechanism described above is used.
To support this stack-switching mechanism with shadow stacks enabled, the processor provides an MSR,
IA32_INTERRUPT_SSP_TABLE, to program the linear address of a table of seven shadow stack pointers that are
selected using the IST index from the gate descriptor. To switch to a shadow stack selected from the interrupt
shadow stack table pointed to by the IA32_INTERRUPT_SSP_TABLE, the processor requires that the shadow stack
addresses programmed into this table point to a supervisor shadow stack token; see Figure 7-10.

7-22 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

IST7 SSP
56
IST6 SSP
48
IST5 SSP
40
IST4 SSP
32
IST3 SSP
24
IST2 SSP
16
IST1 SSP
8
Not used; available
0
IA32_INTERRUPT_SSP_TABLE

Figure 7-10. Interrupt Shadow Stack Table

7.15 EXCEPTION AND INTERRUPT REFERENCE


The following sections describe conditions which generate exceptions and interrupts. They are arranged in the
order of vector numbers. The information contained in these sections are as follows:
• Exception Class — Indicates whether the exception class is a fault, trap, or abort type. Some exceptions can
be either a fault or trap type, depending on when the error condition is detected. (This section is not applicable
to interrupts.)
• Description — Gives a general description of the purpose of the exception or interrupt type. It also describes
how the processor handles the exception or interrupt.
• Exception Error Code — Indicates whether an error code is saved for the exception. If one is saved, the
contents of the error code are described. (This section is not applicable to interrupts.)
• Saved Instruction Pointer — Describes which instruction the saved (or return) instruction pointer points to.
It also indicates whether the pointer can be used to restart a faulting instruction.
• Program State Change — Describes the effects of the exception or interrupt on the state of the currently
running program or task and the possibilities of restarting the program or task without loss of continuity.

Vol. 3A 7-23
INTERRUPT AND EXCEPTION HANDLING

Interrupt 0—Divide Error Exception (#DE)

Exception Class Fault.

Description
Indicates the divisor operand for a DIV or IDIV instruction is 0 or that the result cannot be represented in the
number of bits specified for the destination operand.

Exception Error Code


None.

Saved Instruction Pointer


Saved contents of CS and EIP registers point to the instruction that generated the exception.

Program State Change


A program-state change does not accompany the divide error, because the exception occurs before the faulting
instruction is executed.

7-24 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

Interrupt 1—Debug Exception (#DB)

Exception Class Trap or Fault. The exception handler can distinguish between traps or faults by exam-
ining the contents of DR6 and the other debug registers.

Description
Indicates that one or more of several debug-exception conditions has been detected. Whether the exception is a
fault or a trap depends on the condition (see Table 7-3). See Chapter 19, “Debug, Branch Profile, TSC, and Intel®
Resource Director Technology (Intel® RDT) Features,” for detailed information about the debug exceptions.

Table 7-3. Debug Exception Conditions and Corresponding Exception Classes


Exception Condition Exception Class
Instruction fetch breakpoint Fault
Data read or write breakpoint Trap
I/O read or write breakpoint Trap
General detect condition (in conjunction with in-circuit emulation) Fault
Single-step Trap
Task-switch Trap
Execution of INT11 Trap
NOTES:
1. Hardware vendors may use the INT1 instruction for hardware debug. For that reason, Intel recommends software vendors instead
use the INT3 instruction for software breakpoints.

Exception Error Code


None. An exception handler can examine the debug registers to determine which condition caused the exception.

Saved Instruction Pointer


Fault — Saved contents of CS and EIP registers point to the instruction that generated the exception.
Trap — Saved contents of CS and EIP registers point to the instruction following the instruction that generated the
exception.

Program State Change


Fault — A program-state change does not accompany the debug exception, because the exception occurs before
the faulting instruction is executed. The program can resume normal execution upon returning from the debug
exception handler.
Trap — A program-state change does accompany the debug exception, because the instruction or task switch being
executed is allowed to complete before the exception is generated. However, the new state of the program is not
corrupted and execution of the program can continue reliably.
The following items detail the treatment of debug exceptions on the instruction boundary following execution of the
MOV or the POP instruction that loads the SS register:
• If EFLAGS.TF is 1, no single-step trap is generated.
• If the instruction encounters a data breakpoint, the resulting debug exception is delivered after completion of
the instruction after the MOV or POP. This occurs even if the next instruction is INT n, INT3, or INTO.
• Any instruction breakpoint on the instruction after the MOV or POP is suppressed (as if EFLAGS.RF were 1).
Any debug exception inside an RTM region causes a transactional abort and, by default, redirects control flow to the
fallback instruction address. If advanced debugging of RTM transactional regions has been enabled, any transac-
tional abort due to a debug exception instead causes execution to roll back to just before the XBEGIN instruction

Vol. 3A 7-25
INTERRUPT AND EXCEPTION HANDLING

and then delivers a #DB. See Section 17.3.7, “RTM-Enabled Debugger Support,” of Intel® 64 and IA-32 Architec-
tures Software Developer’s Manual, Volume 1.

7-26 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

Interrupt 2—NMI Interrupt

Exception Class Not applicable.

Description
The nonmaskable interrupt (NMI) is generated externally by asserting the processor’s NMI pin or through an NMI
request set by the I/O APIC to the local APIC. This interrupt causes the NMI interrupt handler to be called.

Exception Error Code


Not applicable.

Saved Instruction Pointer


The processor always takes an NMI interrupt on an instruction boundary. The saved contents of CS and EIP regis-
ters point to the next instruction to be executed at the point the interrupt is taken. See Section 7.5, “Exception
Classifications,” for more information about when the processor takes NMI interrupts.

Program State Change


The instruction executing when an NMI interrupt is received is completed before the NMI is generated. A program
or task can thus be restarted upon returning from an interrupt handler without loss of continuity, provided the
interrupt handler saves the state of the processor before handling the interrupt and restores the processor’s state
prior to a return.

Vol. 3A 7-27
INTERRUPT AND EXCEPTION HANDLING

Interrupt 3—Breakpoint Exception (#BP)

Exception Class Trap.

Description
Indicates that a breakpoint instruction (INT3, opcode CC) was executed, causing a breakpoint trap to be gener-
ated. Typically, a debugger sets a breakpoint by replacing the first opcode byte of an instruction with the opcode for
the INT3 instruction. (The INT3 instruction is one byte long, which makes it easy to replace an opcode in a code
segment in RAM with the breakpoint opcode.) The operating system or a debugging tool can use a data segment
mapped to the same physical address space as the code segment to place an INT3 instruction in places where it is
desired to call the debugger.
With the P6 family, Pentium, Intel486, and Intel386 processors, it is more convenient to set breakpoints with the
debug registers. (See Section 19.3.2, “Breakpoint Exception (#BP)—Interrupt Vector 3,” for information about the
breakpoint exception.) If more breakpoints are needed beyond what the debug registers allow, the INT3 instruction
can be used.
Any breakpoint exception inside an RTM region causes a transactional abort and, by default, redirects control flow
to the fallback instruction address. If advanced debugging of RTM transactional regions has been enabled, any
transactional abort due to a break exception instead causes execution to roll back to just before the XBEGIN
instruction and then delivers a debug exception (#DB) — not a breakpoint exception. See Section 17.3.7, “RTM-
Enabled Debugger Support,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1.
A breakpoint exception can also be generated by executing the INT n instruction with an operand of 3. The action
of this instruction (INT 3) is slightly different than that of the INT3 instruction (see “INT n/INTO/INT3/INT1—Call to
Interrupt Procedure” in Chapter 3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume
2A).

Exception Error Code


None.

Saved Instruction Pointer


Saved contents of CS and EIP registers point to the instruction following the INT3 instruction.

Program State Change


Even though the EIP points to the instruction following the breakpoint instruction, the state of the program is
essentially unchanged because the INT3 instruction does not affect any register or memory locations. The
debugger can thus resume the suspended program by replacing the INT3 instruction that caused the breakpoint
with the original opcode and decrementing the saved contents of the EIP register. Upon returning from the
debugger, program execution resumes with the replaced instruction.

7-28 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

Interrupt 4—Overflow Exception (#OF)

Exception Class Trap.

Description
Indicates that an overflow trap occurred when an INTO instruction was executed. The INTO instruction checks the
state of the OF flag in the EFLAGS register. If the OF flag is set, an overflow trap is generated.
Some arithmetic instructions (such as the ADD and SUB) perform both signed and unsigned arithmetic. These
instructions set the OF and CF flags in the EFLAGS register to indicate signed overflow and unsigned overflow,
respectively. When performing arithmetic on signed operands, the OF flag can be tested directly or the INTO
instruction can be used. The benefit of using the INTO instruction is that if the overflow exception is detected, an
exception handler can be called automatically to handle the overflow condition.

Exception Error Code


None.

Saved Instruction Pointer


The saved contents of CS and EIP registers point to the instruction following the INTO instruction.

Program State Change


Even though the EIP points to the instruction following the INTO instruction, the state of the program is essentially
unchanged because the INTO instruction does not affect any register or memory locations. The program can thus
resume normal execution upon returning from the overflow exception handler.

Vol. 3A 7-29
INTERRUPT AND EXCEPTION HANDLING

Interrupt 5—BOUND Range Exceeded Exception (#BR)

Exception Class Fault.

Description
Indicates that a BOUND-range-exceeded fault occurred when a BOUND instruction was executed. The BOUND
instruction checks that a signed array index is within the upper and lower bounds of an array located in memory. If
the array index is not within the bounds of the array, a BOUND-range-exceeded fault is generated.

Exception Error Code


None.

Saved Instruction Pointer


The saved contents of CS and EIP registers point to the BOUND instruction that generated the exception.

Program State Change


A program-state change does not accompany the bounds-check fault, because the operands for the BOUND
instruction are not modified. Returning from the BOUND-range-exceeded exception handler causes the BOUND
instruction to be restarted.

7-30 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

Interrupt 6—Invalid Opcode Exception (#UD)

Exception Class Fault.

Description
Indicates that the processor did one of the following things:
• Attempted to execute an invalid or reserved opcode.
• Attempted to execute an instruction with an operand type that is invalid for its accompanying opcode; for
example, the source operand for a LES instruction is not a memory location.
• Attempted to execute an MMX or SSE/SSE2/SSE3 instruction on an Intel 64 or IA-32 processor that does not
support the MMX technology or SSE/SSE2/SSE3/SSSE3 extensions, respectively. CPUID feature flags MMX (bit
23), SSE (bit 25), SSE2 (bit 26), SSE3 (ECX, bit 0), SSSE3 (ECX, bit 9) indicate support for these extensions.
• Attempted to execute an MMX instruction or SSE/SSE2/SSE3/SSSE3 SIMD instruction (with the exception of
the MOVNTI, PAUSE, PREFETCHh, SFENCE, LFENCE, MFENCE, CLFLUSH, MONITOR, and MWAIT instructions)
when the EM flag in control register CR0 is set (1).
• Attempted to execute an SSE/SE2/SSE3/SSSE3 instruction when the OSFXSR bit in control register CR4 is
clear (0). Note this does not include the following SSE/SSE2/SSE3 instructions: MASKMOVQ, MOVNTQ,
MOVNTI, PREFETCHh, SFENCE, LFENCE, MFENCE, and CLFLUSH; or the 64-bit versions of the PAVGB, PAVGW,
PEXTRW, PINSRW, PMAXSW, PMAXUB, PMINSW, PMINUB, PMOVMSKB, PMULHUW, PSADBW, PSHUFW, PADDQ,
PSUBQ, PALIGNR, PABSB, PABSD, PABSW, PHADDD, PHADDSW, PHADDW, PHSUBD, PHSUBSW, PHSUBW,
PMADDUBSM, PMULHRSW, PSHUFB, PSIGNB, PSIGND, and PSIGNW.
• Attempted to execute an SSE/SSE2/SSE3/SSSE3 instruction on an Intel 64 or IA-32 processor that caused a
SIMD floating-point exception when the OSXMMEXCPT bit in control register CR4 is clear (0).
• Executed a UD0, UD1 or UD2 instruction. Note that even though it is the execution of the UD0, UD1 or UD2
instruction that causes the invalid opcode exception, the saved instruction pointer will still points at the UD0,
UD1 or UD2 instruction.
• Detected a LOCK prefix that precedes an instruction that may not be locked or one that may be locked but the
destination operand is not a memory location.
• Attempted to execute an LLDT, SLDT, LTR, STR, LSL, LAR, VERR, VERW, or ARPL instruction while in real-
address or virtual-8086 mode.
• Attempted to execute the RSM instruction when not in SMM mode.
In Intel 64 and IA-32 processors that implement out-of-order execution microarchitectures, this exception is not
generated until an attempt is made to retire the result of executing an invalid instruction; that is, decoding and
speculatively attempting to execute an invalid opcode does not generate this exception. Likewise, in the Pentium
processor and earlier IA-32 processors, this exception is not generated as the result of prefetching and preliminary
decoding of an invalid instruction. (See Section 7.5, “Exception Classifications,” for general rules for taking of inter-
rupts and exceptions.)
The opcodes D6 and F1 are undefined opcodes reserved by the Intel 64 and IA-32 architectures. These opcodes,
even though undefined, do not generate an invalid opcode exception.

Exception Error Code


None.

Saved Instruction Pointer


The saved contents of CS and EIP registers point to the instruction that generated the exception.

Program State Change


A program-state change does not accompany an invalid-opcode fault, because the invalid instruction is not
executed.

Vol. 3A 7-31
INTERRUPT AND EXCEPTION HANDLING

Interrupt 7—Device Not Available Exception (#NM)

Exception Class Fault.

Description
Indicates one of the following things:
The device-not-available exception is generated by either of three conditions:
• The processor executed an x87 FPU floating-point instruction while the EM flag in control register CR0 was set
(1). See the paragraph below for the special case of the WAIT/FWAIT instruction.
• The processor executed a WAIT/FWAIT instruction while the MP and TS flags of register CR0 were set,
regardless of the setting of the EM flag.
• The processor executed an x87 FPU, MMX, or SSE/SSE2/SSE3 instruction (with the exception of MOVNTI,
PAUSE, PREFETCHh, SFENCE, LFENCE, MFENCE, and CLFLUSH) while the TS flag in control register CR0 was set
and the EM flag is clear.
The EM flag is set when the processor does not have an internal x87 FPU floating-point unit. A device-not-available
exception is then generated each time an x87 FPU floating-point instruction is encountered, allowing an exception
handler to call floating-point instruction emulation routines.
The TS flag indicates that a context switch (task switch) has occurred since the last time an x87 floating-point,
MMX, or SSE/SSE2/SSE3 instruction was executed; but that the context of the x87 FPU, XMM, and MXCSR registers
were not saved. When the TS flag is set and the EM flag is clear, the processor generates a device-not-available
exception each time an x87 floating-point, MMX, or SSE/SSE2/SSE3 instruction is encountered (with the exception
of the instructions listed above). The exception handler can then save the context of the x87 FPU, XMM, and MXCSR
registers before it executes the instruction. See Section 2.5, “Control Registers,” for more information about the TS
flag.
The MP flag in control register CR0 is used along with the TS flag to determine if WAIT or FWAIT instructions should
generate a device-not-available exception. It extends the function of the TS flag to the WAIT and FWAIT instruc-
tions, giving the exception handler an opportunity to save the context of the x87 FPU before the WAIT or FWAIT
instruction is executed. The MP flag is provided primarily for use with the Intel 286 and Intel386 DX processors. For
programs running on the Pentium 4, Intel Xeon, P6 family, Pentium, or Intel486 DX processors, or the Intel 487 SX
coprocessors, the MP flag should always be set; for programs running on the Intel486 SX processor, the MP flag
should be clear.

Exception Error Code


None.

Saved Instruction Pointer


The saved contents of CS and EIP registers point to the floating-point instruction or the WAIT/FWAIT instruction
that generated the exception.

Program State Change


A program-state change does not accompany a device-not-available fault, because the instruction that generated
the exception is not executed.
If the EM flag is set, the exception handler can then read the floating-point instruction pointed to by the EIP and call
the appropriate emulation routine.
If the MP and TS flags are set or the TS flag alone is set, the exception handler can save the context of the x87 FPU,
clear the TS flag, and continue execution at the interrupted floating-point or WAIT/FWAIT instruction.

7-32 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

Interrupt 8—Double Fault Exception (#DF)

Exception Class Abort.

Description
Indicates that the processor detected a second exception while calling an exception handler for a prior exception.
Normally, when the processor detects another exception while trying to call an exception handler, the two excep-
tions can be handled serially. If, however, the processor cannot handle them serially, it signals the double-fault
exception. To determine when two faults need to be signalled as a double fault, the processor divides the excep-
tions into three classes: benign exceptions, contributory exceptions, and page faults (see Table 7-4).

Table 7-4. Interrupt and Exception Classes


Class Vector Number Description
Benign Exceptions and Interrupts 1 Debug
2 NMI Interrupt
3 Breakpoint
4 Overflow
5 BOUND Range Exceeded
6 Invalid Opcode
7 Device Not Available
9 Coprocessor Segment Overrun
16 Floating-Point Error
17 Alignment Check
18 Machine Check
19 SIMD floating-point
All INT n
All INTR
Contributory Exceptions 0 Divide Error
10 Invalid TSS
11 Segment Not Present
12 Stack Fault
13 General Protection
21 Control Protection
Page Faults 14 Page Fault
20 Virtualization Exception

Table 7-5 shows the various combinations of exception classes that cause a double fault to be generated. A double-
fault exception falls in the abort class of exceptions. The program or task cannot be restarted or resumed. The
double-fault handler can be used to collect diagnostic information about the state of the machine and/or, when
possible, to shut the application and/or system down gracefully or restart the system.

Vol. 3A 7-33
INTERRUPT AND EXCEPTION HANDLING

A segment or page fault may be encountered while prefetching instructions; however, this behavior is outside the
domain of Table 7-5. Any further faults generated while the processor is attempting to transfer control to the appro-
priate fault handler could still lead to a double-fault sequence.

Table 7-5. Conditions for Generating a Double Fault


Second Exception
First Exception
Benign Contributory Page Fault
Benign Handle Exceptions Serially Handle Exceptions Serially Handle Exceptions Serially

Contributory Handle Exceptions Serially Generate a Double Fault Handle Exceptions Serially

Page Fault Handle Exceptions Serially Generate a Double Fault Generate a Double Fault

Double Fault Handle Exceptions Serially Enter Shutdown Mode Enter Shutdown Mode

If another contributory or page fault exception occurs while attempting to call the double-fault handler, the
processor enters shutdown mode. This mode is similar to the state following execution of an HLT instruction. In this
mode, the processor stops executing instructions until an NMI interrupt, SMI interrupt, hardware reset, or INIT# is
received. The processor generates a special bus cycle to indicate that it has entered shutdown mode. Software
designers may need to be aware of the response of hardware when it goes into shutdown mode. For example, hard-
ware may turn on an indicator light on the front panel, generate an NMI interrupt to record diagnostic information,
invoke reset initialization, generate an INIT initialization, or generate an SMI. If any events are pending during
shutdown, they will be handled after an wake event from shutdown is processed (for example, A20M# interrupts).
If a shutdown occurs while the processor is executing an NMI interrupt handler, then only a hardware reset can
restart the processor. Likewise, if the shutdown occurs while executing in SMM, a hardware reset must be used to
restart the processor.

Exception Error Code


Zero. The processor always pushes an error code of 0 onto the stack of the double-fault handler.

Saved Instruction Pointer


The saved contents of CS and EIP registers are undefined.

Program State Change


A program-state following a double-fault exception is undefined. The program or task cannot be resumed or
restarted. The only available action of the double-fault exception handler is to collect all possible context informa-
tion for use in diagnostics and then close the application and/or shut down or reset the processor.
If the double fault occurs when any portion of the exception handling machine state is corrupted, the handler
cannot be invoked and the processor must be reset.

7-34 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

Interrupt 9—Coprocessor Segment Overrun

Exception Class Abort. (Intel reserved; do not use. Recent IA-32 processors do not generate this
exception.)

Description
Indicates that an Intel386 CPU-based systems with an Intel 387 math coprocessor detected a page or segment
violation while transferring the middle portion of an Intel 387 math coprocessor operand. The P6 family, Pentium,
and Intel486 processors do not generate this exception; instead, this condition is detected with a general protec-
tion exception (#GP), interrupt 13.

Exception Error Code


None.

Saved Instruction Pointer


The saved contents of CS and EIP registers point to the instruction that generated the exception.

Program State Change


A program-state following a coprocessor segment-overrun exception is undefined. The program or task cannot
be resumed or restarted. The only available action of the exception handler is to save the instruction pointer and
reinitialize the x87 FPU using the FNINIT instruction.

Vol. 3A 7-35
INTERRUPT AND EXCEPTION HANDLING

Interrupt 10—Invalid TSS Exception (#TS)

Exception Class Fault.

Description
Indicates that there was an error related to a TSS. Such an error might be detected during a task switch or during
the execution of instructions that use information from a TSS. Table 7-6 shows the conditions that cause an invalid
TSS exception to be generated.

Table 7-6. Invalid TSS Conditions


Error Code Index Invalid Condition
TSS segment selector index The TSS segment limit is less than 67H for 32-bit TSS or less than 2CH for 16-bit TSS.
TSS segment selector index During an IRET task switch, the TI flag in the TSS segment selector indicates the LDT.
TSS segment selector index During an IRET task switch, the TSS segment selector exceeds descriptor table limit.
TSS segment selector index During an IRET task switch, the busy flag in the TSS descriptor indicates an inactive task.
TSS segment selector index During a task switch, an attempt to access data in a TSS results in a limit violation or
canonical fault.
TSS segment selector index During an IRET task switch, the backlink is a NULL selector.
TSS segment selector index During an IRET task switch, the backlink points to a descriptor which is not a busy TSS.
TSS segment selector index The new TSS descriptor is beyond the GDT limit.
TSS segment selector index The new TSS selector is null on an attempt to lock the new TSS.
TSS segment selector index The new TSS selector has the TI bit set on an attempt to lock the new TSS.
TSS segment selector index The new TSS descriptor is not an available TSS descriptor on an attempt to lock the new
TSS.
LDT segment selector index LDT not valid or not present.
Stack segment selector index The stack segment selector exceeds descriptor table limit.
Stack segment selector index The stack segment selector is NULL.
Stack segment selector index The stack segment descriptor is a non-data segment.
Stack segment selector index The stack segment is not writable.
Stack segment selector index The stack segment DPL ? CPL.
Stack segment selector index The stack segment selector RPL ? CPL.
Code segment selector index The code segment selector exceeds descriptor table limit.
Code segment selector index The code segment selector is NULL.
Code segment selector index The code segment descriptor is not a code segment type.
Code segment selector index The nonconforming code segment DPL ? CPL.
Code segment selector index The conforming code segment DPL is greater than CPL.
Data segment selector index The data segment selector exceeds the descriptor table limit.
Data segment selector index The data segment descriptor is not a readable code or data type.
Data segment selector index The data segment descriptor is a nonconforming code type and RPL > DPL.
Data segment selector index The data segment descriptor is a nonconforming code type and CPL > DPL.
TSS segment selector index The TSS segment descriptor/upper descriptor is beyond the GDT segment limit.
TSS segment selector index The TSS segment descriptor is not an available TSS type.
TSS segment selector index The TSS segment descriptor is an available 286 TSS type in IA-32e mode.

7-36 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

Table 7-6. Invalid TSS Conditions (Contd.)


Error Code Index Invalid Condition
TSS segment selector index The TSS segment upper descriptor is not the correct type.
TSS segment selector index The TSS segment descriptor contains a non-canonical base.

This exception can generated either in the context of the original task or in the context of the new task (see Section
9.3, “Task Switching”). Until the processor has completely verified the presence of the new TSS, the exception is
generated in the context of the original task. Once the existence of the new TSS is verified, the task switch is
considered complete. Any invalid-TSS conditions detected after this point are handled in the context of the new
task. (A task switch is considered complete when the task register is loaded with the segment selector for the new
TSS and, if the switch is due to a procedure call or interrupt, the previous task link field of the new TSS references
the old TSS.)
The invalid-TSS handler must be a task called using a task gate. Handling this exception inside the faulting TSS
context is not recommended because the processor state may not be consistent.

Exception Error Code


An error code containing the segment selector index for the segment descriptor that caused the violation is pushed
onto the stack of the exception handler. If the EXT flag is set, it indicates that the exception was caused by an event
external to the currently running program (for example, if an external interrupt handler using a task gate
attempted a task switch to an invalid TSS).

Saved Instruction Pointer


If the exception condition was detected before the task switch was carried out, the saved contents of CS and EIP
registers point to the instruction that invoked the task switch. If the exception condition was detected after the task
switch was carried out, the saved contents of CS and EIP registers point to the first instruction of the new task.

Program State Change


The ability of the invalid-TSS handler to recover from the fault depends on the error condition than causes the fault.
See Section 9.3, “Task Switching,” for more information on the task switch process and the possible recovery
actions that can be taken.
If an invalid TSS exception occurs during a task switch, it can occur before or after the commit-to-new-task point.
If it occurs before the commit point, no program state change occurs. If it occurs after the commit point (when the
segment descriptor information for the new segment selectors have been loaded in the segment registers), the
processor will load all the state information from the new TSS before it generates the exception. During a task
switch, the processor first loads all the segment registers with segment selectors from the TSS, then checks their
contents for validity. If an invalid TSS exception is discovered, the remaining segment registers are loaded but not
checked for validity and therefore may not be usable for referencing memory. The invalid TSS handler should not
rely on being able to use the segment selectors found in the CS, SS, DS, ES, FS, and GS registers without causing
another exception. The exception handler should load all segment registers before trying to resume the new task;
otherwise, general-protection exceptions (#GP) may result later under conditions that make diagnosis more diffi-
cult. The Intel recommended way of dealing situation is to use a task for the invalid TSS exception handler. The task
switch back to the interrupted task from the invalid-TSS exception-handler task will then cause the processor to
check the registers as it loads them from the TSS.

Vol. 3A 7-37
INTERRUPT AND EXCEPTION HANDLING

Interrupt 11—Segment Not Present (#NP)

Exception Class Fault.

Description
Indicates that the present flag of a segment or gate descriptor is clear. The processor can generate this exception
during any of the following operations:
• While attempting to load CS, DS, ES, FS, or GS registers. [Detection of a not-present segment while loading the
SS register causes a stack fault exception (#SS) to be generated.] This situation can occur while performing a
task switch.
• While attempting to load the LDTR using an LLDT instruction. Detection of a not-present LDT while loading the
LDTR during a task switch operation causes an invalid-TSS exception (#TS) to be generated.
• When executing the LTR instruction and the TSS is marked not present.
• While attempting to use a gate descriptor or TSS that is marked segment-not-present, but is otherwise valid.
An operating system typically uses the segment-not-present exception to implement virtual memory at the
segment level. If the exception handler loads the segment and returns, the interrupted program or task resumes
execution.
A not-present indication in a gate descriptor, however, does not indicate that a segment is not present (because
gates do not correspond to segments). The operating system may use the present flag for gate descriptors to
trigger exceptions of special significance to the operating system.
A contributory exception or page fault that subsequently referenced a not-present segment would cause a double
fault (#DF) to be generated instead of #NP.

Exception Error Code


An error code containing the segment selector index for the segment descriptor that caused the violation is pushed
onto the stack of the exception handler. If the EXT flag is set, it indicates that the exception resulted from either:
• an external event (NMI or INTR) that caused an interrupt, which subsequently referenced a not-present
segment
• a benign exception that subsequently referenced a not-present segment
The IDT flag is set if the error code refers to an IDT entry. This occurs when the IDT entry for an interrupt being
serviced references a not-present gate. Such an event could be generated by an INT instruction or a hardware
interrupt.

Saved Instruction Pointer


The saved contents of CS and EIP registers normally point to the instruction that generated the exception. If the
exception occurred while loading segment descriptors for the segment selectors in a new TSS, the CS and EIP
registers point to the first instruction in the new task. If the exception occurred while accessing a gate descriptor,
the CS and EIP registers point to the instruction that invoked the access (for example a CALL instruction that refer-
ences a call gate).

Program State Change


If the segment-not-present exception occurs as the result of loading a register (CS, DS, SS, ES, FS, GS, or LDTR),
a program-state change does accompany the exception because the register is not loaded. Recovery from this
exception is possible by simply loading the missing segment into memory and setting the present flag in the
segment descriptor.
If the segment-not-present exception occurs while accessing a gate descriptor, a program-state change does not
accompany the exception. Recovery from this exception is possible merely by setting the present flag in the gate
descriptor.
If a segment-not-present exception occurs during a task switch, it can occur before or after the commit-to-new-
task point (see Section 9.3, “Task Switching”). If it occurs before the commit point, no program state change

7-38 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

occurs. If it occurs after the commit point, the processor will load all the state information from the new TSS
(without performing any additional limit, present, or type checks) before it generates the exception. The segment-
not-present exception handler should not rely on being able to use the segment selectors found in the CS, SS, DS,
ES, FS, and GS registers without causing another exception. (See the Program State Change description for “Inter-
rupt 10—Invalid TSS Exception (#TS)” in this chapter for additional information on how to handle this situation.)

Vol. 3A 7-39
INTERRUPT AND EXCEPTION HANDLING

Interrupt 12—Stack Fault Exception (#SS)

Exception Class Fault.

Description
Indicates that one of the following stack related conditions was detected:
• A limit violation is detected during an operation that refers to the SS register. Operations that can cause a limit
violation include stack-oriented instructions such as POP, PUSH, CALL, RET, IRET, ENTER, and LEAVE, as well as
other memory references which implicitly or explicitly use the SS register (for example, MOV AX, [BP+6] or
MOV AX, SS:[EAX+6]). The ENTER instruction generates this exception when there is not enough stack space
for allocating local variables.
• A not-present stack segment is detected when attempting to load the SS register. This violation can occur
during the execution of a task switch, a CALL instruction to a different privilege level, a return to a different
privilege level, an LSS instruction, or a MOV or POP instruction to the SS register.
• A canonical violation is detected in 64-bit mode during an operation that reference memory using the stack
pointer register containing a non-canonical memory address.
Recovery from this fault is possible by either extending the limit of the stack segment (in the case of a limit viola-
tion) or loading the missing stack segment into memory (in the case of a not-present violation.
In the case of a canonical violation that was caused intentionally by software, recovery is possible by loading the
correct canonical value into RSP. Otherwise, a canonical violation of the address in RSP likely reflects some register
corruption in the software.

Exception Error Code


If the exception is caused by a not-present stack segment or by overflow of the new stack during an inter-privilege-
level call, the error code contains a segment selector for the segment that caused the exception. Here, the excep-
tion handler can test the present flag in the segment descriptor pointed to by the segment selector to determine
the cause of the exception. For a normal limit violation (on a stack segment already in use) the error code is set to
0.

Saved Instruction Pointer


The saved contents of CS and EIP registers generally point to the instruction that generated the exception.
However, when the exception results from attempting to load a not-present stack segment during a task switch, the
CS and EIP registers point to the first instruction of the new task.

Program State Change


A program-state change does not generally accompany a stack-fault exception, because the instruction that gener-
ated the fault is not executed. Here, the instruction can be restarted after the exception handler has corrected the
stack fault condition.
If a stack fault occurs during a task switch, it occurs after the commit-to-new-task point (see Section 9.3, “Task
Switching”). Here, the processor loads all the state information from the new TSS (without performing any addi-
tional limit, present, or type checks) before it generates the exception. The stack fault handler should thus not rely
on being able to use the segment selectors found in the CS, SS, DS, ES, FS, and GS registers without causing
another exception. The exception handler should check all segment registers before trying to resume the new
task; otherwise, general protection faults may result later under conditions that are more difficult to diagnose. (See
the Program State Change description for “Interrupt 10—Invalid TSS Exception (#TS)” in this chapter for additional
information on how to handle this situation.)

7-40 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

Interrupt 13—General Protection Exception (#GP)

Exception Class Fault.

Description
Indicates that the processor detected one of a class of protection violations called “general-protection violations.”
The conditions that cause this exception to be generated comprise all the protection violations that do not cause
other exceptions to be generated (such as, invalid-TSS, segment-not-present, stack-fault, or page-fault excep-
tions). The following conditions cause general-protection exceptions to be generated:
• Exceeding the segment limit when accessing the CS, DS, ES, FS, or GS segments.
• Exceeding the segment limit when referencing a descriptor table (except during a task switch or a stack
switch).
• Transferring execution to a segment that is not executable.
• Writing to a code segment or a read-only data segment.
• Reading from an execute-only code segment.
• Loading the SS register with a segment selector for a read-only segment (unless the selector comes from a TSS
during a task switch, in which case an invalid-TSS exception occurs).
• Loading the SS, DS, ES, FS, or GS register with a segment selector for a system segment.
• Loading the DS, ES, FS, or GS register with a segment selector for an execute-only code segment.
• Loading the SS register with the segment selector of an executable segment or a null segment selector.
• Loading the CS register with a segment selector for a data segment or a null segment selector.
• Accessing memory using the DS, ES, FS, or GS register when it contains a null segment selector.
• Switching to a busy task during a call or jump to a TSS.
• Using a segment selector on a non-IRET task switch that points to a TSS descriptor in the current LDT. TSS
descriptors can only reside in the GDT. This condition causes a #TS exception during an IRET task switch.
• Violating any of the privilege rules described in Chapter 6, “Protection.”
• Exceeding the instruction length limit of 15 bytes (this only can occur when redundant prefixes are placed
before an instruction).
• Loading the CR0 register with a set PG flag (paging enabled) and a clear PE flag (protection disabled).
• Loading the CR0 register with a set NW flag and a clear CD flag.
• Referencing an entry in the IDT (following an interrupt or exception) that is not an interrupt, trap, or task gate.
• Attempting to access an interrupt or exception handler through an interrupt or trap gate from virtual-8086
mode when the handler’s code segment DPL is greater than 0.
• Attempting to write a 1 into a reserved bit of CR4.
• Attempting to execute a privileged instruction when the CPL is not equal to 0 (see Section 6.9, “Privileged
Instructions,” for a list of privileged instructions).
• Attempting to execute SGDT, SIDT, SLDT, SMSW, or STR when CR4.UMIP = 1 and the CPL is not equal to 0.
• Writing to a reserved bit in an MSR.
• Accessing a gate that contains a null segment selector.
• Executing the INT n instruction when the CPL is greater than the DPL of the referenced interrupt, trap, or task
gate.
• The segment selector in a call, interrupt, or trap gate does not point to a code segment.
• The segment selector operand in the LLDT instruction is a local type (TI flag is set) or does not point to a
segment descriptor of the LDT type.
• The segment selector operand in the LTR instruction is local or points to a TSS that is not available.
• The target code-segment selector for a call, jump, or return is null.

Vol. 3A 7-41
INTERRUPT AND EXCEPTION HANDLING

• If the PAE and/or PSE flag in control register CR4 is set and the processor detects any reserved bits in a page-
directory-pointer-table entry set to 1. These bits are checked during a write to control registers CR0, CR3, or
CR4 that causes a reloading of the page-directory-pointer-table entry.
• Attempting to write a non-zero value into the reserved bits of the MXCSR register.
• Executing an SSE/SSE2/SSE3 instruction that attempts to access a 128-bit memory location that is not aligned
on a 16-byte boundary when the instruction requires 16-byte alignment. This condition also applies to the stack
segment.
A program or task can be restarted following any general-protection exception. If the exception occurs while
attempting to call an interrupt handler, the interrupted program can be restartable, but the interrupt may be lost.

Exception Error Code


The processor pushes an error code onto the exception handler's stack. If the fault condition was detected while
loading a segment descriptor, the error code contains a segment selector to or IDT vector number for the
descriptor; otherwise, the error code is 0. The source of the selector in an error code may be any of the following:
• An operand of the instruction.
• A selector from a gate which is the operand of the instruction.
• A selector from a TSS involved in a task switch.
• IDT vector number.

Saved Instruction Pointer


The saved contents of CS and EIP registers point to the instruction that generated the exception.

Program State Change


In general, a program-state change does not accompany a general-protection exception, because the invalid
instruction or operation is not executed. An exception handler can be designed to correct all of the conditions that
cause general-protection exceptions and restart the program or task without any loss of program continuity.
If a general-protection exception occurs during a task switch, it can occur before or after the commit-to-new-task
point (see Section 9.3, “Task Switching”). If it occurs before the commit point, no program state change occurs. If
it occurs after the commit point, the processor will load all the state information from the new TSS (without
performing any additional limit, present, or type checks) before it generates the exception. The general-protection
exception handler should thus not rely on being able to use the segment selectors found in the CS, SS, DS, ES, FS,
and GS registers without causing another exception. (See the Program State Change description for “Interrupt
10—Invalid TSS Exception (#TS)” in this chapter for additional information on how to handle this situation.)

General Protection Exception in 64-bit Mode


The following conditions cause general-protection exceptions in 64-bit mode:
• If the memory address is in a non-canonical form.
• If a segment descriptor memory address is in non-canonical form.
• If the target offset in a destination operand of a call or jmp is in a non-canonical form.
• If a code segment or 64-bit call gate overlaps non-canonical space.
• If the code segment descriptor pointed to by the selector in the 64-bit gate doesn't have the L-bit set and the
D-bit clear.
• If the EFLAGS.NT bit is set in IRET.
• If the stack segment selector of IRET is null when going back to compatibility mode.
• If the stack segment selector of IRET is null going back to CPL3 and 64-bit mode.
• If a null stack segment selector RPL of IRET is not equal to CPL going back to non-CPL3 and 64-bit mode.
• If the proposed new code segment descriptor of IRET has both the D-bit and the L-bit set.

7-42 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

• If the segment descriptor pointed to by the segment selector in the destination operand is a code segment and
it has both the D-bit and the L-bit set.
• If the segment descriptor from a 64-bit call gate is in non-canonical space.
• If the DPL from a 64-bit call-gate is less than the CPL or than the RPL of the 64-bit call-gate.
• If the type field of the upper 64 bits of a 64-bit call gate is not 0.
• If an attempt is made to load a null selector in the SS register in compatibility mode.
• If an attempt is made to load null selector in the SS register in CPL3 and 64-bit mode.
• If an attempt is made to load a null selector in the SS register in non-CPL3 and 64-bit mode where RPL is not
equal to CPL.
• If an attempt is made to clear CR0.PG while IA-32e mode is enabled.
• If an attempt is made to set a reserved bit in CR3, CR4 or CR8.

Vol. 3A 7-43
INTERRUPT AND EXCEPTION HANDLING

Interrupt 14—Page-Fault Exception (#PF)

Exception Class Fault.

Description
Indicates that, with paging enabled (the PG flag in the CR0 register is set), the processor detected one of the
following conditions while using the page-translation mechanism to translate a linear address to a physical
address:
• The P (present) flag in a page-directory or page-table entry needed for the address translation is clear,
indicating that a page table or the page containing the operand is not present in physical memory.
• The procedure does not have sufficient privilege to access the indicated page (that is, a procedure running in
user mode attempts to access a supervisor-mode page). If the SMAP flag is set in CR4, a page fault may also
be triggered by code running in supervisor mode that tries to access data at a user-mode address. If either the
PKE flag or the PKS flag is set in CR4, the protection-key rights registers may cause page faults on data
accesses to linear addresses with certain protection keys.
• Code running in user mode attempts to write to a read-only page. If the WP flag is set in CR0, the page fault
will also be triggered by code running in supervisor mode that tries to write to a read-only page.
• An instruction fetch to a linear address that translates to a physical address in a memory page with the
execute-disable bit set (for information about the execute-disable bit, see Chapter 5, “Paging”). If the SMEP
flag is set in CR4, a page fault will also be triggered by code running in supervisor mode that tries to fetch an
instruction from a user-mode address.
• One or more reserved bits in paging-structure entry are set to 1. See description below of RSVD error code flag.
• A shadow-stack access is made to a page that is not a shadow-stack page. See Section 18.2, “Shadow Stacks,”
in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, and Section 5.6, “Access
Rights.”
• An enclave access violates one of the specified access-control requirements. See Section 36.3, “Access-control
Requirements,” and Section 36.20, “Enclave Page Cache Map (EPCM),” in Chapter 36, “Enclave Access Control
and Data Structures.” In this case, the exception is called an SGX-induced page fault. The processor uses the
error code (below) to distinguish SGX-induced page faults from ordinary page faults.
The exception handler can recover from page-not-present conditions and restart the program or task without any
loss of program continuity. It can also restart the program or task after a privilege violation, but the problem that
caused the privilege violation may be uncorrectable.
See also: Section 5.7, “Page-Fault Exceptions.”

Exception Error Code


Yes (special format). The processor provides the page-fault handler with two items of information to aid in diag-
nosing the exception and recovering from it:
• An error code on the stack. The error code for a page fault has a format different from that for other exceptions
(see Figure 7-11). The processor establishes the bits in the error code as follows:
— P flag (bit 0).
This flag is 0 if there is no translation for the linear address because the P flag was 0 in one of the paging-
structure entries used to translate that address.
— W/R (bit 1).
If the access causing the page-fault exception was a write, this flag is 1; otherwise, it is 0. This flag
describes the access causing the page-fault exception, not the access rights specified by paging.
— U/S (bit 2).
If a user-mode access caused the page-fault exception, this flag is 1; it is 0 if a supervisor-mode access did
so. This flag describes the access causing the page-fault exception, not the access rights specified by
paging.

7-44 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

— RSVD flag (bit 3).


This flag is 1 if there is no translation for the linear address because a reserved bit was set in one of the
paging-structure entries used to translate that address.
— I/D flag (bit 4).
This flag is 1 if the access causing the page-fault exception was an instruction fetch. This flag describes the
access causing the page-fault exception, not the access rights specified by paging.
— PK flag (bit 5).
This flag is 1 if the access causing the page-fault exception was a data access to a linear address with a
protection key for which the protection-key rights registers disallow access.
— SS (bit 6).
If the access causing the page-fault exception was a shadow-stack access (including shadow-stack
accesses in enclave mode), this flag is 1; otherwise, it is 0. This flag describes the access causing the page-
fault exception, not the access rights specified by paging.
— HLAT (bit 7).
This flag is 1 if there is no translation for the linear address using HLAT paging because, in one of the
paging-structure entries used to translate that address, either the P flag was 0 or a reserved bit was set. An
error code will set this flag only if it clears bit 0 or sets bit 3. This flag will not be set by a page fault resulting
from a violation of access rights, nor for one encountered during ordinary paging, including the case in
which there has been a restart of HLAT paging.
— SGX flag (bit 15).
This flag is 1 if the exception is unrelated to paging and resulted from violation of SGX-specific access-
control requirements. Because such a violation can occur only if there is no ordinary page fault, this flag is
set only if the P flag (bit 0) is 1 and the RSVD flag (bit 3) and the PK flag (bit 5) are both 0.
See Section 5.6, “Access Rights,” and Section 5.7, “Page-Fault Exceptions,” for more information about page-
fault exceptions and the error codes that they produce.

Vol. 3A 7-45
INTERRUPT AND EXCEPTION HANDLING

31 15 7 6 5 4 3 2 1 0

SS
PK
I/D
RSVD
U/S
W/R
P
SGX

HLAT
Reserved Reserved

P 0 The fault was caused by a non-present page.


1 The fault was caused by a page-level protection violation.

W/R 0 The access causing the fault was a read.


1 The access causing the fault was a write.

U/S 0 A supervisor-mode access caused the fault.


1 A user-mode access caused the fault.

RSVD 0 The fault was not caused by reserved bit violation.


1 The fault was caused by a reserved bit set to 1 in some
paging-structure entry.

I/D 0 The fault was not caused by an instruction fetch.


1 The fault was caused by an instruction fetch.

PK 0 The fault was not caused by protection keys.


1 There was a protection-key violation.

SS 0 The fault was not caused by a shadow-stack access.


1 The fault was caused by a shadow-stack access.

HLAT 0 The fault occurred during ordinary paging or due to access rights.
1 The fault occurred during HLAT paging.

SGX 0 The fault is not related to SGX.


1 The fault resulted from violation of SGX-specific access-control
requirements.

Figure 7-11. Page-Fault Error Code

• The contents of the CR2 register. The processor loads the CR2 register with the linear address that generated
the exception. If linear-address masking had been in effect (Section 4.4), the address recorded reflects the
result of that masking and does not contain any masked metadata. If the page-fault exception occurred during
execution of an instruction in enclave mode (and not during delivery of an event incident to enclave mode), bits
11:0 of the address are cleared.
The page-fault handler can use this address to locate the corresponding paging-structure entries. Another page
fault can potentially occur during execution of the page-fault handler; the handler should save the contents of
the CR2 register before a second page fault can occur.1 If a page fault is caused by a page-level protection
violation, the accessed flags in paging-structure entries may be set when the fault occurs (behavior is model-
specific and not architecturally defined).

Saved Instruction Pointer


The saved contents of CS and EIP registers generally point to the instruction that generated the exception. If the
page-fault exception occurred during a task switch, the CS and EIP registers may point to the first instruction of the
new task (as described in the following “Program State Change” section).

1. Processors update CR2 whenever a page fault is detected. If a second page fault occurs while an earlier page fault is being deliv-
ered, the faulting linear address of the second fault will overwrite the contents of CR2 (replacing the previous address). These
updates to CR2 occur even if the page fault results in a double fault or occurs during the delivery of a double fault.

7-46 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

Program State Change


A program-state change does not normally accompany a page-fault exception, because the instruction that causes
the exception to be generated is not executed. After the page-fault exception handler has corrected the violation
(for example, loaded the missing page into memory), execution of the program or task can be resumed.
When a page-fault exception is generated during a task switch, the program-state may change, as follows. During
a task switch, a page-fault exception can occur during any of following operations:
• While writing the state of the original task into the TSS of that task.
• While reading the GDT to locate the TSS descriptor of the new task.
• While reading the TSS of the new task.
• While reading segment descriptors associated with segment selectors from the new task.
• While reading the LDT of the new task to verify the segment registers stored in the new TSS.
In the last two cases the exception occurs in the context of the new task. The instruction pointer refers to the first
instruction of the new task, not to the instruction which caused the task switch (or the last instruction to be
executed, in the case of an interrupt). If the design of the operating system permits page faults to occur during
task-switches, the page-fault handler should be called through a task gate.
If a page fault occurs during a task switch, the processor will load all the state information from the new TSS
(without performing any additional limit, present, or type checks) before it generates the exception. The page-fault
handler should thus not rely on being able to use the segment selectors found in the CS, SS, DS, ES, FS, and GS
registers without causing another exception. (See the Program State Change description for “Interrupt 10—Invalid
TSS Exception (#TS)” in this chapter for additional information on how to handle this situation.)

Additional Exception-Handling Information


Special care should be taken to ensure that an exception that occurs during an explicit stack switch does not cause
the processor to use an invalid stack pointer (SS:ESP). Software written for 16-bit IA-32 processors often use a
pair of instructions to change to a new stack, for example:
MOV SS, AX
MOV SP, StackTop
When executing this code on one of the 32-bit IA-32 processors, it is possible to get a page fault, general-protec-
tion fault (#GP), or alignment check fault (#AC) after the segment selector has been loaded into the SS register
but before the ESP register has been loaded. At this point, the two parts of the stack pointer (SS and ESP) are
inconsistent. The new stack segment is being used with the old stack pointer.
The processor does not use the inconsistent stack pointer if the exception handler switches to a well defined stack
(that is, the handler is a task or a more privileged procedure). However, if the exception handler is called at the
same privilege level and from the same task, the processor will attempt to use the inconsistent stack pointer.
In systems that handle page-fault, general-protection, or alignment check exceptions within the faulting task (with
trap or interrupt gates), software executing at the same privilege level as the exception handler should initialize a
new stack by using the LSS instruction rather than a pair of MOV instructions, as described earlier in this note.
When the exception handler is running at privilege level 0 (the normal case), the problem is limited to procedures
or tasks that run at privilege level 0, typically the kernel of the operating system.

Vol. 3A 7-47
INTERRUPT AND EXCEPTION HANDLING

Interrupt 16—x87 FPU Floating-Point Error (#MF)

Exception Class Fault.

Description
Indicates that the x87 FPU has detected a floating-point error. The NE flag in the register CR0 must be set for an
interrupt 16 (floating-point error exception) to be generated. (See Section 2.5, “Control Registers,” for a detailed
description of the NE flag.)

NOTE
SIMD floating-point exceptions (#XM) are signaled through interrupt 19.

While executing x87 FPU instructions, the x87 FPU detects and reports six types of floating-point error conditions:
• Invalid operation (#I)
— Stack overflow or underflow (#IS)
— Invalid arithmetic operation (#IA)
• Divide-by-zero (#Z)
• Denormalized operand (#D)
• Numeric overflow (#O)
• Numeric underflow (#U)
• Inexact result (precision) (#P)
Each of these error conditions represents an x87 FPU exception type, and for each of exception type, the x87 FPU
provides a flag in the x87 FPU status register and a mask bit in the x87 FPU control register. If the x87 FPU detects
a floating-point error and the mask bit for the exception type is set, the x87 FPU handles the exception automati-
cally by generating a predefined (default) response and continuing program execution. The default responses have
been designed to provide a reasonable result for most floating-point applications.
If the mask for the exception is clear and the NE flag in register CR0 is set, the x87 FPU does the following:
1. Sets the necessary flag in the FPU status register.
2. Waits until the next “waiting” x87 FPU instruction or WAIT/FWAIT instruction is encountered in the program’s
instruction stream.
3. Generates an internal error signal that cause the processor to generate a floating-point exception (#MF).
Prior to executing a waiting x87 FPU instruction or the WAIT/FWAIT instruction, the x87 FPU checks for pending x87
FPU floating-point exceptions (as described in step 2 above). Pending x87 FPU floating-point exceptions are
ignored for “non-waiting” x87 FPU instructions, which include the FNINIT, FNCLEX, FNSTSW, FNSTSW AX, FNSTCW,
FNSTENV, and FNSAVE instructions. Pending x87 FPU exceptions are also ignored when executing the state
management instructions FXSAVE and FXRSTOR.
All of the x87 FPU floating-point error conditions can be recovered from. The x87 FPU floating-point-error exception
handler can determine the error condition that caused the exception from the settings of the flags in the x87 FPU
status word. See “Software Exception Handling” in Chapter 8 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 1, for more information on handling x87 FPU floating-point exceptions.

Exception Error Code


None. The x87 FPU provides its own error information.

Saved Instruction Pointer


The saved contents of CS and EIP registers point to the floating-point or WAIT/FWAIT instruction that was about to
be executed when the floating-point-error exception was generated. This is not the faulting instruction in which the
error condition was detected. The address of the faulting instruction is contained in the x87 FPU instruction pointer

7-48 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

register. See Section 8.1.8, “x87 FPU Instruction and Data (Operand) Pointers,” in Chapter 8 of the Intel® 64 and
IA-32 Architectures Software Developer’s Manual, Volume 1, for more information about information the FPU saves
for use in handling floating-point-error exceptions.

Program State Change


A program-state change generally accompanies an x87 FPU floating-point exception because the handling of the
exception is delayed until the next waiting x87 FPU floating-point or WAIT/FWAIT instruction following the faulting
instruction. The x87 FPU, however, saves sufficient information about the error condition to allow recovery from the
error and re-execution of the faulting instruction if needed.
In situations where non- x87 FPU floating-point instructions depend on the results of an x87 FPU floating-point
instruction, a WAIT or FWAIT instruction can be inserted in front of a dependent instruction to force a pending x87
FPU floating-point exception to be handled before the dependent instruction is executed. See “x87 FPU Exception
Synchronization” in Chapter 8 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for
more information about synchronization of x87 floating-point-error exceptions.

Vol. 3A 7-49
INTERRUPT AND EXCEPTION HANDLING

Interrupt 17—Alignment Check Exception (#AC)

Exception Class Fault.

Description
Indicates that the processor detected an unaligned memory operand when alignment checking was enabled. Align-
ment checks are only carried out in data (or stack) accesses (not in code fetches or system segment accesses). An
example of an alignment-check violation is a word stored at an odd byte address, or a doubleword stored at an
address that is not an integer multiple of 4. Table 7-7 lists the alignment requirements various data types recog-
nized by the processor.

Table 7-7. Alignment Requirements by Data Type


Data Type Address Must Be Divisible By
Word 2
Doubleword 4
Single precision floating-point (32-bits) 4
Double precision floating-point (64-bits) 8
Double extended precision floating-point (80-bits) 8
Quadword 8
Double quadword 16
Segment Selector 2
32-bit Far Pointer 2
48-bit Far Pointer 4
32-bit Pointer 4
GDTR, IDTR, LDTR, or Task Register Contents 4
FSTENV/FLDENV Save Area 4 or 2, depending on operand size
FSAVE/FRSTOR Save Area 4 or 2, depending on operand size
Bit String 2 or 4 depending on the operand-size attribute.

Note that the alignment check exception (#AC) is generated only for data types that must be aligned on word,
doubleword, and quadword boundaries. A general-protection exception (#GP) is generated 128-bit data types that
are not aligned on a 16-byte boundary.
To enable alignment checking, the following conditions must be true:
• AM flag in CR0 register is set.
• AC flag in the EFLAGS register is set.
• The CPL is 3 (including virtual-8086 mode).
Alignment-check exceptions (#AC) are generated only when operating at privilege level 3 (user mode). Memory
references that default to privilege level 0, such as segment descriptor loads, do not generate alignment-check
exceptions, even when caused by a memory reference made from privilege level 3.
Storing the contents of the GDTR, IDTR, LDTR, or task register in memory while at privilege level 3 can generate
an alignment-check exception. Although application programs do not normally store these registers, the fault can
be avoided by aligning the information stored on an even word-address.
The FXSAVE/XSAVE and FXRSTOR/XRSTOR instructions save and restore a 512-byte data structure, the first byte
of which must be aligned on a 16-byte boundary. If the alignment-check exception (#AC) is enabled when
executing these instructions (and CPL is 3), a misaligned memory operand can cause either an alignment-check
exception or a general-protection exception (#GP) depending on the processor implementation (see “FXSAVE-Save
x87 FPU, MMX, SSE, and SSE2 State” and “FXRSTOR-Restore x87 FPU, MMX, SSE, and SSE2 State” in Chapter 3

7-50 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A; see “XSAVE—Save Processor
Extended States” and “XRSTOR—Restore Processor Extended States” in Chapter 6 of the Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 2D).
The MOVDQU, MOVUPS, and MOVUPD instructions perform 128-bit unaligned loads or stores. The LDDQU instruc-
tions loads 128-bit unaligned data. They do not generate general-protection exceptions (#GP) when operands are
not aligned on a 16-byte boundary. If alignment checking is enabled, alignment-check exceptions (#AC) may or
may not be generated depending on processor implementation when data addresses are not aligned on an 8-byte
boundary.
FSAVE and FRSTOR instructions can generate unaligned references, which can cause alignment-check faults.
These instructions are rarely needed by application programs.

Exception Error Code


Yes. The error code is null; all bits are clear except possibly bit 0 — EXT; see Section 7.13. EXT is set if the #AC is
recognized during delivery of an event other than a software interrupt (see “INT n/INTO/INT3/INT1—Call to Inter-
rupt Procedure” in Chapter 3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A).

Saved Instruction Pointer


The saved contents of CS and EIP registers point to the instruction that generated the exception.

Program State Change


A program-state change does not accompany an alignment-check fault, because the instruction is not executed.

Vol. 3A 7-51
INTERRUPT AND EXCEPTION HANDLING

Interrupt 18—Machine-Check Exception (#MC)

Exception Class Abort.

Description
Indicates that the processor detected an internal machine error or a bus error, or that an external agent detected
a bus error. The machine-check exception is model-specific, available on the Pentium and later generations of
processors. The implementation of the machine-check exception is different between different processor families,
and these implementations may not be compatible with future Intel 64 or IA-32 processors. (Use the CPUID
instruction to determine whether this feature is present.)
Bus errors detected by external agents are signaled to the processor on dedicated pins: the BINIT# and MCERR#
pins on the Pentium 4, Intel Xeon, and P6 family processors and the BUSCHK# pin on the Pentium processor. When
one of these pins is enabled, asserting the pin causes error information to be loaded into machine-check registers
and a machine-check exception is generated.
The machine-check exception and machine-check architecture are discussed in detail in Chapter 17, “Machine-
Check Architecture.” Also, see the data books for the individual processors for processor-specific hardware infor-
mation.

Exception Error Code


None. Error information is provided by machine-check MSRs.

Saved Instruction Pointer


For the Pentium 4 and Intel Xeon processors, the saved contents of extended machine-check state registers are
directly associated with the error that caused the machine-check exception to be generated (see Section 17.3.1.2,
“IA32_MCG_STATUS MSR,” and Section 17.3.2.6, “IA32_MCG Extended Machine Check State MSRs”).
For the P6 family processors, if the EIPV flag in the MCG_STATUS MSR is set, the saved contents of CS and EIP
registers are directly associated with the error that caused the machine-check exception to be generated; if the flag
is clear, the saved instruction pointer may not be associated with the error (see Section 17.3.1.2, “IA32_MC-
G_STATUS MSR”).
For the Pentium processor, contents of the CS and EIP registers may not be associated with the error.

Program State Change


The machine-check mechanism is enabled by setting the MCE flag in control register CR4.
For the Pentium 4, Intel Xeon, P6 family, and Pentium processors, a program-state change always accompanies a
machine-check exception, and an abort class exception is generated. For abort exceptions, information about the
exception can be collected from the machine-check MSRs, but the program cannot generally be restarted.
If the machine-check mechanism is not enabled (the MCE flag in control register CR4 is clear), a machine-check
exception causes the processor to enter the shutdown state.

7-52 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

Interrupt 19—SIMD Floating-Point Exception (#XM)

Exception Class Fault.

Description
Indicates the processor has detected an SSE/SSE2/SSE3 SIMD floating-point exception. The appropriate status
flag in the MXCSR register must be set and the particular exception unmasked for this interrupt to be generated.
There are six classes of numeric exception conditions that can occur while executing an SSE/ SSE2/SSE3 SIMD
floating-point instruction:
• Invalid operation (#I)
• Divide-by-zero (#Z)
• Denormal operand (#D)
• Numeric overflow (#O)
• Numeric underflow (#U)
• Inexact result (Precision) (#P)
The invalid operation, divide-by-zero, and denormal-operand exceptions are pre-computation exceptions; that is,
they are detected before any arithmetic operation occurs. The numeric underflow, numeric overflow, and inexact
result exceptions are post-computational exceptions.
See “SIMD Floating-Point Exceptions” in Chapter 11 of the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 1, for additional information about the SIMD floating-point exception classes.
When a SIMD floating-point exception occurs, the processor does either of the following things:
• It handles the exception automatically by producing the most reasonable result and allowing program
execution to continue undisturbed. This is the response to masked exceptions.
• It generates a SIMD floating-point exception, which in turn invokes a software exception handler. This is the
response to unmasked exceptions.
Each of the six SIMD floating-point exception conditions has a corresponding flag bit and mask bit in the MXCSR
register. If an exception is masked (the corresponding mask bit in the MXCSR register is set), the processor takes
an appropriate automatic default action and continues with the computation. If the exception is unmasked (the
corresponding mask bit is clear) and the operating system supports SIMD floating-point exceptions (the OSXM-
MEXCPT flag in control register CR4 is set), a software exception handler is invoked through a SIMD floating-point
exception. If the exception is unmasked and the OSXMMEXCPT bit is clear (indicating that the operating system
does not support unmasked SIMD floating-point exceptions), an invalid opcode exception (#UD) is signaled instead
of a SIMD floating-point exception.
Note that because SIMD floating-point exceptions are precise and occur immediately, the situation does not arise
where an x87 FPU instruction, a WAIT/FWAIT instruction, or another SSE/SSE2/SSE3 instruction will catch a
pending unmasked SIMD floating-point exception.
In situations where a SIMD floating-point exception occurred while the SIMD floating-point exceptions were
masked (causing the corresponding exception flag to be set) and the SIMD floating-point exception was subse-
quently unmasked, then no exception is generated when the exception is unmasked.
When SSE/SSE2/SSE3 SIMD floating-point instructions operate on packed operands (made up of two or four sub-
operands), multiple SIMD floating-point exception conditions may be detected. If no more than one exception
condition is detected for one or more sets of sub-operands, the exception flags are set for each exception condition
detected. For example, an invalid exception detected for one sub-operand will not prevent the reporting of a divide-
by-zero exception for another sub-operand. However, when two or more exceptions conditions are generated for
one sub-operand, only one exception condition is reported, according to the precedences shown in Table 7-8. This
exception precedence sometimes results in the higher priority exception condition being reported and the lower
priority exception conditions being ignored.

Vol. 3A 7-53
INTERRUPT AND EXCEPTION HANDLING

Table 7-8. SIMD Floating-Point Exceptions Priority


Priority Description
1 (Highest) Invalid operation exception due to SNaN operand (or any NaN operand for maximum, minimum, or certain compare and
convert operations).
2 QNaN operand1.
3 Any other invalid operation exception not mentioned above or a divide-by-zero exception2.
4 Denormal operand exception2.
5 Numeric overflow and underflow exceptions possibly in conjunction with the inexact result exception2.
6 (Lowest) Inexact result exception.
NOTES:
1. Though a QNaN this is not an exception, the handling of a QNaN operand has precedence over lower priority exceptions. For exam-
ple, a QNaN divided by zero results in a QNaN, not a divide-by-zero- exception.
2. If masked, then instruction execution continues, and a lower priority exception can occur as well.

Exception Error Code


None.

Saved Instruction Pointer


The saved contents of CS and EIP registers point to the SSE/SSE2/SSE3 instruction that was executed when the
SIMD floating-point exception was generated. This is the faulting instruction in which the error condition was
detected.

Program State Change


A program-state change does not accompany a SIMD floating-point exception because the handling of the excep-
tion is immediate unless the particular exception is masked. The available state information is often sufficient to
allow recovery from the error and re-execution of the faulting instruction if needed.

7-54 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

Interrupt 20—Virtualization Exception (#VE)

Exception Class Fault.

Description
Indicates that the processor detected an EPT violation in VMX non-root operation. Not all EPT violations cause
virtualization exceptions. See Section 27.5.7.2 for details.
The exception handler can recover from EPT violations and restart the program or task without any loss of program
continuity. In some cases, however, the problem that caused the EPT violation may be uncorrectable.

Exception Error Code


None.

Saved Instruction Pointer


The saved contents of CS and EIP registers generally point to the instruction that generated the exception.

Program State Change


A program-state change does not normally accompany a virtualization exception, because the instruction that
causes the exception to be generated is not executed. After the virtualization exception handler has corrected the
violation (for example, by executing the EPTP-switching VM function), execution of the program or task can be
resumed.

Additional Exception-Handling Information


The processor saves information about virtualization exceptions in the virtualization-exception information area.
See Section 27.5.7.2 for details.

Vol. 3A 7-55
INTERRUPT AND EXCEPTION HANDLING

Interrupt 21—Control Protection Exception (#CP)

Exception Class Fault.

Description
Indicates a control flow transfer attempt violated the control flow enforcement technology constraints.

Exception Error Code


Yes (special format). The processor provides the control protection exception handler with following information
through the error code on the stack.

31 15 14 0

ENCL

CPEC

Reserved

Figure 7-12. Exception Error Code Information

• Bit 14:0 - CPEC


— 1 - NEAR-RET: Indicates the #CP was caused by a near RET instruction.
— 2 - FAR-RET/IRET: Indicates the #CP was caused by a FAR RET or IRET instruction.
— 3 - ENDBRANCH: indicates the #CP was due to missing ENDBRANCH at target of an indirect call or jump
instruction.
— 4 - RSTORSSP: Indicates the #CP was caused by a shadow-stack-restore token check failure in the
RSTORSSP instruction.
— 5- SETSSBSY: Indicates #CP was caused by a supervisor shadow stack token check failure in the SETSSBSY
instruction.
• Bit 15 (ENCL) of the error code, if set to 1, indicates the #CP occurred during enclave execution.

Saved Instruction Pointer


The saved contents of the CS and EIP registers generally point to the instruction that generated the exception.

Program State Change


A program-state change does not normally accompany a control protection exception, because the instruction that
causes the exception to be generated is not executed.
When a control protection exception is generated during a task switch, the program-state may change as follows.
During a task switch, a control protection exception can occur during any of following operations:
• If task switch is initiated by IRET, CS and LIP stored on old task shadow stack do not match CS and LIP of new
task (where LIP is the linear address of the return address).
• If task switch is initiated by IRET and SSP of new task loaded from shadow stack of old task (if new task CPL is
< 3), OR the SSP from IA32_PL3_SSP (if new task CPL = 3) is not aligned to 4 bytes or is a value beyond 4GB.

7-56 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING

In these cases the exception occurs in the context of the new task. The instruction pointer refers to the first instruc-
tion of the new task, not to the instruction which caused the task switch (or the last instruction to be executed, in
the case of an interrupt). If the design of the operating system permits control protection faults to occur during
task-switches, the control protection fault handler should be called through a task gate.

Vol. 3A 7-57
INTERRUPT AND EXCEPTION HANDLING

Interrupts 32 to 255—User Defined Interrupts

Exception Class Not applicable.

Description
Indicates that the processor did one of the following things:
• Executed an INT n instruction where the instruction operand is one of the vector numbers from 32 through 255.
• Responded to an interrupt request at the INTR pin or from the local APIC when the interrupt vector number
associated with the request is from 32 through 255.

Exception Error Code


Not applicable.

Saved Instruction Pointer


The saved contents of CS and EIP registers point to the instruction that follows the INT n instruction or instruction
following the instruction on which the INTR signal occurred.

Program State Change


A program-state change does not accompany interrupts generated by the INT n instruction or the INTR signal. The
INT n instruction generates the interrupt within the instruction stream. When the processor receives an INTR
signal, it commits all state changes for all previous instructions before it responds to the interrupt; so, program
execution can resume upon returning from the interrupt handler.

7-58 Vol. 3A
16.Updates to Chapter 8, Volume 3A
Change bars and violet text show changes to Chapter 8 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3A: System Programming Guide, Part 1.

------------------------------------------------------------------------------------------
Changes to this chapter:
• Added the new Section 8.7, “Flexible Updates of UIF.”

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 14


CHAPTER 8
USER INTERRUPTS

8.1 INTRODUCTION
This chapter provides details of an architectural feature called user interrupts.
This feature defines user interrupts as new events in the architecture. User interrupts are delivered to software
operating in 64-bit mode with CPL = 3 without any change to segmentation state. An individual user interrupt is
identified by a 6-bit user-interrupt vector, which is pushed on the stack as part of user-interrupt delivery. The
UIRET (user-interrupt return) instruction reverses user-interrupt delivery.
System software configures the user-interrupt architecture with MSRs. An operating system (OS) may update the
content of some of these MSRs when switching between OS-managed threads.
One of these MSRs references a data structure called the user posted-interrupt descriptor (UPID). User inter-
rupts for an OS-managed thread can be posted in the UPID associated with that thread. Such user interrupts will
be delivered after receipt of an ordinary interrupt (identified in the UPID) called a user-interrupt notification.1
System software can define operations to post user interrupts and to send user-interrupt notifications. In addition,
the user-interrupt feature defines the SENDUIPI instruction, by which application software can send interprocessor
user interrupts (user IPIs). An execution of SENDUIPI posts a user interrupt in a UPID and may send a user-inter-
rupt notification.
(Platforms may include mechanisms to process external interrupts as either ordinary interrupts or user interrupts.
Those processed as user interrupts would be posted in UPIDs and may result in user-interrupt notifications.
Specifics of such mechanisms are outside of the scope of this manual.)
Section 8.2 explains how a processor enumerates support for user interrupts and how they are enabled by system
software. Section 8.3 identifies the new processor state defined for user interrupts. Section 8.4 explains how a
processor identifies and delivers user interrupts. Section 8.5 describes how a processor identifies and processes
user-interrupt notifications. Section 8.6 enumerates new instructions that support management of user interrupts.
Section 8.8 defines new support for user inter-processor interrupts (user IPIs).

8.2 ENUMERATION AND ENABLING


Software enables user interrupts by setting bit 25 (UINTR) in control register CR4. Setting CR4.UINTR enables
user-interrupt delivery (Section 8.4.2), user-interrupt notification identification (Section 8.5.1), and the user-inter-
rupt instructions (Section 8.6). It does not affect the accessibility of the user-interrupt MSRs (Section 8.3) by
RDMSR, WRMSR or the XSAVE feature set.
Processor support for user interrupts is enumerated by CPUID.(EAX=7,ECX=0):EDX[5]. If this bit is set, software
can set CR4.UINTR to 1 and can access the user-interrupt MSRs using RDMSR and WRMSR (see Section 8.3).
The user-interrupt feature is XSAVE-managed (see Section 13.5). This implies that aspects of the feature are
enumerated as part of enumeration of the XSAVE feature set. See Section 13.5.11 in the Intel® 64 and IA-32 Archi-
tectures Software Developer’s Manual, Volume 1, for details.

8.3 USER-INTERRUPT STATE AND USER-INTERRUPT MSRS


The user-interrupt architecture defines the following new state. Some of this state can be accessed via the RDMSR
and WRMSR instructions (through new user-interrupt MSRs detailed in Section 8.3.2) and some can be accessed
using instructions described in Section 8.6.

1. For clarity, this chapter uses the term ordinary interrupts to refer to those events in the existing interrupt architecture, which are
typically delivered to system software operating with CPL = 0.

Vol. 3A 8-1
USER INTERRUPTS

8.3.1 User-Interrupt State


The following are the elements of the user-interrupt state (listed here independent of how they are accessed):
• UIRR: user-interrupt request register.
This value includes one bit for each of the 64 user-interrupt vectors. If UIRR[i] = 1, a user interrupt with
vector i is requesting service. The notation UIRRV is used to refer to the position of the most significant bit
set in UIRR; if UIRR = 0, UIRRV = 0.
• UIF: user-interrupt flag.
If UIF = 0, user-interrupt delivery is blocked; if UIF = 1, user interrupts may be delivered. User-interrupt
delivery clears UIF, and the new UIRET instruction sets it. Section 8.6 defines other instructions for accessing
UIF and Section 8.7 describes an enhancement that allows UIRET to maintain UIF as 0.
• UIHANDLER: user-interrupt handler.
This is the linear address of the user-interrupt handler. User-interrupt delivery loads this address into RIP.
• UISTACKADJUST: user-interrupt stack adjustment.
This value controls adjustment to the stack pointer (RSP) prior to user-interrupt delivery. It can be configured
to load RSP with an alternate stack pointer or configured to prevent user-interrupt delivery from overwriting
data above the current stack top.
The value UISTACKADJUST must be canonical. If bit 0 is 1, user-interrupt delivery loads RSP with UISTACK-
ADJUST; otherwise, it subtracts UISTACKADJUST from RSP. Either way, user-interrupt delivery then aligns
RSP to a 16-byte boundary. See Section 8.4.2 for details.
• UINV: user-interrupt notification vector.
This is the vector of the ordinary interrupts that are treated as user-interrupt notifications (Section 8.5.1).
When the logical processor receives user-interrupt notification, it processes the user interrupts in the user
posted-interrupt descriptor (UPID) referenced by UPIDADDR (see below and Section 8.5.2).
• UPIDADDR: user posted-interrupt descriptor address.
This is the linear address of the UPID that the logical processor consults upon receiving an ordinary interrupt
with vector UINV.
• UITTADDR: user-interrupt target table address.
This is the linear address of user-interrupt target table (UITT), which the logical processor consults when
software executes the SENDUIPI instruction (see Section 8.8).
• UITTSZ: user-interrupt target table size.
This value is the highest index of a valid entry in the UITT (see Section 8.8).

8.3.2 User-Interrupt MSRs


Some of the state elements identified in Section 8.3.1 can be accessed as user-interrupt MSRs using the RDMSR
and WRMSR instructions:
• IA32_UINTR_RR MSR (MSR address 985H). This MSR is an interface to UIRR (64 bits).
Following a WRMSR to this MSR, the logical processor recognizes a pending user interrupt if and only if some bit
is set in the MSR.
• IA32_UINTR_HANDLER MSR (MSR address 986H). This MSR is an interface to the UIHANDLER address. This is
a linear address that must be canonical relative to the maximum linear-address width supported by the
processor.1 WRMSR to this MSR causes a general-protection fault (#GP) if its source operand does not meet
this requirement.
• IA32_UINTR_STACKADJUST MSR (MSR address 987H). This MSR is an interface to the UISTACKADJUST value.
This value includes a linear address that must be canonical relative to the maximum linear-address width
supported by the processor. WRMSR to this MSR causes a #GP if its source operand does not meet this
requirement.

1. CPUID.80000008H:EAX[15:8] enumerates the maximum linear-address width supported by the processor.

8-2 Vol. 3A
USER INTERRUPTS

Bit 0 of this MSR corresponds to UISTACKADJUST[0], which controls how user-interrupt delivery updates the
stack pointer. WRMSR may set it to either 0 or 1.
• IA32_UINTR_MISC MSR (MSR address 988H). This MSR is an interface to the UITTSZ and UINV values. The
MSR has the following format:
— Bits 31:0 are UITTSZ.
— Bits 39:32 are UINV.
— Bits 63:40 are reserved. WRMSR causes a #GP if it would set any of those bits (if
EDX[31:8] ≠ 000000H).
Because this MSR will share an 8-byte portion of the XSAVE area with UIF (see Section 13.5.11 of Intel® 64
and IA-32 Architectures Software Developer’s Manual, Volume 1), bit 63 of the MSR will never be used and
will always be reserved.
• IA32_UINTR_PD MSR (MSR address 989H). This MSR is an interface to the UPIDADDR address. This is a linear
address that must be canonical relative to the maximum linear-address width supported by the processor.
WRMSR to this MSR causes a #GP if its source operand does not meet this requirement.
Bits 5:0 of this MSR are reserved. WRMSR causes a #GP if it would set any of those bits (if
EAX[5:0] ≠ 000000b).
• IA32_UINTR_TT MSR (MSR address 98AH). This MSR is an interface to the UITTADDR address (in addition, bit
0 enables SENDUIPI).
Bit 63:4 of this MSR holds the current value of UITTADDR. This a linear address that must be canonical relative
to the maximum linear-address width supported by the processor. WRMSR to this MSR causes a #GP if its
source operand does not meet this requirement.
Bits 3:1 of this MSR are reserved. WRMSR causes a #GP if it would set any of those bits (if EAX[3:1] ≠ 000b).
Bit 0 of this MSR determines whether the SENDUIPI instruction is enabled. WRMSR may set it to either 0 or 1.

8.4 EVALUATION AND DELIVERY OF USER INTERRUPTS


A processor determines whether there is a user interrupt to deliver based on UIRR. Section 8.4.1 describes this
recognition of pending user interrupts. Once a logical processor has recognized a pending user interrupt, it will
deliver it on a subsequent instruction boundary by causing a control-flow change asynchronous to software execu-
tion. Section 8.4.2 details this process of user-interrupt delivery.

8.4.1 User-Interrupt Recognition


There is a user interrupt pending whenever UIRR ≠ 0.
Any instruction or operation that modifies UIRR updates the logical processor’s recognition of a pending user inter-
rupt. The following instructions and operations may do this:
• WRMSR to the IA32_UINTR_RR MSR (Section 8.3).
• XRSTORS of the user-interrupt state component.
• User-interrupt delivery (Section 8.4.2).
• User-interrupt notification processing (Section 8.5.2).
• VMX transitions that load the IA32_UINTR_RR MSR.
Each of these instructions or operations results in recognition of a pending user interrupt if it completes with
UIRR ≠ 0; if it completes with UIRR = 0, no pending user interrupt is recognized.
Once recognized, a pending user interrupt may be delivered to software; see Section 8.4.2.

Vol. 3A 8-3
USER INTERRUPTS

8.4.2 User-Interrupt Delivery


If CR4.UINTR = 1 and a user interrupt has been recognized (see Section 8.4.1), it will be delivered at an instruction
boundary when the following conditions all hold: (1) UIF = 1; (2) there is no blocking by MOV SS or by POP SS1;
(3) CPL = 3; (4) IA32_EFER.LMA = CS.L = 1 (the logical processor is in 64-bit mode); and (5) software is not
executing inside an enclave.
User-interrupt delivery has priority just below that of ordinary interrupts. It wakes a logical processor from the
states entered using the TPAUSE and UMWAIT instructions2; it does not wake a logical processor in the shutdown
state or in the wait-for-SIPI state.
User-interrupt delivery does not change CPL (it occurs entirely with CPL = 3). The following pseudocode details the
behavior of user-interrupt delivery:

IF UIHANDLER is not canonical in current paging mode


THEN #GP(0);
FI;
holdRSP := RSP;
IF UISTACKADJUST[0] = 1
THEN RSP := UISTACKADJUST;
ELSE RSP := RSP – UISTACKADJUST;
FI;
RSP := RSP & ~FH; // force the stack to be 16-byte aligned
Push holdRSP;
Push RFLAGS;
Push RIP;
Push UIRRV; // 64-bit push; upper 58 bits pushed as 0
IF shadow stack is enabled for CPL = 3
THEN ShadowStackPush RIP;
FI;
IF end-branch is enabled for CPL = 3
THEN IA32_U_CET.TRACKER := WAIT_FOR_ENDBRANCH;
FI;
UIRR[Vector] := 0;
IF UIRR = 0
THEN cease recognition of any pending user interrupt;
FI;
UIF := 0;
RFLAGS.TF := 0;
RFLAGS.RF := 0;
RIP := UIHANDLER;

If UISTACKADJUST[0] = 0, user-interrupt delivery decrements RSP by UISTACKADJUST; otherwise, it loads RSP


with UISTACKADJUST. In either case, user-interrupt delivery aligns RSP to a 16-byte boundary by clearing
RSP[3:0].
User-interrupt delivery that occurs during transactional execution causes transactional execution to abort and a
transition to a non-transactional execution. The transactional abort loads EAX as it would had it been due to an
ordinary interrupt. User-interrupt delivery occurs after the transactional abort.

1. Execution of the STI instruction does not block delivery of user interrupts for one instruction as it does ordinary interrupts. If a user
interrupt is delivered immediately following execution of a STI instruction, ordinary interrupts are not blocked after delivery of the
user interrupt.
2. User-interrupt delivery occurs only if CPL = 3. Since the HLT and MWAIT instructions can be executed only if CPL = 0, a user inter-
rupt can never be delivered when a logical processor is an activity state that was entered using one of those instructions.

8-4 Vol. 3A
USER INTERRUPTS

The stack accesses performed by user-interrupt delivery may incur faults (page faults, or stack faults due to
canonicality violations). Before such a fault is delivered, RSP is restored to its original value (memory locations
above the top of the stack may have been written). If such a fault produces an error code that uses the EXT bit,
that bit will be cleared to 0.
If a fault occurs during user-interrupt delivery, UIRR is not updated and UIF is not cleared and, as a result, the
logical processor continues to recognize that a user interrupt is pending, and user-interrupt delivery will normally
recur after the fault is handled.
If the shadow-stack feature of control-flow enforcement technology (CET) is enabled for CPL = 3, user-interrupt
delivery pushes the return instruction pointer on the shadow stack. If indirect-branch-tracking feature of CET is
enabled, user-interrupt delivery transitions the indirect branch tracker to the WAIT_FOR_ENDBRANCH state; an
ENDBR64 instruction is expected as first instruction of the user-interrupt handler.
User-interrupt delivery can be tracked by Architectural Last Branch Records (LBRs), Intel® Processor Trace (Intel®
PT), and Performance Monitoring. For both Intel PT and LBRs, user-interrupt delivery is recorded in precisely the
same manner as ordinary interrupt delivery. Hence for LBRs, user interrupts fall into the OTHER_BRANCH category,
which implies that IA32_LBR_CTL.OTHER_BRANCH[bit 22] must be set to record user-interrupt delivery, and that
the IA32_LBR_x_INFO.BR_TYPE field will indicate OTHER_BRANCH for any recorded user interrupt. For Intel PT,
control flow tracing must be enabled by setting IA32_RTIT_CTL.BranchEn[bit 13].
User-interrupt delivery will also increment performance counters for which counting
BR_INST_RETIRED.FAR_BRANCH is enabled. Some implementations may have dedicated events for counting
user-interrupt delivery; see processor-specific event lists at https://download.01.org/perfmon/index/.

8.5 USER-INTERRUPT NOTIFICATION IDENTIFICATION AND PROCESSING


User-interrupt posting is the process by which a platform agent (or software operating on a CPU) records user
interrupts in a user posted-interrupt descriptor (UPID) in memory. The platform agent (or software) may send
an ordinary interrupt (called a user-interrupt notification) to the logical processor on which the target of the
user interrupt is operating.
Table 8-1 gives the format of a UPID.
Table 8-1. Format of User Posted-Interrupt Descriptor — UPID
Bit Position(s) Name Description
Outstanding notifi- If this bit is set, there is a notification outstanding for one or more user interrupts in
0
cation PIR.
If this bit is set, agents (including SENDUIPI) should not send notifications when
1 Suppress notification
posting user interrupts in this descriptor.
User-interrupt notification processing ignores these bits; must be zero for
15:2 Reserved
SENDUIPI.
23:16 Notification vector Used by agents sending user-interrupt notifications (including SENDUIPI).
User-interrupt notification processing ignores these bits; must be zero for
31:24 Reserved
SENDUIPI.
Target physical APIC ID – used by SENDUIPI.
63:32 Notification destination In xAPIC mode, bits 47:40 are the 8-bit APIC ID.
In x2APIC mode, the entire field forms the 32-bit APIC ID.
Posted-interrupt One bit for each user-interrupt vector. There is a user-interrupt request for a vector
127:64
requests (PIR) if the corresponding bit is 1.

The notation PIR (posted-interrupt requests) refers to the 64 posted-interrupt requests in a UPID.
If an ordinary interrupt arrives while CR4.UINTR = IA32_EFER.LMA = 1, the logical processor determines whether
the interrupt is a user-interrupt notification. This process is called user-interrupt notification identification
and is described in Section 8.5.1.
Once a logical processor has identified a user-interrupt notification, it copies user interrupts in the UPID’s PIR into
UIRR. This process is called user-interrupt notification processing and is described in Section 8.5.2.

Vol. 3A 8-5
USER INTERRUPTS

A logical processor is not interruptible during either user-interrupt notification identification or user-interrupt noti-
fication processing or between those operations (when they occur in succession).

8.5.1 User-Interrupt Notification Identification


If CR4.UINTR = IA32_EFER.LMA = 1, a logical processor performs user-interrupt notification identification when it
receives an ordinary interrupt. The following algorithm describes the response by the processor to an ordinary
maskable interrupt when CR4.UINTR = IA32_EFER.LMA = 11:
1. The local APIC is acknowledged; this provides the processor core with an interrupt vector, V.
2. If V = UINV, the logical processor continues to the next step. Otherwise, an interrupt with vector V is delivered
normally through the IDT; the remainder of this algorithm does not apply and user-interrupt notification
processing does not occur.
3. The processor writes zero to the EOI register in the local APIC; this dismisses the interrupt with vector V = UINV
from the local APIC.
User-interrupt notification identification involves acknowledgment of the local APIC and thus occurs only when
ordinary interrupts are not masked.
If user-interrupt notification identification completes step #3, the logical processor then performs user-interrupt
notification processing as described in Section 8.5.2.
An ordinary interrupt that occurs during transactional execution causes the transactional execution to abort and
transition to a non-transactional execution. This occurs before user-interrupt notification identification.
An ordinary interrupt that occurs while software is executing inside an enclave causes an asynchronous enclave
exit (AEX). This AEX occurs before user-interrupt notification identification.

8.5.2 User-Interrupt Notification Processing


Once a logical processor has identified a user-interrupt notification, it performs user-interrupt notification
processing using the UPID at the linear address in the IA32_UINTR_PD MSR.
The following algorithm describes user-interrupt notification processing:
1. The logical processor clears the outstanding-notification bit (bit 0) in the UPID. This is done atomically so as to
leave the remainder of the descriptor unmodified.
2. The logical processor reads PIR (bits 127:64 of the UPID) into a temporary register and writes all zeros to PIR.
This is done atomically so as to ensure that each bit cleared in PIR is set in the temporary register.
3. If any bit is set in the temporary register, the logical processor sets in UIRR each bit corresponding to a bit set
in the temporary register (e.g., with a logical OR) and recognizes a pending user interrupt (if it has not already
done so).
The logical processor performs the steps above in an uninterruptible manner. Steps #1 and #2 may be combined
into a single atomic step. If step #3 leads to recognition of a user interrupt, the processor may deliver that user
interrupt on the following instruction boundary (see Section 8.4.2).
Although user-interrupt notification processing may occur at any privilege level, all of the memory accesses in
steps #1 and #2 are performed with supervisor privilege.
Step #1 and step #2 each access the UPID using a linear address and may therefore incur faults (page faults, or
general-protection faults due to canonicality violations). If such a fault produces an error code that uses the EXT
bit, that bit will be set to 1.
If a fault occurs during user-interrupt notification processing, updates to architectural state performed by the
earlier user-interrupt notification identification (Section 8.5.1) remain committed and are not undone; if such a
fault occurs at step #2 (if it is not performed atomically with step #1), any update to architectural state performed
by step #1 also remains committed. System software is advised to prevent such faults (e.g., by ensuring that no

1. If the interrupt arrives between iterations of a REP-prefixed string instruction, the processor first updates state as follows: RIP is
loaded to reference the string instruction; RCX, RSI, and RDI are updated as appropriate to reflect the iterations completed; and
RFLAGS.RF is set to 1.

8-6 Vol. 3A
USER INTERRUPTS

page fault occurs and that the linear address in the IA32_UINTR_PD MSR is canonical with respect to the paging
mode in use).
If the user-interrupt notification identification that precedes user-interrupt notification processing occurred due to
an ordinary interrupt that arrived while the logical processor was in the HLT state, the logical processor returns to
the HLT state following user-interrupt notification processing.

8.6 USER-INTERRUPT INSTRUCTIONS


The user-interrupt feature defines instructions for control-flow transfer and access to new state. UIRET is an
instruction to effect a return from a user-interrupt handler. CLUI, STUI, and TESTUI allow software to access UIF.
SENDUIPI sends a user IPI. See Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 2A, 2B,
2C, & 2D for details on the instructions’ operation.
The following items provide high-level overviews of the instructions:
• UIRET pops from the stack the state saved by user-interrupt delivery (see Section 8.4.2) and loads those
values into the corresponding registers (software should pop the user-interrupt vector from the stack before
executing UIRET). Because RIP is one of those registers, UIRET effect a return to the that point from which the
user interrupt was delivered.
• CLUI clears UIF.
• STUI sets UIF.
• TESTUI copies UIF to RFLAGS.CF.
• SENDUIPI is discussed in Section 8.8.

8.7 FLEXIBLE UPDATES OF UIF BY UIRET


There may be software usages that seek to return from the handling of a user interrupt while maintaining UIF as 0.
This section describes an enhancement to UIRET that allows that.
This enhancement is supported if CPUID.(EAX=07H, ECX=01H):EDX.UIRET_UIF[bit 17] is enumerated as 1.
If the enhancement is supported, UIRET loads UIF with the value of the bit at position 1 in the RFLAGS image on
the stack. (If the enhancement is not supported, UIRET ignores that bit in the RFLAGS image and always sets UIF
to 1.)
The value of RFLAGS[1] is fixed as 1. All operations that save RFLAGS to memory save bit 1 as set; all operations
that load RFLAGS leave bit 1 set.
This implies that user-interrupt delivery always saves on the stack an RFLAGS value that sets bit 1. If software does
not modify this stack value, UIRET will set UIF to 1, even with the enhancement. Thus, the enhancement should
not affect the operation of existing software, as long as that software does not modify the stack value saved for
RFLAGS[1].
If a user-interrupt handler seek to return from the handling of a user interrupt while maintaining UIF as 0, it should
modify the RFLAGS image on the stack to clear bit 1. Subsequent execution of UIRET will load UIF with 0, as long
as the enhancement is supported.
Note that UIRET never modifies RFLAGS[1] (always leaving it with value 1) regardless of the stack value and
regardless of whether enhancement is supported.
See Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 2A, 2B, 2C, & 2D for details on the
operation of the UIRET instruction.

8.8 USER IPIS


The SENDUIPI instruction sends a user interprocessor interrupt (IPI). The instruction uses a data structure called
the user-interrupt target table (UITT). This table is located at the linear address UITTADDR and it comprises

Vol. 3A 8-7
USER INTERRUPTS

UITTSZ+1 16-byte entries (the values UITTADDR and UITTSZ are defined in Section 8.3.1). SENDUIPI uses the
UITT entry (UITTE) indexed by the instruction’s register operand. Each UITTE has the following format:
• Bit 0: V, a valid bit.
• Bits 7:1 are reserved and must be 0.
• Bits 15:8: UV, the user-interrupt vector (in the range 0–63, so bits 15:14 must be 0).
• Bits 63:16 are reserved.
• Bits 127:64: UPIDADDR, the linear address of a UPID (64-byte aligned, so bits 69:64 must be 0).
SENDUIPI sends a user interrupt by posting a user interrupt with vector V in the UPID referenced by UPIDADDR and
then sending, as an ordinary IPI, any notification interrupt specified in that UPID. Details appear in Intel® 64 and
IA-32 Architectures Software Developer’s Manual, Volumes 2A, 2B, 2C, & 2D.

8-8 Vol. 3A
17.Updates to Chapter 10, Volume 3A
Change bars and violet text show changes to Chapter 10 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3A: System Programming Guide, Part 1.

------------------------------------------------------------------------------------------
Changes to this chapter:
• Updated Section 10.1.2.3, “Features to Disable Bus Locks,” with additional information on UC-lock disable.

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 13


CHAPTER 10
MULTIPLE-PROCESSOR MANAGEMENT

The Intel 64 and IA-32 architectures provide mechanisms for managing and improving the performance of multiple
processors connected to the same system bus. These include:
• Bus locking and/or cache coherency management for performing atomic operations on system memory.
• Serializing instructions.
• An advance programmable interrupt controller (APIC) located on the processor chip (see Chapter 12,
“Advanced Programmable Interrupt Controller (APIC)”). This feature was introduced by the Pentium processor.
• A second-level cache (level 2, L2). For the Pentium 4, Intel Xeon, and P6 family processors, the L2 cache is
included in the processor package and is tightly coupled to the processor. For the Pentium and Intel486
processors, pins are provided to support an external L2 cache.
• A third-level cache (level 3, L3). For Intel Xeon processors, the L3 cache is included in the processor package
and is tightly coupled to the processor.
• Intel Hyper-Threading Technology. This extension to the Intel 64 and IA-32 architectures enables a single
processor core to execute two or more threads concurrently (see Section 10.5, “Intel® Hyper-Threading
Technology and Intel® Multi-Core Technology”).
These mechanisms are particularly useful in symmetric-multiprocessing (SMP) systems. However, they can also be
used when an Intel 64 or IA-32 processor and a special-purpose processor (such as a communications, graphics,
or video processor) share the system bus.
These multiprocessing mechanisms have the following characteristics:
• To maintain system memory coherency — When two or more processors are attempting simultaneously to
access the same address in system memory, some communication mechanism or memory access protocol
must be available to promote data coherency and, in some instances, to allow one processor to temporarily lock
a memory location.
• To maintain cache consistency — When one processor accesses data cached on another processor, it must not
receive incorrect data. If it modifies data, all other processors that access that data must receive the modified
data.
• To allow predictable ordering of writes to memory — In some circumstances, it is important that memory writes
be observed externally in precisely the same order as programmed.
• To distribute interrupt handling among a group of processors — When several processors are operating in a
system in parallel, it is useful to have a centralized mechanism for receiving interrupts and distributing them to
available processors for servicing.
• To increase system performance by exploiting the multi-threaded and multi-process nature of contemporary
operating systems and applications.
The caching mechanism and cache consistency of Intel 64 and IA-32 processors are discussed in Chapter 13. The
APIC architecture is described in Chapter 12. Bus and memory locking, serializing instructions, memory ordering,
and Intel Hyper-Threading Technology are discussed in the following sections.

10.1 LOCKED ATOMIC OPERATIONS


The 32-bit IA-32 processors support locked atomic operations on locations in system memory. These operations
are typically used to manage shared data structures (such as semaphores, segment descriptors, system segments,
or page tables) in which two or more processors may try simultaneously to modify the same field or flag. The
processor uses three interdependent mechanisms for carrying out locked atomic operations:
• Guaranteed atomic operations.
• Bus locking, using the LOCK# signal and the LOCK instruction prefix.

Vol. 3A 10-1
MULTIPLE-PROCESSOR MANAGEMENT

• Cache coherency protocols that ensure that atomic operations can be carried out on cached data structures
(cache lock); this mechanism is present in the Pentium 4, Intel Xeon, and P6 family processors.
These mechanisms are interdependent in the following ways. Certain basic memory transactions (such as reading
or writing a byte in system memory) are always guaranteed to be handled atomically. That is, once started, the
processor guarantees that the operation will be completed before another processor or bus agent is allowed access
to the memory location. The processor also supports bus locking for performing selected memory operations (such
as a read-modify-write operation in a shared area of memory) that typically need to be handled atomically, but are
not automatically handled this way. Because frequently used memory locations are often cached in a processor’s L1
or L2 caches, atomic operations can often be carried out inside a processor’s caches without asserting the bus lock.
Here the processor’s cache coherency protocols ensure that other processors that are caching the same memory
locations are managed properly while atomic operations are performed on cached memory locations.

NOTE
Where there are contested lock accesses, software may need to implement algorithms that ensure
fair access to resources in order to prevent lock starvation. The hardware provides no resource that
guarantees fairness to participating agents. It is the responsibility of software to manage the
fairness of semaphores and exclusive locking functions.
The mechanisms for handling locked atomic operations have evolved with the complexity of IA-32 processors. More
recent IA-32 processors (such as the Pentium 4, Intel Xeon, and P6 family processors) and Intel 64 provide a more
refined locking mechanism than earlier processors. These mechanisms are described in the following sections.

10.1.1 Guaranteed Atomic Operations


The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will
always be carried out atomically:
• Reading or writing a byte.
• Reading or writing a word aligned on a 16-bit boundary.
• Reading or writing a doubleword aligned on a 32-bit boundary.
The Pentium processor (and newer processors since) guarantees that the following additional memory operations
will always be carried out atomically:
• Reading or writing a quadword aligned on a 64-bit boundary.
• 16-bit accesses to uncached memory locations that fit within a 32-bit data bus.
The P6 family processors (and newer processors since) guarantee that the following additional memory operation
will always be carried out atomically:
• Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line.
Processors that enumerate support for Intel® AVX (by setting the feature flag CPUID.01H:ECX.AVX[bit 28]) guar-
antee that the 16-byte memory operations performed by the following instructions will always be carried out atom-
ically:
• MOVAPD, MOVAPS, and MOVDQA.
• VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
• VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking
disabled).
(Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.)
Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be
atomic by the Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium,
and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and
P6 family processors provide bus control signals that permit external memory subsystems to make split accesses
atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be
avoided.

10-2 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

Except as noted above, an x87 instruction or an SSE instruction that accesses data larger than a quadword may be
implemented using multiple memory accesses. If such an instruction stores to memory, some of the accesses may
complete (writing to memory) while another causes the operation to fault for architectural reasons (e.g., due an
page-table entry that is marked “not present”). In this case, the effects of the completed accesses may be visible
to software even though the overall instruction caused a fault. If TLB invalidation has been delayed (see Section
5.10.4.4), such page faults may occur even if all accesses are to the same page.

10.1.2 Bus Locking


Intel 64 and IA-32 processors provide a LOCK# signal that is asserted automatically during certain critical memory
operations to lock the system bus or equivalent link. Assertion of this signal is called a bus lock. While this output
signal is asserted, requests from other processors or bus agents for control of the bus are blocked. Software can
specify other occasions when the LOCK semantics are to be followed by prepending the LOCK prefix to an instruc-
tion.
In the case of the Intel386, Intel486, and Pentium processors, explicitly locked instructions will result in the asser-
tion of the LOCK# signal. It is the responsibility of the hardware designer to make the LOCK# signal available in
system hardware to control memory accesses among processors.
For the P6 and more recent processor families, if the memory area being accessed is cached internally in the
processor, the LOCK# signal is generally not asserted; instead, locking is only applied to the processor’s caches
(see Section 10.1.4, “Effects of a LOCK Operation on Internal Processor Caches”). These processors will assert a
bus lock for a locked access in either of the following situations: (1) the access is to multiple cache lines (a split
lock); or (2) the access uses a memory type other than WB (a UC lock)1.

10.1.2.1 Automatic Locking


The operations on which the processor automatically follows the LOCK semantics are as follows:
• When executing an XCHG instruction that references memory.
• When switching to a task, the processor tests and sets the busy flag in the type field of the TSS descriptor. To
ensure that two processors do not switch to the same task simultaneously, the processor follows the LOCK
semantics while testing and setting this flag.
• When loading a segment descriptor, the processor sets the accessed flag in the segment descriptor if the flag is
clear. During this operation, the processor follows the LOCK semantics so that the descriptor will not be
modified by another processor while it is being updated. For this action to be effective, operating-system
procedures that update descriptors should use the following steps:
— Use a locked operation to modify the access-rights byte to indicate that the segment descriptor is not-
present, and specify a value for the type field that indicates that the descriptor is being updated.
— Update the fields of the segment descriptor. (This operation may require several memory accesses;
therefore, locked operations cannot be used.)
— Use a locked operation to modify the access-rights byte to indicate that the segment descriptor is valid and
present.
— The Intel386 processor always updates the accessed flag in the segment descriptor, whether it is clear or
not. The Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors only update this flag if it is not
already set.
• The processor uses locked cycles to set the accessed and dirty flag in paging-structure entries.
• After an interrupt request, an interrupt controller may use the data bus to send the interrupt’s vector to the
processor. The processor follows the LOCK semantics during this time to ensure that no other data appears on
the data bus while the vector is being transmitted.

1. The term “UC lock” is used because the most common situation regards accesses to UC memory. Despite the name, locked accesses
to WC, WP, and WT memory also cause bus locks.

Vol. 3A 10-3
MULTIPLE-PROCESSOR MANAGEMENT

10.1.2.2 Software Controlled Bus Locking


To explicitly force the LOCK semantics, software can use the LOCK prefix with the following instructions when they
are used to modify a memory location. An invalid-opcode exception (#UD) is generated when the LOCK prefix is
used with any other instruction or when no write operation is made to memory (that is, when the destination
operand is in a register).
• The bit test and modify instructions (BTS, BTR, and BTC).
• The exchange instructions (XADD, CMPXCHG, CMPXCHG8B, and CMPXCHG16B).
• The LOCK prefix is automatically assumed for XCHG instruction.
• The following single-operand arithmetic and logical instructions: INC, DEC, NOT, and NEG.
• The following two-operand arithmetic and logical instructions: ADD, ADC, SUB, SBB, AND, OR, and XOR.
A locked instruction is guaranteed to lock only the area of memory defined by the destination operand, but may be
interpreted by the system as a lock for a larger memory area.
Software should access semaphores (shared memory used for signalling between multiple processors) using iden-
tical addresses and operand lengths. For example, if one processor accesses a semaphore using a word access,
other processors should not access the semaphore using a byte access.

NOTE
Do not implement semaphores using the WC memory type. Do not perform non-temporal stores to
a cache line containing a location used to implement a semaphore.

The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed
for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses
be aligned on their natural boundaries for better system performance:
• Any boundary for an 8-bit access (locked or otherwise).
• 16-bit boundary for locked word accesses.
• 32-bit boundary for locked doubleword accesses.
• 64-bit boundary for locked quadword accesses.
Locked operations are atomic with respect to all other memory operations and all externally visible events. Only
instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchro-
nize data written by one processor and read by another processor.
For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for
them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load
operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.
Locked instructions should not be used to ensure that data written can be fetched as instructions.

NOTE
The locked instructions for the current versions of the Pentium 4, Intel Xeon, P6 family, Pentium,
and Intel486 processors allow data written to be fetched as instructions. However, Intel
recommends that developers who require the use of self-modifying code use a different synchro-
nizing mechanism, described in the following sections.

10.1.2.3 Features to Disable Bus Locks


Because bus locks may adversely affect performance in certain situations, processors may support two features
that system software can use to disable bus locking. These are called split-lock disable and UC-lock disable.
Support for split-lock disable is enumerated by IA32_CORE_CAPABILITIES[5].
Software enables split-lock disable by setting MSR_MEMORY_CTRL[29]. When this bit is set, a locked access to
multiple cache lines causes an alignment-check exception (#AC) with a zero error code.1 The locked access does
not occur.

10-4 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

A processor enumerates support for UC-lock disable either by setting bit 4 of the IA32_CORE_CAPABILITIES MSR
(MSR index CFH) or by enumerating CPUID.(EAX=07H, ECX=2):EDX[bit 6] as 1. The latter form of enumeration
(CPUID) is used beginning with processors based on Sierra Forest microarchitecture; earlier processors may use
the former form (IA32_CORE_CAPABILITIES).

NOTE
No processor will both set IA32_CORE_CAPABILITIES[4] and enumerate
CPUID.(EAX=07H, ECX=2):EDX[bit 6] as 1.
If a processor enumerates support for UC-lock disable (in either way), software can enable UC-lock disable by
setting MSR_MEMORY_CTRL[28]. When this bit is set, a locked access using a memory type other than WB causes
a fault. The locked access does not occur. The specific fault that occurs depends on how UC-lock disable is enumer-
ated:
• If IA32_CORE_CAPABILITIES[4] is read as 1, the UC lock results in a general-protection exception (#GP) with
a zero error code.
• If CPUID.(EAX=07H, ECX=2):EDX[bit 6] is enumerated as 1, the UC lock results in an #AC with an error code
with value 4.
UC-lock disable does not apply to locked accesses to physical addresses specified in a VMCS. Such accesses include
updates to accessed and dirty flags for EPT and those to posted-interrupt descriptors.
UC-lock disable is not enabled if CR0.CD = 1 or if MSR_PRMRR_BASE_0[2:0] ≠ 6 (WB) when PRMRRs are enabled.
If either of those conditions hold, the processor ignores the value of MSR_MEMORY_CTRL[28].
Note that the #AC(0) due to split-lock disable or alignment check is higher priority than a #GP(0) or #AC(4) due
to UC-lock disable. If both features are enabled, a locked access to multiple cache lines causes #AC(0) regardless
of the memory type(s) being accessed.
While MSR_MEMORY_CTRL is not an architectural MSR, the behavior described above is consistent across
processor models that enumerate the support in IA32_CORE_CAPABILITIES or CPUID.
In addition to these features that disable bus locks, there are features that allow software to detect when a bus lock
has occurred. See Section 19.3.1.6 for information about OS bus-lock detection and Section 27.2 for information
about the VMM bus-lock detection.

10.1.3 Handling Self- and Cross-Modifying Code


The act of a processor writing data into a currently executing code segment with the intent of executing that data
as code is called self-modifying code. IA-32 processors exhibit model-specific behavior when executing self-
modified code, depending upon how far ahead of the current execution pointer the code has been modified.
As processor microarchitectures become more complex and start to speculatively execute code ahead of the retire-
ment point (as in P6 and more recent processor families), the rules regarding which code should execute, pre- or
post-modification, become blurred. To write self-modifying code and ensure that it is compliant with current and
future versions of the IA-32 architectures, use one of the following coding options:

(* OPTION 1 *)
Store modified code (as data) into code segment;
Jump to new code or an intermediate location;
Execute new code;

(* OPTION 2 *)
Store modified code (as data) into code segment;
Execute a serializing instruction; (* For example, CPUID instruction *)
Execute new code;

1. Other alignment-check exceptions occur only if CR0.AM = 1, EFLAGS.AC = 1, and CPL = 3. The alignment-check exceptions resulting
from split-lock disable may occur even if CR0.AM = 0, EFLAGS.AC = 0, or CPL < 3.

Vol. 3A 10-5
MULTIPLE-PROCESSOR MANAGEMENT

The use of one of these options is not required for programs intended to run on the Pentium or Intel486 processors,
but are recommended to ensure compatibility with the P6 and more recent processor families.
Self-modifying code will execute at a lower level of performance than non-self-modifying or normal code. The
degree of the performance deterioration will depend upon the frequency of modification and specific characteristics
of the code.
The act of one processor writing data into the currently executing code segment of a second processor with the
intent of having the second processor execute that data as code is called cross-modifying code. As with self-
modifying code, IA-32 processors exhibit model-specific behavior when executing cross-modifying code,
depending upon how far ahead of the executing processors current execution pointer the code has been modified.
To write cross-modifying code and ensure that it is compliant with current and future versions of the IA-32 archi-
tecture, the following processor synchronization algorithm must be implemented:

(* Action of Modifying Processor *)


Memory_Flag := 0; (* Set Memory_Flag to value other than 1 *)
Store modified code (as data) into code segment;
Memory_Flag := 1;

(* Action of Executing Processor *)


WHILE (Memory_Flag ≠ 1)
Wait for code to update;
ELIHW;
Execute serializing instruction; (* For example, CPUID instruction *)
Begin executing modified code;
(The use of this option is not required for programs intended to run on the Intel486 processor, but is recommended
to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and Pentium processors.)
Like self-modifying code, cross-modifying code will execute at a lower level of performance than non-cross-modi-
fying (normal) code, depending upon the frequency of modification and specific characteristics of the code.
The restrictions on self-modifying code and cross-modifying code also apply to the Intel 64 architecture.

10.1.4 Effects of a LOCK Operation on Internal Processor Caches


For the Intel486 and Pentium processors, the LOCK# signal is always asserted on the bus during a LOCK operation,
even if the area of memory being locked is cached in the processor.
For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is
cached in the processor that is performing the LOCK operation as write-back memory and is completely contained
in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory loca-
tion internally and allow it’s cache coherency mechanism to ensure that the operation is carried out atomically. This
operation is called “cache locking.” The cache coherency mechanism automatically prevents two or more proces-
sors that have cached the same area of memory from simultaneously modifying data in that area.

10.2 MEMORY ORDERING


The term memory ordering refers to the order in which the processor issues reads (loads) and writes (stores)
through the system bus to system memory. The Intel 64 and IA-32 architectures support several memory-ordering
models depending on the implementation of the architecture. For example, the Intel386 processor enforces
program ordering (generally referred to as strong ordering), where reads and writes are issued on the system
bus in the order they occur in the instruction stream under all circumstances.
To allow performance optimization of instruction execution, the IA-32 architecture allows departures from strong-
ordering model called processor ordering in Pentium 4, Intel Xeon, and P6 family processors. These processor-
ordering variations (called here the memory-ordering model) allow performance enhancing operations such as
allowing reads to go ahead of buffered writes. The goal of any of these variations is to increase instruction execu-
tion speeds, while maintaining memory coherency, even in multiple-processor systems.

10-6 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

Section 10.2.1 and Section 10.2.2 describe the memory-ordering implemented by Intel486, Pentium, Intel Core 2
Duo, Intel Atom, Intel Core Duo, Pentium 4, Intel Xeon, and P6 family processors. Section 10.2.3 gives examples
illustrating the behavior of the memory-ordering model on IA-32 and Intel-64 processors. Section 10.2.4 considers
the special treatment of stores for string operations and Section 10.2.5 discusses how memory-ordering behavior
may be modified through the use of specific instructions.

10.2.1 Memory Ordering in the Intel® Pentium® and Intel486™ Processors


The Pentium and Intel486 processors follow the processor-ordered memory model; however, they operate as
strongly-ordered processors under most circumstances. Reads and writes always appear in programmed order at
the system bus—except for the following situation where processor ordering is exhibited. Read misses are
permitted to go ahead of buffered writes on the system bus when all the buffered writes are cache hits and, there-
fore, are not directed to the same address being accessed by the read miss.
In the case of I/O operations, both reads and writes always appear in programmed order.
Software intended to operate correctly in processor-ordered processors (such as the Pentium 4, Intel Xeon, and P6
family processors) should not depend on the relatively strong ordering of the Pentium or Intel486 processors.
Instead, it should ensure that accesses to shared variables that are intended to control concurrent execution
among processors are explicitly required to obey program ordering through the use of appropriate locking or seri-
alizing operations (see Section 10.2.5, “Strengthening or Weakening the Memory-Ordering Model”).

10.2.2 Memory Ordering in P6 and More Recent Processor Families


The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium 4, and P6 family processors also use a processor-
ordered memory-ordering model that can be further defined as “write ordered with store-buffer forwarding.” This
model can be characterized as follows.
In a single-processor system for memory regions defined as write-back cacheable, the memory-ordering model
respects the following principles (Note the memory-ordering principles for single-processor and multiple-
processor systems are written from the perspective of software executing on the processor, where the term
“processor” refers to a logical processor. For example, a physical processor supporting multiple cores and/or Intel
Hyper-Threading Technology is treated as a multi-processor systems.):
• Reads are not reordered with other reads.
• Writes are not reordered with older reads.
• Writes to memory are not reordered with other writes, with the following exceptions:
— streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ,
MOVNTDQ, MOVNTPS, and MOVNTPD); and
— string operations (see Section 10.2.4.1).
• No write to memory may be reordered with an execution of the CLFLUSH instruction; a write may be reordered
with an execution of the CLFLUSHOPT instruction that flushes a cache line other than the one being written.1
Executions of the CLFLUSH instruction are not reordered with each other. Executions of CLFLUSHOPT that
access different cache lines may be reordered with each other. An execution of CLFLUSHOPT may be reordered
with an execution of CLFLUSH that accesses a different cache line.
• Reads may be reordered with older writes to different locations but not with older writes to the same location.
• Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.
• Reads cannot pass earlier LFENCE and MFENCE instructions.
• Writes and executions of CLFLUSH and CLFLUSHOPT cannot pass earlier LFENCE, SFENCE, and MFENCE
instructions.
• LFENCE instructions cannot pass earlier reads.
• SFENCE instructions cannot pass earlier writes or executions of CLFLUSH and CLFLUSHOPT.

1. Earlier versions of this manual specified that writes to memory may be reordered with executions of the CLFLUSH instruction. No
processors implementing the CLFLUSH instruction allow such reordering.

Vol. 3A 10-7
MULTIPLE-PROCESSOR MANAGEMENT

• MFENCE instructions cannot pass earlier reads, writes, or executions of CLFLUSH and CLFLUSHOPT.
In a multiple-processor system, the following ordering principles apply:
• Individual processors use the same ordering principles as in a single-processor system.
• Writes by a single processor are observed in the same order by all processors.
• Writes from an individual processor are NOT ordered with respect to the writes from other processors.
• Memory ordering obeys causality (memory ordering respects transitive visibility).
• Any two stores are seen in a consistent order by processors other than those performing the stores
• Locked instructions have a total order.
See the example in Figure 10-1. Consider three processors in a system and each processor performs three writes,
one to each of three defined locations (A, B, and C). Individually, the processors perform the writes in the same
program order, but because of bus arbitration and other memory access mechanisms, the order that the three
processors write the individual memory locations can differ each time the respective code sequences are executed
on the processors. The final values in location A, B, and C would possibly vary on each execution of the write
sequence.
The processor-ordering model described in this section is virtually identical to that used by the Pentium and
Intel486 processors. The only enhancements in the Pentium 4, Intel Xeon, and P6 family processors are:
• Added support for speculative reads, while still adhering to the ordering principles above.
• Store-buffer forwarding, when a read passes a write to the same memory location.
• Out of order store from long string store and string move operations (see Section 10.2.4, “Fast-String
Operation and Out-of-Order Stores,” below).

Order of Writes From Individual Processors


Processor #1 Processor #2 Processor #3
Each processor Write A.1 Write A.2 Write A.3
is guaranteed to Write B.1 Write B.2 Write B.3
perform writes in Write C.2
Write C.1 Write C.3
program order.

Example of order of actual writes


from all processors to memory

Writes are in order Write A.1


with respect to Write B.1
individual processes. Write A.2 Writes from all
Write A.3 processors are
Write C.1 not guaranteed
Write B.2 to occur in a
Write C.2 particular order.
Write B.3
Write C.3

Figure 10-1. Example of Write Ordering in Multiple-Processor Systems

NOTE
In P6 processor family, store-buffer forwarding to reads of WC memory from streaming stores to
the same address does not occur due to errata.

10-8 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

10.2.3 Examples Illustrating the Memory-Ordering Principles


This section provides a set of examples that illustrate the behavior of the memory-ordering principles introduced in
Section 10.2.2. They are designed to give software writers an understanding of how memory ordering may affect
the results of different sequences of instructions.
These examples are limited to accesses to memory regions defined as write-back cacheable (WB). (Section
10.2.3.1 describes other limitations on the generality of the examples.) The reader should understand that they
describe only software-visible behavior. A logical processor may reorder two accesses even if one of examples indi-
cates that they may not be reordered. Such an example states only that software cannot detect that such a reor-
dering occurred. Similarly, a logical processor may execute a memory access more than once as long as the
behavior visible to software is consistent with a single execution of the memory access.

10.2.3.1 Assumptions, Terminology, and Notation


As noted above, the examples in this section are limited to accesses to memory regions defined as write-back
cacheable (WB). They apply only to ordinary loads stores and to locked read-modify-write instructions. They do not
necessarily apply to any of the following: out-of-order stores for string instructions (see Section 10.2.4); accesses
with a non-temporal hint; reads from memory by the processor as part of address translation (e.g., page walks);
and updates to segmentation and paging structures by the processor (e.g., to update “accessed” bits).
The principles underlying the examples in this section apply to individual memory accesses and to locked read-
modify-write instructions. The Intel-64 memory-ordering model guarantees that, for each of the following
memory-access instructions, the constituent memory operation appears to execute as a single memory access:
• Instructions that read or write a single byte.
• Instructions that read or write a word (2 bytes) whose address is aligned on a 2 byte boundary.
• Instructions that read or write a doubleword (4 bytes) whose address is aligned on a 4 byte boundary.
• Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary.
Any locked instruction (either the XCHG instruction or another read-modify-write instruction with a LOCK prefix)
appears to execute as an indivisible and uninterruptible sequence of load(s) followed by store(s) regardless of
alignment.
Other instructions may be implemented with multiple memory accesses. From a memory-ordering point of view,
there are no guarantees regarding the relative order in which the constituent memory accesses are made. There is
also no guarantee that the constituent operations of a store are executed in the same order as the constituent
operations of a load.
Section 10.2.3.2 through Section 10.2.3.7 give examples using the MOV instruction. The principles that underlie
these examples apply to load and store accesses in general and to other instructions that load from or store to
memory. Section 10.2.3.8 and Section 10.2.3.9 give examples using the XCHG instruction. The principles that
underlie these examples apply to other locked read-modify-write instructions.
This section uses the term “processor” is to refer to a logical processor. The examples are written using Intel-64
assembly-language syntax and use the following notational conventions:
• Arguments beginning with an “r”, such as r1 or r2 refer to registers (e.g., EAX) visible only to the processor
being considered.
• Memory locations are denoted with x, y, z.
• Stores are written as mov [ _x], val, which implies that val is being stored into the memory location x.
• Loads are written as mov r, [ _x], which implies that the contents of the memory location x are being loaded
into the register r.
As noted earlier, the examples refer only to software visible behavior. When the succeeding sections make state-
ment such as “the two stores are reordered,” the implication is only that “the two stores appear to be reordered
from the point of view of software.”

Vol. 3A 10-9
MULTIPLE-PROCESSOR MANAGEMENT

10.2.3.2 Neither Loads Nor Stores Are Reordered with Like Operations
The Intel-64 memory-ordering model allows neither loads nor stores to be reordered with the same kind of opera-
tion. That is, it ensures that loads are seen in program order and that stores are seen in program order. This is illus-
trated by the following example:
Example 10-1. Stores Are Not Reordered with Other Stores
Processor 0 Processor 1
mov [ _x], 1 mov r1, [ _y]
mov [ _y], 1 mov r2, [ _x]
Initially x = y = 0
r1 = 1 and r2 = 0 is not allowed

The disallowed return values could be exhibited only if processor 0’s two stores are reordered (with the two loads
occurring between them) or if processor 1’s two loads are reordered (with the two stores occurring between them).

If r1 = 1, the store to y occurs before the load from y. Because the Intel-64 memory-ordering model does not allow
stores to be reordered, the earlier store to x occurs before the load from y. Because the Intel-64 memory-ordering
model does not allow loads to be reordered, the store to x also occurs before the later load from x. This r2 = 1.

10.2.3.3 Stores Are Not Reordered With Earlier Loads


The Intel-64 memory-ordering model ensures that a store by a processor may not occur before a previous load by
the same processor. This is illustrated in Example 10-2.

10-10 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

Example 10-2. Stores Are Not Reordered with Older Loads


Processor 0 Processor 1
mov r1, [ _x] mov r2, [ _y]
mov [ _y], 1 mov [ _x], 1
Initially x = y = 0
r1 = 1 and r2 = 1 is not allowed

Assume r1 = 1.
• Because r1 = 1, processor 1’s store to x occurs before processor 0’s load from x.
• Because the Intel-64 memory-ordering model prevents each store from being reordered with the earlier load
by the same processor, processor 1’s load from y occurs before its store to x.
• Similarly, processor 0’s load from x occurs before its store to y.
• Thus, processor 1’s load from y occurs before processor 0’s store to y, implying r2 = 0.

10.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations


The Intel-64 memory-ordering model allows a load to be reordered with an earlier store to a different location.
However, loads are not reordered with stores to the same location.
The fact that a load may be reordered with an earlier store to a different location is illustrated by the following
example:
Example 10-3. Loads May be Reordered with Older Stores
Processor 0 Processor 1
mov [ _x], 1 mov [ _y], 1
mov r1, [ _y] mov r2, [ _x]
Initially x = y = 0
r1 = 0 and r2 = 0 is allowed

At each processor, the load and the store are to different locations and hence may be reordered. Any interleaving
of the operations is thus allowed. One such interleaving has the two loads occurring before the two stores. This
would result in each load returning value 0.
The fact that a load may not be reordered with an earlier store to the same location is illustrated by the following
example:
Example 10-4. Loads Are not Reordered with Older Stores to the Same Location
Processor 0
mov [ _x], 1
mov r1, [ _x]
Initially x = 0
r1 = 0 is not allowed

The Intel-64 memory-ordering model does not allow the load to be reordered with the earlier store because the
accesses are to the same location. Therefore, r1 = 1 must hold.

Vol. 3A 10-11
MULTIPLE-PROCESSOR MANAGEMENT

10.2.3.5 Intra-Processor Forwarding Is Allowed


The memory-ordering model allows concurrent stores by two processors to be seen in different orders by those two
processors; specifically, each processor may perceive its own store occurring before that of the other. This is illus-
trated by the following example:
Example 10-5. Intra-Processor Forwarding is Allowed
Processor 0 Processor 1
mov [ _x], 1 mov [ _y], 1
mov r1, [ _x] mov r3, [ _y]
mov r2, [ _y] mov r4, [ _x]
Initially x = y = 0
r2 = 0 and r4 = 0 is allowed

The memory-ordering model imposes no constraints on the order in which the two stores appear to execute by the
two processors. This fact allows processor 0 to see its store before seeing processor 1's, while processor 1 sees its
store before seeing processor 0's. (Each processor is self consistent.) This allows r2 = 0 and r4 = 0.
In practice, the reordering in this example can arise as a result of store-buffer forwarding. While a store is tempo-
rarily held in a processor's store buffer, it can satisfy the processor's own loads but is not visible to (and cannot
satisfy) loads by other processors.

10.2.3.6 Stores Are Transitively Visible


The memory-ordering model ensures transitive visibility of stores; stores that are causally related appear to all
processors to occur in an order consistent with the causality relation. This is illustrated by the following example:
Example 10-6. Stores Are Transitively Visible
Processor 0 Processor 1 Processor 2
mov [ _x], 1 mov r1, [ _x]
mov [ _y], 1 mov r2, [ _y]
mov r3, [_x]
Initially x = y = 0
r1 = 1, r2 = 1, r3 = 0 is not allowed

Assume that r1 = 1 and r2 = 1.


• Because r1 = 1, processor 0’s store occurs before processor 1’s load.
• Because the memory-ordering model prevents a store from being reordered with an earlier load (see Section
10.2.3.3), processor 1’s load occurs before its store. Thus, processor 0’s store causally precedes processor 1’s
store.
• Because processor 0’s store causally precedes processor 1’s store, the memory-ordering model ensures that
processor 0’s store appears to occur before processor 1’s store from the point of view of all processors.
• Because r2 = 1, processor 1’s store occurs before processor 2’s load.
• Because the Intel-64 memory-ordering model prevents loads from being reordered (see Section 10.2.3.2),
processor 2’s load occur in order.
• The above items imply that processor 0’s store to x occurs before processor 2’s load from x. This implies that
r3 = 1.

10-12 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

10.2.3.7 Stores Are Seen in a Consistent Order by Other Processors


As noted in Section 10.2.3.5, the memory-ordering model allows stores by two processors to be seen in different
orders by those two processors. However, any two stores must appear to execute in the same order to all proces-
sors other than those performing the stores. This is illustrated by the following example:
Example 10-7. Stores Are Seen in a Consistent Order by Other Processors
Processor 0 Processor 1 Processor 2 Processor 3
mov [ _x], 1 mov [ _y], 1 mov r1, [ _x] mov r3, [_y]
mov r2, [ _y] mov r4, [_x]

Initially x = y =0
r1 = 1, r2 = 0, r3 = 1, r4 = 0 is not allowed

By the principles discussed in Section 10.2.3.2:


• Processor 2’s first and second load cannot be reordered.
• Processor 3’s first and second load cannot be reordered.
• If r1 = 1 and r2 = 0, processor 0’s store appears to precede processor 1’s store with respect to processor 2.
• Similarly, r3 = 1 and r4 = 0 imply that processor 1’s store appears to precede processor 0’s store with respect
to processor 1.
Because the memory-ordering model ensures that any two stores appear to execute in the same order to all
processors (other than those performing the stores), this set of return values is not allowed.

10.2.3.8 Locked Instructions Have a Total Order


The memory-ordering model ensures that all processors agree on a single execution order of all locked instruc-
tions, including those that are larger than 8 bytes or are not naturally aligned. This is illustrated by the following
example:
Example 10-8. Locked Instructions Have a Total Order
Processor 0 Processor 1 Processor 2 Processor 3
xchg [ _x], r1 xchg [ _y], r2
mov r3, [ _x] mov r5, [_y]
mov r4, [ _y] mov r6, [_x]
Initially r1 = r2 = 1, x = y = 0
r3 = 1, r4 = 0, r5 = 1, r6 = 0 is not allowed

Processor 2 and processor 3 must agree on the order of the two executions of XCHG. Without loss of generality,
suppose that processor 0’s XCHG occurs first.
• If r5 = 1, processor 1’s XCHG into y occurs before processor 3’s load from y.
• Because the Intel-64 memory-ordering model prevents loads from being reordered (see Section 10.2.3.2),
processor 3’s loads occur in order and, therefore, processor 1’s XCHG occurs before processor 3’s load from x.
• Since processor 0’s XCHG into x occurs before processor 1’s XCHG (by assumption), it occurs before
processor 3’s load from x. Thus, r6 = 1.
A similar argument (referring instead to processor 2’s loads) applies if processor 1’s XCHG occurs before
processor 0’s XCHG.

10.2.3.9 Loads and Stores Are Not Reordered with Locked Instructions
The memory-ordering model prevents loads and stores from being reordered with locked instructions that execute
earlier or later. The examples in this section illustrate only cases in which a locked instruction is executed before a

Vol. 3A 10-13
MULTIPLE-PROCESSOR MANAGEMENT

load or a store. The reader should note that reordering is prevented also if the locked instruction is executed after
a load or a store.
The first example illustrates that loads may not be reordered with earlier locked instructions:
Example 10-9. Loads Are not Reordered with Locks
Processor 0 Processor 1
xchg [ _x], r1 xchg [ _y], r3
mov r2, [ _y] mov r4, [ _x]
Initially x = y = 0, r1 = r3 = 1
r2 = 0 and r4 = 0 is not allowed

As explained in Section 10.2.3.8, there is a total order of the executions of locked instructions. Without loss of
generality, suppose that processor 0’s XCHG occurs first.
Because the Intel-64 memory-ordering model prevents processor 1’s load from being reordered with its earlier
XCHG, processor 0’s XCHG occurs before processor 1’s load. This implies r4 = 1.
A similar argument (referring instead to processor 2’s accesses) applies if processor 1’s XCHG occurs before
processor 0’s XCHG.
The second example illustrates that a store may not be reordered with an earlier locked instruction:
Example 10-10. Stores Are not Reordered with Locks
Processor 0 Processor 1
xchg [ _x], r1 mov r2, [ _y]
mov [ _y], 1 mov r3, [ _x]
Initially x = y = 0, r1 = 1
r2 = 1 and r3 = 0 is not allowed

Assume r2 = 1.
• Because r2 = 1, processor 0’s store to y occurs before processor 1’s load from y.
• Because the memory-ordering model prevents a store from being reordered with an earlier locked instruction,
processor 0’s XCHG into x occurs before its store to y. Thus, processor 0’s XCHG into x occurs before
processor 1’s load from y.
• Because the memory-ordering model prevents loads from being reordered (see Section 10.2.3.2),
processor 1’s loads occur in order and, therefore, processor 1’s XCHG into x occurs before processor 1’s load
from x. Thus, r3 = 1.

10.2.4 Fast-String Operation and Out-of-Order Stores


Section 7.3.9.3 of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, described an optimi-
zation of repeated string operations called fast-string operation.
As explained in that section, the stores produced by fast-string operation may appear to execute out of order. Soft-
ware dependent upon sequential store ordering should not use string operations for the entire data structure to be
stored. Data and semaphores should be separated. Order-dependent code should write to a discrete semaphore
variable after any string operations to allow correctly ordered data to be seen by all processors. Atomicity of load
and store operations is guaranteed only for native data elements of the string with native data size, and only if they
are included in a single cache line.
Section 10.2.4.1 and Section 10.2.4.2 provide further explain and examples.

10.2.4.1 Memory-Ordering Model for String Operations on Write-Back (WB) Memory


This section deals with the memory-ordering model for string operations on write-back (WB) memory for the Intel
64 architecture.

10-14 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

The memory-ordering model respects the follow principles:


1. Stores within a single string operation may be executed out of order.
2. Stores from separate string operations (for example, stores from consecutive string operations) do not execute
out of order. All the stores from an earlier string operation will complete before any store from a later string
operation.
3. String operations are not reordered with other store operations.
Fast string operations (e.g., string operations initiated with the MOVS/STOS instructions and the REP prefix) may
be interrupted by exceptions or interrupts. The interrupts are precise but may be delayed - for example, the inter-
ruptions may be taken at cache line boundaries, after every few iterations of the loop, or after operating on every
few bytes. Different implementations may choose different options, or may even choose not to delay interrupt
handling, so software should not rely on the delay. When the interrupt/trap handler is reached, the source/destina-
tion registers point to the next string element to be operated on, while the EIP stored in the stack points to the
string instruction, and the ECX register has the value it held following the last successful iteration. The return from
that trap/interrupt handler should cause the string instruction to be resumed from the point where it was inter-
rupted.
The string operation memory-ordering principles, (item 2 and 3 above) should be interpreted by taking the incor-
ruptibility of fast string operations into account. For example, if a fast string operation gets interrupted after k iter-
ations, then stores performed by the interrupt handler will become visible after the fast string stores from iteration
0 to k, and before the fast string stores from the (k+1)th iteration onward.
Stores within a single string operation may execute out of order (item 1 above) only if fast string operation is
enabled. Fast string operations are enabled/disabled through the IA32_MISC_ENABLE model specific register.

10.2.4.2 Examples Illustrating Memory-Ordering Principles for String Operations


The following examples uses the same notation and convention as described in Section 10.2.3.1.
In Example 10-11, processor 0 does one round of (128 iterations) doubleword string store operation via rep:stosd,
writing the value 1 (value in EAX) into a block of 512 bytes from location _x (kept in ES:EDI) in ascending order.
Since each operation stores a doubleword (4 bytes), the operation is repeated 128 times (value in ECX). The block
of memory initially contained 0. Processor 1 is reading two memory locations that are part of the memory block
being updated by processor 0, i.e, reading locations in the range _x to (_x+511).

Example 10-11. Stores Within a String Operation May be Reordered


Processor 0 Processor 1
rep:stosd [ _x] mov r1, [ _z]
mov r2, [ _y]
Initially on processor 0: EAX = 1, ECX=128, ES:EDI =_x
Initially [_x] to 511[_x]= 0, _x <= _y < _z < _x+512
r1 = 1 and r2 = 0 is allowed

It is possible for processor 1 to perceive that the repeated string stores in processor 0 are happening out of order.
Assume that fast string operations are enabled on processor 0.
In Example 10-12, processor 0 does two separate rounds of rep stosd operation of 128 doubleword stores, writing
the value 1 (value in EAX) into the first block of 512 bytes from location _x (kept in ES:EDI) in ascending order. It
then writes 1 into a second block of memory from (_x+512) to (_x+1023). All of the memory locations initially
contain 0. The block of memory initially contained 0. Processor 1 performs two load operations from the two blocks
of memory.

Vol. 3A 10-15
MULTIPLE-PROCESSOR MANAGEMENT

Example 10-12. Stores Across String Operations Are not Reordered


Processor 0 Processor 1
rep:stosd [ _x]
mov r1, [ _z]
mov ecx, $128
mov r2, [ _y]
rep:stosd 512[ _x]
Initially on processor 0: EAX = 1, ECX=128, ES:EDI =_x
Initially [_x] to 1023[_x]= 0, _x <= _y < _x+512 < _z < _x+1024
r1 = 1 and r2 = 0 is not allowed

It is not possible in the above example for processor 1 to perceive any of the stores from the later string operation
(to the second 512 block) in processor 0 before seeing the stores from the earlier string operation to the first 512
block.
The above example assumes that writes to the second block (_x+512 to _x+1023) does not get executed while
processor 0’s string operation to the first block has been interrupted. If the string operation to the first block by
processor 0 is interrupted, and a write to the second memory block is executed by the interrupt handler, then that
change in the second memory block will be visible before the string operation to the first memory block resumes.
In Example 10-13, processor 0 does one round of (128 iterations) doubleword string store operation via rep:stosd,
writing the value 1 (value in EAX) into a block of 512 bytes from location _x (kept in ES:EDI) in ascending order. It
then writes to a second memory location outside the memory block of the previous string operation. Processor 1
performs two read operations, the first read is from an address outside the 512-byte block but to be updated by
processor 0, the second ready is from inside the block of memory of string operation.

Example 10-13. String Operations Are not Reordered with later Stores
Processor 0 Processor 1
rep:stosd [ _x] mov r1, [ _z]
mov [_z], $1 mov r2, [ _y]
Initially on processor 0: EAX = 1, ECX=128, ES:EDI =_x
Initially [_y] = [_z] = 0, [_x] to 511[_x]= 0, _x <= _y < _x+512, _z is a separate memory location
r1 = 1 and r2 = 0 is not allowed

Processor 1 cannot perceive the later store by processor 0 until it sees all the stores from the string operation.
Example 10-13 assumes that processor 0’s store to [_z] is not executed while the string operation has been inter-
rupted. If the string operation is interrupted and the store to [_z] by processor 0 is executed by the interrupt
handler, then changes to [_z] will become visible before the string operation resumes.
Example 10-14 illustrates the visibility principle when a string operation is interrupted.

Example 10-14. Interrupted String Operation


Processor 0 Processor 1
rep:stosd [ _x] // interrupted before es:edi reach _y mov r1, [ _z]
mov [_z], $1 // interrupt handler mov r2, [ _y]
Initially on processor 0: EAX = 1, ECX=128, ES:EDI =_x
Initially [_y] = [_z] = 0, [_x] to 511[_x]= 0, _x <= _y < _x+512, _z is a separate memory location
r1 = 1 and r2 = 0 is allowed

10-16 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

In Example 10-14, processor 0 started a string operation to write to a memory block of 512 bytes starting at
address _x. Processor 0 got interrupted after k iterations of store operations. The address _y has not yet been
updated by processor 0 when processor 0 got interrupted. The interrupt handler that took control on processor 0
writes to the address _z. Processor 1 may see the store to _z from the interrupt handler, before seeing the
remaining stores to the 512-byte memory block that are executed when the string operation resumes.
Example 10-15 illustrates the ordering of string operations with earlier stores. No store from a string operation can
be visible before all prior stores are visible.

Example 10-15. String Operations Are not Reordered with Earlier Stores
Processor 0 Processor 1
mov [_z], $1 mov r1, [ _y]
rep:stosd [ _x] mov r2, [ _z]
Initially on processor 0: EAX = 1, ECX=128, ES:EDI =_x
Initially [_y] = [_z] = 0, [_x] to 511[_x]= 0, _x <= _y < _x+512, _z is a separate memory location
r1 = 1 and r2 = 0 is not allowed

10.2.5 Strengthening or Weakening the Memory-Ordering Model


The Intel 64 and IA-32 architectures provide several mechanisms for strengthening or weakening the memory-
ordering model to handle special programming situations. These mechanisms include:
• The I/O instructions, locked instructions, the LOCK prefix, and serializing instructions force stronger ordering
on the processor.
• The SFENCE instruction (introduced to the IA-32 architecture in the Pentium III processor) and the LFENCE and
MFENCE instructions (introduced in the Pentium 4 processor) provide memory-ordering and serialization
capabilities for specific types of memory operations.
• The memory type range registers (MTRRs) can be used to strengthen or weaken memory ordering for specific
area of physical memory (see Section 13.11, “Memory Type Range Registers (MTRRs)”). MTRRs are available
only in the Pentium 4, Intel Xeon, and P6 family processors.
• The page attribute table (PAT) can be used to strengthen memory ordering for a specific page or group of pages
(see Section 13.12, “Page Attribute Table (PAT)”). The PAT is available only in the Pentium 4, Intel Xeon, and
Pentium III processors.
These mechanisms can be used as follows:
Memory mapped devices and other I/O devices on the bus are often sensitive to the order of writes to their I/O
buffers. I/O instructions can be used to (the IN and OUT instructions) impose strong write ordering on such
accesses as follows. Prior to executing an I/O instruction, the processor waits for all previous instructions in the
program to complete and for all buffered writes to drain to memory. Only instruction fetch and page tables walks
can pass I/O instructions. Execution of subsequent instructions do not begin until the processor determines that
the I/O instruction has been completed.
Synchronization mechanisms in multiple-processor systems may depend upon a strong memory-ordering model.
Here, a program can use a locked instruction such as the XCHG instruction or the LOCK prefix to ensure that a
read-modify-write operation on memory is carried out atomically. Locked instructions typically operate like I/O
instructions in that they wait for all previous memory accesses to complete and for all buffered writes to drain to
memory (see Section 10.1.2, “Bus Locking”). Unlike I/O operations, locked instructions do not wait for all previous
instructions to complete execution.
Program synchronization can also be carried out with serializing instructions (see Section 10.3). These instructions
are typically used at critical procedure or task boundaries to force completion of all previous instructions before a
jump to a new section of code or a context switch occurs. Like the I/O instructions, the processor waits until all
previous instructions have been completed and all buffered writes have been drained to memory before executing
the serializing instruction.

Vol. 3A 10-17
MULTIPLE-PROCESSOR MANAGEMENT

The SFENCE, LFENCE, and MFENCE instructions provide a performance-efficient way of ensuring load and store
memory ordering between routines that produce weakly-ordered results and routines that consume that data. The
functions of these instructions are as follows:
• SFENCE — Serializes all store (write) operations that occurred prior to the SFENCE instruction in the program
instruction stream, but does not affect load operations.
• LFENCE — Serializes all load (read) operations that occurred prior to the LFENCE instruction in the program
instruction stream, but does not affect store operations.1
• MFENCE — Serializes all store and load operations that occurred prior to the MFENCE instruction in the
program instruction stream.
Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficient method of controlling memory
ordering than the CPUID instruction.
The MTRRs were introduced in the P6 family processors to define the cache characteristics for specified areas of
physical memory. The following are two examples of how memory types set up with MTRRs can be used strengthen
or weaken memory ordering for the Pentium 4, Intel Xeon, and P6 family processors:
• The strong uncached (UC) memory type forces a strong-ordering model on memory accesses. Here, all reads
and writes to the UC memory region appear on the bus and out-of-order or speculative accesses are not
performed. This memory type can be applied to an address range dedicated to memory mapped I/O devices to
force strong memory ordering.
• For areas of memory where weak ordering is acceptable, the write back (WB) memory type can be chosen.
Here, reads can be performed speculatively and writes can be buffered and combined. For this type of memory,
cache locking is performed on atomic (locked) operations that do not split across cache lines, which helps to
reduce the performance penalty associated with the use of the typical synchronization instructions, such as
XCHG, that lock the bus during the entire read-modify-write operation. With the WB memory type, the XCHG
instruction locks the cache instead of the bus if the memory access is contained within a cache line.
The PAT was introduced in the Pentium III processor to enhance the caching characteristics that can be assigned to
pages or groups of pages. The PAT mechanism typically used to strengthen caching characteristics at the page level
with respect to the caching characteristics established by the MTRRs. Table 13-7 shows the interaction of the PAT
with the MTRRs.
Intel recommends that software written to run on Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium 4, Intel
Xeon, and P6 family processors assume the processor-ordering model or a weaker memory-ordering model. The
Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium 4, Intel Xeon, and P6 family processors do not implement a
strong memory-ordering model, except when using the UC memory type. Despite the fact that Pentium 4, Intel
Xeon, and P6 family processors support processor ordering, Intel does not guarantee that future processors will
support this model. To make software portable to future processors, it is recommended that operating systems
provide critical region and resource control constructs and API’s (application program interfaces) based on I/O,
locking, and/or serializing instructions be used to synchronize access to shared areas of memory in multiple-
processor systems. Also, software should not depend on processor ordering in situations where the system hard-
ware does not support this memory-ordering model.

10.3 SERIALIZING INSTRUCTIONS


The Intel 64 and IA-32 architectures define several serializing instructions. These instructions force the
processor to complete all modifications to flags, registers, and memory by previous instructions and to drain all
buffered writes to memory before the next instruction is fetched and executed. For example, when a MOV to control
register instruction is used to load a new value into control register CR0 to enable protected mode, the processor
must perform a serializing operation before it enters protected mode. This serializing operation ensures that all

1. Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution
until LFENCE completes. As a result, an instruction that loads from memory and that precedes an LFENCE receives data from mem-
ory prior to completion of the LFENCE. An LFENCE that follows an instruction that stores to memory might complete before the data
being stored have become globally visible. Instructions following an LFENCE may be fetched from memory before the LFENCE, but
they will not execute until the LFENCE completes.

10-18 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

operations that were started while the processor was in real-address mode are completed before the switch to
protected mode is made.
The concept of serializing instructions was introduced into the IA-32 architecture with the Pentium processor to
support parallel instruction execution. Serializing instructions have no meaning for the Intel486 and earlier proces-
sors that do not implement parallel instruction execution.
It is important to note that executing of serializing instructions on P6 and more recent processor families constrain
speculative execution because the results of speculatively executed instructions are discarded. The following
instructions are serializing instructions:
• Privileged serializing instructions — INVD, INVEPT, INVLPG, INVVPID, LGDT, LIDT, LLDT, LTR, MOV (to
control register, with the exception of MOV CR81), MOV (to debug register), WBINVD, and WRMSR2.
• Non-privileged serializing instructions — CPUID, IRET, RSM, and SERIALIZE.
When the processor serializes instruction execution, it ensures that all pending memory transactions are
completed (including writes stored in its store buffer) before it executes the next instruction. Nothing can pass a
serializing instruction and a serializing instruction cannot pass any other instruction (read, write, instruction fetch,
or I/O). For example, CPUID can be executed at any privilege level to serialize instruction execution with no effect
on program flow, except that the EAX, EBX, ECX, and EDX registers are modified.
The following instructions are memory-ordering instructions, not serializing instructions. These drain the data
memory subsystem. They do not serialize the instruction execution stream:3
• Non-privileged memory-ordering instructions — SFENCE, LFENCE, and MFENCE.
The SFENCE, LFENCE, and MFENCE instructions provide more granularity in controlling the serialization of memory
loads and stores (see Section 10.2.5, “Strengthening or Weakening the Memory-Ordering Model”).
The following additional information is worth noting regarding serializing instructions:
• The processor does not write back the contents of modified data in its data cache to external memory when it
serializes instruction execution. Software can force modified data to be written back by executing the WBINVD
instruction, which is a serializing instruction. The amount of time or cycles for WBINVD to complete will vary
due to the size of different cache hierarchies and other factors. As a consequence, the use of the WBINVD
instruction can have an impact on interrupt/event response time.
• When an instruction is executed that enables or disables paging (that is, changes the PG flag in control register
CR0), the instruction should be followed by a jump instruction. The target instruction of the jump instruction is
fetched with the new setting of the PG flag (that is, paging is enabled or disabled), but the jump instruction
itself is fetched with the previous setting. The Pentium 4, Intel Xeon, and P6 family processors do not require
the jump operation following the move to register CR0 (because any use of the MOV instruction in a Pentium 4,
Intel Xeon, or P6 family processor to write to CR0 is completely serializing). However, to maintain backwards
and forward compatibility with code written to run on other IA-32 processors, it is recommended that the jump
operation be performed.
• Whenever an instruction is executed to change the contents of CR3 while paging is enabled, the next
instruction is fetched using the translation tables that correspond to the new value of CR3. Therefore the next
instruction and the sequentially following instructions should have a mapping based upon the new value of
CR3. (Global entries in the TLBs are not invalidated, see Section 5.10.4, “Invalidation of TLBs and Paging-
Structure Caches.”)
• The Pentium processor and more recent processor families use branch-prediction techniques to improve
performance by prefetching the destination of a branch instruction before the branch instruction is executed.
Consequently, instruction execution is not deterministically serialized when a branch instruction is executed.

1. MOV CR8 is not defined architecturally as a serializing instruction.


2. An execution of WRMSR to any non-serializing MSR is not serializing. Non-serializing MSRs include the following: IA32_SPEC_CTRL
MSR (MSR index 48H), IA32_PRED_CMD MSR (MSR index 49H), IA32_TSX_CTRL MSR (MSR index 122H), IA32_TSC_DEADLINE MSR
(MSR index 6E0H), IA32_PKRS MSR (MSR index 6E1H), IA32_HWP_REQUEST MSR (MSR index 774H), or any of the x2APIC MSRs
(MSR indices 802H to 83FH).
3. LFENCE does provide some guarantees on instruction ordering. It does not execute until all prior instructions have completed locally,
and no later instruction begins execution until LFENCE completes.

Vol. 3A 10-19
MULTIPLE-PROCESSOR MANAGEMENT

10.4 MULTIPLE-PROCESSOR (MP) INITIALIZATION


The IA-32 architecture (beginning with the P6 family processors) defines a multiple-processor (MP) initialization
protocol called the Multiprocessor Specification Version 1.4. This specification defines the boot protocol to be used
by IA-32 processors in multiple-processor systems. (Here, multiple processors is defined as two or more proces-
sors.) The MP initialization protocol has the following important features:
• It supports controlled booting of multiple processors without requiring dedicated system hardware.
• It allows hardware to initiate the booting of a system without the need for a dedicated signal or a predefined
boot processor.
• It allows all IA-32 processors to be booted in the same manner, including those supporting Intel Hyper-
Threading Technology.
• The MP initialization protocol also applies to MP systems using Intel 64 processors.
The mechanism for carrying out the MP initialization protocol differs depending on the Intel processor generations.
The following bullets summarizes the evolution of the changes:
• For P6 family or older processors supporting MP operations— The selection of the BSP and APs (see
Section 10.4.1, “BSP and AP Processors”) is handled through arbitration on the APIC bus, using BIPI and FIPI
messages. These processor generations have CPUID signatures of (family=06H, extended_model=0,
model<=0DH), or family <06H. See Section 10.11.1, “Overview of the MP Initialization Process for P6 Family
Processors,” for a complete discussion of MP initialization for P6 family processors.
• Early generations of IA processors with family 0FH — The selection of the BSP and APs (see Section
10.4.1, “BSP and AP Processors”) is handled through arbitration on the system bus, using BIPI and FIPI
messages (see Section 10.4.3, “MP Initialization Protocol Algorithm for MP Systems”). These processor
generations have CPUID signatures of family=0FH, model=0H, stepping<=09H.
• Later generations of IA processors with family 0FH, and IA processors with system bus — The
selection of the BSP and APs is handled through a special system bus cycle, without using BIPI and FIPI
message arbitration (see Section 10.4.3, “MP Initialization Protocol Algorithm for MP Systems”). These
processor generations have CPUID signatures of family=0FH with (model=0H, stepping>=0AH) or (model >0,
all steppings); or family=06H, extended_model=0, model>=0EH.
• All other modern IA processor generations supporting MP operations— The selection of the BSP and
APs in the system is handled by platform-specific arrangement of the combination of hardware, BIOS, and/or
configuration input options. The basis of the selection mechanism is similar to those of the Later generations of
family 0FH and other Intel processor using system bus (see Section 10.4.3, “MP Initialization Protocol
Algorithm for MP Systems”). These processor generations have CPUID signatures of family=06H, extended_-
model>0.
The family, model, and stepping ID for a processor is given in the EAX register when the CPUID instruction is
executed with a value of 1 in the EAX register.

10.4.1 BSP and AP Processors


The MP initialization protocol defines two classes of processors: the bootstrap processor (BSP) and the application
processors (APs). Following a power-up or RESET of an MP system, system hardware dynamically selects one of the
processors on the system bus as the BSP. The remaining processors are designated as APs.
As part of the BSP selection mechanism, the BSP flag is set in the IA32_APIC_BASE MSR (see Figure 12-5) of the
BSP, indicating that it is the BSP. This flag is cleared for all other processors.
The BSP executes the BIOS’s boot-strap code to configure the APIC environment, sets up system-wide data struc-
tures, and starts and initializes the APs. When the BSP and APs are initialized, the BSP then begins executing the
operating-system initialization code.
Following a power-up or reset, the APs complete a minimal self-configuration, then wait for a startup signal (a SIPI
message) from the BSP processor. Upon receiving a SIPI message, an AP executes the BIOS AP configuration code,
which ends with the AP being placed in halt state.
For Intel 64 and IA-32 processors supporting Intel Hyper-Threading Technology, the MP initialization protocol treats
each of the logical processors on the system bus or coherent link domain as a separate processor (with a unique

10-20 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

APIC ID). During boot-up, one of the logical processors is selected as the BSP and the remainder of the logical
processors are designated as APs.

10.4.2 MP Initialization Protocol Requirements and Restrictions


The MP initialization protocol imposes the following requirements and restrictions on the system:
• The MP protocol is executed only after a power-up or RESET. If the MP protocol has completed and a BSP is
chosen, subsequent INITs (either to a specific processor or system wide) do not cause the MP protocol to be
repeated. Instead, each logical processor examines its BSP flag (in the IA32_APIC_BASE MSR) to determine
whether it should execute the BIOS boot-strap code (if it is the BSP) or enter a wait-for-SIPI state (if it is an
AP).
• All devices in the system that are capable of delivering interrupts to the processors must be inhibited from
doing so for the duration of the MP initialization protocol. The time during which interrupts must be inhibited
includes the window between when the BSP issues an INIT-SIPI-SIPI sequence to an AP and when the AP
responds to the last SIPI in the sequence.

10.4.3 MP Initialization Protocol Algorithm for MP Systems


Following a power-up or RESET of an MP system, the processors in the system execute the MP initialization protocol
algorithm to initialize each of the logical processors on the system bus or coherent link domain. In the course of
executing this algorithm, the following boot-up and initialization operations are carried out:
1. Each logical processor is assigned a unique APIC ID, based on system topology. The unique ID is a 32-bit value
if the processor supports CPUID leaf 0BH, otherwise the unique ID is an 8-bit value. (see Section 10.4.5,
“Identifying Logical Processors in an MP System”).
2. Each logical processor is assigned a unique arbitration priority based on its APIC ID.
3. Each logical processor executes its internal BIST simultaneously with the other logical processors in the
system.
4. Upon completion of the BIST, the logical processors use a hardware-defined selection mechanism to select the
BSP and the APs from the available logical processors on the system bus. The BSP selection mechanism differs
depending on the family, model, and stepping IDs of the processors, as follows:
— Later generations of IA processors within family 0FH (see Section 10.4), IA processors with system bus
(family=06H, extended_model=0, model>=0EH), or all other modern Intel processors (family=06H,
extended_model>0):
• The logical processors begin monitoring the BNR# signal, which is toggling. When the BNR# pin stops
toggling, each processor attempts to issue a NOP special cycle on the system bus.
• The logical processor with the highest arbitration priority succeeds in issuing a NOP special cycle and is
nominated the BSP. This processor sets the BSP flag in its IA32_APIC_BASE MSR, then fetches and
begins executing BIOS boot-strap code, beginning at the reset vector (physical address FFFF FFF0H).
• The remaining logical processors (that failed in issuing a NOP special cycle) are designated as APs. They
leave their BSP flags in the clear state and enter a “wait-for-SIPI state.”
— Early generations of IA processors within family 0FH (family=0FH, model=0H, stepping<=09H), P6 family
or older processors supporting MP operations (family=06H, extended_model=0, model<=0DH; or family
<06H):
• Each processor broadcasts a BIPI to “all including self.” The first processor that broadcasts a BIPI (and
thus receives its own BIPI vector), selects itself as the BSP and sets the BSP flag in its IA32_APIC_BASE
MSR. (See Section 10.11.1, “Overview of the MP Initialization Process for P6 Family Processors,” for a
description of the BIPI, FIPI, and SIPI messages.)
• The remainder of the processors (which were not selected as the BSP) are designated as APs. They
leave their BSP flags in the clear state and enter a “wait-for-SIPI state.”
• The newly established BSP broadcasts an FIPI message to “all including self,” which the BSP and APs
treat as an end of MP initialization signal. Only the processor with its BSP flag set responds to the FIPI

Vol. 3A 10-21
MULTIPLE-PROCESSOR MANAGEMENT

message. It responds by fetching and executing the BIOS boot-strap code, beginning at the reset vector
(physical address FFFF FFF0H).
5. As part of the boot-strap code, the BSP creates an ACPI table and/or an MP table and adds its initial APIC ID to
these tables as appropriate.
6. At the end of the boot-strap procedure, the BSP sets a processor counter to 1, then broadcasts a SIPI message
to all the APs in the system. Here, the SIPI message contains a vector to the BIOS AP initialization code (at
000VV000H, where VV is the vector contained in the SIPI message).
7. The first action of the AP initialization code is to set up a race (among the APs) to a BIOS initialization
semaphore. The first AP to the semaphore begins executing the initialization code. (See Section 10.4.4, “MP
Initialization Example,” for semaphore implementation details.) As part of the AP initialization procedure, the
AP adds its APIC ID number to the ACPI and/or MP tables as appropriate and increments the processor counter
by 1. At the completion of the initialization procedure, the AP executes a CLI instruction and halts itself.
8. When each of the APs has gained access to the semaphore and executed the AP initialization code, the BSP
establishes a count for the number of processors connected to the system bus, completes executing the BIOS
boot-strap code, and then begins executing operating-system boot-strap and start-up code.
9. While the BSP is executing operating-system boot-strap and start-up code, the APs remain in the halted state.
In this state they will respond only to INITs, NMIs, and SMIs. They will also respond to snoops and to assertions
of the STPCLK# pin.
The following section gives an example (with code) of the MP initialization protocol for of multiple processors oper-
ating in an MP configuration.
Chapter 2, “Model-Specific Registers (MSRs)‚” in the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 4, describes how to program the LINT[0:1] pins of the processor’s local APICs after an MP config-
uration has been completed.

10.4.4 MP Initialization Example


The following example illustrates the use of the MP initialization protocol used to initialize processors in an MP
system after the BSP and APs have been established. The code runs on Intel 64 or IA-32 processors that use a
protocol. This includes P6 Family processors, Pentium 4 processors, Intel Core Duo, Intel Core 2 Duo and Intel Xeon
processors.
The following constants and data definitions are used in the accompanying
code examples. They are based on the addresses of the APIC registers defined in Table 12-1.

ICR_LOW EQU 0FEE00300H


SVR EQU 0FEE000F0H
APIC_ID EQU 0FEE00020H
LVT3 EQU 0FEE00370H
APIC_ENABLED EQU 0100H
BOOT_ID DD ?
COUNT EQU 00H
VACANT EQU 00H

10.4.4.1 Typical BSP Initialization Sequence


After the BSP and APs have been selected (by means of a hardware protocol, see Section 10.4.3, “MP Initialization
Protocol Algorithm for MP Systems”), the BSP begins executing BIOS boot-strap code (POST) at the normal IA-32
architecture starting address (FFFF FFF0H). The boot-strap code typically performs the following operations:
1. Initializes memory.
2. Loads the microcode update into the processor.
3. Initializes the MTRRs.
4. Enables the caches.

10-22 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

5. Executes the CPUID instruction with a value of 0H in the EAX register, then reads the EBX, ECX, and EDX
registers to determine if the BSP is “GenuineIntel.”
6. Executes the CPUID instruction with a value of 1H in the EAX register, then saves the values in the EAX, ECX,
and EDX registers in a system configuration space in RAM for use later.
7. Loads start-up code for the AP to execute into a 4-KByte page in the lower 1 MByte of memory.
8. Switches to protected mode and ensures that the APIC address space is mapped to the strong uncacheable
(UC) memory type.
9. Determine the BSP’s APIC ID from the local APIC ID register (default is 0), the code snippet below is an
example that applies to logical processors in a system whose local APIC units operate in xAPIC mode that APIC
registers are accessed using memory mapped interface:

MOV ESI, APIC_ID; Address of local APIC ID register


MOV EAX, [ESI];
AND EAX, 0FF000000H; Zero out all other bits except APIC ID
MOV BOOT_ID, EAX; Save in memory
Saves the APIC ID in the ACPI and/or MP tables and optionally in the system configuration space in RAM.
10. Converts the base address of the 4-KByte page for the AP’s bootup code into 8-bit vector. The 8-bit vector
defines the address of a 4-KByte page in the real-address mode address space (1-MByte space). For example,
a vector of 0BDH specifies a start-up memory address of 000BD000H.
11. Enables the local APIC by setting bit 8 of the APIC spurious vector register (SVR).

MOV ESI, SVR; Address of SVR


MOV EAX, [ESI];
OR EAX, APIC_ENABLED; Set bit 8 to enable (0 on reset)
MOV [ESI], EAX;
12. Sets up the LVT error handling entry by establishing an 8-bit vector for the APIC error handler.

MOV ESI, LVT3;


MOV EAX, [ESI];
AND EAX, FFFFFF00H; Clear out previous vector.
OR EAX, 000000xxH; xx is the 8-bit vector the APIC error handler.
MOV [ESI], EAX;
13. Initializes the Lock Semaphore variable VACANT to 00H. The APs use this semaphore to determine the order in
which they execute BIOS AP initialization code.
14. Performs the following operation to set up the BSP to detect the presence of APs in the system and the number
of processors (within a finite duration, minimally 100 milliseconds):
— Sets the value of the COUNT variable to 1.
— In the AP BIOS initialization code, the AP will increment the COUNT variable to indicate its presence. The
finite duration while waiting for the COUNT to be updated can be accomplished with a timer. When the timer
expires, the BSP checks the value of the COUNT variable. If the timer expires and the COUNT variable has
not been incremented, no APs are present or some error has occurred.
15. Broadcasts an INIT-SIPI-SIPI IPI sequence to the APs to wake them up and initialize them. Alternatively,
following a power-up or RESET, since all APs are already in the “wait-for-SIPI state,” the BSP can broadcast just
a single SIPI IPI to the APs to wake them up and initialize them. If software knows how many logical processors
it expects to wake up, it may choose to poll the COUNT variable. If the expected processors show up before the
100 millisecond timer expires, the timer can be canceled and skip to step 16.
The left-hand-side of the procedure illustrated in Table 10-1 provides an algorithm when the expected
processor count is unknown. The right-hand-side of Table 10-1 can be used when the expected processor count
is known.

Vol. 3A 10-23
MULTIPLE-PROCESSOR MANAGEMENT

Table 10-1. Broadcast INIT-SIPI-SIPI Sequence and Choice of Timeouts


INIT-SIPI-SIPI when the expected processor count is unknown INIT-SIPI-SIPI when the expected processor count is known
MOV ESI, ICR_LOW; Load address of ICR low dword into ESI. MOV ESI, ICR_LOW; Load address of ICR low dword into ESI.
MOV EAX, 000C4500H; Load ICR encoding for broadcast INIT IPI MOV EAX, 000C4500H; Load ICR encoding for broadcast INIT IPI
; to all APs into EAX. ; to all APs into EAX.
MOV [ESI], EAX; Broadcast INIT IPI to all APs MOV [ESI], EAX; Broadcast INIT IPI to all APs
; 10-millisecond delay loop. ; 10-millisecond delay loop.
MOV EAX, 000C46XXH; Load ICR encoding for broadcast SIPI IP MOV EAX, 000C46XXH; Load ICR encoding for broadcast SIPI IP
; to all APs into EAX, where xx is the vector computed in step 10. ; to all APs into EAX, where xx is the vector computed in step 10.
MOV [ESI], EAX; Broadcast SIPI IPI to all APs MOV [ESI], EAX; Broadcast SIPI IPI to all APs
; 200-microsecond delay loop ; 200 microsecond delay loop with check to see if COUNT has
MOV [ESI], EAX; Broadcast second SIPI IPI to all APs ; reached the expected processor count. If COUNT reaches
;Waits for the timer interrupt until the timer expires ; expected processor count, cancel timer and go to step 16.
MOV [ESI], EAX; Broadcast second SIPI IPI to all APs
; Wait for the timer interrupt polling COUNT. If COUNT reaches
; expected processor count, cancel timer and go to step 16.
; If timer expires, go to step 16.

16. Reads and evaluates the COUNT variable and establishes a processor count.
17. If necessary, reconfigures the APIC and continues with the remaining system diagnostics as appropriate.

10.4.4.2 Typical AP Initialization Sequence


When an AP receives the SIPI, it begins executing BIOS AP initialization code at the vector encoded in the SIPI. The
AP initialization code typically performs the following operations:
1. Waits on the BIOS initialization Lock Semaphore. When control of the semaphore is attained, initialization
continues.
2. Loads the microcode update into the processor.
3. Initializes the MTRRs (using the same mapping that was used for the BSP).
4. Enables the cache.
5. Executes the CPUID instruction with a value of 0H in the EAX register, then reads the EBX, ECX, and EDX
registers to determine if the AP is “GenuineIntel.”
6. Executes the CPUID instruction with a value of 1H in the EAX register, then saves the values in the EAX, ECX,
and EDX registers in a system configuration space in RAM for use later.
7. Switches to protected mode and ensures that the APIC address space is mapped to the strong uncacheable
(UC) memory type.
8. Determines the AP’s APIC ID from the local APIC ID register, and adds it to the MP and ACPI tables and
optionally to the system configuration space in RAM.
9. Initializes and configures the local APIC by setting bit 8 in the SVR register and setting up the LVT3 (error LVT)
for error handling (as described in steps 9 and 10 in Section 10.4.4.1, “Typical BSP Initialization Sequence”).
10. Configures the APs SMI execution environment. (Each AP and the BSP must have a different SMBASE address.)
11. Increments the COUNT variable by 1.
12. Releases the semaphore.
13. Executes one of the following:

10-24 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

— the CLI and HLT instructions (if MONITOR/MWAIT is not supported), or


— the CLI, MONITOR, and MWAIT sequence to enter a deep C-state.
14. Waits for an INIT IPI.

10.4.5 Identifying Logical Processors in an MP System


After the BIOS has completed the MP initialization protocol, each logical processor can be uniquely identified by its
local APIC ID. Software can access these APIC IDs in either of the following ways:
• Read APIC ID for a local APIC — Code running on a logical processor can read APIC ID in one of two ways
depending on the local APIC unit is operating in x2APIC mode or in xAPIC mode:
— If the local APIC unit supports x2APIC and is operating in x2APIC mode, 32-bit APIC ID can be read by
executing a RDMSR instruction to read the processor’s x2APIC ID register. This method is equivalent to
executing CPUID leaf 0BH described below.
— If the local APIC unit is operating in xAPIC mode, 8-bit APIC ID can be read by executing a MOV instruction
to read the processor’s local APIC ID register (see Section 12.4.6, “Local APIC ID”). This is the ID to use for
directing physical destination mode interrupts to the processor.
• Read ACPI or MP table — As part of the MP initialization protocol, the BIOS creates an ACPI table and an MP
table. These tables are defined in the Multiprocessor Specification Version 1.4 and provide software with a list
of the processors in the system and their local APIC IDs. The format of the ACPI table is derived from the ACPI
specification, which is an industry standard power management and platform configuration specification for MP
systems.
• Read Initial APIC ID (If the processor does not support CPUID leaf 0BH) — An APIC ID is assigned to a logical
processor during power up. This is the initial APIC ID reported by CPUID.1:EBX[31:24] and may be different
from the current value read from the local APIC. The initial APIC ID can be used to determine the topological
relationship between logical processors for multi-processor systems that do not support CPUID leaf 0BH.
Bits in the 8-bit initial APIC ID can be interpreted using several bit masks. Each bit mask can be used to extract
an identifier to represent a hierarchical domain of the multi-threading resource topology in an MP system (See
Section 10.9.1, “Hierarchical Mapping of Shared Resources”). The initial APIC ID may consist of up to four bit-
fields. In a non-clustered MP system, the field consists of up to three bit fields.
• Read 32-bit APIC ID from CPUID leaf 0BH (If the processor supports CPUID leaf 0BH) — A unique APIC ID
is assigned to a logical processor during power up. This APIC ID is reported by CPUID.0BH:EDX[31:0] as a 32-
bit value. Use the 32-bit APIC ID and CPUID leaf 0BH to determine the topological relationship between logical
processors if the processor supports CPUID leaf 0BH.
Bits in the 32-bit x2APIC ID can be extracted into sub-fields using CPUID leaf 0BH parameters. (See Section
10.9.1, “Hierarchical Mapping of Shared Resources”).
Figure 10-2 shows two examples of APIC ID bit fields in earlier single-core processors. In single-core Intel Xeon
processors, the APIC ID assigned to a logical processor during power-up and initialization is 8 bits. Bits 2:1 form a
2-bit physical package identifier (which can also be thought of as a socket identifier). In systems that configure
physical processors in clusters, bits 4:3 form a 2-bit cluster ID. Bit 0 is used in the Intel Xeon processor MP to iden-
tify the two logical processors within the package (see Section 10.9.3, “Hierarchical ID of Logical Processors in an
MP System”). For Intel Xeon processors that do not support Intel Hyper-Threading Technology, bit 0 is always set
to 0; for Intel Xeon processors supporting Intel Hyper-Threading Technology, bit 0 performs the same function as
it does for Intel Xeon processor MP.
For more recent multi-core processors, see Section 10.9.1, “Hierarchical Mapping of Shared Resources,” for a
complete description of the topological relationships between logical processors and bit field locations within an
initial APIC ID across Intel 64 and IA-32 processor families.
Note the number of bit fields and the width of bit-fields are dependent on processor and platform hardware capa-
bilities. Software should determine these at runtime. When initial APIC IDs are assigned to logical processors, the
value of APIC ID assigned to a logical processor will respect the bit-field boundaries corresponding core, physical
package, etc. Additional examples of the bit fields in the initial APIC ID of multi-threading capable systems are
shown in Section 10.9.

Vol. 3A 10-25
MULTIPLE-PROCESSOR MANAGEMENT

APIC ID Format for Intel Xeon Processors that


do not Support Intel Hyper-Threading Technology
7 5 4 3 2 1 0

Reserved 0

Cluster
Processor ID

APIC ID Format for P6 Family Processors


7 4 3 2 1 0

Reserved

Cluster
Processor ID

Figure 10-2. Interpretation of APIC ID in Early MP Systems

For P6 family processors, the APIC ID that is assigned to a processor during power-up and initialization is 4 bits
(see Figure 10-2). Here, bits 0 and 1 form a 2-bit processor (or socket) identifier and bits 2 and 3 form a 2-bit
cluster ID.

10.5 INTEL® HYPER-THREADING TECHNOLOGY AND INTEL® MULTI-CORE


TECHNOLOGY
Intel Hyper-Threading Technology and Intel multi-core technology are extensions to Intel 64 and IA-32 architec-
tures that enable a single physical processor to execute two or more separate code streams (called threads)
concurrently. In Intel Hyper-Threading Technology, a single processor core provides two logical processors that
share execution resources (see Section 10.7, “Intel® Hyper-Threading Technology Architecture”). In Intel multi-
core technology, a physical processor package provides two or more processor cores. Both configurations require
chipsets and a BIOS that support the technologies.
Software should not rely on processor names to determine whether a processor supports Intel Hyper-Threading
Technology or Intel multi-core technology. Use the CPUID instruction to determine processor capability (see
Section 10.6.2, “Initializing Multi-Core Processors”).

10.6 DETECTING HARDWARE MULTI-THREADING SUPPORT AND TOPOLOGY


Use the CPUID instruction to detect the presence of hardware multi-threading support in a physical processor.
Hardware multi-threading can support several varieties of multigrade and/or Intel Hyper-Threading Technology.
CPUID instruction provides several sets of parameter information to aid software enumerating topology informa-
tion. The relevant topology enumeration parameters provided by CPUID include:
• Hardware Multi-Threading feature flag (CPUID.1:EDX[28] = 1) — Indicates when set that the physical
package is capable of supporting Intel Hyper-Threading Technology and/or multiple cores.
• Processor topology enumeration parameters for 8-bit APIC ID:
— Addressable IDs for Logical processors in the same Package (CPUID.1:EBX[23:16]) — Indicates
the maximum number of addressable ID for logical processors in a physical package. Within a physical
package, there may be addressable IDs that are not occupied by any logical processors. This parameter
does not represents the hardware capability of the physical processor.1

10-26 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

• Addressable IDs for processor cores in the same Package1 (CPUID.(EAX=4, ECX=02):EAX[31:26] +
1 = Y) — Indicates the maximum number of addressable IDs attributable to processor cores (Y) in the physical
package.
• Extended Processor Topology Enumeration parameters for 32-bit APIC ID: Intel 64 processors
supporting CPUID leaf 0BH will assign unique APIC IDs to each logical processor in the system. CPUID leaf 0BH
reports the 32-bit APIC ID and provide topology enumeration parameters. See CPUID instruction reference
pages in Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A.
The CPUID feature flag may indicate support for hardware multi-threading when only one logical processor avail-
able in the package. In this case, the decimal value represented by bits 16 through 23 in the EBX register will have
a value of 1.
Software should note that the number of logical processors enabled by system software may be less than the value
of “Addressable IDs for Logical processors”. Similarly, the number of cores enabled by system software may be less
than the value of “Addressable IDs for processor cores”.
Software can detect the availability of the CPUID extended topology enumeration leaf (0BH) by performing two
steps:
• Check maximum input value for basic CPUID information by executing CPUID with EAX= 0. If CPUID.0H:EAX is
greater than or equal or 11 (0BH), then proceed to next step,
• Check CPUID.EAX=0BH, ECX=0H:EBX is non-zero.
If both of the above conditions are true, extended topology enumeration leaf is available. Note the presence of
CPUID leaf 0BH in a processor does not guarantee support that the local APIC supports x2APIC. If
CPUID.(EAX=0BH, ECX=0H):EBX returns zero and maximum input value for basic CPUID information is greater
than 0BH, then CPUID.0BH leaf is not supported on that processor.

10.6.1 Initializing Processors Supporting Intel® Hyper-Threading Technology


The initialization process for an MP system that contains processors supporting Intel Hyper-Threading Technology
is the same as for conventional MP systems (see Section 10.4, “Multiple-Processor (MP) Initialization”). One logical
processor in the system is selected as the BSP and other processors (or logical processors) are designated as APs.
The initialization process is identical to that described in Section 10.4.3, “MP Initialization Protocol Algorithm for MP
Systems,” and Section 10.4.4, “MP Initialization Example.”
During initialization, each logical processor is assigned an APIC ID that is stored in the local APIC ID register for
each logical processor. If two or more processors supporting Intel Hyper-Threading Technology are present, each
logical processor on the system bus is assigned a unique ID (see Section 10.9.3, “Hierarchical ID of Logical Proces-
sors in an MP System”). Once logical processors have APIC IDs, software communicates with them by sending APIC
IPI messages.

10.6.2 Initializing Multi-Core Processors


The initialization process for an MP system that contains multi-core Intel 64 or IA-32 processors is the same as for
conventional MP systems (see Section 10.4, “Multiple-Processor (MP) Initialization”). A logical processor in one
core is selected as the BSP; other logical processors are designated as APs.
During initialization, each logical processor is assigned an APIC ID. Once logical processors have APIC IDs, soft-
ware may communicate with them by sending APIC IPI messages.

1. Operating system and BIOS may implement features that reduce the number of logical processors available in a platform to applica-
tions at runtime to less than the number of physical packages times the number of hardware-capable logical processors per pack-
age.
1. Software must check CPUID for its support of leaf 4 when implementing support for multi-core. If CPUID leaf 4 is not available at
runtime, software should handle the situation as if there is only one core per package.
2. Maximum number of cores in the physical package must be queried by executing CPUID with EAX=4 and a valid ECX input value.
Valid ECX input values start from 0.

Vol. 3A 10-27
MULTIPLE-PROCESSOR MANAGEMENT

10.6.3 Executing Multiple Threads on an Intel® 64 or IA-32 Processor Supporting Hardware


Multi-Threading
Upon completing the operating system boot-up procedure, the bootstrap processor (BSP) executes operating
system code. Other logical processors are placed in the halt state. To execute a code stream (thread) on a halted
logical processor, the operating system issues an interprocessor interrupt (IPI) addressed to the halted logical
processor. In response to the IPI, the processor wakes up and begins executing the code identified by the vector
received as part of the IPI.
To manage execution of multiple threads on logical processors, an operating system can use conventional
symmetric multiprocessing (SMP) techniques. For example, the operating-system can use a time-slice or load
balancing mechanism to periodically interrupt each of the active logical processors. Upon interrupting a logical
processor, the operating system checks its run queue for a thread waiting to be executed and dispatches the thread
to the interrupted logical processor.

10.6.4 Handling Interrupts on an IA-32 Processor Supporting Hardware Multi-Threading


Interrupts are handled on processors supporting Intel Hyper-Threading Technology as they are on conventional MP
systems. External interrupts are received by the I/O APIC, which distributes them as interrupt messages to specific
logical processors (see Figure 10-3).
Logical processors can also send IPIs to other logical processors by writing to the ICR register of its local APIC (see
Section 12.6, “Issuing Interprocessor Interrupts”). This also applies to dual-core processors.

Intel Processor with Intel Intel Processor with Intel


Hyper-Threading Technology Hyper-Threading Technology

Logical Logical Logical Logical


Processor 0 Processor 1 Processor 0 Processor 1

Processor Core Processor Core

Local APIC Local APIC Local APIC Local APIC

Bus Interface Bus Interface

Interrupt Interrupt
IPIs IPIs
Messages Messages

Interrupt Messages

Bridge

PCI

I/O APIC External


Interrupts

System Chipset

Figure 10-3. Local APICs and I/O APIC in MP System Supporting Intel HT Technology

10.7 INTEL® HYPER-THREADING TECHNOLOGY ARCHITECTURE


Figure 10-4 shows a generalized view of an Intel processor supporting Intel Hyper-Threading Technology, using the
original Intel Xeon processor MP as an example. This implementation of the Intel Hyper-Threading Technology

10-28 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

consists of two logical processors (each represented by a separate architectural state) which share the processor’s
execution engine and the bus interface. Each logical processor also has its own advanced programmable interrupt
controller (APIC).

Logical Logical
Processor 0 Processor 1
Architectural Architectural
State State

Execution Engine

Local APIC Local APIC

Bus Interface

System Bus

Figure 10-4. IA-32 Processor with Two Logical Processors Supporting Intel HT Technology

10.7.1 State of the Logical Processors


The following features are part of the architectural state of logical processors within Intel 64 or IA-32 processors
supporting Intel Hyper-Threading Technology. The features can be subdivided into three groups:
• Duplicated for each logical processor
• Shared by logical processors in a physical processor
• Shared or duplicated, depending on the implementation
The following features are duplicated for each logical processor:
• General purpose registers (EAX, EBX, ECX, EDX, ESI, EDI, ESP, and EBP)
• Segment registers (CS, DS, SS, ES, FS, and GS)
• EFLAGS and EIP registers. Note that the CS and EIP/RIP registers for each logical processor point to the
instruction stream for the thread being executed by the logical processor.
• x87 FPU registers (ST0 through ST7, status word, control word, tag word, data operand pointer, and instruction
pointer)
• MMX registers (MM0 through MM7)
• XMM registers (XMM0 through XMM7) and the MXCSR register
• Control registers and system table pointer registers (GDTR, LDTR, IDTR, task register)
• Debug registers (DR0, DR1, DR2, DR3, DR6, DR7) and the debug control MSRs
• Machine check global status (IA32_MCG_STATUS) and machine check capability (IA32_MCG_CAP) MSRs
• Thermal clock modulation and ACPI Power management control MSRs
• Time stamp counter MSRs
• Most of the other MSR registers, including the page attribute table (PAT). See the exceptions below.
• Local APIC registers.
• Additional general purpose registers (R8-R15), XMM registers (XMM8-XMM15), control register, IA32_EFER on
Intel 64 processors.
The following features are shared by logical processors:

Vol. 3A 10-29
MULTIPLE-PROCESSOR MANAGEMENT

• Memory type range registers (MTRRs)


Whether the following features are shared or duplicated is implementation-specific:
• IA32_MISC_ENABLE MSR (MSR address 1A0H)
• Machine check architecture (MCA) MSRs (except for the IA32_MCG_STATUS and IA32_MCG_CAP MSRs)
• Performance monitoring control and counter MSRs

10.7.2 APIC Functionality


When a processor supporting Intel Hyper-Threading Technology support is initialized, each logical processor is
assigned a local APIC ID (see Table 12-1). The local APIC ID serves as an ID for the logical processor and is stored
in the logical processor’s APIC ID register. If two or more processors supporting Intel Hyper-Threading Technology
are present in a dual processor (DP) or MP system, each logical processor on the system bus is assigned a unique
local APIC ID (see Section 10.9.3, “Hierarchical ID of Logical Processors in an MP System”).
Software communicates with local processors using the APIC’s interprocessor interrupt (IPI) messaging facility.
Setup and programming for APICs is identical in processors that support and do not support Intel Hyper-Threading
Technology. See Chapter 12, “Advanced Programmable Interrupt Controller (APIC),” for a detailed discussion.

10.7.3 Memory Type Range Registers (MTRR)


MTRRs in a processor supporting Intel Hyper-Threading Technology are shared by logical processors. When one
logical processor updates the setting of the MTRRs, settings are automatically shared with the other logical proces-
sors in the same physical package.
The architectures require that all MP systems based on Intel 64 and IA-32 processors (this includes logical proces-
sors) must use an identical MTRR memory map. This gives software a consistent view of memory, independent of
the processor on which it is running. See Section 13.11, “Memory Type Range Registers (MTRRs),” for information
on setting up MTRRs.

10.7.4 Page Attribute Table (PAT)


Each logical processor has its own PAT MSR (IA32_PAT). However, as described in Section 13.12, “Page Attribute
Table (PAT),” the PAT MSR settings must be the same for all processors in a system, including the logical proces-
sors.

10.7.5 Machine Check Architecture


In the Intel HT Technology context as implemented by processors based on Intel NetBurst® microarchitecture, all
of the machine check architecture (MCA) MSRs (except for the IA32_MCG_STATUS and IA32_MCG_CAP MSRs) are
duplicated for each logical processor. This permits logical processors to initialize, configure, query, and handle
machine-check exceptions simultaneously within the same physical processor. The design is compatible with
machine check exception handlers that follow the guidelines given in Chapter 17, “Machine-Check Architecture.”
The IA32_MCG_STATUS MSR is duplicated for each logical processor so that its machine check in progress bit field
(MCIP) can be used to detect recursion on the part of MCA handlers. In addition, the MSR allows each logical
processor to determine that a machine-check exception is in progress independent of the actions of another logical
processor in the same physical package.
Because the logical processors within a physical package are tightly coupled with respect to shared hardware
resources, both logical processors are notified of machine check errors that occur within a given physical processor.
If machine-check exceptions are enabled when a fatal error is reported, all the logical processors within a physical
package are dispatched to the machine-check exception handler. If machine-check exceptions are disabled, the
logical processors enter the shutdown state and assert the IERR# signal.
When enabling machine-check exceptions, the MCE flag in control register CR4 should be set for each logical
processor.

10-30 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

On Intel Atom family processors that support Intel Hyper-Threading Technology, the MCA facilities are shared
between all logical processors on the same processor core.

10.7.6 Debug Registers and Extensions


Each logical processor has its own set of debug registers (DR0, DR1, DR2, DR3, DR6, DR7) and its own debug
control MSR. These can be set to control and record debug information for each logical processor independently.
Each logical processor also has its own last branch records (LBR) stack.

10.7.7 Performance Monitoring Counters


Performance counters and their companion control MSRs are shared between the logical processors within a
processor core for processors based on Intel NetBurst microarchitecture. As a result, software must manage the
use of these resources. The performance counter interrupts, events, and precise event monitoring support can be
set up and allocated on a per thread (per logical processor) basis.
See Section 21.6.4, “Performance Monitoring and Intel® Hyper-Threading Technology in Processors Based on Intel
NetBurst® Microarchitecture,” for a discussion of performance monitoring in the Intel Xeon processor MP.
In Intel Atom processor family that support Intel Hyper-Threading Technology, the performance counters (general-
purpose and fixed-function counters) and their companion control MSRs are duplicated for each logical processor.

10.7.8 IA32_MISC_ENABLE MSR


The IA32_MISC_ENABLE MSR (MSR address 1A0H) is generally shared between the logical processors in a
processor core supporting Intel Hyper-Threading Technology. However, some bit fields within IA32_MISC_ENABLE
MSR may be duplicated per logical processor. The partition of shared or duplicated bit fields within IA32_MISC_EN-
ABLE is implementation dependent. Software should program duplicated fields carefully on all logical processors in
the system to ensure consistent behavior.

10.7.9 Memory Ordering


The logical processors in an Intel 64 or IA-32 processor supporting Intel Hyper-Threading Technology obey the
same rules for memory ordering as Intel 64 or IA-32 processors without Intel HT Technology (see Section 10.2,
“Memory Ordering”). Each logical processor uses a processor-ordered memory model that can be further defined
as “write-ordered with store buffer forwarding.” All mechanisms for strengthening or weakening the memory-
ordering model to handle special programming situations apply to each logical processor.

10.7.10 Serializing Instructions


As a general rule, when a logical processor in a processor supporting Intel Hyper-Threading Technology executes a
serializing instruction, only that logical processor is affected by the operation. An exception to this rule is the
execution of the WBINVD, INVD, and WRMSR instructions; and the MOV CR instruction when the state of the CD
flag in control register CR0 is modified. Here, both logical processors are serialized.

10.7.11 Microcode Update Resources


In an Intel processor supporting Intel Hyper-Threading Technology, the microcode update facilities are shared
between the logical processors; either logical processor can initiate an update. Each logical processor has its own
BIOS signature MSR (IA32_BIOS_SIGN_ID at MSR address 8BH). When a logical processor performs an update for
the physical processor, the IA32_BIOS_SIGN_ID MSRs for resident logical processors are updated with identical
information. If logical processors initiate an update simultaneously, the processor core provides the necessary
synchronization needed to ensure that only one update is performed at a time.

Vol. 3A 10-31
MULTIPLE-PROCESSOR MANAGEMENT

NOTE
Some processors (prior to the introduction of Intel 64 Architecture and based on Intel NetBurst
microarchitecture) do not support simultaneous loading of microcode update to the sibling logical
processors in the same core. All other processors support logical processors initiating an update
simultaneously. Intel recommends a common approach that the microcode loader use the
sequential technique described in Section 11.11.6.3.

10.7.12 Self Modifying Code


Intel processors supporting Intel Hyper-Threading Technology support self-modifying code, where data writes
modify instructions cached or currently in flight. They also support cross-modifying code, where on an MP system
writes generated by one processor modify instructions cached or currently in flight on another. See Section 10.1.3,
“Handling Self- and Cross-Modifying Code,” for a description of the requirements for self- and cross-modifying code
in an IA-32 processor.

10.7.13 Implementation-Specific Intel® HT Technology Facilities


The following non-architectural facilities are implementation-specific in IA-32 processors supporting Intel Hyper-
Threading Technology:
• Caches.
• Translation lookaside buffers (TLBs).
• Thermal monitoring facilities.
The Intel Xeon processor MP implementation is described in the following sections.

10.7.13.1 Processor Caches


For processors supporting Intel Hyper-Threading Technology, the caches are shared. Any cache manipulation
instruction that is executed on one logical processor has a global effect on the cache hierarchy of the physical
processor. Note the following:
• WBINVD instruction — The entire cache hierarchy is invalidated after modified data is written back to
memory. All logical processors are stopped from executing until after the write-back and invalidate operation is
completed. A special bus cycle is sent to all caching agents. The amount of time or cycles for WBINVD to
complete will vary due to the size of different cache hierarchies and other factors. As a consequence, the use of
the WBINVD instruction can have an impact on interrupt/event response time.
• INVD instruction — The entire cache hierarchy is invalidated without writing back modified data to memory.
All logical processors are stopped from executing until after the invalidate operation is completed. A special bus
cycle is sent to all caching agents.
• CLFLUSH and CLFLUSHOPT instructions — The specified cache line is invalidated from the cache hierarchy
after any modified data is written back to memory and a bus cycle is sent to all caching agents, regardless of
which logical processor caused the cache line to be filled.
• CD flag in control register CR0 — Each logical processor has its own CR0 control register, and thus its own
CD flag in CR0. The CD flags for the two logical processors are ORed together, such that when any logical
processor sets its CD flag, the entire cache is nominally disabled.

10.7.13.2 Processor Translation Lookaside Buffers (TLBs)


In processors supporting Intel Hyper-Threading Technology, data cache TLBs are shared. The instruction cache TLB
may be duplicated or shared in each logical processor, depending on implementation specifics of different processor
families.
Entries in the TLBs are tagged with an ID that indicates the logical processor that initiated the translation. This tag
applies even for translations that are marked global using the page-global feature for memory paging. See Section
5.10, “Caching Translation Information,” for information about global translations.

10-32 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

When a logical processor performs a TLB invalidation operation, only the TLB entries that are tagged for that logical
processor are guaranteed to be flushed. This protocol applies to all TLB invalidation operations, including writes to
control registers CR3 and CR4 and uses of the INVLPG instruction.

10.7.13.3 Thermal Monitor


In a processor that supports Intel Hyper-Threading Technology, logical processors share the catastrophic shutdown
detector and the automatic thermal monitoring mechanism (see Section 16.8, “Thermal Monitoring and Protec-
tion”). Sharing results in the following behavior:
• If the processor’s core temperature rises above the preset catastrophic shutdown temperature, the processor
core halts execution, which causes both logical processors to stop execution.
• When the processor’s core temperature rises above the preset automatic thermal monitor trip temperature, the
frequency of the processor core is automatically modulated, which effects the execution speed of both logical
processors.
For software controlled clock modulation, each logical processor has its own IA32_CLOCK_MODULATION MSR,
allowing clock modulation to be enabled or disabled on a logical processor basis. Typically, if software controlled
clock modulation is going to be used, the feature must be enabled for all the logical processors within a physical
processor and the modulation duty cycle must be set to the same value for each logical processor. If the duty cycle
values differ between the logical processors, the processor clock will be modulated at the highest duty cycle
selected.

10.7.13.4 External Signal Compatibility


This section describes the constraints on external signals received through the pins of a processor supporting Intel
Hyper-Threading Technology and how these signals are shared between its logical processors.
• STPCLK# — A single STPCLK# pin is provided on the physical package of the Intel Xeon processor MP. External
control logic uses this pin for power management within the system. When the STPCLK# signal is asserted, the
processor core transitions to the stop-grant state, where instruction execution is halted but the processor core
continues to respond to snoop transactions. Regardless of whether the logical processors are active or halted
when the STPCLK# signal is asserted, execution is stopped on both logical processors and neither will respond
to interrupts.

In MP systems, the STPCLK# pins on all physical processors are generally tied together. As a result this signal
affects all the logical processors within the system simultaneously.
• LINT0 and LINT1 pins — A processor supporting Intel Hyper-Threading Technology has only one set of LINT0
and LINT1 pins, which are shared between the logical processors. When one of these pins is asserted, both
logical processors respond unless the pin has been masked in the APIC local vector tables for one or both of the
logical processors.

Typically in MP systems, the LINT0 and LINT1 pins are not used to deliver interrupts to the logical processors.
Instead all interrupts are delivered to the local processors through the I/O APIC.
• A20M# pin — On an IA-32 processor, the A20M# pin is typically provided for compatibility with the Intel 286
processor. Asserting this pin causes bit 20 of the physical address to be masked (forced to zero) for all external
bus memory accesses. Processors supporting Intel Hyper-Threading Technology provide one A20M# pin, which
affects the operation of both logical processors within the physical processor.
The functionality of A20M# is used primarily by older operating systems and not used by modern operating
systems. On newer Intel 64 processors, A20M# may be absent.

10.8 MULTI-CORE ARCHITECTURE


This section describes the architecture of Intel 64 and IA-32 processors supporting dual-core and quad-core tech-
nology. The discussion is applicable to the Intel Pentium processor Extreme Edition, Pentium D, Intel Core Duo,
Intel Core 2 Duo, Dual-core Intel Xeon processor, Intel Core 2 Quad processors, and quad-core Intel Xeon proces-
sors. Features vary across different microarchitectures and are detectable using CPUID.

Vol. 3A 10-33
MULTIPLE-PROCESSOR MANAGEMENT

In general, each processor core has dedicated microarchitectural resources identical to a single-processor imple-
mentation of the underlying microarchitecture without hardware multi-threading capability. Each logical processor
in a dual-core processor (whether supporting Intel Hyper-Threading Technology or not) has its own APIC function-
ality, PAT, machine check architecture, debug registers and extensions. Each logical processor handles serialization
instructions or self-modifying code on its own. Memory order is handled the same way as in Intel Hyper-Threading
Technology.
The topology of the cache hierarchy (with respect to whether a given cache level is shared by one or more
processor cores or by all logical processors in the physical package) depends on the processor implementation.
Software must use the deterministic cache parameter leaf of CPUID instruction to discover the cache-sharing
topology between the logical processors in a multi-threading environment.

10.8.1 Logical Processor Support


The topological composition of processor cores and logical processors in a multi-core processor can be discovered
using CPUID. Within each processor core, one or more logical processors may be available.
System software must follow the requirement MP initialization sequences (see Section 10.4, “Multiple-Processor
(MP) Initialization”) to recognize and enable logical processors. At runtime, software can enumerate those logical
processors enabled by system software to identify the topological relationships between these logical processors.
(See Section 10.9.5, “Identifying Topological Relationships in an MP System”).

10.8.2 Memory Type Range Registers (MTRR)


MTRR is shared between two logical processors sharing a processor core if the physical processor supports Intel
Hyper-Threading Technology. MTRR is not shared between logical processors located in different cores or different
physical packages.
The Intel 64 and IA-32 architectures require that all logical processors in an MP system use an identical MTRR
memory map. This gives software a consistent view of memory, independent of the processor on which it is
running.
See Section 13.11, “Memory Type Range Registers (MTRRs).”

10.8.3 Performance Monitoring Counters


Performance counters and their companion control MSRs are shared between two logical processors sharing a
processor core if the processor core supports Intel Hyper-Threading Technology and is based on Intel NetBurst
microarchitecture. They are not shared between logical processors in different cores or different physical packages.
As a result, software must manage the use of these resources, based on the topology of performance monitoring
resources. Performance counter interrupts, events, and precise event monitoring support can be set up and allo-
cated on a per thread (per logical processor) basis.
See Section 21.6.4, “Performance Monitoring and Intel® Hyper-Threading Technology in Processors Based on Intel
NetBurst® Microarchitecture.”

10.8.4 IA32_MISC_ENABLE MSR


Some bit fields in IA32_MISC_ENABLE MSR (MSR address 1A0H) may be shared between two logical processors
sharing a processor core, or may be shared between different cores in a physical processor. See Chapter 2, “Model-
Specific Registers (MSRs)‚” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 4.

10.8.5 Microcode Update Resources


Microcode update facilities are shared between two logical processors sharing a processor core if the physical
package supports Intel Hyper-Threading Technology. They are not shared between logical processors in different

10-34 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

cores or different physical packages. Either logical processor that has access to the microcode update facility can
initiate an update.
Each logical processor has its own BIOS signature MSR (IA32_BIOS_SIGN_ID at MSR address 8BH). When a logical
processor performs an update for the physical processor, the IA32_BIOS_SIGN_ID MSRs for resident logical
processors are updated with identical information.
All microcode update steps during processor initialization should use the same update data on all cores in all phys-
ical packages of the same stepping. Any subsequent microcode update must apply consistent update data to all
cores in all physical packages of the same stepping. If the processor detects an attempt to load an older microcode
update when a newer microcode update had previously been loaded, it may reject the older update to stay with the
newer update.

NOTE
Some processors (prior to the introduction of Intel 64 Architecture and based on Intel NetBurst
microarchitecture) do not support simultaneous loading of microcode update to the sibling logical
processors in the same core. All other processors support logical processors initiating an update
simultaneously. Intel recommends a common approach that the microcode loader use the
sequential technique described in Section 11.11.6.3.

10.9 PROGRAMMING CONSIDERATIONS FOR HARDWARE MULTI-THREADING


CAPABLE PROCESSORS
In a multi-threading environment, there may be certain hardware resources that are physically shared at some
level of the hardware topology. In the multi-processor systems, typically bus and memory sub-systems are physi-
cally shared between multiple sockets. Within a hardware multi-threading capable processors, certain resources
are provided for each processor core, while other resources may be provided for each logical processors (see
Section 10.7, “Intel® Hyper-Threading Technology Architecture,” and Section 10.8, “Multi-Core Architecture”).
From a software programming perspective, control transfer of processor operation is managed at the granularity of
logical processor (operating systems dispatch a runnable task by allocating an available logical processor on the
platform). To manage the topology of shared resources in a multi-threading environment, it may be useful for soft-
ware to understand and manage resources that are shared by more than one logical processors.

10.9.1 Hierarchical Mapping of Shared Resources


The APIC_ID value associated with each logical processor in a multi-processor system is unique (see Section 10.6,
“Detecting Hardware Multi-Threading Support and Topology”). This 8-bit or 32-bit value can be decomposed into
sub-fields, where each sub-field corresponds a hierarchical domain of the topological mapping of hardware
resources.
The decomposition of an APIC_ID may consist of several sub fields representing the topology within a physical
processor package, the higher-order bits of an APIC ID may also be used by cluster vendors to represent the
topology of cluster nodes of each coherent multiprocessor systems:
• Cluster — Some multi-threading environments consists of multiple clusters of multi-processor systems. The
CLUSTER_ID sub-field is usually supported by vendor firmware to distinguish different clusters. For non-
clustered systems, CLUSTER_ID is usually 0 and system topology is reduced.
• Package — A physical processor package mates with a socket. A package may contain one or more software
visible die. The PACKAGE_ID sub-field distinguishes different physical packages within a cluster.
• Die — A software-visible chip inside a package. The DIE_ID sub-field distinguishes different die within a
package. If there are no software visible die, the width of this bit field is 0.
• DieGrp — A group of die that share certain resources.
• Tile — A set of cores that share certain resources. The TILE_ID sub-field distinguishes different tiles. If there
are no software visible tiles, the width of this bit field is 0.

Vol. 3A 10-35
MULTIPLE-PROCESSOR MANAGEMENT

• Module — A set of cores that share certain resources. The MODULE_ID sub-field distinguishes different
modules. If there are no software visible modules, the width of this bit field is 0.
• Core — Processor cores may be contained within modules, within tiles, on software-visible die, or appear
directly at the package domain. The CORE_ID sub-field distinguishes processor cores. For a single-core
processor, the width of this bit field is 0.
• Logical Processor — A processor core provides one or more logical processors sharing execution resources.
The LOGICAL_PROCESSOR_ID sub-field distinguishes logical processors in a core. The width of this bit field is
non-zero if a processor core provides more than one logical processors.
The LOGICAL_PROCESSOR_ID and CORE_ID sub-fields are bit-wise contiguous in the APIC_ID field (see
Figure 10-5).

X 0

Reserved

CLUSTER_ID
PACKAGE_ID
DIE_ID
TILE_ID
MODULE_ID
CORE_ID
LOGICAL_PROCESSOR_ID

X=31 if x2APIC is supported, otherwise X= 7

Figure 10-5. Generalized Seven-Domain Interpretation of the APIC ID

If the processor supports CPUID leaf 0BH and leaf 1FH, the 32-bit APIC ID can represent cluster plus several
domains of topology within the physical processor package. The exact number of hierarchical domains within a
physical processor package must be enumerated through CPUID leaf 0BH and leaf 1FH. Common processor fami-
lies may employ a topology similar to that represented by the 8-bit Initial APIC ID. In general, CPUID leaf 0BH and
leaf 1FH can support a topology enumeration algorithm that decompose a 32-bit APIC ID into more than four sub-
fields (see Figure 10-6).

NOTE
CPUID leaf 0BH and leaf 1FH can have differences in the number of domain types reported (CPUID
leaf 1FH defines additional domain types). If the processor supports CPUID leaf 1FH, usage of this
leaf is preferred over leaf 0BH. CPUID leaf 0BH is available for legacy compatibility going forward.

The width of each sub-field depends on hardware and software configurations. Field widths can be determined at
runtime using the algorithm discussed below (Example 10-16 through Example 10-21).
Figure 7-6 depicts the relationships of three of the hierarchical sub-fields in a hypothetical MP system. The value of
valid APIC_IDs need not be contiguous across package boundary or core boundaries.

10-36 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

PACKAGE 31 0
Q
R
CORE Reserved

LOGICAL
PROCESSOR
CLUSTER_ID
PACKAGE_ID
Q_ID
R_ID

Physical Processor Topology CORE_ID


LOGICAL_PROCESSOR_ID

32-bit APIC ID Composition

Figure 10-6. Conceptual Six-Domain Topology and 32-bit APIC ID Composition

10.9.2 Hierarchical Mapping of CPUID Extended Topology Leaf


CPUID leaf 0BH and leaf 1FH provide enumeration parameters for software to identify each hierarchy of the
processor topology in a deterministic manner. Each hierarchical domain of the topology starting from the Logical
Processor domain is represented numerically by a sub-leaf index within the CPUID 0BH leaf and 1FH leaf. Each
domain of the topology is mapped to a sub-field in the APIC ID, following the general relationship depicted in
Figure 10-6. This mechanism allows software to query the exact number of domains within a physical processor
package and the bit-width of each sub-field of x2APIC ID directly. For example,
• Starting from sub-leaf index 0 and incrementing ECX until CPUID.(EAX=0BH or 1FH, ECX=N):ECX[15:8]
returns an invalid “domain type” encoding. The number of domains within the physical processor package is
“N” (excluding PACKAGE). Using Figure 10-6 as an example, CPUID.(EAX=0BH or 1FH, ECX=4):ECX[15:8] will
report 00H, indicating sub leaf 04H is invalid. This is also depicted by a pseudo code example:

Example 10-16. Number of Domains Below the Physical Processor Package

Word NumberOfDomainsBelowPackage = 0;
DWord Subleaf = 0;

EAX = 0BH or 1FH; // query each sub leaf of CPUID leaf 0BH or 1FH; CPUID leaf 1FH is preferred over leaf 0BH if available.
ECX = Subleaf;
CPUID;
while(EBX != 0) // Enumerate until EBX reports 0
{
if(EAX[4:0] != 0) // A Shift Value of 0 indicates this domain does not exist.
// (Such as no SMT_ID, which is required entry at sub-leaf 0.)
{
NumberOfDomainsBelowPackage++;
}
Subleaf++;
EAX = 0BH or 1FH;
ECX = Subleaf;
CPUID;
}
// NumberOfDomainsBelowPackage contains the absolute number of domains that exist below package.
N = Subleaf; // Sub-leaf supplies the number of entries CPUID will return.

Vol. 3A 10-37
MULTIPLE-PROCESSOR MANAGEMENT

• Sub-leaf index 0 (ECX= 0 as input) provides enumeration parameters to extract the LOGICAL_PROCESSOR_ID
sub-field of x2APIC ID. If EAX = 0BH or 1FH, and ECX =0 is specified as input when executing CPUID,
CPUID.(EAX=0BH or 1FH, ECX=0):EAX[4:0] reports a value (a right-shift count) that allow software to extract
part of x2APIC ID to distinguish the next higher topological entities above the LOGICAL_PROCESSOR_ID
domain. This value also corresponds to the bit-width of the sub-field of x2APIC ID corresponding the hierar-
chical domain with sub-leaf index 0.
• For each subsequent higher sub-leaf index m, CPUID.(EAX=0BH or 1FH, ECX=m):EAX[4:0] reports the right-
shift count that will allow software to extract part of x2APIC ID to distinguish higher-domain topological
entities. This means the right-shift value at of sub-leaf m, corresponds to the least significant (m+1) sub-fields
of the 32-bit x2APIC ID.

Example 10-17. BitWidth Determination of x2APIC ID Sub-fields

For m = 0, m < N, m ++;


{ cumulative_width[m] = CPUID.(EAX=0BH or 1FH, ECX= m): EAX[4:0]; }
BitWidth[0] = cumulative_width[0];
For m = 1, m < N, m ++;
BitWidth[m] = cumulative_width[m] - cumulative_width[m-1];

NOTE
CPUID leaf 1FH is a preferred superset to leaf 0BH. Leaf 1FH defines additional domain types, and
it must be parsed by an algorithm that can handle the addition of future domain types.

Previously, only the following encoding of hierarchical domain types were defined: 0 (invalid), 1 (logical processor),
and 2 (core). With the additional hierarchical domain types available (see Section 10.9.1, “Hierarchical Mapping of
Shared Resources,” and Figure 10-5, “Generalized Seven-Domain Interpretation of the APIC ID” ) software must
not assume any “domain type” encoding value to be related to any sub-leaf index, except sub-leaf 0.

Example 10-18. Support Routines for Identifying Package, Die, Core, and Logical Processors from 32-bit x2APIC ID

a. Derive the extraction bitmask for logical processors in a processor core and associated mask offset for different
cores.
//
// This example shows how to enumerate CPU topology domain types (domain types may or may not be known/supported by the
software)
//
// Below is the list of sample domain types used in the example.
// Refer to the CPUID Leaf 1FH definition for the actual domain type numbers: “V2 Extended Topology Enumeration Leaf (Initial EAX
Value = 1FH, ECX ≥ 0)” .
//
// LOGICAL PROCESSOR
// CORE
// MODULE
// TILE
// DIE
// PACKAGE
//
// The example shows how to identify and derive the extraction bitmask for the domains with identify type
LOGICAL_PROCESSOR_ID/CORE_ID/DIE_ID/PACKAGE_ID
//

int DeriveLogical_Processor_Mask_Offsets (void)


{

10-38 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

IF (!HWMTSupported()) return -1;


execute cpuid with EAX = 0BH or 1FH, ECX = 0;
IF (returned domain type encoding in EXC[15:8] does not match LOGICAL_PROCESSOR_ID) return -1;
Mask_Logical_Processor_shift = EAX[4:0]; //# bits shift right of APIC ID to distinguish different cores, note this can be a shift
// of zero if there is only one logical processor per core.
Logical Processor Mask =~( (-1) << Mask_Logical_Processor_shift); //shift left to derive extraction bitmask for
// LOGICAL_PROCESSOR_ID
return 0;
}

b. Derive the extraction bitmask for processor cores in a physical processor package and associated mask offset for
different packages.

int DeriveCore_Mask_Offsets (void)


{
IF (!HWMTSupported()) return -1;
execute cpuid with EAX = 0BH or 1FH, ECX = 0;
WHILE( ECX[15:8] ) { //domain type encoding is valid
Mask_last_known_shift = EAX[4:0]
IF (returned domain type encoding in ECX[15:8] matches CORE) {
Mask_Core_shift = EAX[4:0];
}
ELSE IF (returned domain type encoding in ECX[15:8] matches DIE {
Mask_Die_shift = EAX[4:0];
}
//
// Keep enumerating. Check if the next domain is the desired domain and if not, keep enumerating until you reach a known
// domain or the invalid domain (“0” domain type). If there are more domains between DIE and PACKAGE, the unknown
// domains will be ignored and treated as an extension of the last known domain (i.e., DIE in this case).
//

ECX++;
execute cpuid with EAX = 0BH or 1FH;
}

COREPlusLogical_Processor_MASK = ~( (-1) << Mask_Core_shift);


DIEPlusCORE_MASK = ~( (-1) << Mask_Die_shift);

//
// Treat domains between DIE and physical package as an extension of DIE for software choosing not to implement or recognize
// these unknown domains.
//

CORE_MASK = COREPlusLogical_Processor_MASK ^ Logical Processor Mask;


DIE_MASK = DIEPlusCORE_MASK ^ COREPlusLogical_Processor_MASK;
PACKAGE_MASK = (-1) << Mask_last_known_shift;

return -1;

Vol. 3A 10-39
MULTIPLE-PROCESSOR MANAGEMENT

10.9.3 Hierarchical ID of Logical Processors in an MP System


For Intel 64 and IA-32 processors, system hardware establishes an 8-bit initial APIC ID (or 32-bit APIC ID if the
processor supports CPUID leaf 0BH) that is unique for each logical processor following power-up or RESET (see
Section 10.6.1). Each logical processor on the system is allocated an initial APIC ID. BIOS may implement features
that tell the OS to support less than the total number of logical processors on the system bus. Those logical proces-
sors that are not available to applications at runtime are halted during the OS boot process. As a result, the number
valid local APIC_IDs that can be queried by affinitizing-current-thread-context (See Example 10-23) is limited to
the number of logical processors enabled at runtime by the OS boot process.
Table 10-2 shows an example of the 8-bit APIC IDs that are initially reported for logical processors in a system with
four Intel Xeon MP processors that support Intel Hyper-Threading Technology (a total of 8 logical processors, each
physical package has two processor cores and supports Intel Hyper-Threading Technology). Of the two logical
processors within a Intel Xeon processor MP, logical processor 0 is designated the primary logical processor and
logical processor 1 as the secondary logical processor.

LOGICAL_PROCESSOR_ID

CORE_ID
T0 T1 T0 T1 T0 T1 T0 T1
PACKAGE_ID
Core 0 Core 1 Core 0 Core 1

Package 0 Package 1

Figure 10-7. Topological Relationships Between Hierarchical IDs in a Hypothetical MP Platform

Table 10-2. Initial APIC IDs for the Logical Processors in a System that has Four Intel Xeon MP Processors
Supporting Intel Hyper-Threading Technology1
Initial APIC ID PACKAGE_ID CORE_ID LOGICAL_PROCESSOR_ID
0H 0H 0H 0H
1H 0H 0H 1H
2H 1H 0H 0H
3H 1H 0H 1H
4H 2H 0H 0H
5H 2H 0H 1H
6H 3H 0H 0H
7H 3H 0H 1H
NOTE:
1. Because information on the number of processor cores in a physical package was not available in early single-core processors sup-
porting Intel Hyper-Threading Technology, the CORE_ID can be treated as 0.

Table 10-3 shows the initial APIC IDs for a hypothetical situation with a dual processor system. Each physical
package providing two processor cores, and each processor core also supporting Intel Hyper-Threading Tech-
nology.

10-40 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

Table 10-3. Initial APIC IDs for the Logical Processors in a System that has Two Physical Processors Supporting
Dual-Core and Intel Hyper-Threading Technology
Initial APIC ID PACKAGE_ID CORE_ID LOGICAL_PROCESSOR_ID
0H 0H 0H 0H
1H 0H 0H 1H
2H 0H 1H 0H
3H 0H 1H 1H
4H 1H 0H 0H
5H 1H 0H 1H
6H 1H 1H 0H
7H 1H 1H 1H

10.9.3.1 Hierarchical ID of Logical Processors with x2APIC ID


Table 10-4 shows an example of possible x2APIC ID assignments for a dual processor system that support x2APIC.
Each physical package providing four processor cores, and each processor core also supporting Intel Hyper-
Threading Technology. Note that the x2APIC ID need not be contiguous in the system.

Table 10-4. Example of Possible x2APIC ID Assignment in a System that has Two Physical Processors Supporting
x2APIC and Intel Hyper-Threading Technology
x2APIC ID PACKAGE_ID CORE_ID LOGICAL_PROCESSOR_ID
0H 0H 0H 0H
1H 0H 0H 1H
2H 0H 1H 0H
3H 0H 1H 1H
4H 0H 2H 0H
5H 0H 2H 1H
6H 0H 3H 0H
7H 0H 3H 1H
10H 1H 0H 0H
11H 1H 0H 1H
12H 1H 1H 0H
13H 1H 1H 1H
14H 1H 2H 0H
15H 1H 2H 1H
16H 1H 3H 0H
17H 1H 3H 1H

Vol. 3A 10-41
MULTIPLE-PROCESSOR MANAGEMENT

10.9.4 Algorithm for Three-Domain Mappings of APIC_ID


Software can gather the initial APIC_IDs for each logical processor supported by the operating system at runtime1
and extract identifiers corresponding to the three domains of sharing topology (package, core, and logical
processor). The three-domain algorithms below focus on a non-clustered MP system for simplicity. They do not
assume APIC IDs are contiguous or that all logical processors on the platform are enabled.
Intel supports multi-threading systems where all physical processors report identical values in CPUID leaf 0BH,
CPUID.1:EBX[23:16]), CPUID.42:EAX[31:26], and CPUID.43:EAX[25:14]. The algorithms below assume the
target system has symmetry across physical package boundaries with respect to the number of logical processors
per package, number of cores per package, and cache topology within a package.
Software can choose to assume three-domain hierarchy if it was developed to understand only three domains.
However, software implementation needs to ensure it does not break if it runs on systems that have more domains
in the hierarchy even if it does not recognize them.
The extraction algorithm (for three-domain mappings from an APIC ID) uses the general procedure depicted in
Example 10-19, and is supplemented by more detailed descriptions on the derivation of topology enumeration
parameters for extraction bit masks:
1. Detect hardware multi-threading support in the processor.
2. Derive a set of bit masks that can extract the sub ID of each hierarchical domain of the topology. The algorithm
to derive extraction bit masks for LOGICAL_PROCESSOR_ID/CORE_ID/PACKAGE_ID differs based on APIC ID
is 32-bit (see step 3 below) or 8-bit (see step 4 below).
3. If the processor supports CPUID leaf 0BH, each APIC ID contains a 32-bit value, the topology enumeration
parameters needed to derive three-domain extraction bit masks are:
a. Query the right-shift value for the LOGICAL_PROCESSOR_ID domain of the topology using CPUID leaf 0BH
with ECX =0H as input. The number of bits to shift-right on x2APIC ID (EAX[4:0]) can distinguish different
higher-domain entities above logical processor in the same physical package. This is also the width of the
bit mask to extract the LOGICAL_PROCESSOR_ID. The shift value may be 0 and enumerate no logical
processor bit mask to create. A platform where cores only have one logical processor are not required to
enumerate a separate bit layout for logical processor, and the lowest bits may only identify the core (where
core and logical processor are then synonymous).
b. Enumerate until the desired domain is found (i.e., processor cores). Determine if the next domain is the
expected domain. If the next domain is not known to the software, keep enumerating until the next known
or the last domain. Software should use the previous domain before this to represent the last previously
known domain (i.e., processor cores). If the software does not recognize or implement certain hierarchical
domains, it should assume these unknown domains as an extension of the last known domain.
c. Query CPUID leaf 0BH for the amount of bit shift to distinguish next higher-domain entities (e.g., physical
processor packages) in the system. This describes an explicit three-domain-topology situation for
commonly available processors. Consult Example 10-17 to adapt to situations beyond a three-domain
topology of a physical processor. The width of the extraction bit mask can be used to derive the cumulative
extraction bitmask to extract the sub IDs of logical processors (including different processor cores) in the
same physical package. The extraction bit mask to distinguish merely different processor cores can be
derived by xor’ing the logical processor extraction bit mask from the cumulative extraction bit mask.
d. Query the 32-bit x2APIC ID for the logical processor where the current thread is executing.
e. Derive the extraction bit masks corresponding to LOGICAL_PROCESSOR_ID, CORE_ID, and PACKAGE_ID,
starting from LOGICAL_PROCESSOR_ID.
f. Apply each extraction bit mask to the 32-bit x2APIC ID to extract sub-field IDs.

1. As noted in Section 10.6 and Section 10.9.3, the number of logical processors supported by the OS at runtime may be less than the
total number logical processors available in the platform hardware.
2. Maximum number of addressable ID for processor cores in a physical processor is obtained by executing CPUID with EAX=4 and a
valid ECX index. The ECX index starts at 0.
3. Maximum number addressable ID for processor cores sharing the target cache level is obtained by executing CPUID with EAX = 4
and the ECX index corresponding to the target cache level.

10-42 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

4. If the processor does not support CPUID leaf 0BH, each initial APIC ID contains an 8-bit value, the topology
enumeration parameters needed to derive extraction bit masks are:
a. Query the size of address space for sub IDs that can accommodate logical processors in a physical
processor package. This size parameters (CPUID.1:EBX[23:16]) can be used to derive the width of an
extraction bitmask to enumerate the sub IDs of different logical processors in the same physical package.
b. Query the size of address space for sub IDs that can accommodate processor cores in a physical processor
package. This size parameters can be used to derive the width of an extraction bitmask to enumerate the
sub IDs of processor cores in the same physical package.
c. Query the 8-bit initial APIC ID for the logical processor where the current thread is executing.
d. Derive the extraction bit masks using respective address sizes corresponding to LOGICAL_PROCESSOR_ID,
CORE_ID, and PACKAGE_ID, starting from LOGICAL_PROCESSOR_ID.
e. Apply each extraction bit mask to the 8-bit initial APIC ID to extract sub-field IDs.

Example 10-19. Support Routines for Detecting Hardware Multi-Threading and Identifying the Relationships Between
Package, Core, and Logical Processors
1. Detect support for Hardware Multi-Threading Support in a processor.

// Returns a non-zero value if CPUID reports the presence of hardware multi-threading


// support in the physical package where the current logical processor is located.
// This does not guarantee BIOS or OS will enable all logical processors in the physical
// package and make them available to applications.
// Returns zero if hardware multi-threading is not present.

#define HWMT_BIT 10000000H

unsigned int HWMTSupported(void)


{
// ensure cpuid instruction is supported
execute cpuid with eax = 0 to get vendor string
execute cpuid with eax = 1 to get feature flag and signature

// Check to see if this a Genuine Intel Processor

if (vendor string EQ GenuineIntel) {


return (feature_flag_edx & HWMT_BIT); // bit 28
}
return 0;
}

Example 10-20. Support Routines for Identifying Package, Core, and Logical Processors from 32-bit x2APIC ID
a. Derive the extraction bitmask for logical processors in a processor core and associated mask offset for different
cores.

int DeriveLogical_Processor_Mask_Offsets (void)


{
if (!HWMTSupported()) return -1;
execute cpuid with eax = 11, ECX = 0;
If (returned domain type encoding in ECX[15:8] does not match logical processor) return -1;
Mask_Logical_Processor_shift = EAX[4:0]; // # bits shift right of APIC ID to distinguish different cores, note this can be a shift
// of zero if there is only one logical processor per core.
Logical Processor Mask = ~( (-1) << Mask_Logical_Processor_shift); // shift left to derive extraction bitmask for
// LOGICAL_PROCESSOR_ID

Vol. 3A 10-43
MULTIPLE-PROCESSOR MANAGEMENT

return 0;
}

b. Derive the extraction bitmask for processor cores in a physical processor package and associated mask offset for
different packages.

int DeriveCore_Mask_Offsets (void)


{
if (!HWMTSupported()) return -1;
execute cpuid with eax = 11, ECX = 0;
while( ECX[15:8] ) { // domain type encoding is valid
Mask_Core_shift = EAX[4:0]; // needed to distinguish different physical packages
ECX ++;
execute cpuid with eax = 11;
}
COREPlusLogical_Processor_MASK = ~( (-1) << Mask_Core_shift);
// treat domains between core and physical package as a core for software choosing not to implement or recognize
// these unknown domains
CORE_MASK = COREPlusLogical_Processor_MASK ^ Logical Processor Mask;
PACKAGE_MASK = (-1) << Mask_Core_shift;
return -1;
}

c. Query the x2APIC ID of a logical processor.

APIC_IDs for each logical processor.

unsigned char Getx2APIC_ID (void)


{
unsigned reg_edx = 0;
execute cpuid with eax = 11, ECX = 0
store returned value of edx
return (unsigned) (reg_edx) ;
}

Example 10-21. Support Routines for Identifying Package, Core, and Logical Processors from 8-bit Initial APIC ID
a. Find the size of address space for logical processors in a physical processor package.

#define NUM_LOGICAL_BITS 00FF0000H


// Use the mask above and CPUID.1.EBX[23:16] to obtain the max number of addressable IDs
// for logical processors in a physical package,

//Returns the size of address space of logical processors in a physical processor package;
// Software should not assume the value to be a power of 2.

unsigned char MaxLPIDsPerPackage(void)


{
if (!HWMTSupported()) return 1;
execute cpuid with eax = 1
store returned value of ebx
return (unsigned char) ((reg_ebx & NUM_LOGICAL_BITS) >> 16);
}

10-44 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

b. Find the size of address space for processor cores in a physical processor package.

// Returns the max number of addressable IDs for processor cores in a physical processor package;
// Software should not assume cpuid reports this value to be a power of 2.

unsigned MaxCoreIDsPerPackage(void)
{
if (!HWMTSupported()) return (unsigned char) 1;
if cpuid supports leaf number 4
{ // we can retrieve multi-core topology info using leaf 4
execute cpuid with eax = 4, ecx = 0
store returned value of eax
return (unsigned) ((reg_eax >> 26) +1);
}
else // must be a single-core processor
return 1;
}
c. Query the initial APIC ID of a logical processor.

#define INITIAL_APIC_ID_BITS FF000000H // CPUID.1.EBX[31:24] initial APIC ID

// Returns the 8-bit unique initial APIC ID for the processor running the code.
// Software can use OS services to affinitize the current thread to each logical processor
// available under the OS to gather the initial APIC_IDs for each logical processor.

unsigned GetInitAPIC_ID (void)


{
unsigned int reg_ebx = 0;
execute cpuid with eax = 1
store returned value of ebx
return (unsigned) ((reg_ebx & INITIAL_APIC_ID_BITS) >> 24;
}
d. Find the width of an extraction bitmask from the maximum count of the bit-field (address size).

// Returns the mask bit width of a bit field from the maximum count that bit field can represent.
// This algorithm does not assume ‘address size’ to have a value equal to power of 2.
// Address size for LOGICAL_PROCESSOR_ID can be calculated from MaxLPIDsPerPackage()/MaxCoreIDsPerPackage()
// Then use the routine below to derive the corresponding width of logical processor extraction bitmask
// Address size for CORE_ID is MaxCoreIDsPerPackage(),
// Derive the bitwidth for CORE extraction mask similarly

unsigned FindMaskWidth(Unsigned Max_Count)


{unsigned int mask_width, cnt = Max_Count;
__asm {
mov eax, cnt
mov ecx, 0
mov mask_width, ecx
dec eax
bsr cx, ax
jz next
inc cx
mov mask_width, ecx
next:
mov eax, mask_width

Vol. 3A 10-45
MULTIPLE-PROCESSOR MANAGEMENT

}
return mask_width;
}
e. Extract a sub ID from an 8-bit full ID, using address size of the sub ID and shift count.

// The routine below can extract LOGICAL_PROCESSOR_ID, CORE_ID, and PACKAGE_ID respectively from the init APIC_ID
// To extract LOGICAL_PROCESSOR_ID, MaxSubIDvalue is set to the address size of LOGICAL_PROCESSOR_ID, Shift_Count = 0
// To extract CORE_ID, MaxSubIDvalue is the address size of CORE_ID, Shift_Count is width of logical processor extraction bitmask.
// Returns the value of the sub ID, this is not a zero-based value

Unsigned char GetSubID(unsigned char Full_ID, unsigned char MaxSubIDvalue, unsigned char Shift_Count)
{
MaskWidth = FindMaskWidth(MaxSubIDValue);
MaskBits = ((uchar) (FFH << Shift_Count)) ^ ((uchar) (FFH << Shift_Count + MaskWidth)) ;
SubID = Full_ID & MaskBits;
Return SubID;
}

Software must not assume local APIC_ID values in an MP system are consecutive. Non-consecutive local APIC_IDs
may be the result of hardware configurations or debug features implemented in the BIOS or OS.
An identifier for each hierarchical domain can be extracted from an 8-bit APIC_ID using the support routines illus-
trated in Example 10-21. The appropriate bit mask and shift value to construct the appropriate bit mask for each
domain must be determined dynamically at runtime.

10.9.5 Identifying Topological Relationships in an MP System


To detect the number of physical packages, processor cores, or other topological relationships in a MP system, the
following procedures are recommended:
• Extract the three-domain identifiers from the APIC ID of each logical processor enabled by system software.
The sequence is as follows (see the pseudo code shown in Example 10-22 and support routines shown in
Example 10-19):
• The extraction start from the right-most bit field, corresponding to LOGICAL_PROCESSOR_ID, the
innermost hierarchy in a three-domain topology (See Figure 10-7). For the right-most bit field, the shift
value of the working mask is zero. The width of the bit field is determined dynamically using the
maximum number of logical processor per core, which can be derived from information provided from
CPUID.
• To extract the next bit-field, the shift value of the working mask is determined from the width of the bit
mask of the previous step. The width of the bit field is determined dynamically using the maximum
number of cores per package.
• To extract the remaining bit-field, the shift value of the working mask is determined from the maximum
number of logical processor per package. So the remaining bits in the APIC ID (excluding those bits
already extracted in the two previous steps) are extracted as the third identifier. This applies to a non-
clustered MP system, or if there is no need to distinguish between PACKAGE_ID and CLUSTER_ID.
If there is need to distinguish between PACKAGE_ID and CLUSTER_ID, PACKAGE_ID can be extracted
using an algorithm similar to the extraction of CORE_ID, assuming the number of physical packages in
each node of a clustered system is symmetric.
• Assemble the three-domain identifiers of LOGICAL_PROCESSOR_ID, CORE_ID, PACKAGE_IDs into arrays for
each enabled logical processor. This is shown in Example 10-23a.
• To detect the number of physical packages: use PACKAGE_ID to identify those logical processors that reside in
the same physical package. This is shown in Example 10-23b. This example also depicts a technique to
construct a mask to represent the logical processors that reside in the same package.

10-46 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

• To detect the number of processor cores: use CORE_ID to identify those logical processors that reside in the
same core. This is shown in Example 10-23. This example also depicts a technique to construct a mask to
represent the logical processors that reside in the same core.
In Example 10-22, the numerical ID value can be obtained from the value extracted with the mask by shifting it
right by shift count. Algorithms below do not shift the value. The assumption is that the SubID values can be
compared for equivalence without the need to shift.

Example 10-22. Pseudo Code Depicting Three-Domain Extraction Algorithm


For Each local_APIC_ID{
// Calculate Logical Processor Mask, the bit mask pattern to extract LOGICAL_PROCESSOR_ID,
// Logical Processor Mask is determined using topology enumertaion parameters
// from CPUID leaf 0BH (Example 10-20);
// otherwise, Logical Processor Mask is determined using CPUID leaf 01H and leaf 04H (Example 10-21).
// This algorithm assumes there is symmetry across core boundary, i.e., each core within a
// package has the same number of logical processors
// LOGICAL_PROCESSOR_ID always starts from bit 0, corresponding to the right-most bit-field
LOGICAL_PROCESSOR_ID = APIC_ID & Logical Processor Mask;

// Extract CORE_ID:
// Core Mask is determined in Example 10-20 or Example 10-21
CORE_ID = (APIC_ID & Core Mask);

// Extract PACKAGE_ID:
// Assume single cluster.
// Shift out the mask width for maximum logical processors per package
// Package Mask is determined in Example 10-20 or Example 10-21
PACKAGE_ID = (APIC_ID & Package Mask) ;
}

Example 10-23. Compute the Number of Packages, Cores, and Processor Relationships in a MP System
a) Assemble lists of PACKAGE_ID, CORE_ID, and LOGICAL_PROCESSOR_ID of each enabled logical processors

// The BIOS and/or OS may limit the number of logical processors available to applications after system boot.
// The below algorithm will compute topology for the processors visible to the thread that is computing it.

// Extract the 3-domains of IDs on every processor.


// SystemAffinity is a bitmask of all the processors started by the OS. Use OS specific APIs to obtain it.
// ThreadAffinityMask is used to affinitize the topology enumeration thread to each processor using OS specific APIs.
// Allocate per processor arrays to store the Package_ID, Core_ID, and LOGICAL_PROCESSOR_ID for every started processor.

ThreadAffinityMask = 1;
ProcessorNum = 0;
while (ThreadAffinityMask ≠ 0 && ThreadAffinityMask <= SystemAffinity) {
// Check to make sure we can utilize this processor first.
if (ThreadAffinityMask & SystemAffinity){
Set thread to run on the processor specified in ThreadAffinityMask
Wait if necessary and ensure thread is running on specified processor

APIC_ID = GetAPIC_ID(); // 32 bit ID in Example 10-20 or 8-bit ID in Example 10-21


Extract the Package_ID, Core_ID, and LOGICAL_PROCESSOR_ID as explained in three domain extraction
algorithm of Example 10-22
PackageID[ProcessorNUM] = PACKAGE_ID;
CoreID[ProcessorNum] = CORE_ID;

Vol. 3A 10-47
MULTIPLE-PROCESSOR MANAGEMENT

LOGICAL_PROCESSOR_ID[ProcessorNum] = LOGICAL_PROCESSOR_ID;
ProcessorNum++;
}
ThreadAffinityMask <<= 1;
}
NumStartedLPs = ProcessorNum;

b) Using the list of PACKAGE_ID to count the number of physical packages in a MP system and construct, for each package, a multi-bit
mask corresponding to those logical processors residing in the same package.

// Compute the number of packages by counting the number of processors with unique PACKAGE_IDs in the PackageID array.
// Compute the mask of processors in each package.

// PackageIDBucket is an array of unique PACKAGE_ID values. Allocate an array of NumStartedLPs count of entries in this array.
// PackageProcessorMask is a corresponding array of the bit mask of processors belonging to the same package, these are
// processors with the same PACKAGE_ID.
// The algorithm below assumes there is symmetry across package boundary if more than one socket is populated in an MP
//system.
// Bucket Package IDs and compute processor mask for every package.

PackageNum = 1;
PackageIDBucket[0] = PackageID[0];
ProcessorMask = 1;
PackageProcessorMask[0] = ProcessorMask;
For (ProcessorNum = 1; ProcessorNum < NumStartedLPs; ProcessorNum++) {
ProcessorMask << = 1;
For (i=0; i < PackageNum; i++) {
// we may be comparing bit-fields of logical processors residing in different
// packages, the code below assume package symmetry
If (PackageID[ProcessorNum] = PackageIDBucket[i]) {
PackageProcessorMask[i] |= ProcessorMask;
Break; // found in existing bucket, skip to next iteration
}
}
if (i =PackageNum) {
//PACKAGE_ID did not match any bucket, start new bucket
PackageIDBucket[i] = PackageID[ProcessorNum];
PackageProcessorMask[i] = ProcessorMask;
PackageNum++;
}
}
// PackageNum has the number of Packages started in OS
// PackageProcessorMask[] array has the processor set of each package

c) Using the list of CORE_ID to count the number of cores in a MP system and construct, for each core, a multi-bit mask corresponding
to those logical processors residing in the same core.

Processors in the same core can be determined by bucketing the processors with the same PACKAGE_ID and CORE_ID. Note that code
below can BIT OR the values of PACKGE and CORE ID because they have not been shifted right.
The algorithm below assumes there is symmetry across package boundary if more than one socket is populated in an MP system.

//Bucketing PACKAGE and CORE IDs and computing processor mask for every core
CoreNum = 1;
CoreIDBucket[0] = PackageID[0] | CoreID[0];
ProcessorMask = 1;

10-48 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

CoreProcessorMask[0] = ProcessorMask;
For (ProcessorNum = 1; ProcessorNum < NumStartedLPs; ProcessorNum++) {
ProcessorMask << = 1;
For (i=0; i < CoreNum; i++) {
// we may be comparing bit-fields of logical processors residing in different
// packages, the code below assume package symmetry
If ((PackageID[ProcessorNum] | CoreID[ProcessorNum]) = CoreIDBucket[i]) {
CoreProcessorMask[i] |= ProcessorMask;
Break; // found in existing bucket, skip to next iteration
}
}
if (i = CoreNum) {
//Did not match any bucket, start new bucket
CoreIDBucket[i] = PackageID[ProcessorNum] | CoreID[ProcessorNum];
CoreProcessorMask[i] = ProcessorMask;
CoreNum++;
}
}
// CoreNum has the number of cores started in the OS
// CoreProcessorMask[] array has the processor set of each core

Other processor relationships such as processor mask of sibling cores can be computed from set operations of the
PackageProcessorMask[] and CoreProcessorMask[].
The algorithm shown above can be adapted to work with earlier generations of single-core IA-32 processors that
support Intel Hyper-Threading Technology and in situations that the deterministic cache parameter leaf is not
supported (provided CPUID supports initial APIC ID). A reference code example is available (see Intel® 64 Archi-
tecture Processor Topology Enumeration Technical Paper).

10.10 MANAGEMENT OF IDLE AND BLOCKED CONDITIONS


When a logical processor in an MP system (including multi-core processor or processors supporting Intel Hyper-
Threading Technology) is idle (no work to do) or blocked (on a lock or semaphore), additional management of the
core execution engine resource can be accomplished by using the HLT (halt), PAUSE, or the MONITOR/MWAIT
instructions.

10.10.1 HLT Instruction


The HLT instruction stops the execution of the logical processor on which it is executed and places it in a halted
state until further notice (see the description of the HLT instruction in Chapter 3 of the Intel® 64 and IA-32 Archi-
tectures Software Developer’s Manual, Volume 2A). When a logical processor is halted, active logical processors
continue to have full access to the shared resources within the physical package. Here shared resources that were
being used by the halted logical processor become available to active logical processors, allowing them to execute
at greater efficiency. When the halted logical processor resumes execution, shared resources are again shared
among all active logical processors. (See Section 10.10.6.3, “Halt Idle Logical Processors,” for more information
about using the HLT instruction with processors supporting Intel Hyper-Threading Technology.)

10.10.2 PAUSE Instruction


The PAUSE instruction can improves the performance of processors supporting Intel Hyper-Threading Technology
when executing “spin-wait loops” and other routines where one thread is accessing a shared lock or semaphore in
a tight polling loop. When executing a spin-wait loop, the processor can suffer a severe performance penalty when
exiting the loop because it detects a possible memory order violation and flushes the core processor’s pipeline. The
PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses

Vol. 3A 10-49
MULTIPLE-PROCESSOR MANAGEMENT

this hint to avoid the memory order violation and prevent the pipeline flush. In addition, the PAUSE instruction de-
pipelines the spin-wait loop to prevent it from consuming execution resources excessively and consume power
needlessly. (See Section 10.10.6.1, “Use the PAUSE Instruction in Spin-Wait Loops,” for more information about
using the PAUSE instruction with IA-32 processors supporting Intel Hyper-Threading Technology.)

10.10.3 Detecting Support MONITOR/MWAIT Instruction


Streaming SIMD Extensions 3 introduced two instructions (MONITOR and MWAIT) to help multithreaded software
improve thread synchronization. In the initial implementation, MONITOR and MWAIT are available to software at
ring 0. The instructions are conditionally available at levels greater than 0. Use the following steps to detect the
availability of MONITOR and MWAIT:
• Use CPUID to query the MONITOR bit (CPUID.1.ECX[3] = 1).
• If CPUID indicates support, execute MONITOR inside a TRY/EXCEPT exception handler and trap for an
exception. If an exception occurs, MONITOR and MWAIT are not supported at a privilege level greater than 0.
See Example 10-24.

Example 10-24. Verifying MONITOR/MWAIT Support


boolean MONITOR_MWAIT_works = TRUE;
try {
_asm {
xor ecx, ecx
xor edx, edx
mov eax, MemArea
monitor
}
// Use monitor
} except (UNWIND) {
// if we get here, MONITOR/MWAIT is not supported
MONITOR_MWAIT_works = FALSE;
}

10.10.4 MONITOR/MWAIT Instruction


Operating systems usually implement idle loops to handle thread synchronization. In a typical idle-loop scenario,
there could be several “busy loops” and they would use a set of memory locations. An impacted processor waits in
a loop and poll a memory location to determine if there is available work to execute. The posting of work is typically
a write to memory (the work-queue of the waiting processor). The time for initiating a work request and getting it
scheduled is on the order of a few bus cycles.
From a resource sharing perspective (logical processors sharing execution resources), use of the HLT instruction in
an OS idle loop is desirable but has implications. Executing the HLT instruction on a idle logical processor puts the
targeted processor in a non-execution state. This requires another processor (when posting work for the halted
logical processor) to wake up the halted processor using an inter-processor interrupt. The posting and servicing of
such an interrupt introduces a delay in the servicing of new work requests.
In a shared memory configuration, exits from busy loops usually occur because of a state change applicable to a
specific memory location; such a change tends to be triggered by writes to the memory location by another agent
(typically a processor).
MONITOR/MWAIT complement the use of HLT and PAUSE to allow for efficient partitioning and un-partitioning of
shared resources among logical processors sharing physical resources. MONITOR sets up an effective address
range that is monitored for write-to-memory activities; MWAIT places the processor in an optimized state (this may
vary between different implementations) until a write to the monitored address range occurs.
In the initial implementation of MONITOR and MWAIT, they are available at CPL = 0 only.

10-50 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

Both instructions rely on the state of the processor’s monitor hardware. The monitor hardware can be either armed
(by executing the MONITOR instruction) or triggered (due to a variety of events, including a store to the monitored
memory region). If upon execution of MWAIT, monitor hardware is in a triggered state: MWAIT behaves as a NOP
and execution continues at the next instruction in the execution stream. The state of monitor hardware is not archi-
tecturally visible except through the behavior of MWAIT.
Multiple events other than a write to the triggering address range can cause a processor that executed MWAIT to
wake up. These include events that would lead to voluntary or involuntary context switches, such as:
• External interrupts, including NMI, SMI, INIT, BINIT, MCERR, A20M#
• Faults, Aborts (including Machine Check)
• Architectural TLB invalidations including writes to CR0, CR3, CR4, and certain MSR writes; execution of LMSW
(occurring prior to issuing MWAIT but after setting the monitor)
• Voluntary transitions due to fast system call and far calls (occurring prior to issuing MWAIT but after setting the
monitor)
Power management related events (such as Thermal Monitor 2 or chipset driven STPCLK# assertion) will not cause
the monitor event pending flag to be cleared. Faults will not cause the monitor event pending flag to be cleared.
Software should not allow for voluntary context switches in between MONITOR/MWAIT in the instruction flow. Note
that execution of MWAIT does not re-arm the monitor hardware. This means that MONITOR/MWAIT need to be
executed in a loop. Also note that exits from the MWAIT state could be due to a condition other than a write to the
triggering address; software should explicitly check the triggering data location to determine if the write occurred.
Software should also check the value of the triggering address following the execution of the monitor instruction
(and prior to the execution of the MWAIT instruction). This check is to identify any writes to the triggering address
that occurred during the course of MONITOR execution.
The address range provided to the MONITOR instruction must be of write-back caching type. Only write-back
memory type stores to the monitored address range will trigger the monitor hardware. If the address range is not
in memory of write-back type, the address monitor hardware may not be set up properly or the monitor hardware
may not be armed. Software is also responsible for ensuring that
• Writes that are not intended to cause the exit of a busy loop do not write to a location within the address region
being monitored by the monitor hardware,
• Writes intended to cause the exit of a busy loop are written to locations within the monitored address region.
Not doing so will lead to more false wakeups (an exit from the MWAIT state not due to a write to the intended data
location). These have negative performance implications. It might be necessary for software to use padding to
prevent false wakeups. CPUID provides a mechanism for determining the size data locations for monitoring as well
as a mechanism for determining the size of a the pad.

10.10.5 Monitor/Mwait Address Range Determination


To use the MONITOR/MWAIT instructions, software should know the length of the region monitored by the
MONITOR/MWAIT instructions and the size of the coherence line size for cache-snoop traffic in a multiprocessor
system. This information can be queried using the CPUID monitor leaf function (EAX = 05H). You will need the
smallest and largest monitor line size:
• To avoid missed wake-ups: make sure that the data structure used to monitor writes fits within the smallest
monitor line-size. Otherwise, the processor may not wake up after a write intended to trigger an exit from
MWAIT.
• To avoid false wake-ups; use the largest monitor line size to pad the data structure used to monitor writes.
Software must make sure that beyond the data structure, no unrelated data variable exists in the triggering
area for MWAIT. A pad may be needed to avoid this situation.
These above two values bear no relationship to cache line size in the system and software should not make any
assumptions to that effect. Within a single-cluster system, the two parameters should default to be the same (the
size of the monitor triggering area is the same as the system coherence line size).
Based on the monitor line sizes returned by the CPUID, the OS should dynamically allocate structures with appro-
priate padding. If static data structures must be used by an OS, attempt to adapt the data structure and use a

Vol. 3A 10-51
MULTIPLE-PROCESSOR MANAGEMENT

dynamically allocated data buffer for thread synchronization. When the latter technique is not possible, consider
not using MONITOR/MWAIT when using static data structures.
To set up the data structure correctly for MONITOR/MWAIT on multi-clustered systems: interaction between
processors, chipsets, and the BIOS is required (system coherence line size may depend on the chipset used in the
system; the size could be different from the processor’s monitor triggering area). The BIOS is responsible to set the
correct value for system coherence line size using the IA32_MONITOR_FILTER_LINE_SIZE MSR. Depending on the
relative magnitude of the size of the monitor triggering area versus the value written into the IA32_MONITOR_FIL-
TER_LINE_SIZE MSR, the smaller of the parameters will be reported as the Smallest Monitor Line Size. The larger
of the parameters will be reported as the Largest Monitor Line Size.

10.10.6 Required Operating System Support


This section describes changes that must be made to an operating system to run on processors supporting Intel
Hyper-Threading Technology. It also describes optimizations that can help an operating system make more efficient
use of the logical processors sharing execution resources. The required changes and suggested optimizations are
representative of the types of modifications that appear in Windows* XP and Linux* kernel 2.4.0 operating systems
for Intel processors supporting Intel Hyper-Threading Technology. Additional optimizations for processors
supporting Intel Hyper-Threading Technology are described in the Intel® 64 and IA-32 Architectures Optimization
Reference Manual.

10.10.6.1 Use the PAUSE Instruction in Spin-Wait Loops


Intel recommends that a PAUSE instruction be placed in all spin-wait loops that run on Intel processors supporting
Intel Hyper-Threading Technology and multi-core processors.
Software routines that use spin-wait loops include multiprocessor synchronization primitives (spin-locks, sema-
phores, and mutex variables) and idle loops. Such routines keep the processor core busy executing a load-compare-
branch loop while a thread waits for a resource to become available. Including a PAUSE instruction in such a loop
greatly improves efficiency (see Section 10.10.2, “PAUSE Instruction”). The following routine gives an example of a
spin-wait loop that uses a PAUSE instruction:

Spin_Lock:
CMP lockvar, 0 ;Check if lock is free
JE Get_Lock
PAUSE ;Short delay
JMP Spin_Lock
Get_Lock:
MOV EAX, 1
XCHG EAX, lockvar ;Try to get lock
CMP EAX, 0 ;Test if successful
JNE Spin_Lock
Critical_Section:
<critical section code>
MOV lockvar, 0
...
Continue:
The spin-wait loop above uses a “test, test-and-set” technique for determining the availability of the synchroniza-
tion variable. This technique is recommended when writing spin-wait loops.
In IA-32 processor generations earlier than the Pentium 4 processor, the PAUSE instruction is treated as a NOP
instruction.

10.10.6.2 Potential Usage of MONITOR/MWAIT in C0 Idle Loops


An operating system may implement different handlers for different idle states. A typical OS idle loop on an ACPI-
compatible OS is shown in Example 10-25:

10-52 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

Example 10-25. A Typical OS Idle Loop


// WorkQueue is a memory location indicating there is a thread
// ready to run. A non-zero value for WorkQueue is assumed to
// indicate the presence of work to be scheduled on the processor.
// The idle loop is entered with interrupts disabled.

WHILE (1) {
IF (WorkQueue) THEN {
// Schedule work at WorkQueue.
}
ELSE {
// No work to do - wait in appropriate C-state handler depending
// on Idle time accumulated
IF (IdleTime >= IdleTimeThreshhold) THEN {
// Call appropriate C1, C2, C3 state handler, C1 handler
// shown below
}
}
}
// C1 handler uses a Halt instruction
VOID C1Handler()
{ STI
HLT
}

The MONITOR and MWAIT instructions may be considered for use in the C0 idle state loops, if MONITOR and MWAIT are supported.

Example 10-26. An OS Idle Loop with MONITOR/MWAIT in the C0 Idle Loop


// WorkQueue is a memory location indicating there is a thread
// ready to run. A non-zero value for WorkQueue is assumed to
// indicate the presence of work to be scheduled on the processor.
// The following example assumes that the necessary padding has been
// added surrounding WorkQueue to eliminate false wakeups
// The idle loop is entered with interrupts disabled.

WHILE (1) {
IF (WorkQueue) THEN {
// Schedule work at WorkQueue.
}
ELSE {
// No work to do - wait in appropriate C-state handler depending
// on Idle time accumulated.
IF (IdleTime >= IdleTimeThreshhold) THEN {
// Call appropriate C1, C2, C3 state handler, C1
// handler shown below
MONITOR WorkQueue // Setup of eax with WorkQueue
// LinearAddress,
// ECX, EDX = 0
IF (WorkQueue = 0) THEN {
MWAIT
}
}
}

Vol. 3A 10-53
MULTIPLE-PROCESSOR MANAGEMENT

}
// C1 handler uses a Halt instruction.
VOID C1Handler()
{ STI
HLT
}

10.10.6.3 Halt Idle Logical Processors


If one of two logical processors is idle or in a spin-wait loop of long duration, explicitly halt that processor by means
of a HLT instruction.
In an MP system, operating systems can place idle processors into a loop that continuously checks the run queue
for runnable software tasks. Logical processors that execute idle loops consume a significant amount of core’s
execution resources that might otherwise be used by the other logical processors in the physical package. For this
reason, halting idle logical processors optimizes the performance.1 If all logical processors within a physical
package are halted, the processor will enter a power-saving state.

10.10.6.4 Potential Usage of MONITOR/MWAIT in C1 Idle Loops


An operating system may also consider replacing HLT with MONITOR/MWAIT in its C1 idle loop. An example is
shown in Example 10-27:

Example 10-27. An OS Idle Loop with MONITOR/MWAIT in the C1 Idle Loop


// WorkQueue is a memory location indicating there is a thread
// ready to run. A non-zero value for WorkQueue is assumed to
// indicate the presence of work to be scheduled on the processor.
// The following example assumes that the necessary padding has been
// added surrounding WorkQueue to eliminate false wakeups
// The idle loop is entered with interrupts disabled.

WHILE (1) {
IF (WorkQueue) THEN {
// Schedule work at WorkQueue
}
ELSE {
// No work to do - wait in appropriate C-state handler depending
// on Idle time accumulated
IF (IdleTime >= IdleTimeThreshhold) THEN {
// Call appropriate C1, C2, C3 state handler, C1
// handler shown below
}
}
}

VOID C1Handler()

{ MONITOR WorkQueue // Setup of eax with WorkQueue LinearAddress,


// ECX, EDX = 0
IF (WorkQueue = 0) THEN {
STI

1. Excessive transitions into and out of the HALT state could also incur performance penalties. Operating systems should evaluate the
performance trade-offs for their operating system.

10-54 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

MWAIT // EAX, ECX = 0


}
}

10.10.6.5 Guidelines for Scheduling Threads on Logical Processors Sharing Execution Resources
Because the logical processors, the order in which threads are dispatched to logical processors for execution can
affect the overall efficiency of a system. The following guidelines are recommended for scheduling threads for
execution.
• Dispatch threads to one logical processor per processor core before dispatching threads to the other logical
processor sharing execution resources in the same processor core.
• In an MP system with two or more physical packages, distribute threads out over all the physical processors,
rather than concentrate them in one or two physical processors.
• Use processor affinity to assign a thread to a specific processor core or package, depending on the cache-
sharing topology. The practice increases the chance that the processor’s caches will contain some of the
thread’s code and data when it is dispatched for execution after being suspended.

10.10.6.6 Eliminate Execution-Based Timing Loops


Intel discourages the use of timing loops that depend on a processor’s execution speed to measure time. There are
several reasons:
• Timing loops cause problems when they are calibrated on a IA-32 processor running at one frequency and then
executed on a processor running at another frequency.
• Routines for calibrating execution-based timing loops produce unpredictable results when run on an IA-32
processor supporting Intel Hyper-Threading Technology. This is due to the sharing of execution resources
between the logical processors within a physical package.
To avoid the problems described, timing loop routines must use a timing mechanism for the loop that does not
depend on the execution speed of the logical processors in the system. The following sources are generally avail-
able:
• A high resolution system timer (for example, an Intel 8254).
• A high resolution timer within the processor (such as, the local APIC timer or the time-stamp counter).
For additional information, see the Intel® 64 and IA-32 Architectures Optimization Reference Manual.

10.10.6.7 Place Locks and Semaphores in Aligned, 128-Byte Blocks of Memory


When software uses locks or semaphores to synchronize processes, threads, or other code sections; Intel recom-
mends that only one lock or semaphore be present within a cache line (or 128 byte sector, if 128-byte sector is
supported). In processors based on Intel NetBurst microarchitecture (which support 128-byte sector consisting of
two cache lines), following this recommendation means that each lock or semaphore should be contained in a 128-
byte block of memory that begins on a 128-byte boundary. The practice minimizes the bus traffic required to
service locks.

10.11 MP INITIALIZATION FOR P6 FAMILY PROCESSORS


This section describes the MP initialization process for systems that use multiple P6 family processors. This process
uses the MP initialization protocol that was introduced with the Pentium Pro processor (see Section 10.4, “Multiple-
Processor (MP) Initialization”). For P6 family processors, this protocol is typically used to boot 2 or 4 processors
that reside on single system bus; however, it can support from 2 to 15 processors in a multi-clustered system when
the APIC buses are tied together. Larger systems are not supported.

Vol. 3A 10-55
MULTIPLE-PROCESSOR MANAGEMENT

10.11.1 Overview of the MP Initialization Process for P6 Family Processors


During the execution of the MP initialization protocol, one processor is selected as the bootstrap processor (BSP)
and the remaining processors are designated as application processors (APs), see Section 10.4.1, “BSP and AP
Processors.” Thereafter, the BSP manages the initialization of itself and the APs. This initialization includes
executing BIOS initialization code and operating-system initialization code.
The MP protocol imposes the following requirements and restrictions on the system:
• An APIC clock (APICLK) must be provided.
• The MP protocol will be executed only after a power-up or RESET. If the MP protocol has been completed and a
BSP has been chosen, subsequent INITs (either to a specific processor or system wide) do not cause the MP
protocol to be repeated. Instead, each processor examines its BSP flag (in the APIC_BASE MSR) to determine
whether it should execute the BIOS boot-strap code (if it is the BSP) or enter a wait-for-SIPI state (if it is an
AP).
• All devices in the system that are capable of delivering interrupts to the processors must be inhibited from
doing so for the duration of the MP initialization protocol. The time during which interrupts must be inhibited
includes the window between when the BSP issues an INIT-SIPI-SIPI sequence to an AP and when the AP
responds to the last SIPI in the sequence.
The following special-purpose interprocessor interrupts (IPIs) are used during the boot phase of the MP initializa-
tion protocol. These IPIs are broadcast on the APIC bus.
• Boot IPI (BIPI)—Initiates the arbitration mechanism that selects a BSP from the group of processors on the
system bus and designates the remainder of the processors as APs. Each processor on the system bus
broadcasts a BIPI to all the processors following a power-up or RESET.
• Final Boot IPI (FIPI)—Initiates the BIOS initialization procedure for the BSP. This IPI is broadcast to all the
processors on the system bus, but only the BSP responds to it. The BSP responds by beginning execution of the
BIOS initialization code at the reset vector.
• Startup IPI (SIPI)—Initiates the initialization procedure for an AP. The SIPI message contains a vector to the AP
initialization code in the BIOS.
Table 10-5 describes the various fields of the boot phase IPIs.

Table 10-5. Boot Phase IPI Message Format


Destination Destination Trigger Destination Delivery Vector
Type
Field Shorthand Mode Level Mode Mode (Hex)
BIPI Not used All including self Edge Deassert Don’t Care Fixed 40 to 4E*
(000)
FIPI Not used All including self Edge Deassert Don’t Care Fixed 10
(000)
SIPI Used All excluding self Edge Assert Physical StartUp 00 to FF
(110)
NOTE:
* For all P6 family processors.

For BIPI messages, the lower 4 bits of the vector field contain the APIC ID of the processor issuing the message and
the upper 4 bits contain the “generation ID” of the message. All P6 family processor will have a generation ID of 4H.
BIPIs will therefore use vector values ranging from 40H to 4EH (4FH can not be used because FH is not a valid APIC
ID).

10.11.2 MP Initialization Protocol Algorithm


Following a power-up or RESET of a system, the P6 family processors in the system execute the MP initialization
protocol algorithm to initialize each of the processors on the system bus. In the course of executing this algorithm,
the following boot-up and initialization operations are carried out:

10-56 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT

1. Each processor on the system bus is assigned a unique APIC ID, based on system topology (see Section 10.4.5,
“Identifying Logical Processors in an MP System”). This ID is written into the local APIC ID register for each
processor.
2. Each processor executes its internal BIST simultaneously with the other processors on the system bus. Upon
completion of the BIST (at T0), each processor broadcasts a BIPI to “all including self” (see Figure 10-8).
3. APIC arbitration hardware causes all the APICs to respond to the BIPIs one at a time (at T1, T2, T3, and T4).
4. When the first BIPI is received (at time T1), each APIC compares the four least significant bits of the BIPI’s
vector field with its APIC ID. If the vector and APIC ID match, the processor selects itself as the BSP by setting
the BSP flag in its IA32_APIC_BASE MSR. If the vector and APIC ID do not match, the processor selects itself
as an AP by entering the “wait for SIPI” state. (Note that in Figure 10-8, the BIPI from processor 1 is the first
BIPI to be handled, so processor 1 becomes the BSP.)
5. The newly established BSP broadcasts an FIPI message to “all including self.” The FIPI is guaranteed to be
handled only after the completion of the BIPIs that were issued by the non-BSP processors.

System (CPU) Bus

Pentium III Pentium III Pentium III Pentium III


Processor 0 Processor 1 Processor 2 Processor 3

APIC Bus
Processor 1
Becomes BSP
T0 T1 T2 T3 T4 T5

BIPI.1 BIPI.0 BIPI.3 BIPI.2 FIPI


Serial Bus Activity

Figure 10-8. MP System With Multiple Pentium III Processors

6. After the BSP has been established, the outstanding BIPIs are received one at a time (at T2, T3, and T4) and
ignored by all processors.
7. When the FIPI is finally received (at T5), only the BSP responds to it. It responds by fetching and executing
BIOS boot-strap code, beginning at the reset vector (physical address FFFF FFF0H).
8. As part of the boot-strap code, the BSP creates an ACPI table and an MP table and adds its initial APIC ID to
these tables as appropriate.
9. At the end of the boot-strap procedure, the BSP broadcasts a SIPI message to all the APs in the system. Here,
the SIPI message contains a vector to the BIOS AP initialization code (at 000V V000H, where VV is the vector
contained in the SIPI message).
10. All APs respond to the SIPI message by racing to a BIOS initialization semaphore. The first one to the
semaphore begins executing the initialization code. (See MP init code for semaphore implementation details.)
As part of the AP initialization procedure, the AP adds its APIC ID number to the ACPI and MP tables as appro-
priate. At the completion of the initialization procedure, the AP executes a CLI instruction (to clear the IF flag in
the EFLAGS register) and halts itself.
11. When each of the APs has gained access to the semaphore and executed the AP initialization code and all
written their APIC IDs into the appropriate places in the ACPI and MP tables, the BSP establishes a count for the
number of processors connected to the system bus, completes executing the BIOS boot-strap code, and then
begins executing operating-system boot-strap and start-up code.

Vol. 3A 10-57
MULTIPLE-PROCESSOR MANAGEMENT

12. While the BSP is executing operating-system boot-strap and start-up code, the APs remain in the halted state.
In this state they will respond only to INITs, NMIs, and SMIs. They will also respond to snoops and to assertions
of the STPCLK# pin.
See Section 10.4.4, “MP Initialization Example,” for an annotated example the use of the MP protocol to boot IA-32
processors in an MP. This code should run on any IA-32 processor that used the MP protocol.

10.11.2.1 Error Detection and Handling During the MP Initialization Protocol


Errors may occur on the APIC bus during the MP initialization phase. These errors may be transient or permanent
and can be caused by a variety of failure mechanisms (for example, broken traces, soft errors during bus usage,
etc.). All serial bus related errors will result in an APIC checksum or acceptance error.
The MP initialization protocol makes the following assumptions regarding errors that occur during initialization:
• If errors are detected on the APIC bus during execution of the MP initialization protocol, the processors that
detect the errors are shut down.
• The MP initialization protocol will be executed by processors even if they fail their BIST sequences.

10-58 Vol. 3A
18.Updates to Chapter 18, Volume 3B
Change bars and violet text show changes to Chapter 18 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3B: System Programming Guide, Part 2.

------------------------------------------------------------------------------------------
Changes to this chapter:
• Updated Section 18.10, “Incremental Decoding Information: Processor Family with CPUID
DisplayFamily_DisplayModel Signature 06_5FH, Machine Error Codes For Machine Check,” to correct the
register banks that error codes are reported in from IA32_MC6 and IA32_MC7 to IA32_MC7 and IA32_MC8.
Similar updates were made to Section 18.10.1, “Integrated Memory Controller Machine Check Errors.”

Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes 23


CHAPTER 18
INTERPRETING MACHINE CHECK ERROR CODES

Encoding of the model-specific and other information fields is different across processor families. The differences
are documented in the following sections.

18.1 INCREMENTAL DECODING INFORMATION: PROCESSOR FAMILY 06H,


MACHINE ERROR CODES FOR MACHINE CHECK
This section provides information for interpreting additional model-specific fields for external bus errors relating to
processor family 06H. The references to processor family 06H refers to only IA-32 processors with CPUID signa-
tures listed in Table 18-1.

Table 18-1. CPUID DisplayFamily_DisplayModel Signatures for Processor Family 06H


DisplayFamily_DisplayModel Processor Families/Processor Number Series
06_0EH Intel® Core™ Duo processor, Intel® Core™ Solo processor
06_0DH Intel Pentium M processor
06_09H Intel Pentium M processor
06_7H, 06_08H, 06_0AH, 06_0BH Intel Pentium III Xeon Processor, Intel Pentium III Processor
06_03H, 06_05H Intel Pentium II Xeon Processor, Intel Pentium II Processor
06_01H Intel Pentium Pro Processor

These errors are reported in the IA32_MCi_STATUS MSRs. They are reported architecturally as compound errors
with a general form of 0000 1PPT RRRR IILL in the MCA error code field. See Chapter 17 for information on the
interpretation of compound error codes. Incremental decoding information is listed in Table 18-2.

Table 18-2. Incremental Decoding Information: Processor Family 06H Machine Error Codes for Machine Check
Type Bit No. Bit Function Bit Description
MCA Error 15:0
Codes1
Model Specific 18:16 Reserved Reserved
Errors
24:19 Bus Queue Request 000000: BQ_DCU_READ_TYPE error.
Type 000010: BQ_IFU_DEMAND_TYPE error.
000011: BQ_IFU_DEMAND_NC_TYPE error.
000100: BQ_DCU_RFO_TYPE error.
000101: BQ_DCU_RFO_LOCK_TYPE error.
000110: BQ_DCU_ITOM_TYPE error.
001000: BQ_DCU_WB_TYPE error.
001010: BQ_DCU_WCEVICT_TYPE error.
001011: BQ_DCU_WCLINE_TYPE error.
001100: BQ_DCU_BTM_TYPE error.

Vol. 3B 18-1
INTERPRETING MACHINE CHECK ERROR CODES

Table 18-2. Incremental Decoding Information: Processor Family 06H Machine Error Codes for Machine Check
Type Bit No. Bit Function Bit Description
001101: BQ_DCU_INTACK_TYPE error.
001110: BQ_DCU_INVALL2_TYPE error.
001111: BQ_DCU_FLUSHL2_TYPE error.
010000: BQ_DCU_PART_RD_TYPE error.
010010: BQ_DCU_PART_WR_TYPE error.
010100: BQ_DCU_SPEC_CYC_TYPE error.
011000: BQ_DCU_IO_RD_TYPE error.
011001: BQ_DCU_IO_WR_TYPE error.
011100: BQ_DCU_LOCK_RD_TYPE error.
011110: BQ_DCU_SPLOCK_RD_TYPE error.
011101: BQ_DCU_LOCK_WR_TYPE error.
27:25 Bus Queue Error Type 000: BQ_ERR_HARD_TYPE error.
001: BQ_ERR_DOUBLE_TYPE error.
010: BQ_ERR_AERR2_TYPE error.
100: BQ_ERR_SINGLE_TYPE error.
101: BQ_ERR_AERR1_TYPE error.
28 FRC Error 1 if FRC error active.
29 BERR 1 if BERR is driven.
30 Internal BINIT 1 if BINIT driven for this processor.
31 Reserved Reserved
Other 34:32 Reserved Reserved
Information
35 External BINIT 1 if BINIT is received from external bus.
36 Response Parity Error This bit is asserted in IA32_MCi_STATUS if this component has received a parity
error on the RS[2:0]# pins for a response transaction. The RS signals are checked
by the RSP# external pin.
37 Bus BINIT This bit is asserted in IA32_MCi_STATUS if this component has received a hard
error response on a split transaction one access that has needed to be split across
the 64-bit external bus interface into two accesses).
38 Timeout BINIT This bit is asserted in IA32_MCi_STATUS if this component has experienced a ROB
time-out, which indicates that no micro-instruction has been retired for a
predetermined period of time.
A ROB time-out occurs when the 15-bit ROB time-out counter carries a 1 out of its
high order bit. 2 The timer is cleared when a micro-instruction retires, an exception
is detected by the core processor, RESET is asserted, or when a ROB BINIT occurs.
The ROB time-out counter is prescaled by the 8-bit PIC timer which is a divide by
128 of the bus clock (the bus clock is 1:2, 1:3, 1:4 of the core clock3). When a carry
out of the 8-bit PIC timer occurs, the ROB counter counts up by one. While this bit is
asserted, it cannot be overwritten by another error.
41:39 Reserved Reserved
42 Hard Error This bit is asserted in IA32_MCi_STATUS if this component has initiated a bus
transactions which has received a hard error response. While this bit is asserted, it
cannot be overwritten.
43 IERR This bit is asserted in IA32_MCi_STATUS if this component has experienced a
failure that causes the IERR pin to be asserted. While this bit is asserted, it cannot
be overwritten.

18-2 Vol. 3B
INTERPRETING MACHINE CHECK ERROR CODES

Table 18-2. Incremental Decoding Information: Processor Family 06H Machine Error Codes for Machine Check
Type Bit No. Bit Function Bit Description
44 AERR This bit is asserted in IA32_MCi_STATUS if this component has initiated 2 failing
bus transactions which have failed due to Address Parity Errors AERR asserted).
While this bit is asserted, it cannot be overwritten.
45 UECC The Uncorrectable ECC error bit is asserted in IA32_MCi_STATUS for uncorrected
ECC errors. While this bit is asserted, the ECC syndrome field will not be
overwritten.
46 CECC The correctable ECC error bit is asserted in IA32_MCi_STATUS for corrected ECC
errors.
54:47 ECC Syndrome The ECC syndrome field in IA32_MCi_STATUS contains the 8-bit ECC syndrome only
if the error was a correctable/uncorrectable ECC error and there wasn't a previous
valid ECC error syndrome logged in IA32_MCi_STATUS.
A previous valid ECC error in IA32_MCi_STATUS is indicated by
IA32_MCi_STATUS.bit45 uncorrectable error occurred) being asserted. After
processing an ECC error, machine check handling software should clear
IA32_MCi_STATUS.bit45 so that future ECC error syndromes can be logged.
56:55 Reserved Reserved
Status Register 63:57
Validity
Indicators1

NOTES:
1. These fields are architecturally defined. Refer to Chapter 17, “Machine-Check Architecture,” for more information.
2. For processors with a CPUID signature of 06_0EH, a ROB time-out occurs when the 23-bit ROB time-out counter carries a 1 out of its
high order bit.
3. For processors with a CPUID signature of 6_06_60H and later, the PIC timer will count crystal clock cycles.

18.2 INCREMENTAL DECODING INFORMATION: INTEL® CORE™ 2 PROCESSOR


FAMILY, MACHINE ERROR CODES FOR MACHINE CHECK
Table 18-4 provides information for interpreting additional model-specific fields for external bus errors relating to
processors based on Intel® Core™ microarchitecture, which implements the P4 bus specification. Table 18-3 lists
the CPUID signatures for Intel 64 processors that are covered by Table 18-4. These errors are reported in the
IA32_MCi_STATUS MSRs. They are reported architecturally as compound errors with a general form of
0000 1PPT RRRR IILL in the MCA error code field. See Chapter 17 for information on the interpretation of
compound error codes.

Table 18-3. CPUID DisplayFamily_DisplayModel Signatures for Processors Based on Intel® Core™ Microarchitecture
DisplayFamily_DisplayModel Processor Families/Processor Number Series
06_1DH Intel® Xeon® Processor 7400 series
06_17H Intel® Xeon® Processor 5200, 5400 series, Intel® Core™ 2 Quad processor Q9650
06_0FH Intel® Xeon® Processor 3000, 3200, 5100, 5300, 7300 series, Intel® Core™ 2 Quad, Intel® Core™ 2
Extreme, Intel® Core™ 2 Duo processors, Intel Pentium dual-core processors

Vol. 3B 18-3
INTERPRETING MACHINE CHECK ERROR CODES

Table 18-4. Incremental Bus Error Codes of Machine Check for Processors
Based on Intel® Core™ Microarchitecture
Type Bit No. Bit Function Bit Description
MCA Error 15:0
Codes1
Model Specific 18:16 Reserved Reserved
Errors
24:19 Bus Queue Request ‘000001: BQ_PREF_READ_TYPE error.
Type 000000: BQ_DCU_READ_TYPE error.
000010: BQ_IFU_DEMAND_TYPE error
000011: BQ_IFU_DEMAND_NC_TYPE error.
000100: BQ_DCU_RFO_TYPE error.
000101: BQ_DCU_RFO_LOCK_TYPE error.
000110: BQ_DCU_ITOM_TYPE error.
001000: BQ_DCU_WB_TYPE error.
001010: BQ_DCU_WCEVICT_TYPE error.
001011: BQ_DCU_WCLINE_TYPE error.
001100: BQ_DCU_BTM_TYPE error.
001101: BQ_DCU_INTACK_TYPE error.
001110: BQ_DCU_INVALL2_TYPE error.
001111: BQ_DCU_FLUSHL2_TYPE error.
010000: BQ_DCU_PART_RD_TYPE error.
010010: BQ_DCU_PART_WR_TYPE error.
010100: BQ_DCU_SPEC_CYC_TYPE error.
011000: BQ_DCU_IO_RD_TYPE error.
011001: BQ_DCU_IO_WR_TYPE error.
011100: BQ_DCU_LOCK_RD_TYPE error.
011110: BQ_DCU_SPLOCK_RD_TYPE error.
011101: BQ_DCU_LOCK_WR_TYPE error.
100100: BQ_L2_WI_RFO_TYPE error.
100110: BQ_L2_WI_ITOM_TYPE error.
27:25 Bus Queue Error Type ‘001: Address Parity Error.
‘010: Response Hard Error.
‘011: Response Parity Error.
28 MCE Driven 1 if MCE is driven.
29 MCE Observed 1 if MCE is observed.
30 Internal BINIT 1 if BINIT driven for this processor.
31 BINIT Observed 1 if BINIT is observed for this processor.
Other 33:32 Reserved Reserved
Information
34 PIC and FSB Data Data Parity detected on either PIC or FSB access.
Parity
35 Reserved Reserved
36 Response Parity Error This bit is asserted in IA32_MCi_STATUS if this component has received a parity
error on the RS[2:0]# pins for a response transaction. The RS signals are checked
by the RSP# external pin.

18-4 Vol. 3B
INTERPRETING MACHINE CHECK ERROR CODES

Table 18-4. Incremental Bus Error Codes of Machine Check for Processors
Based on Intel® Core™ Microarchitecture (Contd.)
Type Bit No. Bit Function Bit Description
37 FSB Address Parity Address parity error detected:
1: Address parity error detected.
0: No address parity error.
38 Timeout BINIT This bit is asserted in IA32_MCi_STATUS if this component has experienced a ROB
time-out, which indicates that no micro-instruction has been retired for a
predetermined period of time.
A ROB time-out occurs when the 23-bit ROB time-out counter carries a 1 out

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy