RDNA2 Shader ISA November2020 PDF
RDNA2 Shader ISA November2020 PDF
Reference Guide
AMD
30-November-2020
"RDNA 2" Instruction Set Architecture
Specification Agreement
This Specification Agreement (this "Agreement") is a legal agreement between Advanced Micro Devices, Inc. ("AMD") and "You"
as the recipient of the attached AMD Specification (the "Specification"). If you are accessing the Specification as part of your
performance of work for another party, you acknowledge that you have authority to bind such party to the terms and
conditions of this Agreement. If you accessed the Specification by any means or otherwise use or provide Feedback (defined
below) on the Specification, You agree to the terms and conditions set forth in this Agreement. If You do not agree to the terms
and conditions set forth in this Agreement, you are not licensed to use the Specification; do not use, access or provide Feedback
about the Specification. In consideration of Your use or access of the Specification (in whole or in part), the receipt and
sufficiency of which are acknowledged, You agree as follows:
1. You may review the Specification only (a) as a reference to assist You in planning and designing Your product, service or
technology ("Product") to interface with an AMD product in compliance with the requirements as set forth in the
Specification and (b) to provide Feedback about the information disclosed in the Specification to AMD.
2. Except as expressly set forth in Paragraph 1, all rights in and to the Specification are retained by AMD. This Agreement
does not give You any rights under any AMD patents, copyrights, trademarks or other intellectual property rights. You
may not (i) duplicate any part of the Specification; (ii) remove this Agreement or any notices from the Specification, or (iii)
give any part of the Specification, or assign or otherwise provide Your rights under this Agreement, to anyone else.
3. The Specification may contain preliminary information, errors, or inaccuracies, or may not include certain necessary
information. Additionally, AMD reserves the right to discontinue or make changes to the Specification and its products at
any time without notice. The Specification is provided entirely "AS IS." AMD MAKES NO WARRANTY OF ANY KIND AND
DISCLAIMS ALL EXPRESS, IMPLIED AND STATUTORY WARRANTIES, INCLUDING BUT NOT LIMITED TO IMPLIED
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, TITLE OR THOSE
WARRANTIES ARISING AS A COURSE OF DEALING OR CUSTOM OF TRADE. AMD SHALL NOT BE LIABLE FOR DIRECT,
INDIRECT, CONSEQUENTIAL, SPECIAL, INCIDENTAL, PUNITIVE OR EXEMPLARY DAMAGES OF ANY KIND (INCLUDING
LOSS OF BUSINESS, LOSS OF INFORMATION OR DATA, LOST PROFITS, LOSS OF CAPITAL, LOSS OF GOODWILL)
REGARDLESS OF THE FORM OF ACTION WHETHER IN CONTRACT, TORT (INCLUDING NEGLIGENCE) AND STRICT
PRODUCT LIABILITY OR OTHERWISE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
4. Furthermore, AMD’s products are not designed, intended, authorized or warranted for use as components in systems
intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other
application in which the failure of AMD’s product could create a situation where personal injury, death, or severe
property or environmental damage may occur.
5. You have no obligation to give AMD any suggestions, comments or feedback ("Feedback") relating to the Specification.
However, any Feedback You voluntarily provide may be used by AMD without restriction, fee or obligation of
confidentiality. Accordingly, if You do give AMD Feedback on any version of the Specification, You agree AMD may freely
use, reproduce, license, distribute, and otherwise commercialize Your Feedback in any product, as well as has the right to
sublicense third parties to do the same. Further, You will not give AMD any Feedback that You may have reason to believe
is (i) subject to any patent, copyright or other intellectual property claim or right of any third party; or (ii) subject to
license terms which seek to require any product or intellectual property incorporating or derived from Feedback or any
Product or other AMD intellectual property to be licensed to or otherwise provided to any third party.
6. You shall adhere to all applicable U.S., European, and other export laws, including but not limited to the U.S. Export
Administration Regulations ("EAR"), (15 C.F.R. Sections 730 through 774), and E.U. Council Regulation (EC) No 428/2009 of
5 May 2009. Further, pursuant to Section 740.6 of the EAR, You hereby certifies that, except pursuant to a license granted
by the United States Department of Commerce Bureau of Industry and Security or as otherwise permitted pursuant to a
License Exception under the U.S. Export Administration Regulations ("EAR"), You will not (1) export, re-export or release
to a national of a country in Country Groups D:1, E:1 or E:2 any restricted technology, software, or source code You receive
hereunder, or (2) export to Country Groups D:1, E:1 or E:2 the direct product of such technology or software, if such
foreign produced direct product is subject to national security controls as identified on the Commerce Control List
(currently found in Supplement 1 to Part 774 of EAR). For the most current Country Group listings, or for additional
ii of 283
"RDNA 2" Instruction Set Architecture
information about the EAR or Your obligations under those regulations, please refer to the U.S. Bureau of Industry and
Security’s website at http://www.bis.doc.gov/.
7. If You are a part of the U.S. Government, then the Specification is provided with "RESTRICTED RIGHTS" as set forth in
subparagraphs (c) (1) and (2) of the Commercial Computer Software-Restricted Rights clause at FAR 52.227-14 or
subparagraph (c) (1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.277-7013, as
applicable.
8. This Agreement is governed by the laws of the State of California without regard to its choice of law principles. Any
dispute involving it must be brought in a court having jurisdiction of such dispute in Santa Clara County, California, and
You waive any defenses and rights allowing the dispute to be litigated elsewhere. If any part of this agreement is
unenforceable, it will be considered modified to the extent necessary to make it enforceable, and the remainder shall
continue in effect. The failure of AMD to enforce any rights granted hereunder or to take action against You in the event
of any breach hereunder shall not be deemed a waiver by AMD as to subsequent enforcement of rights or subsequent
actions in the event of future breaches. This Agreement is the entire agreement between You and AMD concerning the
Specification; it may be changed only by a written document signed by both You and an authorized representative of
AMD.
DISCLAIMER
The information contained herein is for informational purposes only, and is subject to change without notice. This
document may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to
update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties
with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind,
including the implied warranties of noninfringement, merchantability or fitness for particular purposes, with respect
to the operation or use of AMD hardware, software or other products described herein. No license, including implied
or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations
applicable to the purchase or use of AMD’s products or technology are as set forth in a signed agreement between the
parties or in AMD’s Standard Terms and Conditions of Sale.
AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. OpenCL is a
trademark of Apple Inc. used by permission by Khronos Group, Inc. OpenGL® and the oval logo are trademarks or
registered trademarks of Hewlett Packard Enterprise in the United States and/or other countries worldwide. DirectX is
a registered trademark of Microsoft Corporation in the US and other jurisdictions. Other product names used in this
publication are for identification purposes only and may be trademarks of their respective companies.
iii of 283
"RDNA 2" Instruction Set Architecture
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Organization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Related Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Instruction Changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Additional Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1. Terminology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2. Program Organization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
iv of 283
"RDNA 2" Instruction Set Architecture
v of 283
"RDNA 2" Instruction Set Architecture
vi of 283
"RDNA 2" Instruction Set Architecture
11.2. Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
vii of 283
"RDNA 2" Instruction Set Architecture
viii of 283
"RDNA 2" Instruction Set Architecture
Preface
The document specifies the instructions (include the format of each type of instruction) and the
relevant program state (including how the program state interacts with the instructions). Some
instruction fields are mutually dependent; not all possible settings for all fields are legal. This
document specifies the valid combinations.
1. Specify the language constructs and behavior, including the organization of each type of
instruction in both text syntax and binary format.
2. Provide a reference of instruction operation that compiler writers can use to maximize
performance of the processor.
Audience
This document is intended for programmers writing application and system software, including
operating systems, compilers, loaders, linkers, device drivers, and system utilities. It assumes
that programmers are writing compute-intensive parallel applications (streaming applications)
and assumes an understanding of requisite programming practices.
Organization
This document begins with an overview of the AMD RDNA processors' hardware and
programming environment (Chapter 1).
Chapter 2 describes the organization of RDNA programs.
Chapter 3 describes the program state that is maintained.
Chapter 4 describes the program flow.
Chapter 5 describes the scalar ALU operations.
Chapter 6 describes the vector ALU operations.
Chapter 7 describes the scalar memory operations.
Chapter 8 describes the vector memory operations.
Chapter 9 provides information about the flat memory instructions.
Chapter 10 describes the data share operations.
Chapter 11 describes exporting the parameters of pixel color and vertex shaders.
Chapter 12 describes instruction details, first by the microcode format to which they belong,
Conventions
The following conventions are used in this document:
[1,2) A range that includes the left-most value (in this case, 1), but excludes the right-
most value (in this case, 2).
[1,2] A range that includes both the left-most and right-most values.
7:4 A bit range, from bit 7 to bit 4, inclusive. The high-order bit is shown first.
italicized word or phrase The first use of a term or concept basic to the understanding of stream
computing.
Related Documents
• Intermediate Language (IL) Reference Manual. Published by AMD.
• AMD Accelerated Parallel Processing OpenCL™ Programming Guide. Published by AMD.
• The OpenCL™ Specification. Published by Khronos Group. Aaftab Munshi, editor.
• OpenGL® Programming Guide, at http://www.glprogramming.com/red/
• Microsoft DirectX® Reference Website, at https://msdn.microsoft.com/en-us/library/
windows/desktop/ee663274(v=vs.85).aspx
• Ray Tracing
• Dot product ALU operations added accelerate inferencing and deep-learning:
◦ V_DOT2_F32_F16 / V_DOT2C_F32_F16
◦ V_DOT2_I32_I16 / V_DOT2_U32_U16
◦ V_DOT4_I32_I8 / V_DOT4C_I32_I8
Conventions 2 of 283
"RDNA 2" Instruction Set Architecture
◦ V_DOT4_U32_U8
◦ V_DOT8_I32_I4
◦ V_DOT8_U32_U4
• Image Load MSAA
• Global memory loads with "Add-TID"
• Atomic clamped subtract buffer and global instructions
• VGPR & LDS allocation-unit size doubled
• S_MEMTIME replaced by "s_getreg_b32 Sn, SHADER_CYCLES"
Instruction Changes
Removed:
Additional Information
For more information on AMD GPU architectures please visit https://GPUOpen.com
Chapter 1. Introduction
The AMD RDNA processor implements a parallel micro-architecture that provides an excellent
platform not only for computer graphics applications but also for general-purpose data parallel
applications. Data-intensive applications that require high bandwidth or are computationally
intensive may be run on an AMD RDNA processor.
The figure below shows a block diagram of the AMD RDNA Generation series processors
The RDNA device includes a data-parallel processor (DPP) array, a command processor, a
memory controller, and other logic (not shown). The RDNA command processor reads
commands that the host has written to memory-mapped RDNA registers in the system-memory
address space. The command processor sends hardware-generated interrupts to the host when
the command is completed. The RDNA memory controller has direct access to all RDNA device
memory and the host-specified areas of system memory. To satisfy read and write requests, the
memory controller performs the functions of a direct-memory access (DMA) controller, including
computing memory-address offsets based on the format of the requested data in memory. In the
RDNA environment, a complete application includes two parts:
4 of 283
"RDNA 2" Instruction Set Architecture
The DPP array is the heart of the RDNA processor. The array is organized as a set of
workgroup processor pipelines, each independent from the others, that operate in parallel on
streams of floating-point or integer data. The workgroup processor pipelines can process data
or, through the memory controller, transfer data to, or from, memory. Computation in a
workgroup processor pipeline can be made conditional. Outputs written to memory can also be
made conditional.
When it receives a request, the workgroup processor pipeline loads instructions and data from
memory, begins execution, and continues until the end of the kernel. As kernels are running, the
RDNA hardware is designed to automatically fetch instructions from memory into on-chip
caches; RDNA software plays no role in this. RDNA kernels can load data from off-chip memory
into on-chip general-purpose registers (GPRs) and caches.
The AMD RDNA devices can detect floating point exceptions and can generate interrupts. In
particular, they detect IEEE-754 floating-point exceptions in hardware; these can be recorded
for post-execution analysis. The software interrupts shown in the previous figure from the
command processor to the host represent hardware-generated interrupts for signaling
command-completion and related management functions.
The RDNA processor hides memory latency by keeping track of potentially hundreds of work-
items in various stages of execution, and by overlapping compute operations with memory-
access operations.
1.1. Terminology
Table 1. Basic Terms
Term Description
RDNA Processor The RDNA shader processor is a scalar and vector ALU designed to run complex
programs on behalf of a wavefront.
Dispatch A dispatch launches a 1D, 2D, or 3D grid of work to the RDNA processor array.
Workgroup A workgroup is a collection of wavefronts that have the ability to synchronize with each
other quickly; they also can share data through the Local Data Share.
Work-item A single element of work: one element from the dispatch grid, or in graphics a pixel or
vertex.
Literal Constant A 32-bit integer or float constant that is placed in the instruction stream.
Scalar ALU (SALU) The scalar ALU operates on one value per wavefront and manages all control flow.
Term Description
Vector ALU (VALU) The vector ALU maintains Vector GPRs that are unique for each work item and execute
arithmetic operations uniquely on each work-item.
Workgroup Processor The basic unit of shader computation hardware, including scalar & vector ALU’s and
(WGP) memory, as well as LDS and scalar caches.
Compute Unit (CU) One half of a WGP. Contains 2 SIMD32’s which share one path to memory.
Microcode format The microcode format describes the bit patterns used to encode instructions. Each
instruction is either 32 or more bits, in units of 32-bits.
Instruction An instruction is the basic unit of the kernel. Instructions include: vector ALU, scalar
ALU, memory transfer, and control flow operations.
Quad A quad is a 2x2 group of screen-aligned pixels. This is relevant for sampling texture
maps.
Texture Sampler (S#) A texture sampler is a 128-bit entity that describes how the vector memory system
reads and samples (filters) a texture map.
Texture Resource (T#) A texture resource descriptor describes an image in memory: address, data format,
width, height, depth, etc.
Buffer Resource (V#) A buffer resource descriptor describes a buffer in memory: address, data format, stride,
etc.
UTC Universal (Address) Translation Cache : used for virtual memory translating logical to
physical addresses.
• A scalar ALU, which operates on one value per wavefront (common to all work items).
• A vector ALU, which operates on unique values per work-item.
• Local data storage, which allows work-items within a workgroup to communicate and share
data.
• Scalar memory, which can transfer data between SGPRs and memory through a cache.
• Vector memory, which can transfer data between VGPRs and memory, including sampling
texture maps.
All kernel control flow is handled using scalar ALU instructions. This includes if/else, branches
and looping. Scalar ALU (SALU) and memory instructions work on an entire wavefront and
operate on up to two SGPRs, as well as literal constants.
Vector memory and ALU instructions operate on all work-items in the wavefront at one time. In
order to support branching and conditional execute, every wavefront has an EXECute mask that
determines which work-items are active at that moment, and which are dormant. Active work-
items execute the vector instruction, and dormant ones treat the instruction as a NOP. The
EXEC mask can be changed at any time by Scalar ALU instructions.
Vector ALU instructions can take up to three arguments, which can come from VGPRs, SGPRs,
or literal constants that are part of the instruction stream. They operate on all work-items
enabled by the EXEC mask. Vector compare and add with- carryout return a bit-per-work-item
mask back to the SGPRs to indicate, per work-item, which had a "true" result from the compare
or generated a carry-out.
Vector memory instructions transfer data between VGPRs and memory. Each work-item
supplies its own memory address and supplies or receives unique data. These instructions are
also subject to the EXEC mask.
that half (i.e. EXEC == 0 for that half). Wave64 VALU instructions which return a scalar (SGPR
or VCC) value do not skip either pass. Wave64 Vector Memory instructions can skip either pass,
but do not skip both passes.
The upper half of EXEC and VCC are ignored for wave32 waves.
When a workgroup is dispatched or a graphics draw is launched, the waves can allocate LDS
space in one of two modes: CU or WGP mode. The shader can simultaneously execute some
waves in LDS mode and other waves in CU mode.
• CU mode: in this mode, the LDS is effectively split into a separate upper and lower LDS,
• enabling a PE to recover the pre-op value from an atomic operation by performing a cache-
less load from its return address after receipt of the write confirmation acknowledgment,
and
• enabling the system to maintain a relaxed consistency model.
Each scatter write from a given PE to a given memory channel maintains order. The
acknowledgment enables one processing element to implement a fence to maintain serial
consistency by ensuring all writes have been posted to memory prior to completing a
subsequent write. In this manner, the system can maintain a relaxed consistency model
between all parallel work-items operating on the system.
The amount of shader padding required is related to how far the shader hardware may prefetch
ahead. The shader can be set to prefetch 1, 2 or 3 cachelines (64 bytes) ahead of the current
program counter. This is controlled via a wave-launch state register, or by the shader program
itself with S_INST_PREFETCH.
LDS Local Data Share 64kB Local data share is a scratch RAM with built-in
arithmetic capabilities that allow data to be shared
between threads in a workgroup.
EXEC Execute Mask 64 A bit mask with one bit per thread, which is applied to
vector instructions and controls that threads execute
and that ignore the instruction.
EXECZ EXEC is zero 1 A single bit flag indicating that the EXEC mask is all
zeros.
VCC Vector Condition Code 64 A bit mask with one bit per thread; it holds the result
of a vector compare operation.
VCCZ VCC is zero 1 A single bit-flag indicating that the VCC mask is all
zeros.
SCC Scalar Condition Code 1 Result from a scalar ALU comparison instruction.
TBA Trap Base Address 64 Holds the pointer to the current trap handler program.
TMA Trap Memory Address 64 Temporary register for shader operations. For
example, can hold a pointer to memory used by the
trap handler.
TTMP0-TTMP15 Trap Temporary SGPRs 32 16 SGPRs available only to the Trap Handler for
temporary storage.
VMCNT Vector memory instruction 6 Counts the number of VMEM load instructions issued
count but not yet completed.
VSCNT Vector memory instruction 6 Counts the number of VMEM store instructions
count issued but not yet completed.
EXPCNT Export Count 3 Counts the number of Export and GDS instructions
issued but not yet completed. Also counts VMEM
writes that have not yet sent their write-data to the
last level cache.
LGKMCNT LDS, GDS, Constant and 4 Counts the number of LDS, GDS, constant-fetch
Message count (scalar memory read), and message instructions
issued but not yet completed.
The PC interacts with three instructions: S_GET_PC, S_SET_PC, S_SWAP_PC. These transfer
the PC to, and from, an even-aligned SGPR pair.
EXEC can be read from, and written to, through scalar instructions; it also can be written as a
result of a vector-ALU compare (V_CMPX). This mask affects vector-ALU, vector-memory, LDS,
GDS, and export instructions. It does not affect scalar (ALU or memory) execution or branches.
A helper bit (EXECZ) can be used as a condition for branches to skip code when EXEC is zero.
Wave32: the upper 32-bit of EXEC are ignored, and EXECZ represents the status of only the
lower 32-bits of EXEC.
• VALU instructions can be skipped, unless they write SGPRs (these are not skipped)
• Wave64 memory instructions: can skip one half but not the entire instruction
• Wave32 memory instructions: not skipped
Use CBRANCH to rapidly skip over code when it is likely that the EXEC mask is zero.
SCC 1 Scalar condition code. Used as a carry-out bit. For a comparison instruction,
this bit indicates failure or success. For logical operations, this is 1 if the
result was non-zero.
SPI_PRIO 2:1 Wavefront priority set by the shader processor interpolator (SPI) when the
wavefront is created. See the S_SETPRIO instruction (page 12-49) for
details. 0 is lowest, 3 is highest priority.
USER_PRIO 4:3 User settable wave-priority set by the shader program. See the
S_SETPRIO instruction (page 12-49) for details.
PRIV 5 Privileged mode. Can only be active when in the trap handler. Gives write
access to the TTMP, TMA, and TBA registers.
TRAP_EN 6 Indicates that a trap handler is present. When set to zero, traps are not
taken.
TTRACE_EN 7 Indicates whether thread trace is enabled for this wavefront. If zero, also
ignore any shader-generated (instruction) thread-trace data.
EXPORT_RDY 8 This status bit indicates if export buffer space has been allocated. The
shader stalls any export instruction until this bit becomes 1. It is set to 1
when export buffer space has been allocated. Before a Pixel or Vertex
shader can export, the hardware checks the state of this bit. If the bit is 1,
export can be issued. If the bit is zero, the wavefront sleeps until space
becomes available in the export buffer. Then, this bit is set to 1, and the
wavefront resumes.
HALT 13 Wavefront is halted or scheduled to halt. HALT can be set by the host
through wavefront-control messages, or by the shader. This bit is ignored
while in the trap handler (PRIV = 1); it also is ignored if a host-initiated trap
is received (request to enter the trap handler).
TTRACE_SIMD_EN 15 Enables/disables thread trace for this SIMD. This bit allows more than one
SIMD to be outputting USERDATA (shader initiated writes to the thread-
trace buffer). Note that wavefront data is only traced from one SIMD per
shader engine. Wavefront user data (instruction based) can still be output if
this bit is zero.
VALID 16 Wavefront is active (has been created and not yet ended).
SKIP_EXPORT 18 For Vertex Shaders only. 1 = this shader is not allocated export buffer
space; all export instructions are ignored (treated as NOPs). Formerly
called VS_NO_ALLOC. Used for stream-out of multiple streams (multiple
passes over the same VS), and for DS running in the VS stage for
wavefronts that produced no primitives.
FP_ROUND 3:0 [1:0] Single precision round mode. [3:2] Double/Half-precision round mode.
Round Modes: 0=nearest even, 1= +infinity, 2= -infinity, 3= toward zero.
FP_DENORM 7:4 [1:0] Single denormal mode. [3:2] Double/Half-precision denormal mode.
Denorm modes:
0 = flush input and output denorms.
1 = allow input denorms, flush output denorms.
2 = flush input denorms, allow output denorms.
3 = allow input and output denorms.
DX10_CLAMP 8 Used by the vector ALU to force DX10-style treatment of NaNs: when set,
clamp NaN to zero; otherwise, pass NaN through.
IEEE 9 Floating point opcodes that support exception flag gathering quiet and
propagate signaling NaN inputs per IEEE 754-2008. Min_dx10 and max_dx10
become IEEE 754-2008 compliant due to signaling NaN propagation and
quieting.
LOD_CLAMPED 10 Sticky bit indicating that one or more texture accesses had their LOD
clamped.
DEBUG 11 Forces the wavefront to jump to the exception handler after each instruction is
executed (but not after ENDPGM). Only works if TRAP_EN = 1.
EXCP_EN 20:12 Enable mask for exceptions. Enabled means if the exception occurs and
TRAP_EN==1, a trap is taken.
[12] : invalid.
[13] : inputDenormal.
[14] : float_div0.
[15] : overflow.
[16] : underflow.
[17] : inexact.
[18] : int_div0.
[19] : address watch
Out-of-range can occur through GPR-indexing or bad programming. It is illegal to index from
one register type into another (for example: SGPRs into trap registers or inline constants). It is
also illegal to index within inline constants.
The following describe the out-of-range behavior for various storage types.
• SGPRs
◦ SGPRs cannot be "out of range".
However, it is illegal to index from one range to another, or for a 64-bit operand to
straddle two ranges.
The ranges are: [ SGPRs 0-105 and VCCH, VCCL], [ Trap Temps 0-15 ], [ all other
values ]
• VGPRs
◦ It is illegal to index from SGPRs into VGPRs, or vice versa.
◦ Out-of-range = (vgpr < 0 || (vgpr >= vgpr_size))
◦ If a source VGPR is out of range, VGPR0 is used.
◦ If a destination VGPR is out-of-range, the instruction is ignored and nothing is written
(treated as an NOP).
• LDS
◦ If the LDS-ADDRESS is out-of-range (addr < 0 or > (MIN(lds_size, m0)):
▪ Writes out-of-range are discarded; it is undefined if SIZE is not a multiple of write-
data-size.
▪ Reads return the value zero.
◦ If any source-VGPR is out-of-range, the VGPR0 value is used.
◦ If the dest-VGPR is out of range, nullify the instruction (issue with exec=0)
• Memory, LDS, and GDS: Reads and atomics with returns.
◦ If any source VGPR or SGPR is out-of-range, the data value is undefined.
◦ If any destination VGPR is out-of-range, the operation is nullified by issuing the
instruction as if the EXEC mask were cleared to 0.
▪ This out-of-range check must check all VGPRs that can be returned (for example:
VDST to VDST+3 for a BUFFER_LOAD_DWORDx4).
▪ This check must also include the extra PRT (partially resident texture) VGPR and
nullify the fetch if this VGPR is out-of-range, no matter whether the texture system
actually returns this value or not.
▪ Atomic operations with out-of-range destination VGPRs are nullified: issued, but
with exec mask of zero.
Instructions with multiple destinations (for example: V_ADDC): if any destination is out-of-range,
no results are written.
• When 64-bit data is used. This is required for moves to/from 64-bit registers, including the
PC.
• When scalar memory reads that the address-base comes from an SGPR-pair (either in
SGPR).
Quad-alignment is required for the data-GPR when a scalar memory read returns four or more
Dwords. When a 64-bit quantity is stored in SGPRs, the LSBs are in SGPR[n], and the MSBs
are in SGPR[n+1].
Shared VGPRs logically occupy the VGPR addresses immediately following the private VGPRs.
E.g. if a wave has 8 private VGPRs, they are V0-V7 and shared VGPRs start at V8. If there are
16 shared VGPRs, they are accessed as V8-23.
The SCC can be used as the carry-in for extended-precision integer arithmetic, as well as the
selector for conditional moves and branches.
There is also a VCC summary bit (vccz) that is set to 1 when the VCC result is zero. This is
useful for early-exit branch tests. VCC is also set for selected integer ALU operations (carry-
out).
Vector compares have the option of writing the result to VCC (32-bit instruction encoding) or to
any SGPR (64-bit instruction encoding). VCCZ is updated every time VCC is updated: vector
compares and scalar writes to VCC.
The EXEC mask determines which threads execute an instruction. The VCC indicates which
executing threads passed the conditional test, or which threads generated a carry-out from an
integer add or subtract.
sources VCC, that counts against the limit on the total number of SGPRs that
can be sourced for a given instruction. VCC physically resides in the highest
two user SGPRs.
When used by a wave32, the upper 32 bits of VCC are unused and only the lower 32 bits of
VCC contribute to the value of VCCZ.
All Trap temporary SGPRs (TTMP*) are privileged for writes - they can be written only when in
the trap handler (status.priv = 1). When not privileged, writes to these are ignored. TMA and
TBA are read-only; they can be accessed through S_GETREG_B32.
When a trap is taken (either user initiated, exception or host initiated), the shader hardware is
designed to generate an S_TRAP instruction. This loads trap information into a pair of SGPRS:
HT is set to one for host initiated traps, and zero for user traps (s_trap) or exceptions. TRAP_ID
is zero for exceptions, or the user/host trapID for those traps. When the trap handler is entered,
the PC of the faulting instruction is: (PC - PC_rewind*4).
STATUS . TRAP_EN - This bit indicates to the shader whether or not a trap handler is present.
When one is not present, traps are not taken, no matter whether they’re floating point, user-, or
host-initiated traps. When the trap handler is present, the wavefront uses an extra 16 SGPRs for
trap processing. If trap_en == 0, all traps and exceptions are ignored, and s_trap is converted
by hardware to NOP.
MODE . EXCP_EN[8:0] - Floating point exception enables. Defines which exceptions and
events cause a trap.
Bit Exception
0 Invalid
1 Input Denormal
2 Divide by zero
3 Overflow
4 Underflow
5 Inexact
EXCP 8:0 Status bits of which exceptions have occurred. These bits are sticky and
accumulate results until the shader program clears them. These bits are
accumulated regardless of the setting of EXCP_EN. These can be read or
written without shader privilege.
Bit Exception
0 invalid
1 Input Denormal
2 Divide by zero
3 overflow
4 underflow
5 inexact
6 integer divide by zero
7 address watch 0
8 memory violation
SAVECTX 10 A bit set by the host command indicating that this wave must jump to its trap
handler and save its context. This bit must be cleared by the trap handler using
S_SETREG. Note - a shader can set this bit to 1 to cause a save-context trap,
and due to hardware latency the shader may execute up to 2 additional
instructions before taking the trap.
ADDR_WATCH1-3 14:12 Indicates that address watch 1, 2, or 3 has been hit. Bit 12 is address watch 1;
bit 13 is 2; bit 14 is 3.
DP_RATE 31:29 Determines how the shader interprets the TRAP_STS.cycle. Different Vector
Shader Processors (VSP) process instructions at different rates.
Memory Buffer to LDS does NOT return a memory violation if the LDS address is out of range,
but masks off EXEC bits of threads that would go out of range.
When a memory access is in violation, the appropriate memory (LDS or cache) returns
MEM_VIOL to the wave. This is stored in the wave’s TRAPSTS.mem_viol bit. This bit is sticky,
so once set to 1, it remains at 1 until the user clears it.
Memory violations are fatal: if a trap handler is present and the wave is not already in the trap
handler, the wave jumps to the trap handler; otherwise it signals an interrupt and halt.
Memory violations are not precise. The violation is reported when the LDS or cache processes
the address; during this time, the wave may have processed many more instructions. When a
mem_viol is reported, the Program Counter saved is that of the next instruction to execute; it
has no relationship the faulting instruction.
State initialization is controlled by state registers which are defined in other documentation.
Execute mask (EXEC) workitem valid mask. Indicates which workitems are valid for this wavefront.
Wave32 uses only bits 31-0. The combined ES+GS, HS+LS loads a dummy
non-zero value into EXEC, and the shader must calculate the real value from
initialized SGPRs.
S_ENDPGM Terminates the wavefront. It can appear anywhere in the kernel and can appear multiple
times.
S_ENDPGM_SAVED Terminates the wavefront due to context save. It can appear anywhere in the kernel and can
appear multiple times.
S_VERSION Does nothing (treated as S_NOP), but can be used as a code comment to indicate the
hardware version the shader is compiled for (using the SIMM16 field).
S_CODE_END Treated as an illegal instruction. Used to pad past the end of shaders.
Clauses are defined by the S_CLAUSE instructions, which specifies the number of instructions
that make up the clause. The clause-type is implicitly defined by the type of instruction
immediately following the clause. Clause types are:
• VALU
• SMEM
• LDS
• FLAT
• Texture, buffer, global and scratch
4.2. Branching
Branching is done using one of the following scalar ALU instructions.
S_CBRANCH_<test> Conditional branch. Branch only if <test> is true. Tests are VCCZ, VCCNZ,
EXECZ, EXECNZ, SCCZ, and SCCNZ.
S_SUBVECTOR_LOOP_BEGIN Starts a subvector execution loop. The SIMM16 field is the branch offset to the
instruction after S_SUBVECTOR_LOOP_END, and the SGPR is used for
temporary EXEC storage.
S_SUBVECTOR_LOOP_END Marks the end of the subvector execution loop. The SIMM16 field points back to
the instruction after S_SUBVECTOR_LOOP_BEGIN, and the SGPR is used for
temporary EXEC storage.
For conditional branches, the branch condition can be determined by either scalar or vector
operations. A scalar compare operation sets the Scalar Condition Code (SCC), which then can
be used as a conditional branch condition. Vector compare operations set the VCC mask, and
VCCZ or VCCNZ then can be used to determine branching.
normal method is to issue each half of a wave64 as two wave32 instructions, then move on to
the next instruction. This alternative method is to issue a group of instructions, all for the first 32
workitems and then come back and execute the same instructions but for the second 32
workitems. This has two potential advantages:
• Memory operations are for smaller units of work and may cache better
◦ example: reading multiple entries from a strided buffer
• Wave-temporary VGPRs are available:
◦ In Wave64 each wave may declare N normal VGPRs (the wave gets 64 * N dwords,
with N per work-item), and M temp VGPRs which may only be used in this mode. The
temp VGPRs are physically adjacent to the normal ones, but logically are from just after
the private VGPRs. These can be used on each pass of the subvector execution.
Subvector execution is simply a loop construct where half of the EXEC mask is zero for each
pass over the body of the code. All wave64 rules still apply. The loop executes zero, one or two
times, depending on the initial state of the EXEC mask. During each pass of the loop, one half
of EXEC is forced to zero (after being saved in an SGPR). The EXEC mask is restored at the
end of the loop.
If EXECHI = 0: the body is executed only once: EXECLO is stored in S0 and restored at the
end, but it was zero anyway. If EXEC_LO was zero at the start, the same thing happens. If both
halves of EXEC are non-zero, do the low pass first (storing EXECHI in S0), then restore
EXECHI and save off EXECLO and do it again. Restore EXECLO at the end of the second
pass. The “pass #” is encoded by observing which half of EXEC is zero.
Subvector looping imposes a rule that the “body code” cannot let the working half of the exec
mask go to zero. If it might go to zero, it must be saved at the start of the loop and be restored
before the end since the S_SUBVECTOR_LOOP_* instructions determine which pass they’re in
by looking at which half of EXEC is zero.
4.3. Workgroups
Work-groups are collections of wavefronts running on the same workgroup processor which can
synchronize and share data. Up to 1024 work-items (16 wave64’s or 32 wave32’s) can be
combined into a work-group. When multiple wavefronts are in a workgroup, the S_BARRIER
instruction can be used to force each wavefront to wait until all other wavefronts reach the same
instruction; then, all wavefronts continue. Any wavefront may terminate early using S_ENDPGM,
and the barrier is considered satisfied when the remaining live waves reach their barrier
instruction.
The shader has four counters that track the progress of issued instructions. S_WAITCNT waits
for the values of these counters to be at, or below, specified values before continuing.
These allow the shader writer to schedule long-latency instructions, execute unrelated work,
and specify when results of long-latency operations are needed.
Instructions of a given type return in order, but instructions of different types can complete out-
of-order. For example, both GDS and LDS instructions use LGKM_cnt, but they can return out-
of-order. VMEM loads update VM_CNT in the order the instructions were issued, so waiting on
VM_CNT to be less-than a particular value ensures all previous loads have completed. It is
possible for data to be written to VGPRs out-of-order. Stores from a wave are not kept in order
with stores from that wave.
VM_CNT
Vector memory count (reads, atomic with return). Determines when memory reads have
finished.
VS_CNT
Vector memory store count (writes, atomic without return). Determines when memory writes
have completed.
LGKM_CNT
(LDS, GDS, (K)constant, (M)essage) Determines when one of these low-latency instructions
have completed.
EXP_CNT
VGPR-export count. Determines when data has been read out of the VGPR and sent to
GDS, at which time it is safe to overwrite the contents of that VGPR.
Figure 5. Scalar ALU format with one immediate value source operands
Figure 6. Scalar ALU format for compares, with two sources but no destinaton
Field Description
The lists of similar instructions sometimes use a condensed form using curly braces { } to
express a list of possible names. For example, S_AND_{B32, B64} defines two legal
instructions: S_AND_B32 and S_AND_B64.
In the table below, 0-127 can be used as scalar sources or destinations; 128-255 can only be
used as sources.
128 0 zero
237 PRIVATE_BASE
238 PRIVATE_LIMIT
241 -0.5
242 1.0
243 -1.0
244 2.0
245 -2.0
246 4.0
247 -4.0
The SALU cannot use VGPRs or LDS. SALU instructions can use a 32-bit literal constant. This
constant is part of the instruction stream and is available to all SALU microcode formats except
SOPP and SOPK. Literal constants are used by setting the source instruction field to "literal"
(255), and then the following instruction dword is used as the source value.
If the destination SGPR is out-of-range, no SGPR is written with the result. However, SCC and
possibly EXEC (if saveexec) is still written.
If an instruction uses 64-bit data in SGPRs, the SGPR pair must be aligned to an even
boundary. For example, it is legal to use SGPRs 2 and 3 or 8 and 9 (but not 11 and 12) to
represent 64-bit data.
S_MOV_{B32,B64} SOP1 n D = S0
{S_NAND,S_NOR,S_XNOR}_{B32,B64} SOP2 y D = ~(S0 & S1), ~(S0 OR S1), ~(S0 XOR S1)
S_BFM_{B32,B64} SOP2 n Bit field mask. D = ((1 << S0[4:0]) - 1) << S1[4:0].
S_BFE_U32, S_BFE_U64 SOP2 y Bit Field Extract, then sign-extend result for I32/64
S_BFE_I32, S_BFE_I64 instructions.
(signed/unsigned) S0 = data,
S1[5:0] = offset, S1[22:16]= width.
S_FLBIT_I32_{B32,B64} SOP1 n Find last bit. D = the number of zeros before the
first one starting from the MSB. Returns -1 if none.
S_FLBIT_I32 SOP1 n Count how many bits in a row (from MSB to LSB)
S_FLBIT_I32_I64 are the same as the sign bit. Return -1 if the input
is zero or all 1’s (-1). 32-bit pseudo-code:
if (S0 == 0 || S0 == -1) D = -1
else
D=0
for (I = 31 .. 0)
if (S0[I] == S0[31])
D++
else break
This opcode behaves the same as V_FFBH_I32.
S_SETREG_B32 SOPK* n Write the LSBs of D into a hardware register. (Note that D is a
source SGPR.) Must add an S_NOP between two consecutive
S_SETREG to the same register.
S_SETREG_IMM32_B32 SOPK* n S_SETREG where 32-bit data comes from a literal constant (so
this is a 64-bit instruction format).
The hardware register is specified in the DEST field of the instruction, using the values in the
table above. Some bits of the DEST specify which register to read/write, but additional bits
specify which bits in the register to read/write:
0 reserved
1 MODE R/W.
3 TRAPSTS R/W.
8 - 14 reserved.
25 POPS_PACKER Bit [0] = POPS enabled for this wave; bits [2:1] = Pops Packer ID
29 SHADER_CYCLES Return the value of a 20-bit clock cycle counter. Used for measuring time-delta within a
wave, not between waves.
VM_CNT 23:22, Number of VMEM load instructions issued but not yet returned.
3:0
EXP_CNT 6:4 Number of Exports issued but have not yet read their data from VGPRs.
LGKM_CNT 11:8 LDS, GDS, Constant-memory and Message instructions issued-but-not-completed count.
VS_CNT 31:26 Number of VMEM store instructions issued but not yet returned.
VGPR_BASE 5:0 Physical address of first VGPR assigned to this wavefront, as [7:2]
VGPR_SIZE 13:8 Number of VGPRs assigned to this wavefront, as [7:2]+4. 0=4 VGPRs, 1=8 VGPRs, etc.
LDS_BASE 7:0 Physical address of first LDS location assigned to this wavefront, in
units of 64 Dwords.
VGPR_SHARED_SIZE 27:24 Number of shared VGPRs allocate to this wave, in units of 8 VGPRs.
(0=0vgprs, 1=8vgprs, …)
Parameter interpolation is a mixed VALU and LDS instruction, and is described in the Data
Share chapter.
When an instruction is available in two microcode formats, it is up to the user to decide which to
use. It is recommended to use the 32-bit encoding whenever possible.
VOP2 is for instructions with two inputs and a single vector destination. Instructions that have a
carry-out implicitly write the carry-out to the VCC register.
VOP1 is for instructions with no inputs or a single input and one destination.
VOP3 is for instructions with up to three inputs, input modifiers (negate and absolute value), and
output modifiers. There are two forms of VOP3: one which uses a scalar destination field (used
only for div_scale, integer add and subtract); this is designated VOP3b. All other instructions
use the common form, designated VOP3a.
Any of the 32-bit microcode formats may use a 32-bit literal constant, as well VOP3. Note
however that VOP3 plus a literal makes a 96-bit instruction and excessive use of this
combination may reduce performance.
VOP3P is for instructions that use "packed math": These instructions perform an operation on a
pair of input values that are packed into the high and low 16-bits of each operand; the two 16-bit
results are written to a single VGPR as two packed values.
6.2. Operands
All VALU instructions take at least one input operand (except V_NOP and V_CLREXCP). The
data-size of the operands is explicitly defined in the name of the instruction. For example,
V_FMA_F32 operates on 32-bit floating point data.
124 M0 M0 register
128 0 Zero
234 DPP8FI DPP - 8 lane transfer with fetch from invalid lanes. (only valid as source-
0)
236 SHARED_LIMIT
237 PRIVATE_BASE
238 PRIVATE_LIMIT
241 -0.5
1/(2*PI) is 0.15915494.
242 1.0 The exact value used is:
half: 0x3118
243 -1.0 single: 0x3e22f983
244 2.0 double: 0x3fc45f306dc9c882
245 -2.0
246 4.0
247 -4.0
248 1/(2*PI)
254 LDS direct Use LDS direct read to supply 32-bit value Vector-alu instructions only.
• VGPRs
• SGPRs
• Inline constants - constant selected by a specific VSRC value
• Literal constant - 32-bit value in the instruction stream.
• LDS direct data read
• M0
• EXEC mask
Limitations
• At most two scalar values can be read per instructions, but the values can be used for more
than one operand.
◦ Scalar values include: SGPRs, VCC, EXEC (used as data), and literal constants
◦ Some instructions implicitly read an SGPR (which includes VCC), and this implicit read
counts agains the total supported limit.
▪ These are: Add/sub with carry-in, FMAS and CNDMASK
◦ 64-bit shift instructions can use only a single scalar value, not two
• At most one literal constant can be used
• Inline constants are free, and do not count against these limits
• Only SRC0 can use LDS_DIRECT (see Chapter 10, "Data Share Operations")
Instructions using the VOP3 form and also using floating-point inputs have the option of
applying absolute value (ABS field) or negate (NEG field) to any of the input operands.
Literal constants are 32-bits, but they can be used as sources which normally require 64-bit
data. They are expanded to 64 bits following these rules:
All V_CMPX instructions write the result of their comparison (one bit per thread) the EXEC
mask.
Instructions producing a carry-out (integer add and subtract) write their result to VCC when used
in the VOP2 form, and to an arbitrary SGPR-pair when used in the VOP3 form.
When the VOP3 form is used, instructions with a floating-point result can apply an output
modifier (OMOD field) that multiplies the result by: 0.5, 1.0, 2.0 or 4.0. Optionally, the result can
be clamped (CLAMP field) to the range [0.0, +1.0].
Output modifiers apply only to floating point results and are ignored for integer or bit results.
Output modifiers are not compatible with output denormals: if output denormals are enabled,
then output modifiers are ignored. If output demormals are disabled, then the output modifier is
applied and denormals are flushed to zero. Output modifiers are not IEEE compatible: -0 is
flushed to +0. Output modifiers are ignored if the IEEE mode bit is set to 1.
In the table below, all codes can be used when the vector source is nine bits; codes 0 to 255
can be the scalar source if it is eight bits; codes 0 to 127 can be the scalar source if it is seven
bits; and codes 256 to 511 can be the vector source or destination.
V_MIN3_{F16,I16,U16} V_ASHRREV_I16
V_MAX3_{F16,I16,U16} V_MAX_U16
V_MED3_{F16,I16,U16} V_MAX_I16
V_PACK_F16 V_MIN_U16
V_MIN_I16
When the destination GPR is out-of-range, the instruction executes but does not write the
results.
6.3. Instructions
The table below lists the complete VALU instruction set by microcode encoding, except for
VOP3P instructions which are listed in a later section.
V_PERM_B32 V_SAT_PK_U8_I16
V_QSAD_PK_U16_U8 V_SWAP_B32
V_XAD_U32
V_XOR3_B32
VOP3P: V_DOT2C_F32_F16
V_DOT2_F32_F16 V_DOT4C_I32_I8
V_DOT2_I32_I16
V_DOT2_U32_U16
V_DOT4_I32_I8
V_DOT4_U32_U8
V_DOT8_I32_I4
V_DOT8_U32_U4
V_CMP I16, I32, I64, U16, F, LT, EQ, LE, GT, LG, GE, T Write VCC..
U32, U64
V_CMPX Write exec.
V_CMP F16, F32, F64 F, LT, EQ, LE, GT, LG, GE, T, Write VCC.
O, U, NGE, NLG, NGT, NLE, NEQ, NLT
V_CMPX (o = total order, u = unordered, Write exec.
N = NaN or normal compare)
V_CMP_CLASS F16, F32, F64 Test for one of: signaling-NaN, quiet-NaN, Write VCC.
positive or negative: infinity, normal,
V_CMPX_CLASS subnormal, zero. Write exec.
Round and denormal modes can also be set using S_ROUND_MODE and
S_DENORM_MODE.
The table below describes the instructions which enable, disable and control VGPR indexing.
V_MOVRELSD_2_B32 VOP1 Move with relative source and destination, each different:
VGPR[D+M0[25:16]] = VGPR[S0+M0[7:0]].
V_SWAPREL_B32 VOP1 Swap two VGPRs, each relative to a separate index: swap
VGPR[D+M0[25:16]] with VGPR[S0+M0[7:0]].
Packed math uses the instructions below and the microcode format "VOP3P". This format adds
op_sel and neg fields for both the low and high operands, and removes ABS and OMOD.
V_FMA_MIX_* are not packed math, but perform a single MAD operation on
a mixture of 16- and 32-bit inputs. They are listed here because they use the
VOP3P encoding.
A scan operation is one which computes a value per thread which is based on the values of the
previous threads and possibly itself. E.g. a running sum is the sum of the values from previous
threads in the vector. A reduction operation is essentially a scan which returns a single value
from the highest numbered active thread. These operations take the SP multiple instruction
cycles (at least 8 times what an ADD_F32 takes). Rather than make these a single macro in
SQ, the shader program will have unique instructions for each pass of the scan. This prevents
any instruction scheduling issues (any other waves may execute in between these individual
stage instruction) and allows more general flexibility.
Use of DPP is indicated by setting the SRC0 operand to a literal constant: DPP8 or DPP16.
Note that since SRC-0 is set to the literal value, the actual VGPR address for Source-0 comes
from the literal constant (DPP). The scan operation requires the EXEC mask to be set to all 1’s
for proper operation. Unused threads (lanes) should be set to a value which will not change the
result prior to the scan. Readlane, readfirstlane and writelane cannot be used with DPP.
The scalar unit reads consecutive Dwords from memory to the SGPRs. This is intended
primarily for loading ALU constants and for indirect T#/S# lookup. No data formatting is
supported, nor is byte or short data.
OP 8 Opcode.
SBASE 6 SGPR-pair (SBASE has an implied LSB of zero) which provides a base address, or for BUFFER
instructions, a set of 4 SGPRs (4-sgpr aligned) which hold the resource constant. For BUFFER
instructions, the only resource fields used are: base, stride, num_records.
OFFSET 21 An immediate signed byte offset. Must be positive with s_buffer operations.
SOFFSET 7 The address of an SGPR which supplies an unsigned byte address offset. Set this to NULL to
disable.
7.2. Operations
7.2.1. S_LOAD_DWORD
These instructions load 1-16 Dwords from memory. The data in SGPRs is specified in SDATA,
and the address is composed of the SBASE, OFFSET, and SOFFSET fields.
S_LOAD :
All components of the address (base, offset, inst_offset, M0) are in bytes, but the two LSBs are
ignored and treated as if they were zero.
Scalar access to private (scratch) space must either use a buffer constant or manually convert
the address.
Buffer constant fields used: base_address, stride, num_records. Other fields are ignored.
Scalar memory read does not support "swizzled" buffers. Stride is used only for memory
address bounds checking, not for computing the address to access.
The SMEM supplies only a SBASE address (byte) and an offset (byte or Dword). Any "index *
stride" must be calculated manually in shader code and added to the offset prior to the SMEM.
The two LSBs of V#.base and of the final address are ignored to force Dword alignment.
7.2.2. S_DCACHE_INV
This instruction invalidates the entire scalar cache. It does not return anything to SDST.
7.2.3. S_MEMREALTIME
This instruction reads a 64-bit "real time-counter" and returns the value into a pair of SGPRS:
SDST and SDST+1. The time value is from a clock for which the frequency is constant (not
affected by power modes or core clock frequency changes).
Because the instructions can return out-of-order, the only sensible way to use this counter is to
implement S_WAITCNT 0; this imposes a wait for all data to return from previous SMEMs
before continuing.
A group is a set of the same type of instruction that happen to occur in the code but are not
necessarily executed as a clause. A group ends when a non-SMEM instruction is encountered.
Scalar memory instructions are issued in groups. The hardware does not enforce that a single
wave will execute an entire group before issuing instructions from another wave.
Group restrictions:
Instruction ordering
The data cache is free to re-order instructions. The only assurance of ordering comes when the
shader executes an S_WAITCNT LGKMcnt==0. Cache invalidate instructions are not assured to
SBASE
The value of SBASE must be even for S_BUFFER_LOAD (specifying the address of an
SGPR which is a multiple of four). If SBASE is out-of-range, the value from SGPR0 is used.
OFFSET
The value of OFFSET has no alignment restrictions.
Memory Address : If the memory address is out-of-range (clamped), the operation is not
performed for any Dwords that are out-of-range.
Software initiates a load, store or atomic operation through the texture cache through one of
three types of VMEM instructions:
The instruction defines which VGPR(s) supply the addresses for the operation, which VGPRs
supply or receive data from the operation, and a series of SGPRs that contain the memory
buffer descriptor (V# or T#). Also, MIMG operations supply a texture sampler (S#) from a series
of four SGPRs; this sampler defines texel filtering operations to be performed on data read from
the image.
Buffer reads have the option of returning data to VGPRs or directly into LDS.
Examples of buffer objects are vertex buffers, raw buffers, stream-out buffers, and structured
buffers.
Buffer objects support both homogeneous and heterogeneous data, but no filtering of read-data
(no samplers). Buffer instructions are divided into two groups:
◦ The only operations are Load and Store, both with data format conversion.
Atomic operations take data from VGPRs and combine them arithmetically with data already in
memory. Optionally, the value that was in memory before the operation took place can be
returned to the shader.
All VM operations use a buffer resource constant (V#) which is a 128-bit value in SGPRs. This
constant is sent to the texture cache when the instruction is executed. This constant defines the
address and characteristics of the buffer in memory. Typically, these constants are fetched from
memory using scalar memory reads prior to executing VM instructions, but these constants also
can be generated within the shader.
The D16 instruction variants convert the results to packed 16-bit values. For example,
BUFFER_LOAD_FORMAT_D16_XYZW writes two VGPRs.
MTBUF Instructions
TBUFFER_LOAD_FORMAT_{x,xy,xyz,xyzw} Read from, or write to, a typed buffer object. Also used for a
TBUFFER_STORE_FORMAT_{x,xy,xyz,xyzw} vertex fetch.
TBUFFER_LOAD_FORMAT_D16_{x,xy,xyz,xyzw}
TBUFFER_STORE_FORMAT_D16_{x,xy,xyz,xyzw}
MUBUF Instructions
Instruction Description
VADDR 8 Address of VGPR to supply first component of address (offset or index). When both index and
offset are used, index is in the first VGPR, offset in the second.
VDATA 8 Address of VGPR to supply first component of write data or receive first component of read-
data.
SOFFSET 8 SGPR to supply unsigned byte offset. SGPR, M0, or inline constant.
SRSRC 5 Specifies which SGPR supplies V# (resource constant) in four consecutive SGPRs. This field
is missing the two LSBs of the SGPR address, since this address is be aligned to a multiple of
four SGPRs.
FORMAT 7 Data Format of data in memory buffer. See: Buffer Image format Table
GLC 1 Globally Coherent. Controls how reads and writes are handled by the L0 texture cache.
READ
GLC = 0 Reads can hit on the L0 and persist across wavefronts
GLC = 1 Reads miss the L0 and force fetch to L2. No L0 persistence across waves.
WRITE
GLC = 0 Writes miss the L0, write through to L2, and persist in L0 across wavefronts.
GLC = 1 Writes miss the L0, write through to L2. No persistence across wavefronts.
ATOMIC
GLC = 0 Previous data value is not returned. No L0 persistence across wavefronts.
GLC = 1 Previous data value is returned. No L0 persistence across wavefronts.
Note: GLC means "return pre-op value" for atomics.
DLC 1 Device Level Coherent. When set, accesses are forced to miss in level 1
SLC 1 System Level Coherent. Used in conjunction with DLC to determine L2 cache policies.
TFE 1 Texel Fault Enable for PRT (partially resident textures). When set to 1 and fetch returns a
NACK, status is written to the VGPR at DST+1 (first VGPR after all fetch-dest VGPRs).
Address
Zero, one or two VGPRs are used, depending of the offset-enable (OFFEN) and index-
enable (IDXEN) in the instruction word, as shown in the table below:
0 0 nothing
0 1 uint offset
1 0 uint index
Write Data : N consecutive VGPRs, starting at VDATA. The data format specified in the
instruction word (FORMAT for MTBUF, or encoded in the opcode field for MUBUF) and D16
setting determines how many Dwords to write.
Read Data Format : Read data is 32 or 16 bits, based on the data format in the instruction or
resource and D16. Float or normalized data is returned as floats; integer formats are returned
as integers (signed or unsigned, same type as the memory storage format). Memory reads of
data in memory that is 32 or 64 bits do not undergo any format conversion unless they return as
16-bit due to D16 being set.
Atomics with Return : Data is read out of the VGPR(s) starting at VDATA to supply to the
atomic operation. If the atomic returns a value to VGPRs, that data is returned to those same
VGPRs starting at VDATA.
the resource, instruction fields, or the opcode itself. Dst_sel comes from the resource, but is
ignored for many operations.
Instruction : The instruction’s format field is used instead of the resource’s fields.
Data format derived : The data format is derived from the opcode and ignores the resource
definition. For example, buffer_load_ubyte sets the data-format to 8 to uint.
The resource’s data format must not be INVALID; that format has specific
meaning (unbound resource), and for that case the data format is not
replaced by the instruction’s implied data format.
DST_SEL identity : Depending on the number of components in the data-format, this is: X000,
XY00, XYZ0, or XYZW.
The MTBUF derives the data format from the instruction. The MUBUF
BUFFER_LOAD_FORMAT and BUFFER_STORE_FORMAT instructions use format from the
resource; other MUBUF instructions derive data-format from the instruction itself.
D16 Instructions : Load-format and store-format instructions also come in a "d16" variant. For
stores, each 32-bit VGPR holds two 16-bit data elements that are passed to the texture unit.
This texture unit converts them to the texture format before writing to memory. For loads, data
returned from the texture unit is converted to 16 bits, and a pair of data are stored in each 32-bit
VGPR (LSBs first, then MSBs). Control over int vs. float is controlled by FORMAT.
inst_idxen 1 Boolean: get index from VGPR when true, or no index when false.
inst_offen 1 Boolean: get offset from VGPR when true, or no offset when false. Note that inst_offset is
present, regardless of this bit.
The "element size" for a buffer instruction is the amount of data the instruction transfers, or the
number of contiguous bytes of a record for a given index, and is fixed at 4 bytes.
Range Checking
Range checking determines if a given buffer memory address is in-range (valid) or out of range.
When an address is out of range, writes are ignored (dropped) and reads return zero. Range
checking is controlled by a 2-bit field in the buffer resource: OOB_SELECT (Out of Bounds
select).
Notes:
1. Reads that go out-of-range return zero (except for components with V#.dst_sel = SEL_1
that return 1).
2. Writes that are out-of-range do not write anything.
3. Load/store-format-* instruction and atomics are range-checked "all or nothing" - either
entirely in or out.
4. Load/store-Dword-x{2,3,4} and range-check per component.
Swizzled addressing rearranges the data in the buffer which may improve performance for
arrays of structures. Swizzled addressing also requires Dword-aligned accesses. The buffer’s
STRIDE must be a multiple of element_size.
Remember that the "sgpr_offset" is not a part of the "offset" term in the above equations.
Here are few proposed uses of swizzled addressing in common graphics buffers.
inst_vgpr_offset_ T F T T T T
en
inst_vgpr_index_ F T T F F F
en
const_add_tid_en F F F T T F
able
const_buffer_swiz F T T T F F
zle
• D16 loads data into or stores data from the lower 16 bits of a VGPR.
• D16_HI loads data into or stores data from the upper 16 bits of a VGPR.
For example, BUFFER_LOAD_UBYTE_D16 reads a byte per work-item from memory, converts
it to a 16-bit integer, then loads it into the lower 16 bits of the data VGPR.
8.1.7. Alignment
Formatted ops such as BUFFER_LOAD_FORMAT_* must always be aligned to element_size.
The table below details the fields that make up the buffer resource descriptor.
62 1 Cache swizzle Buffer access. Optionally, swizzle texture cache TC L0 cache banks.
104:102 3 Dst_sel_z
107:105 3 Dst_sel_w
118:117 2 Index stride 0:8, 1:16, 2:32, or 3:64. Used for swizzled buffer addressing.
119 1 Add tid enable Add thread ID to the index for to calculate the address.
127:126 2 Type Value == 0 for buffer. Overlaps upper two bits of four-bit TYPE field in
128-bit V# resource.
A resource set to all zeros acts as an unbound texture or buffer (return 0,0,0,0).
The figure below shows the components of the LDS and memory address calculation:
TIDinWave is only added if the resource (V#) has the ADD_TID_ENABLE field set to 1, whereas
LDS adds it. The MEM_ADDR M# is in the VDATA field; it specifies M0.
Clamping Rules
Memory address clamping follows the same rules as any other buffer fetch. LDS address
clamping: the return data cannot be written outside the LDS space allocated to this wave.
• Set the active-mask to limit buffer reads to those threads that return data to a legal LDS
location.
• The LDSbase (alloc) is in units of 32 Dwords, as is LDSsize.
• M0[15:0] is in bytes.
GLC
The GLC bit means different things for loads, stores, and atomic ops.
• For GLC==0
◦ The load can read data from the GPU L0.
◦ Typically, all loads (except load-acquire) use GLC==0.
• For GLC==1
◦ The load intentionally misses the GPU L0 and reads from L2. If there was a line in the
GPU L0 that matched, it is invalidated; L2 is reread.
◦ NOTE: L2 is not re-read for every work-item in the same wave-front for a single load
instruction. For example: b=uav[N+tid] // assume this is a byte read w/ glc==1 and N is
aligned to 64B In the above op, the first Tid of the wavefront brings in the line from L2
or beyond, and all 63 of the other Tids read from same cache line in the L0.
For both GLC==0 and GLC==1, write data are combined across work-items of the wavefront
store clause which can contain multiple store ops; dirtied lines are written to the L2 cache
automatically and invalidated.
Atomics
The Device Level Coherent bit (DLC) and System Level Coherent (SLC) bits control the
behavior of the second and third level caches.
0 0 LRU
0 1 Bypass
1 1 Hit No Allocate
Image objects are accessed using from one to four dimensional addresses; they are composed
of homogeneous data of one to four elements. These image objects are read from, or written to,
using IMAGE_* or SAMPLE_* instructions, all of which use the MIMG instruction format.
IMAGE_LOAD instructions read an element from the image buffer directly into VGPRS, and
SAMPLE instructions use sampler constants (S#) and apply filtering to the data after it is read.
IMAGE_ATOMIC instructions combine data from VGPRs with data already in memory, and
optionally return the value that was in memory before the operation.
All VM operations use an image resource constant (T#) that is a 128-bit or 256-bit value in
SGPRs. This constant is sent to the texture cache when the instruction is executed. This
constant defines the address, data format, and characteristics of the surface in memory. Some
image instructions also use a sampler constant that is a 128-bit constant in SGPRs. Typically,
these constants are fetched from memory using scalar memory reads prior to executing VM
instructions, but these constants can also be generated within the shader.
Texture fetch instructions have a data mask (DMASK) field. DMASK specifies how many data
components it receives. If DMASK is less than the number of components in the texture, the
texture unit only sends DMASK components, starting with R, then G, B, and A. if DMASK
specifies more than the texture format specifies, the shader receives data based on T#.dst_sel
for the missing components.
GATHER_* Read up to four texels where each texel contains a single component of an image object
data format. It takes 4 instructions to read RGBA.
IMAGE_LOAD_<op> Read data from an image object using one of the following: image_load, image_load_mip,
image_load_{pck, pck_sgn, mip_pck, mip_pck_sgn}.
IMAGE_ATOMIC_<op Image atomic operation, which is one of the following: swap, cmpswap, add, sub, umin,
> smin, umax, smax, and, or, xor, inc, dec, fcmpswap, fmin, fmax.
ADDR1 - 8 12 additional VGPR address fields, used by the MIMG-NSA format. (VADDR acts as
ADDR12 ADDR0).
VDATA 8 Address of VGPR to supply first component of write data or receive first component of
read-data.
SSAMP 5 SGPR to supply S# (sampler constant) in four consecutive SGPRs. Missing two LSBs of
SGPR-address since it is aligned to a multiple of four SGPRs.
SRSRC 5 SGPR to supply T# (resource constant) in four or eight consecutive SGPRs. Missing two
LSBs of SGPR-address since it is aligned to a multiple of four SGPRs.
UNRM 1 Force address to be un-normalized regardless of T#. Set to 1 for image loads, stores
and atomics.
DMASK 4 Data VGPR enable mask: one to four consecutive VGPRs. Reads: defines which
components are returned.
DMASK[0] = red, DMASK[1] = green, DMASK[2] = blue, DMASK[2] = alpha
For example: dst_sel=unity, DMASK=0110 writes green to VGPRn and blue to
VGPRn+1. D16 packs two components into one VGPR, so the example above would
return 1 VGPR with green in VGPRn[15:0] and blue into VGPRn[31:16].
Writes: defines which components are written with data from VGPRs (missing
components get 0). Enabled components come from consecutive VGPRs.
For example: DMASK=1001: Red is in VGPRn and alpha in VGPRn+1. For D16 writes,
two components' data are packed into one VGPR, so for this example Red ata comes
from VGPRn[15:0] and alpha data from VGPRn[31:16].
GLC 1 Globally Coherent. Controls how reads and writes are handled by the L0 texture cache.
READ:
GLC = 0 Reads can hit on the L0 and persist across waves.
GLC = 1 Reads miss the L0 and force fetch to L2. No L0 persistence across waves.
WRITE:
GLC = 0 Writes miss the L0, write through to L2, and persist in L0 across wavefronts.
GLC = 1 Writes miss the L0, write through to L2. No persistence across wavefronts.
ATOMIC:
GLC = 0 Previous data value is not returned. No L0 persistence across wavefronts.
GLC = 1 Previous data value is returned. No L0 persistence across wavefronts.
DLC 1 Device Level Coherent. When set, accesses are forced to miss in level 1 texture cache.
SLC 1 System Level Coherent. Used in conjunction with DLC to determine L2 cache policies.
TFE 1 Texel Fault Enable for PRT (partially resident textures). When set to 1 and fetch returns
a NACK, status is written to the VGPR at DST+1 (first VGPR after all fetch-dest
VGPRs).
LWE 1 LOD Warning Enable. When set to 1, a texture fetch may return "LOD_CLAMPED = 1".
A16 1 When set, all address components are 16-bit UINT for image ops without sampler; 16-
bit float for image ops with sampler. Address components are packed two per VGPR
except for texel offset where one VGPR contains three 6bit uint offsets. PCF reference
(for _C instructions) ignores this field and is 32bit.
D16 1 VGPR-Data-16bit. On loads, convert data in memory to 16-bit format before storing it in
VGPRs (2 16bit values per VGPR). For stores, convert 16-bit data in VGPRs to memory
data format before writing to memory. Whether the data is treated as float or int is
decided by format. Allowed only with these opcodes:
IMAGE_SAMPLE*
IMAGE_GATHER4*
IMAGE_LOAD
IMAGE_LOAD_MIP
IMAGE_STORE
IMAGE_STORE_MIP
When using 16-bit addresses, each VGPR holds a pair of addresses and these cannot be
located in different VGPRs.
The table below shows the contents of address VGPRs for the various image opcodes.
1 1D Array x slice
1 2D x y
2 2D MSAA x y fragid
2 2D Array x y slice
2 3D x y z
2 Cube x y face_id
2 2D x y mipid
3 3D x y z mipid
Certain sample and gather opcodes require additional values from VGPRs beyond what is
shown. These values are: offset, bias, z-compare, and gradients.
sample 0 1D x
1 1D Array x slice
1 2D x y
2 2D interlaced x y field
2 2D Array x y slice
2 3D x y z
2 Cube x y face_id
sample_l 1 1D x lod
2 2D x y lod
3 3D x y z lod
sample_cl 1 1D x clamp
2 2D x y clamp
3 3D x y z clamp
gather4 1 2D x y
2 2D interlaced x y field
2 2D Array x y slice
2 Cube x y face_id
gather4_l 2 2D x y lod
gather4_cl 2 2D x y clamp
The table below lists and briefly describes the legal suffixes for image instructions:
_B LOD BIAS 1: lod bias Add this BIAS to the LOD computed.
_CL LOD CLAMP - Clamp the computed LOD to be no larger than this value.
_D Derivative 2,4 or 6: slopes Send dx/dv, dx/dy, etc. slopes to be used in LOD computation.
_O Offset 1: offsets Send X, Y, Z integer offsets (packed into 1 Dword) to offset XYZ address.
1D DX/DH DX/DV - - - -
• Body: One to four Dwords, as defined by the table: [Image Opcodes with Sampler] Address
components are X,Y,Z,W with X in VGPR_M, Y in VGPR_M+1, etc. The number of
components in "body" is the value of the ACNT field in the table, plus one.
• Data: Written from, or returned to, one to four consecutive VGPRs. The amount of data read
or written is determined by the DMASK field of the instruction.
• Reads: DMASK specifies which elements of the resource are returned to consecutive
VGPRs. The texture system reads data from memory and based on the data format
expands it to a canonical RGBA form, filling in zero or one for missing components. Then,
DMASK is applied, and only those components selected are returned to the shader.
• Writes: When writing an image object, it is only possible to write an entire element (all
components), not just individual components. The components come from consecutive
VGPRs, and the texture system fills in the value zero for any missing components of the
image’s data format; it ignores any values that are not part of the stored data format. For
example, if the DMASK=1001, the shader sends Red from VGPR_N, and Alpha from
VGPR_N+1, to the texture unit. If the image object is RGB, the texel is overwritten with Red
from the VGPR_N, Green and Blue set to zero, and Alpha from the shader ignored.
• Atomics: Image atomic operations are supported only on 32- and 64-bit-per pixel surfaces.
The surface data format is specified in the resource constant. Atomic operations treat the
element as a single component of 32- or 64-bits. For atomic operations, DMASK is set to
the number of VGPRs (Dwords) to send to the texture unit. DMASK legal values for atomic
image operations: no other values of DMASK are legal.
0x1 = 32-bit atomics except cmpswap.
0x3 = 32-bit atomic cmpswap.
0x3 = 64-bit atomics except cmpswap.
0xf = 64-bit atomic cmpswap.
• Atomics with Return: Data is read out of the VGPR(s), starting at VDATA, to supply to the
atomic operation. If the atomic returns a value to VGPRs, that data is returned to those
same VGPRs starting at VDATA.
D16 Instructions
Load-format and store-format instructions also come in a "d16" variant. For stores, each 32-bit
VGPR holds two 16-bit data elements that are passed to the texture unit. The texture unit
converts them to the texture format before writing to memory. For loads, data returned from the
texture unit is converted to 16 bits, and a pair of data are stored in each 32- bit VGPR (LSBs
first, then MSBs). The DMASK bit represents individual 16- bit elements; so, when
DMASK=0011 for an image-load, two 16-bit components are loaded into a single 32-bit VGPR.
A16 Instructions
The A16 instruction bit indicates that the address components are 16 bits instead of the usual
32 bits. Components are packed such that the first address component goes into the low 16 bits
([15:0]), and the next into the high 16 bits ([31:16]).
39:0 40 base address 256-byte aligned (represents bits 47:8). Also used for fmask-ptr.
51:40 12 min lod 4.8 (four uint bits, eight fraction bits) format.
98:96 3 dst_sel_x 0 = 0, 1 = 1, 4 = R, 5 = G, 6 = B, 7 = A.
101:99 3 dst_sel_y
104:102 3 dst_sel_z
107:105 3 dst_sel_w
111:108 4 base level largest mip level in the resource view. For MSAA, this should be set
to 0
115:112 4 last level smallest mip level in resource view. For MSAA, holds log2(number of
samples).
123:121 3 BC Swizzle Specifies channel ordering for border color data independent of the
T# dst_sel_*s. Internal xyzw channels get the following border color
channels as stored in memory. 0=xyzw, 1=xwyz, 2=wzyx, 3=wxyz,
4=zyxw, 5=yxwz
140:128 13 depth Depth-1 of Mip0 for a 3D map; last array slice for a 2D-array or 1D-
array or cube-map.
163:160 4 array pitch For Arrays, array pitch for quilts, encoded as trunc(log2(array
pitch))+1. values 8..15 reserved
For 3D, bit 0 indicates SRV or UAV:
0: SRV (base_array ignored, depth w.r.t. base map)
1: UAV (base_array and depth are first and last layer in view, and
w.r.t. mip level specified)
167:164 14 max mip Resource MipLevels-1. Describes the resource, as opposed to base-
level and last-level which describes the resource-view. For MSAA,
holds the number of samples.
182:180 3 perf mod scales sampler’s perf Z, perf mip, aniso-bias, lod-bias-sec
183 1 corner samples Describes how texels were generated in the resource. 0=center
mod sampled, 1 = corner sampled.
202 1 Iterate 256 Indicates that compressed tiles in this surface have been flused out
to every 256B of the tile. Only applies to MSAA depth surfaces.
208:207 2 Max Maximum uncompressed block size used for compressed shader
Uncompressed writes
block size
210:209 2 Max Compressed Maximum compressed block size used for compressed shader writes
block size
211 1 Meta Pipe Aligned Maintains pipe alignment in metadata addressing (DCC and tiling)
214 1 Alpha is on MSB Set to 1 if the surface’s component swap is not reversed (DCC)
255:216 40 Meta Data Address Upper bits of meta-data address (DCC) [47:8]
18:16 3 aniso threshold threshold under which floor(aniso ratio) determines number of
samples and step size
19 1 mc coord trunc enables bilinear blend fraction truncation to 1 bit for motion
compensation
28 1 disable cube wrap disables seamless DX10 cubemaps, allows cubemaps to clamp
according to clamp_x and clamp_y fields
43:32 12 min lod minimum LOD ins resource view space (0.0 = T#.base_level) u4.8.
59:56 4 perf_mip defines range of lod fractions that snap to nearest mip only when
mip_filter=Linear
77:76 2 lod bias high This is bits [13:12] of the LOD bias.
83:78 6 lod bias sec bias added to computed LOD, scaled by T#.perf_modulation. s2.4.
89:88 2 z filter Volume Filter: 0=none (use XY min/mag filter), 1=point, 2=linear
91:90 2 mip filter Mip level filter: 0=none (disable mipmapping,use base-leve),
1=point, 2=linear
127:126 2 border color type Opaque-black, transparent-black, white, use border color ptr.
27 16_16_UINT 72 32_32_32_UINT
173 BC3_UNORM
174 BC3_SRGB
175 BC4_UNORM
176 BC4_SNORM
177 BC5_UNORM
178 BC5_SNORM
179 BC6_UFLOAT
180 BC6_SFLOAT
181 BC7_UNORM
182 BC7_SRGB
The shader developer’s responsibility to avoid data hazards associated with VMEM instructions
include waiting for VMEM read instruction completion before reading data fetched from the TC
(VMCNT and VSCNT).
• IMAGE_BVH_INTERSECT_RAY
• IMAGE_BVH64_INTERSECT_RAY
These instructions receive ray data from the VGPRs and fetch BVH (Bounding Volume
Hierarchy) from memory.
• Box BVH nodes perform 4x Ray/Box intersection, sorts the 4 children based on intersection
distance and returns the child pointers and hit status.
• Triangle nodes perform 1 Ray/Triangle intersection test and returns the intersection point
and triangle ID.
The two instructions are identical, except that the “64” version supports a 64-bit address while
the normal version supports only a 32bit address. Both instructions can use the “A16” instruction
field to reduce some (but not all) of the address components to 16 bits (from 32). These
addresses are: ray_dir and ray_inv_dir.
0 node_pointer (u32) node_pointer (u32) node_pointer [31:0] (u32) node_pointer [31:0] (u32)
1 ray_extent (f32) ray_extent (f32) node_pointer [63:32] (u32) node_pointer [63:32] (u32)
6 ray_dir.y (f32) [15:0] = ray_dir.z (f16) ray_dir.x (f32) [15:0] = ray_dir.x (f16)
[31:16] = ray_inv_dir.x(f16) [31:16] = ray_dir.y (f16)
7 ray_dir.z (f32) [15:0] = ray_inv_dir.y (f16) ray_dir.y (f32) [15:0] = ray_dir.z (f16)
[31:16] = ray_inv_dir.z [31:16] = ray_inv_dir.x(f16)
(f16)
Vgpr_d[4] are the destination VGPRs of the results of intersection testing. The values returned
here are different depending on the type of BVH node that was fetched. For box nodes the
results contain the 4 pointers of the children boxes in intersection time sorted order. For triangle
BVH nodes the results contain the intersection time and triangle ID of the triangle tested.
Sgpr_r[4] is the texture descriptor for the operation. The instruction is encoded with
use_128bit_resource=1.
The return order settings of the BVH ops are ignored instead they use the in-order read return
queue.
The T# used with these instructions is different from other image instructions.
Base Address 39:0 40 Base address of the BVH texture 256 byte aligned
Box growing amount 62:55 8 Number of ULPs to be added during ray-box test, encoded as
unsigned integer
Box sorting enable 63 1 Whether the ray-box test result need to be sorted
Size 105:64 42 Number of nodes minus 1 in the BVH texture used to enforce bounds
checking
Barycentrics
The ray-tracing hardware is designed to support computation of barycentric coordinates directly
in hardware. This uses the “triangle_return_mode” in the table in the previous section (T#
descriptor).
ADDR 8 VGPR which holds the address. For 64-bit addresses, ADDR has the LSBs, and ADDR+1 has
the MSBs.
As an offset a single VGPR has a 32 bit unsigned offset.
For FLAT_*: specifies an address.
For GLOBAL_* and SCRATCH_* when SADDR is NULL: specifies an address.
For GLOBAL_* and SCRATCH_* when SADDR is not NULL: specifies an offset.
DATA 8 VGPR which holds the first Dword of data. Instructions can use 0-4 Dwords.
VDST 8 VGPR destination for data returned to the kernel, either from LOADs or Atomics with GLC=1
(return pre-op value).
SLC 1 System Level Coherent. Used in conjunction with DLC to determine L2 cache policies.
GLC 1 Global Level Coherent. For Atomics, GLC: 1 means return pre-op value, 0 means do not return
pre-op value.
LDS 1 When set, data is moved from memory to LDS instead of to VGPRs. Available only for loads.
For Global and Scratch only; must be zero for Flat.
SADDR 7 Scalar SGPR that provides an offset address. To disable use, set this field to NULL or 0x7f
(exec_hi).
Meaning of this field is different for Scratch and Global:
Flat: Unused.
Scratch: Use an SGPR (instead of VGPR) for the address.
Global: Use the SGPR to provide a base address; the VGPR provides a 32-bit offset per lane.
M0 16 Implied use of M0 for SCRATCH and GLOBAL only when LDS=1. Provides the LDS address-
offset.
The atomic instructions above are also available in "_X2" versions (64-bit).
9.2. Instructions
The FLAT instruction set is nearly identical to the Buffer instruction set, but without the FORMAT
reads and writes. Unlike Buffer instructions, FLAT instructions cannot return data directly to
LDS, but only to VGPRs.
FLAT instructions do not use a resource constant (V#) or sampler (S#); however, they do require
a additional register (FLAT_SCRATCH) to hold scratch-space memory address information in
case any threads' address resolves to scratch space. See the scratch section for details.
Internally, FLAT instruction are executed as both an LDS and a Buffer instruction; so, they
increment both VM_CNT/VS_CNT and LGKM_CNT and are not considered done until both
have been decremented. There is no way beforehand to determine whether a FLAT instruction
uses only LDS or texture memory space.
9.2.1. Ordering
Flat instructions can complete out of order with each other. If one flat instruction finds all of its
data in Texture cache, and the next finds all of its data in LDS, the second instruction might
complete first. If the two fetches return data to the same VGPR, the result are unknown.
9.3. Addressing
FLAT instructions support both 64- and 32-bit addressing. The address size is set using a mode
register (PTR32), and a local copy of the value is stored per wave.
The addresses for the aperture check differ in 32- and 64-bit mode; however, this is not covered
here.
64-bit addresses are stored with the LSBs in the VGPR at ADDR, and the MSBs in the VGPR at
ADDR+1.
For scratch space, the texture unit takes the address from the VGPR and does the following.
• FLAT
a. VGPR (32 or 64 bit) supplies the complete address. SADDR must be NULL.
• Global
a. VGPR (32 or 64 bit) supplies the address. Indicated by: SADDR == NULL.
b. SGPR (64 bit) supplies an address, and a VGPR (32 bit) supplies an offset
• SCRATCH
a. VGPR (32 bit) supplies an offset. Indicated by SADDR==NULL.
Every mode above can also add the "instruction immediate offset" to the address.
9.4. Global
Global instructions are similar to Flat instructions, but the programmer must ensure that no
threads access LDS space; thus, no LDS bandwidth is used by global instructions.
These instructions also allow direct data movement to LDS from memory without going through
VGPRs.
Since these instructions do not access LDS, only VM_CNT/VS_CNT is used, not LGKM_CNT. If
a global instruction does attempt to access LDS, the instruction returns MEM_VIOL.
9.5. Scratch
Scratch instructions are similar to Flat, but the programmer must ensure that no threads access
LDS space, and the memory space is swizzled. Thus, no LDS bandwidth is used by scratch
instructions.
Scratch instructions also support multi-Dword access and mis-aligned access (although mis-
aligned is slower).
The size of the address component is dependent on the ADDRESS_MODE: 32-bits or 64-bit
pointers. The VGPR-offset is 32 bits.
These instructions also allow direct data movement to LDS from memory without going through
VGPRs.
Since these instructions do not access LDS, only VM_CNT/VS_CNT is used, not LGKM_CNT. It
is not possible for a Scratch instruction to access LDS; thus, no error or aperture checking is
done.
The policy for threads with bad addresses is: writes outside this range do not write a value, and
reads return zero.
Addressing errors from either LDS or texture are returned on their respective "instruction done"
busses as MEM_VIOL. This sets the wave’s MEM_VIOL TrapStatus bit and causes an
exception (trap) if the corresponding EXCPEN bit is set.
9.7. Data
FLAT instructions can use zero to four consecutive Dwords of data in VGPRs and/or memory.
The DATA field determines which VGPR(s) supply source data (if any), and the VDST VGPRs
hold return data (if any). No data-format conversion is done.
“D16” instructions use only 16-bit of the VGPR instead of the full 32bits. “D16_HI” instructions
read or write only the high 16-bits, while “D16” use the low 16-bits. Scratch & Global D16 load
instructions with LDS=1 will write the entire 32-bits of LDS.
The wavefront must supply the scratch size and offset (for space allocated to this wave) with
every FLAT request. Prior to issuing any FLAT or Scratch instructions, the shader program must
initialize the FLAT_SCRATCH register with the base address of scratch space allocated this
wave.
FLAT_SCRATCH is a 64-bit, byte address. The shader composes the value by adding together
two separate values: the base address, which can be passed in via an initialized SGPR, or
perhaps through a constant buffer, and the per-wave allocation offset (also initialized in an
SGPR).
10.1. Overview
The figure below shows the conceptual framework of the LDS is integration into the memory of
AMD GPUs using OpenCL.
Physically located on-chip, directly adjacent to the ALUs, the LDS is approximately one order of
magnitude faster than global memory (assuming no bank conflicts).
There are 128kB memory per workgroup processor split up into 64 banks of dword-wide RAMs.
These 64 banks are further sub-divided into two sets of 32-banks each where 32 of the banks
are affiliated with a pair of SIMD32’s, and the other 32 banks are affiliated with the other pair of
SIMD32’s within the WGP. Each bank is a 512x32 two-port RAM (1R/1W per clock cycle).
Dwords are placed in the banks serially, but all banks can execute a store or load
simultaneously. One work-group can request up to 64kB memory.
The high bandwidth of the LDS memory is achieved not only through its proximity to the ALUs,
but also through simultaneous access to its memory banks. Thus, it is possible to concurrently
execute 32 write or read instructions, each nominally 32-bits; extended instructions,
read2/write2, can be 64-bits each. If, however, more than one access attempt is made to the
same bank at the same time, a bank conflict occurs. In this case, for indexed and atomic
operations, the hardware is designed to prevent the attempted concurrent accesses to the same
bank by turning them into serial accesses. This decreases the effective bandwidth of the LDS.
For increased throughput (optimal efficiency), therefore, it is important to avoid bank conflicts. A
knowledge of request scheduling and address mapping is key to achieving this.
Data can be loaded into LDS either by transferring it from VGPRs to LDS using "DS"
instructions, or by loading in from memory. When loading from memory, the data may be loaded
into VGPRs first or for some types of loads it may be loaded directly into LDS from memory. To
store data from LDS to global memory, data is read from LDS and placed into the workitem’s
VGPRs, then written out to global memory. To make effective use of the LDS, a kernel must
perform many operations on what is transferred between global memory and LDS.
LDS atomics are performed in the LDS hardware. (Thus, although ALUs are not directly used for
these operations, latency is incurred by the LDS executing this function.)
In CU mode, waves are allocated to two SIMD32’s which share a texture memory unit, and are
allocated LDS space which is all local (on the same side) as the SIMDs. This mode can provide
higher LDS memory bandwidth than WGP mode.
In WGP mode, the waves are distributed over all 4 SIMD32’s and LDS space maybe allocated
anywhere within the LDS memory. Waves may access data on the "near" or "far" side of LDS
equally, but performance may be lower in some cases. This mode provides more ALU and
texture memory bandwidth to a single workgroup (of at least 4 waves).
• Direct Read – reads a single dword from LDS and broadcasts the data as input to a vector
ALU op.
• Indexed Read/write and Atomic ops – read/write address comes from a VGPR and data
to/from VGPR.
◦ LDS-ops require up to 3 inputs: 2data+1addr and immediate return VGPR.
• Parameter Interpolation – similar to direct read but with specific addressing.
◦ Reads up to 2 parameters (P0, P1-P0) or (P2-P0) from one attribute to be supplied to a
muladd.
◦ Also supplies individual parameter read for general interpolation (or select I,J=0.0)
LDS Direct reads occur in vector ALU (VALU) instructions and allow the LDS to supply a single
DWORD value which is broadcast to all threads in the wavefront and is used as the SRC0 input
to the ALU operations. A VALU instruction indicates that input is to be supplied by LDS by using
the LDS_DIRECT for the SRC0 field.
The LDS address and data-type of the data to be read from LDS comes from the M0 register:
Pixel shaders use LDS to read vertex parameter values; the pixel shader then interpolates them
to find the per-pixel parameter values. LDS parameter reads occur when the following opcodes
are used.
The typical parameter interpolation operations involves reading three parameters: P0, P10, and
P20, and using the two barycentric coordinates, I and J, to determine the final per-pixel value:
Parameter interpolation instructions indicate the parameter attribute number (0 to 32) and the
component number (0=x, 1=y, 2=z and 3=w).
OP 2 Opcode:
0: v_interp_p1_f32 VDST = P10 * VSRC + P0
1: v_interp_p2_f32 VDST = P20 * VSRC + VDST
2: v_interp_mov_f32 VDST = (P0, P10 or P20 selected by VSRC[1:0])
P0, P10 and P20 are parameter values read from LDS
VSRC 8 Source VGPR supplies interpolation "I" or "J" value. For OP==v_interp_mov_f32: 0=P10,
1=P20, 2=P0.
Parameter interpolation and parameter move instructions must initialize the M0 register before
using it. The lds_param_offset[15:0] is an address offset from the beginning of LDS storage
allocated to this wavefront to where parameters begin in LDS memory for this wavefront.
The new_prim_mask is a 15-bit mask with one bit per quad; a one in this mask indicates that
this quad begins a new primitive, a zero indicates it uses the same primitive as the previous
quad. The mask is 15 bits, not 16, since the first quad in a wavefront begins a new primitive and
so it is not included in the mask.
The above parameter interpolation opcodes use the VINTRP microcode format, but for
interpolation on 16-bit data, the VOP3 format is used. The opcodes supported are:
V_INTERP_P1LL_F16 d.f32 = lds.f16 * vgpr.f32 + lds.f16 attr_word selects LDS high or low 16bits. “LL” is
for “two LDS arguments.”
V_INTERP_P2_F16 d.f16 = lds.f16 * vgpr.f32 + vgpr.f32 Final computation. attr_word selects LDS high or
low 16bits. Result is written to the 16 LSB’s of the
dest-vgpr.
Indexed and atomic operations supply a unique address per work-item from the VGPRs to the
LDS, and supply or return unique data per work-item back to VGPRs. Due to the internal
banked structure of LDS, operations can complete in as little as one cycle (for wave32, or 2
cycles for wave64), or take as many 64 cycles, depending upon the number of bank conflicts
(addresses that map to the same memory bank).
Indexed operations are simple LDS load and store operations that read data from, and return
data to, VGPRs.
Atomic operations are arithmetic operations that combine data from VGPRs and data in LDS,
and write the result back to LDS. Atomic operations have the option of returning the LDS "pre-
op" value to VGPRs.
The table below lists and briefly describes the LDS instruction fields.
OP 7 LDS opcode.
OFFSET0 8 Immediate offset, in bytes. Instructions with one address combine the offset fields into a single 16-
bit unsigned offset: {offset1, offset0}. Instructions with two addresses (for example: READ2) use
OFFSET1 8 the offsets separately as two 8- bit unsigned offsets.
VDST 8 VGPR to which result is written: either from LDS-load or atomic return value.
The M0 register is not used for most LDS-indexed operations: only the "ADD_TID" instructions
read M0 and for these it represents a byte address.
DS_READ_{B32,B64,B96,B128,U8,I8,U16,I16} Read one value per thread; sign extend to Dword, if signed.
DS_BPERMUTE_B32 Backward permute. Does not actually write any LDS memory.
LDS[thread_id] = src0
where thread_id is 0..63, and returnVal = LDS[dst].
Note that LDS_ADDR1 is used only for READ2*, WRITE2*, and WREXCHG2*.
The address comes from VGPR, and both ADDR and InstrOffset are byte addresses.
At the time of wavefront creation, LDS_BASE is assigned to the physical LDS region owned by
this wavefront or work-group.
Specify only one address by setting both offsets to the same value. This causes only one read
or write to occur and uses only the first DATA0.
DS_{READ,WRITE}_ADD_TID Addressing
The "ADD_TID" (add thread-id) is a separate form where the base address for the instruction is
common to all threads, but then each thread has a fixed offset added in based on its thread-ID
within the wave. This allows a convenient way to quickly transfer data between VGPRs and LDS
without having to use a VGPR to supply an address.
ADDR is a Dword address. VGPRs 0,1 and dst are double-GPRs for doubles data.
VGPR data sources can only be VGPRs or constant values, not SGPRs.
Note that in wave64 mode the permute operates only across 32 lanes at a time of each half of a
wave64. In other words, it executes as if were two independent wave32’s. Each half-wave can
use indices in the range 0-31 to reference lanes in that same half-wave.
These instructions use the LDS hardware but do not use any memory storage, and may be
used by waves which have not allocated any LDS space. The instructions supply a data value
from VGPRs and an index value per lane.
The EXEC mask is honored for both reading the source and writing the destination. Index
values out of range will wrap around (only index bits [6:2] are used, the other bits of the index
are ignored). Reading from disabled lanes returns zero.
In the instruction word: VDST is the dest VGPR, ADDR is the index VGPR, and DATA0 is the
source data VGPR. Note that index values are in bytes (so multiply by 4), and have the ‘offset0’
field added to them before use.
M0 is used for:
• Vertex Position
• Vertex Parameter
• Pixel color
• Pixel depth (Z)
• Primitive Data
VM 1 Valid Mask. When set to 1, this indicates that the EXEC mask represents the
valid-mask for this wavefront. It can be sent multiple times per shader (the final
value is used), but must be sent at least once per pixel shader.
DONE 1 This is the final pixel shader or vertex-position export of the program. Used only
for pixel and position exports. Set to zero for parameters.
COMPR 1 Compressed data. When set, indicates that the data being exported is 16-bits
per component rather than the usual 32-bit.
VSRC0 8
11.2. Operations
Every pixel shader must have at least one export instruction. The last export instruction
executed must have the DONE bit set to one.
The EXEC mask is applied to all exports. Only pixels with the corresponding EXEC bit set to 1
export data to the output buffer. Results from multiple exports are accumulated in the output
buffer.
At least one export must have the VM bit set to 1. This export, in addition to copying data to the
color or depth output buffer, also informs the color buffer which pixels are valid and which have
been discarded. The value of the EXEC mask communicates the pixel valid mask. If multiple
exports are sent with VM set to 1, the mask from the final export is used. If the shader program
wants to only update the valid mask but not send any new data, the program can do an export
to the NULL target.
Every vertex shader must output at least one position vector (x, y, z; w is optional) to the POS0
target. The last position export must have the DONE bit set to 1. A vertex shader can export
zero or more parameters. For optimized performance, it is recommended to output all position
When access to the bus is granted, the EXEC mask is read and the VGPR data sent out. After
the last of the VGPR data is sent, the EXPCNT counter is decremented by 1.
Use S_WAITCNT on EXPCNT to prevent the shader program from overwriting EXEC or the
VGPRs holding the data to be exported before the export operation has completed.
Multiple export instructions can be outstanding at one time. Exports of the same type (for
example: position) are completed in order, but exports of different types can be completed out of
order.
If the STATUS register’s SKIP_EXPORT bit is set to one, the hardware treats all EXPORT
instructions as if they were NOPs.
If an instruction has two suffixes (for example, _I32_F32), the first suffix indicates the destination
type, the second the source type.
• D = destination
• U = unsigned integer
• S = source
• SCC = scalar condition code
• I = signed integer
• B = bitfield
Note: Rounding and Denormal modes apply to all floating-point operations unless otherwise
specified in the instruction description.
Instructions in this format may use a 32-bit literal constant which occurs immediately after the
instruction.
1 S_SUB_U32 Subtract the second unsigned integer from the first with
carry-out.
3 S_SUB_I32 Subtract the second signed integer from the first with
carry-out.
5 S_SUBB_U32 Subtract the second unsigned integer from the first with
carry-in and carry-out.
D = S0 & S1;
SCC = (D != 0).
D = S0 & S1;
SCC = (D != 0).
D = S0 | S1;
SCC = (D != 0).
D = S0 | S1;
SCC = (D != 0).
D = S0 ^ S1;
SCC = (D != 0).
D = S0 ^ S1;
SCC = (D != 0).
D = S0 & ~S1;
SCC = (D != 0).
D = S0 & ~S1;
SCC = (D != 0).
D = S0 | ~S1;
SCC = (D != 0).
D = S0 | ~S1;
SCC = (D != 0).
D = ~(S0 | S1);
SCC = (D != 0).
D = ~(S0 | S1);
SCC = (D != 0).
D = ~(S0 ^ S1);
SCC = (D != 0).
D = ~(S0 ^ S1);
SCC = (D != 0).
Functional examples:
53 S_MUL_HI_U32 Multiple two unsigned integers and store the high 32 bits.
54 S_MUL_HI_I32 Multiple two signed integers and store the high 32 bits.
Instructions in this format may not use a 32-bit literal constant which occurs immediately after
the instruction.
D.i32 = signext(SIMM16[15:0]).
if(SCC)
D.i32 = signext(SIMM16[15:0]);
endif.
int32 tmp = D.i32; // save value so we can check sign bits for
overflow later.
D.i32 = D.i32 + signext(SIMM16[15:0]);
SCC = (tmp[31] == SIMM16[15] && tmp[31] != D.i32[31]). // signed
overflow.
hardware-reg = S0.u.
21 S_SETREG_IMM32_B Write some or all of the LSBs of IMM32 into a hardware register;
32 this instruction requires a 32-bit literal constant.
hardware-reg = LITERAL.
22 S_CALL_B64 Implements a short call, where the return address (the next
instruction after the S_CALL_B64) is saved to D. Long calls
should consider S_SWAPPC_B64 instead.
D.u64 = PC + 4;
PC = PC + signext(SIMM16 * 4) + 4.
23 S_WAITCNT_VSCNT Wait for the counts of outstanding vector store events -- vector
memory stores and atomics that DO NOT return data -- to be at or
below the specified level. This counter is not used in
'all-in-order' mode.
if(EXEC[63:0] == 0)
// no passes, skip entire loop
jump LABEL
elif(EXEC_LO == 0)
// execute high pass only
D0 = EXEC_LO
else
// execute low pass first, either running both passes or
running low pass only
D0 = EXEC_HI
EXEC_HI = 0
endif.
Example:
s_subvector_loop_begin s0, SKIP_ALL
LOOP_START:
// instructions
// ...
LOOP_END:
s_subvector_loop_end s0, LOOP_START
SKIP_ALL:
if(EXEC_HI != 0)
EXEC_LO = D0
elif(S0 == 0)
// done: executed low pass and skip high pass
nop
else
// execute second pass of two-pass mode
EXEC_HI = D0
D0 = EXEC_LO
EXEC_LO = 0
jump LABEL
endif.
Instructions in this format may use a 32-bit literal constant which occurs immediately after the
instruction.
D.u = S0.u.
D.u64 = S0.u64.
if(SCC) then
D.u = S0.u;
endif.
if(SCC) then
D.u64 = S0.u64;
endif.
D = ~S0;
SCC = (D != 0).
D = ~S0;
SCC = (D != 0).
9 S_WQM_B32 Computes whole quad mode for an active/valid mask. If any pixel
in a quad is active, all pixels of the quad are marked active.
10 S_WQM_B64 Computes whole quad mode for an active/valid mask. If any pixel
in a quad is active, all pixels of the quad are marked active.
D.u[31:0] = S0.u[0:31].
D.u64[63:0] = S0.u64[0:63].
D = 0;
for i in 0 ... opcode_size_in_bits - 1 do
D += (S0[i] == 0 ? 1 : 0)
endfor;
SCC = (D != 0).
Functional examples:
S_BCNT0_I32_B32(0x00000000) => 32
S_BCNT0_I32_B32(0xcccccccc) => 16
S_BCNT0_I32_B32(0xffffffff) => 0
D = 0;
for i in 0 ... opcode_size_in_bits - 1 do
D += (S0[i] == 0 ? 1 : 0)
endfor;
SCC = (D != 0).
Functional examples:
S_BCNT0_I32_B32(0x00000000) => 32
S_BCNT0_I32_B32(0xcccccccc) => 16
S_BCNT0_I32_B32(0xffffffff) => 0
D = 0;
for i in 0 ... opcode_size_in_bits - 1 do
D += (S0[i] == 1 ? 1 : 0)
endfor;
SCC = (D != 0).
Functional examples:
S_BCNT1_I32_B32(0x00000000) => 0
S_BCNT1_I32_B32(0xcccccccc) => 16
S_BCNT1_I32_B32(0xffffffff) => 32
D = 0;
for i in 0 ... opcode_size_in_bits - 1 do
D += (S0[i] == 1 ? 1 : 0)
endfor;
SCC = (D != 0).
Functional examples:
S_BCNT1_I32_B32(0x00000000) => 0
S_BCNT1_I32_B32(0xcccccccc) => 16
S_BCNT1_I32_B32(0xffffffff) => 32
17 S_FF0_I32_B32 Returns the bit position of the first zero from the LSB (least
significant bit), or -1 if there are no zeros.
Functional examples:
S_FF0_I32_B32(0xaaaaaaaa) => 0
S_FF0_I32_B32(0x55555555) => 1
S_FF0_I32_B32(0x00000000) => 0
S_FF0_I32_B32(0xffffffff) => 0xffffffff
S_FF0_I32_B32(0xfffeffff) => 16
18 S_FF0_I32_B64 Returns the bit position of the first zero from the LSB (least
significant bit), or -1 if there are no zeros.
Functional examples:
S_FF0_I32_B32(0xaaaaaaaa) => 0
S_FF0_I32_B32(0x55555555) => 1
S_FF0_I32_B32(0x00000000) => 0
S_FF0_I32_B32(0xffffffff) => 0xffffffff
S_FF0_I32_B32(0xfffeffff) => 16
19 S_FF1_I32_B32 Returns the bit position of the first one from the LSB (least
significant bit), or -1 if there are no ones.
Functional examples:
S_FF1_I32_B32(0xaaaaaaaa) => 1
S_FF1_I32_B32(0x55555555) => 0
S_FF1_I32_B32(0x00000000) => 0xffffffff
S_FF1_I32_B32(0xffffffff) => 0
S_FF1_I32_B32(0x00010000) => 16
20 S_FF1_I32_B64 Returns the bit position of the first one from the LSB (least
significant bit), or -1 if there are no ones.
Functional examples:
S_FF1_I32_B32(0xaaaaaaaa) => 1
S_FF1_I32_B32(0x55555555) => 0
S_FF1_I32_B32(0x00000000) => 0xffffffff
S_FF1_I32_B32(0xffffffff) => 0
S_FF1_I32_B32(0x00010000) => 16
21 S_FLBIT_I32_B32 Counts how many zeros before the first one starting from the MSB
(most significant bit). Returns -1 if there are no ones.
Functional examples:
22 S_FLBIT_I32_B64 Counts how many zeros before the first one starting from the MSB
(most significant bit). Returns -1 if there are no ones.
Functional examples:
23 S_FLBIT_I32 Counts how many bits in a row (from MSB to LSB) are the same as
the sign bit. Returns -1 if all bits are the same.
Functional examples:
24 S_FLBIT_I32_I64 Counts how many bits in a row (from MSB to LSB) are the same as
the sign bit. Returns -1 if all bits are the same.
Functional examples:
D.i = signext(S0.i[7:0]).
D.i = signext(S0.i[15:0]).
D.u[S0.u[4:0]] = 0.
D.u64[S0.u[5:0]] = 0.
D.u[S0.u[4:0]] = 1.
D.u64[S0.u[5:0]] = 1.
D.u64 = PC + 4.
PC = S0.u64.
D.u64 = PC + 4;
PC = S0.u64.
PRIV = 0;
PC = S0.u64.
36 S_AND_SAVEEXEC_ Bitwise AND with EXEC mask. The original EXEC mask is saved to
B64 the destination SGPRs before the bitwise operation is
performed.
D.u64 = EXEC;
EXEC = S0.u64 & EXEC;
SCC = (EXEC != 0).
37 S_OR_SAVEEXEC_B Bitwise OR with EXEC mask. The original EXEC mask is saved to
64 the destination SGPRs before the bitwise operation is
performed.
D.u64 = EXEC;
EXEC = S0.u64 | EXEC;
SCC = (EXEC != 0).
38 S_XOR_SAVEEXEC_ Bitwise XOR with EXEC mask. The original EXEC mask is saved to
B64 the destination SGPRs before the bitwise operation is
performed.
D.u64 = EXEC;
EXEC = S0.u64 ^ EXEC;
SCC = (EXEC != 0).
39 S_ANDN2_SAVEEXE Bitwise ANDN2 with EXEC mask. The original EXEC mask is saved
C_B64 to the destination SGPRs before the bitwise operation is
performed.
D.u64 = EXEC;
EXEC = S0.u64 & ~EXEC;
SCC = (EXEC != 0).
40 S_ORN2_SAVEEXEC Bitwise ORN2 with EXEC mask. The original EXEC mask is saved to
_B64 the destination SGPRs before the bitwise operation is
performed.
D.u64 = EXEC;
EXEC = S0.u64 | ~EXEC;
SCC = (EXEC != 0).
41 S_NAND_SAVEEXEC Bitwise NAND with EXEC mask. The original EXEC mask is saved to
_B64 the destination SGPRs before the bitwise operation is
performed.
D.u64 = EXEC;
EXEC = ~(S0.u64 & EXEC);
SCC = (EXEC != 0).
42 S_NOR_SAVEEXEC_ Bitwise NOR with EXEC mask. The original EXEC mask is saved to
B64 the destination SGPRs before the bitwise operation is
performed.
D.u64 = EXEC;
EXEC = ~(S0.u64 | EXEC);
SCC = (EXEC != 0).
43 S_XNOR_SAVEEXEC Bitwise XNOR with EXEC mask. The original EXEC mask is saved to
_B64 the destination SGPRs before the bitwise operation is
performed.
D.u64 = EXEC;
EXEC = ~(S0.u64 ^ EXEC);
SCC = (EXEC != 0).
D = 0;
for i in 0 ... (opcode_size_in_bits / 4) - 1 do
D[i] = (S0[i * 4 + 3:i * 4] != 0);
endfor;
SCC = (D != 0).
D = 0;
for i in 0 ... (opcode_size_in_bits / 4) - 1 do
D[i] = (S0[i * 4 + 3:i * 4] != 0);
endfor;
SCC = (D != 0).
SGPR[D.addr].u32 = SGPR[S0.addr+M0[31:0]].u32
47 S_MOVRELS_B64 Move from a relative source address. The index in M0.u must be
even for this operation.
SGPR[D.addr].u64 = SGPR[S0.addr+M0[31:0]].u64
SGPR[D.addr+M0[31:0]].u32 = SGPR[S0.addr].u32
SGPR[D.addr+M0[31:0]].u64 = SGPR[S0.addr].u64
Functional examples:
55 S_ANDN1_SAVEEXE Bitwise ANDN1 with EXEC mask. The original EXEC mask is saved
C_B64 to the destination SGPRs before the bitwise operation is
performed.
D.u64 = EXEC;
EXEC = ~S0.u64 & EXEC;
SCC = (EXEC != 0).
56 S_ORN1_SAVEEXEC Bitwise ORN1 with EXEC mask. The original EXEC mask is saved to
_B64 the destination SGPRs before the bitwise operation is
performed.
D.u64 = EXEC;
EXEC = ~S0.u64 | EXEC;
SCC = (EXEC != 0).
57 S_ANDN1_WREXEC_ Bitwise ANDN1 with EXEC mask. Unlike the SAVEEXEC series of
B64 opcodes, the value written to destination SGPRs is the result of
the bitwise-op result. EXEC and the destination SGPRs will have
the same value at the end of this instruction. This instruction
is intended to accelerate waterfalling.
58 S_ANDN2_WREXEC_ Bitwise ANDN2 with EXEC mask. Unlike the SAVEEXEC series of
B64 opcodes, the value written to destination SGPRs is the result of
the bitwise-op result. EXEC and the destination SGPRs will have
the same value at the end of this instruction. This instruction
is intended to accelerate waterfalling.
60 S_AND_SAVEEXEC_ Bitwise AND with EXEC mask. The original EXEC mask is saved to
B32 the destination SGPRs before the bitwise operation is
performed.
D.u32 = EXEC_LO;
EXEC_LO = S0.u32 & EXEC_LO;
SCC = (EXEC_LO != 0).
61 S_OR_SAVEEXEC_B Bitwise OR with EXEC mask. The original EXEC mask is saved to
32 the destination SGPRs before the bitwise operation is
performed.
D.u32 = EXEC_LO;
EXEC_LO = S0.u32 | EXEC_LO;
SCC = (EXEC_LO != 0).
62 S_XOR_SAVEEXEC_ Bitwise XOR with EXEC mask. The original EXEC mask is saved to
B32 the destination SGPRs before the bitwise operation is
performed.
D.u32 = EXEC_LO;
EXEC_LO = S0.u32 ^ EXEC_LO;
SCC = (EXEC_LO != 0).
63 S_ANDN2_SAVEEXE Bitwise ANDN2 with EXEC mask. The original EXEC mask is saved
C_B32 to the destination SGPRs before the bitwise operation is
performed.
D.u32 = EXEC_LO;
EXEC_LO = S0.u32 & ~EXEC_LO;
SCC = (EXEC_LO != 0).
64 S_ORN2_SAVEEXEC Bitwise ORN2 with EXEC mask. The original EXEC mask is saved to
_B32 the destination SGPRs before the bitwise operation is
performed.
D.u32 = EXEC_LO;
EXEC_LO = S0.u32 | ~EXEC_LO;
SCC = (EXEC_LO != 0).
65 S_NAND_SAVEEXEC Bitwise NAND with EXEC mask. The original EXEC mask is saved to
_B32 the destination SGPRs before the bitwise operation is
performed.
D.u32 = EXEC_LO;
EXEC_LO = ~(S0.u32 & EXEC_LO);
SCC = (EXEC_LO != 0).
66 S_NOR_SAVEEXEC_ Bitwise NOR with EXEC mask. The original EXEC mask is saved to
B32 the destination SGPRs before the bitwise operation is
performed.
D.u32 = EXEC_LO;
EXEC_LO = ~(S0.u32 | EXEC_LO);
SCC = (EXEC_LO != 0).
67 S_XNOR_SAVEEXEC Bitwise XNOR with EXEC mask. The original EXEC mask is saved to
_B32 the destination SGPRs before the bitwise operation is
performed.
D.u32 = EXEC_LO;
EXEC_LO = ~(S0.u32 ^ EXEC_LO);
SCC = (EXEC_LO != 0).
68 S_ANDN1_SAVEEXE Bitwise ANDN1 with EXEC mask. The original EXEC mask is saved
C_B32 to the destination SGPRs before the bitwise operation is
performed.
D.u32 = EXEC_LO;
EXEC_LO = ~S0.u32 & EXEC_LO;
SCC = (EXEC_LO != 0).
69 S_ORN1_SAVEEXEC Bitwise ORN1 with EXEC mask. The original EXEC mask is saved to
_B32 the destination SGPRs before the bitwise operation is
performed.
D.u32 = EXEC_LO;
EXEC_LO = ~S0.u32 | EXEC_LO;
SCC = (EXEC_LO != 0).
70 S_ANDN1_WREXEC_ Bitwise ANDN1 with EXEC mask. Unlike the SAVEEXEC series of
B32 opcodes, the value written to destination SGPRs is the result of
the bitwise-op result. EXEC and the destination SGPRs will have
the same value at the end of this instruction. This instruction
is intended to accelerate waterfalling.
71 S_ANDN2_WREXEC_ Bitwise ANDN2 with EXEC mask. Unlike the SAVEEXEC series of
B32 opcodes, the value written to destination SGPRs is the result of
the bitwise-op result. EXEC and the destination SGPRs will have
the same value at the end of this instruction. This instruction
is intended to accelerate waterfalling. See S_ANDN2_WREXEC_B64
for example code.
SGPR[D.addr+M0[25:16]].u32 = SGPR[S0.addr+M0[9:0]].u32
Instructions in this format may use a 32-bit literal constant which occurs immediately after the
instruction.
0 S_CMP_EQ_I32 Compare two integers for equality. Note that S_CMP_EQ_I32 and
S_CMP_EQ_U32 are identical opcodes, but both are provided for
symmetry.
1 S_CMP_LG_I32 Compare two integers for inequality. Note that S_CMP_LG_I32 and
S_CMP_LG_U32 are identical opcodes, but both are provided for
symmetry.
6 S_CMP_EQ_U32 Compare two integers for equality. Note that S_CMP_EQ_I32 and
S_CMP_EQ_U32 are identical opcodes, but both are provided for
symmetry.
7 S_CMP_LG_U32 Compare two integers for inequality. Note that S_CMP_LG_I32 and
S_CMP_LG_U32 are identical opcodes, but both are provided for
symmetry.
Examples:
s_nop 0 // Wait 1 cycle.
s_nop 0xf // Wait 16 cycles.
Examples:
s_branch label // Set SIMM16 = +4 = 0x0004
s_nop 0 // 4 bytes
label:
s_nop 0 // 4 bytes
s_branch label // Set SIMM16 = -8 = 0xfff8
3 S_WAKEUP Allow a wave to 'ping' all the other waves in its threadgroup to
force them to wake up early from an S_SLEEP instruction.
The ping is ignored if the waves are not sleeping. This allows
for efficient polling on a memory location. The waves which
are polling can sit in a long S_SLEEP between memory reads, but
the wave which writes the value can tell them all to wake up
early now that the data is available. This is useful for
fBarrier implementations (speedup). This method is also safe
from races because if any wave misses the ping, everything still
works fine (waves which missed it just complete their S_SLEEP).
if(SCC == 0) then
PC = PC + signext(SIMM16 * 4) + 4;
endif.
if(SCC == 1) then
PC = PC + signext(SIMM16 * 4) + 4;
endif.
if(VCC == 0) then
PC = PC + signext(SIMM16 * 4) + 4;
endif.
if(VCC != 0) then
PC = PC + signext(SIMM16 * 4) + 4;
endif.
if(EXEC == 0) then
PC = PC + signext(SIMM16 * 4) + 4;
endif.
if(EXEC != 0) then
PC = PC + signext(SIMM16 * 4) + 4;
endif.
13 S_SETHALT S_SETHALT can set/clear the HALT or FATAL_HALT status bits. The
particular status bit is chosen by halt type control as
indicated in SIMM16[2]; 0 = HALT bit select; 1 = FATAL_HALT bit
select.
Examples:
s_sleep 0 // Wait for 0 clocks.
s_sleep 1 // Wait for 1-64 clocks.
s_sleep 2 // Wait for 65-128 clocks.
17 S_SENDMSGHALT Send a message and then HALT the wavefront; see S_SENDMSG for
details.
TrapID = SIMM16[7:0];
Wait for all instructions to complete;
{TTMP1, TTMP0} = {1'h0, PCRewind[5:0], HT[0], TrapID[7:0],
PC[47:0]};
PC = TBA; // trap base address
PRIV = 1.
23 S_CBRANCH_CDBGSY Perform a conditional short jump when the system debug flag is
S set.
if(conditional_debug_system != 0) then
PC = PC + signext(SIMM16 * 4) + 4;
endif.
24 S_CBRANCH_CDBGUS Perform a conditional short jump when the user debug flag is
ER set.
if(conditional_debug_user != 0) then
PC = PC + signext(SIMM16 * 4) + 4;
endif.
25 S_CBRANCH_CDBGSY Perform a conditional short jump when either the system or the
S_OR_USER user debug flag are set.
26 S_CBRANCH_CDBGSY Perform a conditional short jump when both the system and the
S_AND_USER user debug flag are set.
27 S_ENDPGM_SAVED End of program; signal that a wave has been saved by the
context-switch trap handler and terminate wavefront. The
hardware implicitly executes S_WAITCNT 0 and S_WAITCNT_VSCNT 0
before executing this instruction. See S_ENDPGM for additional
variants.
30 S_ENDPGM_ORDERED End of program; signal that a wave has exited its POPS critical
_PS_DONE section and terminate wavefront. The hardware implicitly
executes S_WAITCNT 0 and S_WAITCNT_VSCNT 0 before executing this
instruction. This instruction is an optimization that combines
S_SENDMSG(MSG_ORDERED_PS_DONE) and S_ENDPGM; there may be cases
where you still need to send the message separately, in which
case the shader must end with a regular S_ENDPGM instruction.
See S_ENDPGM for additional variants.
Example:
...
s_endpgm // last real instruction in shader buffer
s_code_end // 1
s_code_end // 2
s_code_end // 3
s_code_end // 4
s_code_end // done!
0: reserved
1: SQ_PREFETCH_1_LINE -- prefetch 1 line
2: SQ_PREFETCH_2_LINES -- prefetch 2 lines
3: SQ_PREFETCH_3_LINES -- prefetch 3 lines
SALU
Export
Branch
Message
GDS
Some wait values are smaller than the counters: the max "wait"
value means "don't wait on this counter". For example, VM_VSRC
is 4 bits, but the wait field for VM_VSRC is only 3 bits. The
value 7 means don't wait on VM_VSRC, 6 means wait for VM_VSRC <=
6, etc. The wait value for VA_VCC is just 1 bit even though the
counter is 3 bits: 0 = wait for va_vcc==0, 1 = don't wait on
va_vcc.
none 0 - illegal
Instructions in this format may use a 32-bit literal constant, DPP or SDWA which occurs
immediately after the instruction.
1 V_CNDMASK_B32 Conditional mask on each thread. In VOP3 the VCC source may be a
scalar GPR specified in S2.u.
D.f32 =
S0.f16[0] * S1.f16[0] +
S0.f16[1] * S1.f16[1] + D.f32.
9 V_MUL_I32_I24 Multiply two signed 24-bit integers and store the result as a
signed 32-bit integer. This opcode is as efficient as basic
single-precision opcodes since it utilizes the single-precision
floating point multiplier. See also V_MUL_HI_I32_I24.
10 V_MUL_HI_I32_I24 Multiply two signed 24-bit integers and store the high 32 bits
of the result as a signed 32-bit integer. See also
V_MUL_I32_I24.
11 V_MUL_U32_U24 Multiply two unsigned 24-bit integers and store the result as an
unsigned 32-bit integer. This opcode is as efficient as basic
single-precision opcodes since it utilizes the single-precision
floating point multiplier. See also V_MUL_HI_U32_U24.
12 V_MUL_HI_U32_U24 Multiply two unsigned 24-bit integers and store the high 32 bits
of the result as an unsigned 32-bit integer. See also
V_MUL_U32_U24.
D.i32 =
S0.i8[0] * S1.i8[0] +
S0.i8[1] * S1.i8[1] +
S0.i8[2] * S1.i8[2] +
S0.i8[3] * S1.i8[3] + D.i32.
D.f32 = min(S0.f32,S1.f32);
if (IEEE_MODE && S0.f == sNaN)
D.f = Quiet(S0.f);
else if (IEEE_MODE && S1.f == sNaN)
D.f = Quiet(S1.f);
else if (S0.f == NaN)
D.f = S1.f;
else if (S1.f == NaN)
D.f = S0.f;
else if (S0.f == +0.0 && S1.f == -0.0)
D.f = S1.f;
else if (S0.f == -0.0 && S1.f == +0.0)
D.f = S0.f;
else
// Note: there's no IEEE special case here like there is for
V_MAX_F32.
D.f = (S0.f < S1.f ? S0.f : S1.f);
endif.
D.f32 = max(S0.f32,S1.f32);
if (IEEE_MODE && S0.f == sNaN)
D.f = Quiet(S0.f);
else if (IEEE_MODE && S1.f == sNaN)
D.f = Quiet(S1.f);
else if (S0.f == NaN)
D.f = S1.f;
else if (S1.f == NaN)
D.f = S0.f;
else if (S0.f == +0.0 && S1.f == -0.0)
D.f = S0.f;
else if (S0.f == -0.0 && S1.f == +0.0)
D.f = S1.f;
else if (IEEE_MODE)
D.f = (S0.f >= S1.f ? S0.f : S1.f);
else
D.f = (S0.f > S1.f ? S0.f : S1.f);
endif.
22 V_LSHRREV_B32 Logical shift right with shift count in the first operand.
24 V_ASHRREV_I32 Arithmetic shift right (preserve sign bit) with shift count in
the first operand.
26 V_LSHLREV_B32 Logical shift left with shift count in the first operand.
38 V_SUB_NC_U32 Subtract the second unsigned integer from the first unsigned
integer. No carry-in or carry-out.
39 V_SUBREV_NC_U32 Subtract the first unsigned integer from the second unsigned
integer. No carry-in or carry-out.
40 V_ADD_CO_CI_U32 Add two unsigned integers and a carry-in from VCC. Store the
result and also save the carry-out to VCC. In VOP3 the VCC
destination may be an arbitrary SGPR-pair, and the VCC source
comes from the SGPR-pair at S2.u.
41 V_SUB_CO_CI_U32 Subtract the second unsigned integer from the first unsigned
integer and then subtract a carry-in from VCC. Store the result
and also save the carry-out to VCC. In VOP3 the VCC destination
may be an arbitrary SGPR-pair, and the VCC source comes from the
SGPR-pair at S2.u.
42 V_SUBREV_CO_CI_U3 Subtract the first unsigned integer from the second unsigned
2 integer and then subtract a carry-in from VCC. Store the result
and also save the carry-out to VCC. In VOP3 the VCC destination
may be an arbitrary SGPR-pair, and the VCC source comes from the
SGPR-pair at S2.u.
D.f16_lo = f32_to_f16(S0.f32);
D.f16_hi = f32_to_f16(S1.f32).
// Round-toward-zero regardless of current round mode setting in
hardware.
51 V_SUB_F16 Subtract the second FP16 value from the first. 0.5ULP
precision, Supports denormals, round mode, exception flags and
saturation.
52 V_SUBREV_F16 Subtract the first FP16 value from the second. 0.5ULP
precision. Supports denormals, round mode, exception flags and
saturation.
55 V_FMAMK_F16 Multiply a FP16 value with a literal constant and add a second
FP16 value using fused multiply-add. This opcode cannot use the
VOP3 encoding and cannot use input/output modifiers. Supports
round mode, exception flags, saturation.
56 V_FMAAK_F16 Multiply two FP16 values and add a literal constant using fused
multiply-add. This opcode cannot use the VOP3 encoding and
cannot use input/output modifiers. Supports round mode,
exception flags, saturation.
D.f16 = max(S0.f16,S1.f16);
if (IEEE_MODE && S0.f16 == sNaN)
D.f16 = Quiet(S0.f16);
else if (IEEE_MODE && S1.f16 == sNaN)
D.f16 = Quiet(S1.f16);
else if (S0.f16 == NaN)
D.f16 = S1.f16;
else if (S1.f16 == NaN)
D.f16 = S0.f16;
else if (S0.f16 == +0.0 && S1.f16 == -0.0)
D.f16 = S0.f16;
else if (S0.f16 == -0.0 && S1.f16 == +0.0)
D.f16 = S1.f16;
else if (IEEE_MODE)
D.f16 = (S0.f16 >= S1.f16 ? S0.f16 : S1.f16);
else
D.f16 = (S0.f16 > S1.f16 ? S0.f16 : S1.f16);
endif.
D.f16 = min(S0.f16,S1.f16);
if (IEEE_MODE && S0.f16 == sNaN)
D.f16 = Quiet(S0.f16);
else if (IEEE_MODE && S1.f16 == sNaN)
D.f16 = Quiet(S1.f16);
else if (S0.f16 == NaN)
D.f16 = S1.f16;
else if (S1.f16 == NaN)
D.f16 = S0.f16;
else if (S0.f16 == +0.0 && S1.f16 == -0.0)
D.f16 = S1.f16;
else if (S0.f16 == -0.0 && S1.f16 == +0.0)
D.f16 = S0.f16;
else
// Note: there's no IEEE special case here like there is for
V_MAX_F16.
D.f16 = (S0.f16 < S1.f16 ? S0.f16 : S1.f16);
endif.
Instructions in this format may use a 32-bit literal constant, DPP or SDWA which occurs
immediately after the instruction.
D.u = S0.u.
Examples:
v_mov_b32 v0, v1 // Move v1 to v0
v_mov_b32 v0, -v1 // Set v1 to the negation of v0
v_mov_b32 v0, abs(v1) // Set v1 to the absolute value of v0
D.i = (int)S0.d.
D.d = (double)S0.i.
D.f = (float)S0.i.
D.f = (float)S0.u.
D.u = (unsigned)S0.f.
D.i = (int)S0.f.
D.f16 = flt32_to_flt16(S0.f).
D.f = flt16_to_flt32(S0.f16).
D.i = (int)floor(S0.f).
S0___Result__
1000 -0.5000f
1001 -0.4375f
1010 -0.3750f
1011 -0.3125f
1100 -0.2500f
1101 -0.1875f
1110 -0.1250f
1111 -0.0625f
0000 +0.0000f
0001 +0.0625f
0010 +0.1250f
0011 +0.1875f
0100 +0.2500f
0101 +0.3125f
0110 +0.3750f
0111 +0.4375f
D.f = (float)S0.d.
D.d = (double)S0.f.
D.f = (float)(S0.u[7:0]).
D.f = (float)(S0.u[15:8]).
D.f = (float)(S0.u[23:16]).
D.f = (float)(S0.u[31:24]).
D.u = (unsigned)S0.d.
D.d = (double)S0.u.
D.d = trunc(S0.d).
D.d = trunc(S0.d);
if(S0.d > 0.0 && S0.d != D.d) then
D.d += 1.0;
endif.
D.d = trunc(S0.d);
if(S0.d < 0.0 && S0.d != D.d) then
D.d += -1.0;
endif.
D.f = trunc(S0.f).
D.f = trunc(S0.f);
if(S0.f > 0.0 && S0.f != D.f) then
D.f += 1.0;
endif.
D.f = trunc(S0.f);
if(S0.f < 0.0 && S0.f != D.f) then
D.f += -1.0;
endif.
Functional examples:
D.f = log2(S0.f).
Functional examples:
Functional examples:
Unsigned usage:
CVT_F32_U32
RCP_IFLAG_F32
MUL_F32 (2**32 - 1)
CVT_U32_F32
Signed usage:
CVT_F32_I32
RCP_IFLAG_F32
MUL_F32 (2**31 - 1)
CVT_I32_F32
Functional examples:
47 V_RCP_F64 Reciprocal with IEEE rules. Precision is (2**29) ULP, and supports
denormals.
49 V_RSQ_F64 Reciprocal square root with IEEE rules. Precision is (2**29) ULP, and
supports denormals.
D.f = sqrt(S0.f).
Functional examples:
D.d = sqrt(S0.d).
Functional examples:
Functional examples:
D.u = ~S0.u.
D.u[31:0] = S0.u[0:31].
57 V_FFBH_U32 Counts how many zeros before the first one starting from the
MSB. Returns -1 if there are no ones.
Functional examples:
58 V_FFBL_B32 Returns the bit position of the first one from the LSB, or -1 if
there are no ones.
Functional examples:
59 V_FFBH_I32 Counts how many bits in a row (from MSB to LSB) are the same as
the sign bit. Returns -1 if all bits are the same.
Functional examples:
60 V_FREXP_EXP_I32_F6 Returns exponent of single precision float input, such that S0.d
4 = significand * (2 ** exponent). See also V_FREXP_MANT_F64,
which returns the significand. See the C library function
frexp() for more information.
63 V_FREXP_EXP_I32_F3 Returns exponent of single precision float input, such that S0.f
2 = significand * (2 ** exponent). See also V_FREXP_MANT_F32,
which returns the significand. See the C library function
frexp() for more information.
D.f16 = uint16_to_flt16(S.u16).
D.f16 = int16_to_flt16(S.i16).
D.u16 = flt16_to_uint16(S.f16).
D.i16 = flt16_to_int16(S.f16).
Functional examples:
D.f16 = sqrt(S0.f16).
Functional examples:
Functional examples:
D.f16 = log2(S0.f).
Functional examples:
Functional examples:
90 V_FREXP_EXP_I16_F1 Returns exponent of half precision float input, such that S0.f16
6 = significand * (2 ** exponent). See also V_FREXP_MANT_F16,
which returns the significand. See the C library function
frexp() for more information.
D.f16 = trunc(S0.f16);
if(S0.f16 < 0.0f && S0.f16 != D.f16) then
D.f16 -= 1.0;
endif.
D.f16 = trunc(S0.f16);
if(S0.f16 > 0.0f && S0.f16 != D.f16) then
D.f16 += 1.0;
endif.
D.f16 = trunc(S0.f16).
Functional examples:
Functional examples:
D.i16 = flt16_to_snorm16(S.f16).
D.u16 = flt16_to_unorm16(S.f16).
101 V_SWAP_B32 Swap operands. Input and output modifiers not supported; this
is an untyped operation.
tmp = D.u;
D.u = S0.u;
S0.u = tmp.
104 V_SWAPREL_B32 Swap operands. Input and output modifiers not supported; this
is an untyped operation. The two addresses are relatively
indexed using M0.
where:
Compare instructions perform the same compare operation on each lane (workItem or thread)
using that lane’s private data, and producing a 1 bit result per lane into VCC or EXEC.
Instructions in this format may use a 32-bit literal constant or SDWA which occurs immediately
after the instruction.
• Those which can use one of 16 compare operations (floating point types). "{COMPF}"
• Those which can use one of 8 compare operations (integer types). "{COMPI}"
The opcode number is such that for these the opcode number can be calculated from a base
opcode number for the data type, plus an offset for the specific compare operation.
F 0 D.u = 0
TRU 15 D.u = 1
F 0 D.u = 0
TRU 7 D.u = 1
V_CMP_{COMPI}_U16 16-bit signed integer compare. Also writes EXEC. 0xA8 - 0xAF
V_CMPX_{COMPI}_U16 16-bit unsigned integer compare. Also writes EXEC. 0xB8 - 0xBF
V_CMP_{COMPI}_U32 32-bit signed integer compare. Also writes EXEC. 0xC8 - 0xCF
V_CMPX_{COMPI}_U32 32-bit unsigned integer compare. Also writes EXEC. 0xD8 - 0xDF
V_CMP_{COMPI}_U64 64-bit signed integer compare. Also writes EXEC. 0xE8 - 0xEF
V_CMPX_{COMPI}_U64 64-bit unsigned integer compare. Also writes EXEC. 0xF8 - 0xFF
0 V_CMP_F_F32 D[threadId] = 0.
// D = VCC in VOPC encoding.
15 V_CMP_TRU_F32 D[threadId] = 1.
// D = VCC in VOPC encoding.
16 V_CMPX_F_F32 EXEC[threadId] = 0.
31 V_CMPX_TRU_F32 EXEC[threadId] = 1.
32 V_CMP_F_F64 D[threadId] = 0.
// D = VCC in VOPC encoding.
47 V_CMP_TRU_F64 D[threadId] = 1.
// D = VCC in VOPC encoding.
48 V_CMPX_F_F64 EXEC[threadId] = 0.
63 V_CMPX_TRU_F64 EXEC[threadId] = 1.
136 V_CMP_CLASS_F32 VCC = IEEE numeric class function specified in S1.u, performed on S0.f.
The function reports true if the floating point value is *any* of the
numeric types selected in S1.u according to the following list:
143 V_CMP_CLASS_F16 VCC = IEEE numeric class function specified in S1.u, performed on
S0.f16.
Note that the S1 has a format of f16 since floating point literal
constants are interpreted as 16 bit value for this opcode.
The function reports true if the floating point value is *any* of the
numeric types selected in S1.u according to the following list:
152 V_CMPX_CLASS_F32 EXEC = IEEE numeric class function specified in S1.u, performed on
S0.f.
The function reports true if the floating point value is *any* of the
numeric types selected in S1.u according to the following list:
159 V_CMPX_CLASS_F16 EXEC = IEEE numeric class function specified in S1.u, performed on
S0.f16.
Note that the S1 has a format of f16 since floating point literal
constants are interpreted as 16 bit value for this opcode.
The function reports true if the floating point value is *any* of the
numeric types selected in S1.u according to the following list:
168 V_CMP_CLASS_F64 VCC = IEEE numeric class function specified in S1.u, performed on S0.d.
The function reports true if the floating point value is *any* of the
numeric types selected in S1.u according to the following list:
184 V_CMPX_CLASS_F64 EXEC = IEEE numeric class function specified in S1.u, performed on
S0.d.
The function reports true if the floating point value is *any* of the
numeric types selected in S1.u according to the following list:
When the CLAMP microcode bit is set to 1, these compare instructions signal an exception
when either of the inputs is NaN. When CLAMP is set to zero, NaN does not signal an
exception. The second eight VOPC instructions have {OP8} embedded in them. This refers to
each of the compare operations listed below.
where:
4 V_PK_LSHLREV_B16 Packed logical shift left. The shift count is in the first
operand.
5 V_PK_LSHRREV_B16 Packed logical shift right. The shift count is in the first
operand.
6 V_PK_ASHRREV_I16 Packed arithmetic shift right (preserve sign bit). The shift
count is in the first operand.
D.f32 =
S0.f16[0] * S1.f16[0] +
S0.f16[1] * S1.f16[1] + S2.f32.
D.i32 =
S0.i16[0] * S1.i16[0] +
S0.i16[1] * S1.i16[1] + S2.i32.
D.u32 =
S0.u16[0] * S1.u16[0] +
S0.u16[1] * S1.u16[1] + S2.u32.
D.i32 =
S0.i8[0] * S1.i8[0] +
S0.i8[1] * S1.i8[1] +
S0.i8[2] * S1.i8[2] +
S0.i8[3] * S1.i8[3] + S2.i32.
D.u32 =
S0.u8[0] * S1.u8[0] +
S0.u8[1] * S1.u8[1] +
S0.u8[2] * S1.u8[2] +
S0.u8[3] * S1.u8[3] + S2.u32.
D.i32 =
S0.i4[0] * S1.i4[0] +
S0.i4[1] * S1.i4[1] +
S0.i4[2] * S1.i4[2] +
S0.i4[3] * S1.i4[3] +
S0.i4[4] * S1.i4[4] +
S0.i4[5] * S1.i4[5] +
S0.i4[6] * S1.i4[6] +
S0.i4[7] * S1.i4[7] + S2.i32.
D.u32 =
S0.u4[0] * S1.u4[0] +
S0.u4[1] * S1.u4[1] +
S0.u4[2] * S1.u4[2] +
S0.u4[3] * S1.u4[3] +
S0.u4[4] * S1.u4[4] +
S0.u4[5] * S1.u4[5] +
S0.u4[6] * S1.u4[6] +
S0.u4[7] * S1.u4[7] + S2.u32.
D.f = {P10,P20,P0}[S1.u].
VOP3B this encoding allows specifying a unique scalar destination, and is used only for:
V_ADD_CO_U32
V_SUB_CO_U32
V_SUBREV_CO_U32
V_ADDC_CO_U32
V_SUBB_CO_U32
V_SUBBREV_CO_U32
V_DIV_SCALE_F32
V_DIV_SCALE_F64
V_MAD_U64_U32
V_MAD_I64_I32
320 V_FMA_LEGACY_F32 Multiply and add single-precision values. Follows DX9 rules
where 0.0 times anything produces 0.0 (this is not IEEE
compliant).
322 V_MAD_I32_I24 Multiply two signed 24-bit integers, add a signed 32-bit integer
and store the result as a signed 32-bit integer. This opcode is
as efficient as basic single-precision opcodes since it utilizes
the single-precision floating point multiplier.
323 V_MAD_U32_U24 Multiply two unsigned 24-bit integers, add an unsigned 32-bit
integer and store the result as an unsigned 32-bit integer.
This opcode is as efficient as basic single-precision opcodes
since it utilizes the single-precision floating point
multiplier.
331 V_FMA_F32 Fused single precision multiply add. 0.5ULP accuracy, denormals
are supported.
333 V_LERP_U8 Unsigned 8-bit pixel average on packed unsigned bytes (linear
interpolation). S2 acts as a round mode; if set, 0.5 rounds up,
otherwise 0.5 truncates.
336 V_MULLIT_F32 Multiply for lighting. Specific rules apply: 0.0 * x = 0.0;
Specific INF, NaN, overflow rules.
350 V_CVT_PK_U8_F32 Convert floating point value S0 to 8-bit unsigned integer and
pack the result into byte S1 of dword S2.
sign_out = sign(S1.f)^sign(S2.f);
if (S2.f == NAN)
D.f = Quiet(S2.f);
else if (S1.f == NAN)
D.f = Quiet(S1.f);
else if (S1.f == S2.f == 0)
// 0/0
D.f = 0xffc0_0000;
else if (abs(S1.f) == abs(S2.f) == +-INF)
// inf/inf
D.f = 0xffc0_0000;
else if (S1.f == 0 || abs(S2.f) == +-INF)
// x/0, or inf/y
D.f = sign_out ? -INF : +INF;
else if (abs(S1.f) == +-INF || S2.f == 0)
// x/inf, 0/y
D.f = sign_out ? -0 : 0;
else if ((exponent(S2.f) - exponent(S1.f)) < -150)
D.f = sign_out ? -underflow : underflow;
else if (exponent(S1.f) == 255)
D.f = sign_out ? -overflow : overflow;
else
D.f = sign_out ? -abs(S0.f) : abs(S0.f);
endif.
sign_out = sign(S1.d)^sign(S2.d);
if (S2.d == NAN)
D.d = Quiet(S2.d);
else if (S1.d == NAN)
D.d = Quiet(S1.d);
else if (S1.d == S2.d == 0)
// 0/0
D.d = 0xfff8_0000_0000_0000;
else if (abs(S1.d) == abs(S2.d) == +-INF)
// inf/inf
D.d = 0xfff8_0000_0000_0000;
else if (S1.d == 0 || abs(S2.d) == +-INF)
// x/0, or inf/y
D.d = sign_out ? -INF : +INF;
else if (abs(S1.d) == +-INF || S2.d == 0)
// x/inf, 0/y
D.d = sign_out ? -0 : 0;
else if ((exponent(S2.d) - exponent(S1.d)) < -1075)
D.d = sign_out ? -underflow : underflow;
else if (exponent(S1.d) == 2047)
D.d = sign_out ? -overflow : overflow;
else
D.d = sign_out ? -abs(S0.d) : abs(S0.d);
endif.
361 V_MUL_LO_U32 Multiply two unsigned integers. If you only need to multiply
integers with small magnitudes consider V_MUL_U32_U24, which is
a faster implementation.
362 V_MUL_HI_U32 Multiply two unsigned integers and store the high 32 bits of the
result. If you only need to multiply integers with small
magnitudes consider V_MUL_HI_U32_U24, which is a faster
implementation.
364 V_MUL_HI_I32 Multiply two signed integers and store the high 32 bits of the
result. If you only need to multiply integers with small
magnitudes consider V_MUL_HI_I32_I24, which is a faster
implementation.
VCC = 0;
if (S2.f == 0 || S1.f == 0)
D.f = NAN
else if (exponent(S2.f) - exponent(S1.f) >= 96)
// N/D near MAX_FLOAT
VCC = 1;
if (S0.f == S1.f)
// Only scale the denominator
D.f = ldexp(S0.f, 64);
end if
else if (S1.f == DENORM)
D.f = ldexp(S0.f, 64);
else if (1 / S1.f == DENORM && S2.f / S1.f == DENORM)
VCC = 1;
if (S0.f == S1.f)
// Only scale the denominator
D.f = ldexp(S0.f, 64);
end if
else if (1 / S1.f == DENORM)
D.f = ldexp(S0.f, -64);
else if (S2.f / S1.f==DENORM)
VCC = 1;
if (S0.f == S2.f)
// Only scale the numerator
D.f = ldexp(S0.f, 64);
end if
else if (exponent(S2.f) <= 23)
// Numerator is tiny
D.f = ldexp(S0.f, 64);
end if.
VCC = 0;
if (S2.d == 0 || S1.d == 0)
D.d = NAN
else if (exponent(S2.d) - exponent(S1.d) >= 768)
// N/D near MAX_FLOAT
VCC = 1;
if (S0.d == S1.d)
// Only scale the denominator
D.d = ldexp(S0.d, 128);
end if
else if (S1.d == DENORM)
D.d = ldexp(S0.d, 128);
else if (1 / S1.d == DENORM && S2.d / S1.d == DENORM)
VCC = 1;
if (S0.d == S1.d)
// Only scale the denominator
D.d = ldexp(S0.d, 128);
end if
else if (1 / S1.d == DENORM)
D.d = ldexp(S0.d, -128);
else if (S2.d / S1.d==DENORM)
VCC = 1;
if (S0.d == S2.d)
// Only scale the numerator
D.d = ldexp(S0.d, 128);
end if
else if (exponent(S2.d) <= 53)
// Numerator is tiny
D.d = ldexp(S0.d, 128);
end if.
if (VCC[threadId])
D.f = 2**32 * (S0.f * S1.f + S2.f);
else
D.f = S0.f * S1.f + S2.f;
end if.
if (VCC[threadId])
D.d = 2**64 * (S0.d * S1.d + S2.d);
else
D.d = S0.d * S1.d + S2.d;
end if.
372 V_TRIG_PREOP_F64 Look Up 2/PI (S0.d) with segment select S1.u[4:0]. This
operation returns an aligned, double precision segment of 2/PI
needed to do range reduction on S0.d (double-precision value).
Multiple segments can be specified through S1.u[4:0]. Rounding
is round-to-zero. Large inputs (exp > 1968) are scaled to avoid
loss of precision through denormalization.
374 V_MAD_U64_U32 Multiply and add unsigned integers and produce a 64-bit
result.
375 V_MAD_I64_I32 Multiply and add signed integers and produce a 64-bit result.
376 V_XOR3_B32 Bitwise XOR of three inputs. Input and output modifiers not
supported.
767 V_LSHLREV_B64 Logical shift left, count is in the first operand. Only one
scalar broadcast constant is allowed.
768 V_LSHRREV_B64 Logical shift right, count is in the first operand. Only one
scalar broadcast constant is allowed.
769 V_ASHRREV_I64 Arithmetic shift right (preserve sign bit), count is in the
first operand. Only one scalar broadcast constant is allowed.
771 V_ADD_NC_U16 Add two unsigned shorts. Supports saturation (unsigned 16-bit
integer domain). No carry-in or carry-out.
772 V_SUB_NC_U16 Subtract the second unsigned short from the first. Supports
saturation (unsigned 16-bit integer domain). No carry-in or
carry-out.
776 V_ASHRREV_I16 Arithmetic shift right (preserve sign bit), count is in the
first operand.
781 V_ADD_NC_I16 Add two signed shorts. Supports saturation (signed 16-bit
integer domain). No carry-in or carry-out.
782 V_SUB_NC_I16 Subtract the second signed short from the first. Supports
saturation (unsigned 16-bit integer domain). No carry-in or
carry-out.
783 V_ADD_CO_U32 Add two unsigned integers with carry-out. In VOP3 the VCC
destination may be an arbitrary SGPR-pair.
784 V_SUB_CO_U32 Subtract the second unsigned integer from the first with
carry-out. In VOP3 the VCC destination may be an arbitrary
SGPR-pair.
D[31:16].f16 = S1.f16;
D[15:0].f16 = S0.f16.
786 V_CVT_PKNORM_I16 Convert two FP16 values into packed signed normalized shorts.
_F16
D = {(snorm)S1.f16, (snorm)S0.f16}.
787 V_CVT_PKNORM_U1 Convert two FP16 values into packed unsigned normalized
6_F16 shorts.
D = {(unorm)S1.f16, (unorm)S0.f16}.
793 V_SUBREV_CO_U32 Subtract the first unsigned integer from the second with
carry-out. In VOP3 the VCC destination may be an arbitrary
SGPR-pair.
832 V_MAD_U16 Multiply and add unsigned shorts. Supports saturation (unsigned
16-bit integer domain).
834 V_INTERP_P1LL_F16 FP16 parameter interpolation. `LL' stands for `two LDS
arguments'. attr_word selects the high or low half 16 bits of
each LDS dword accessed. This opcode is available for 32-bank
LDS only.
835 V_INTERP_P1LV_F16 FP16 parameter interpolation. `LV' stands for `One LDS and one
VGPR argument'. S2 holds two parameters, attr_word selects the
high or low word of the VGPR for this calculation, as well as
the high or low half of the LDS data. Meant for use with
16-bank LDS.
839 V_ADD_LSHL_U32 Add and then logical shift left the result.
843 V_FMA_F16 Fused half precision multiply add of FP16 values. 0.5ULP
accuracy, denormals are supported.
862 V_MAD_I16 Multiply and add signed short values. Supports saturation
(signed 16-bit integer domain).
sign_out = sign(S1.f16)^sign(S2.f16);
if (S2.f16 == NAN)
D.f16 = Quiet(S2.f16);
else if (S1.f16 == NAN)
D.f16 = Quiet(S1.f16);
else if (S1.f16 == S2.f16 == 0)
// 0/0
D.f16 = 0xfe00;
else if (abs(S1.f16) == abs(S2.f16) == +-INF)
// inf/inf
D.f16 = 0xfe00;
else if (S1.f16 ==0 || abs(S2.f16) == +-INF)
// x/0, or inf/y
D.f16 = sign_out ? -INF : +INF;
else if (abs(S1.f16) == +-INF || S2.f16 == 0)
// x/inf, 0/y
D.f16 = sign_out ? -0 : 0;
else
D.f16 = sign_out ? -abs(S0.f16) : abs(S0.f16);
end if.
864 V_READLANE_B32 Copy one VGPR value to one SGPR. D = SGPR-dest, S0 = Source
Data (VGPR# or M0(lds-direct)), S1 = Lane Select (SGPR or M0).
Lane is S1 % (32 if wave32, 64 if wave64). Ignores exec mask.
if(wave32)
SMEM[D_ADDR] = VMEM[S0_ADDR][S1[4:0]]; // For wave32
else
SMEM[D_ADDR] = VMEM[S0_ADDR][S1[5:0]]; // For wave64
endif.
865 V_WRITELANE_B32 Write value into one VGPR in one lane. D = VGPR-dest, S0 =
Source Data (sgpr, m0, exec or constants), S1 = Lane Select
(SGPR or M0). Lane is S1 % (32 if wave32, 64 if wave64).
Ignores exec mask.
if(wave32)
VMEM[D_ADDR][S1[4:0]] = SMEM[S0_ADDR]; // For wave32
else
VMEM[D_ADDR][S1[5:0]] = SMEM[S0_ADDR]; // For wave64
endif.
867 V_BFM_B32 Bitfield modify. S0 is the bitfield width and S1 is the bitfield
offset.
D.u = S1.u;
for i in 0 .. 31 do
D.u += S0.u[i]; // count i'th bit
endfor.
869 V_MBCNT_LO_U32_B Masked bit count, ThreadPosition is the position of this thread
32 in the wavefront (in 0..63). See also V_MBCNT_HI_U32_B32.
870 V_MBCNT_HI_U32_B Masked bit count, ThreadPosition is the position of this thread
32 in the wavefront (in 0..63). See also V_MBCNT_LO_U32_B32. Note
that in Wave32 mode ThreadMask[63:32] == 0 and this instruction
simply performs a move from S1 to D.
D.i16_lo = (snorm)S0.f32;
D.i16_hi = (snorm)S1.f32.
D.u16_lo = (unorm)S0.f32;
D.u16_hi = (unorm)S1.f32.
874 V_CVT_PK_U16_U32 Convert two unsigned integers into a packed unsigned short.
D.u16_lo = u32_to_u16(S0.u32);
D.u16_hi = u32_to_u16(S1.u32).
875 V_CVT_PK_I16_I32 Convert two signed integers into a packed signed short.
D.i16_lo = i32_to_i16(S0.i32);
D.i16_hi = i32_to_i16(S1.i32).
886 V_SUB_NC_I32 Subtract the second signed integer from the first. No carry-in
or carry-out. Supports saturation (signed 32-bit integer
domain).
The first source must be a VGPR and the second and third
sources must be scalar values; the second and third source
are combined into a single 64-bit value representing lane
selects used to swizzle within each row.
ABS, NEG and OMOD modifiers should all be zeroed for this
instruction.
888 V_PERMLANEX16_B3 Perform arbitrary gather-style operation across two rows (each
2 row is 16 contiguous lanes).
The first source must be a VGPR and the second and third
sources must be scalar values; the second and third source
are combined into a single 64-bit value representing lane
selects used to swizzle within each row.
ABS, NEG and OMOD modifiers should all be zeroed for this
instruction.
v_mov_b32 exec_lo, 0x7fff7fff; // Lanes getting data from their own row
v_mov_b32 s0, 0x87654321;
v_mov_b32 s1, 0x0fedcba9;
v_permlane16_b32 v1, v0, s0, s1 fi; // FI bit needed for lanes 14 and
30
v_mov_b32 exec_lo, 0x80008000; // Lanes getting data from the other row
v_permlanex16_b32 v1, v0, s0, s1 fi; // FI bit needed for lanes 15 and
31
where:
OFFSET0 = Unsigned byte offset added to the address from the ADDR VGPR.
OFFSET1 = Unsigned byte offset added to the address from the ADDR VGPR.
GDS = Set if GDS, cleared if LDS.
OP = DS instruction opcode
ADDR = Source LDS address VGPR 0 - 255.
DATA0 = Source data0 VGPR 0 - 255.
DATA1 = Source data1 VGPR 0 - 255.
VDST = Destination VGPR 0- 255.
All instructions with RTN in the name return the value that was in memory
before the operation was performed.
0 DS_ADD_U32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] += DATA;
RETURN_DATA = tmp.
1 DS_SUB_U32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] -= DATA;
RETURN_DATA = tmp.
// 32bit
addr = VGPR[ADDR]+{INST1,INST0};
tmp = DS[addr].u32;
DS[addr].u32 = VGPR[DATA0].u32-DS[addr].u32;
VGPR[VDST].u32 = tmp.
3 DS_INC_U32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp >= DATA) ? 0 : tmp + 1; // unsigned compare
RETURN_DATA = tmp.
4 DS_DEC_U32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp == 0 || tmp > DATA) ? DATA : tmp - 1; //
unsigned compare
RETURN_DATA = tmp.
5 DS_MIN_I32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA < tmp) ? DATA : tmp; // signed compare
RETURN_DATA = tmp.
6 DS_MAX_I32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA > tmp) ? DATA : tmp; // signed compare
RETURN_DATA = tmp.
7 DS_MIN_U32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA < tmp) ? DATA : tmp; // unsigned compare
RETURN_DATA = tmp.
8 DS_MAX_U32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA > tmp) ? DATA : tmp; // unsigned compare
RETURN_DATA = tmp.
9 DS_AND_B32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] &= DATA;
RETURN_DATA = tmp.
10 DS_OR_B32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] |= DATA;
RETURN_DATA = tmp.
11 DS_XOR_B32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] ^= DATA;
RETURN_DATA = tmp.
12 DS_MSKOR_B32 Masked dword OR, D0 contains the mask and D1 contains the new
value.
// 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (MEM[ADDR] & ~DATA) | DATA2;
RETURN_DATA = tmp.
// 32bit
MEM[ADDR] = DATA.
// 32bit
MEM[ADDR + OFFSET0 * 4] = DATA;
MEM[ADDR + OFFSET1 * 4] = DATA2.
// 32bit
MEM[ADDR + OFFSET0 * 4 * 64] = DATA;
MEM[ADDR + OFFSET1 * 4 * 64] = DATA2.
16 DS_CMPST_B32 Compare and store. Caution, the order of src and cmp are the
*opposite* of the BUFFER_ATOMIC_CMPSWAP opcode.
// 32bit
tmp = MEM[ADDR];
src = DATA2;
cmp = DATA;
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0] = tmp.
// 32bit
tmp = MEM[ADDR];
src = DATA2;
cmp = DATA;
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0] = tmp.
// 32bit
tmp = MEM[ADDR];
src = DATA;
cmp = DATA2;
MEM[ADDR] = (cmp < tmp) ? src : tmp.
// 32bit
tmp = MEM[ADDR];
src = DATA;
cmp = DATA2;
MEM[ADDR] = (tmp > cmp) ? src : tmp.
20 DS_NOP Do nothing.
24 DS_GWS_SEMA_RELEA GDS Only: The GWS resource (rid) indicated will process this
SE_ALL opcode by updating the counter and labeling the specified
resource as a semaphore.
26 DS_GWS_SEMA_V GDS Only: The GWS resource indicated will process this opcode by
updating the counter and labeling the resource as a
semaphore.
This action will release one waved if any are queued in this
resource.
27 DS_GWS_SEMA_BR GDS Only: The GWS resource indicated will process this opcode by
updating the counter by the bulk release delivered count and
labeling the resource as a semaphore.
28 DS_GWS_SEMA_P GDS Only: The GWS resource indicated will process this opcode by
queueing it until counter enables a release and then
decrementing the counter of the resource as a semaphore.
29 DS_GWS_BARRIER GDS Only: The GWS resource indicated will process this opcode by
queueing it until barrier is satisfied. The number of waves
needed is passed in as DATA of first valid thread.
Since the waves deliver the count for the next barrier, this
function can have a different size barrier for each
occurrence.
// Release Machine
if(state.type == BARRIER) then
if(state.flag != thread.flag) then
return rd_done;
endif;
endif.
MEM[ADDR] = DATA[7:0].
MEM[ADDR] = DATA[15:0].
32 DS_ADD_RTN_U32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] += DATA;
RETURN_DATA = tmp.
33 DS_SUB_RTN_U32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] -= DATA;
RETURN_DATA = tmp.
// 32bit
addr = VGPR[ADDR]+{INST1,INST0};
tmp = DS[addr].u32;
DS[addr].u32 = VGPR[DATA0].u32-DS[addr].u32;
VGPR[VDST].u32 = tmp.
35 DS_INC_RTN_U32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp >= DATA) ? 0 : tmp + 1; // unsigned compare
RETURN_DATA = tmp.
36 DS_DEC_RTN_U32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp == 0 || tmp > DATA) ? DATA : tmp - 1; //
unsigned compare
RETURN_DATA = tmp.
37 DS_MIN_RTN_I32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA < tmp) ? DATA : tmp; // signed compare
RETURN_DATA = tmp.
38 DS_MAX_RTN_I32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA > tmp) ? DATA : tmp; // signed compare
RETURN_DATA = tmp.
39 DS_MIN_RTN_U32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA < tmp) ? DATA : tmp; // unsigned compare
RETURN_DATA = tmp.
40 DS_MAX_RTN_U32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA > tmp) ? DATA : tmp; // unsigned compare
RETURN_DATA = tmp.
41 DS_AND_RTN_B32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] &= DATA;
RETURN_DATA = tmp.
42 DS_OR_RTN_B32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] |= DATA;
RETURN_DATA = tmp.
43 DS_XOR_RTN_B32 // 32bit
tmp = MEM[ADDR];
MEM[ADDR] ^= DATA;
RETURN_DATA = tmp.
44 DS_MSKOR_RTN_B32 Masked dword OR, D0 contains the mask and D1 contains the new
value.
// 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (MEM[ADDR] & ~DATA) | DATA2;
RETURN_DATA = tmp.
tmp = MEM[ADDR];
MEM[ADDR] = DATA;
RETURN_DATA = tmp.
48 DS_CMPST_RTN_B32 Compare and store. Caution, the order of src and cmp are the
*opposite* of the BUFFER_ATOMIC_CMPSWAP opcode.
// 32bit
tmp = MEM[ADDR];
src = DATA2;
cmp = DATA;
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0] = tmp.
// 32bit
tmp = MEM[ADDR];
src = DATA2;
cmp = DATA;
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0] = tmp.
// 32bit
tmp = MEM[ADDR];
src = DATA;
cmp = DATA2;
MEM[ADDR] = (cmp < tmp) ? src : tmp.
// 32bit
tmp = MEM[ADDR];
src = DATA;
cmp = DATA2;
MEM[ADDR] = (tmp > cmp) ? src : tmp.
53 DS_SWIZZLE_B32 Dword swizzle, no data is written to LDS memory. See next section
for details.
RETURN_DATA = MEM[ADDR].
RETURN_DATA = signext(MEM[ADDR][7:0]).
RETURN_DATA = {24'h0,MEM[ADDR][7:0]}.
RETURN_DATA = signext(MEM[ADDR][15:0]).
RETURN_DATA = {16'h0,MEM[ADDR][15:0]}.
Inside DS --- Do one atomic add for first valid lane and
broadcast result to all valid lanes. Offset = 0ffset1:offset0;
Interpreted as byte offset --- For 10xx LDS designs only aligned
atomics are supported, so 2 lsbs of offset must be set to zero.
Inside DS --- Do one atomic add for first valid lane and
broadcast result to all valid lanes. Offset = 0ffset1:offset0;
Interpreted as byte offset --- For 10xx LDS designs only aligned
atomics will be supported, so 2 lsbs of offset must be set to zero.
The shader will send the following pipeline_ID to the ordered count
unit
to be used to select the correct pipeline's tracking data.
Additionally pixel waves will use 4 counters depending on
the packer sourcing the pixel waves and generating the launch order.
If this control is not set, hold the crawler until wave does an
additional access with the wave_release the wave. This feature
allows one wavefront to issue serial access to the any of the
12.13. LDS & GDS Instructions 198 of 283
"RDNA 2" Instruction Set Architecture
64 DS_ADD_U64 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] += DATA[0:1];
RETURN_DATA[0:1] = tmp.
65 DS_SUB_U64 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] -= DATA[0:1];
RETURN_DATA[0:1] = tmp.
// 64bit
tmp = MEM[ADDR];
MEM[ADDR] = DATA - MEM[ADDR];
RETURN_DATA = tmp.
67 DS_INC_U64 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp >= DATA[0:1]) ? 0 : tmp + 1; // unsigned compare
RETURN_DATA[0:1] = tmp.
68 DS_DEC_U64 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp == 0 || tmp > DATA[0:1]) ? DATA[0:1] : tmp - 1;
// unsigned compare
RETURN_DATA[0:1] = tmp.
69 DS_MIN_I64 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA[0:1] < tmp) ? DATA[0:1] : tmp; // signed
compare
RETURN_DATA[0:1] = tmp.
70 DS_MAX_I64 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA[0:1] > tmp) ? DATA[0:1] : tmp; // signed
compare
RETURN_DATA[0:1] = tmp.
71 DS_MIN_U64 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA[0:1] < tmp) ? DATA[0:1] : tmp; // unsigned
compare
RETURN_DATA[0:1] = tmp.
72 DS_MAX_U64 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA[0:1] > tmp) ? DATA[0:1] : tmp; // unsigned
compare
RETURN_DATA[0:1] = tmp.
73 DS_AND_B64 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] &= DATA[0:1];
RETURN_DATA[0:1] = tmp.
74 DS_OR_B64 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] |= DATA[0:1];
RETURN_DATA[0:1] = tmp.
75 DS_XOR_B64 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] ^= DATA[0:1];
RETURN_DATA[0:1] = tmp.
76 DS_MSKOR_B64 Masked dword OR, D0 contains the mask and D1 contains the new
value.
// 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (MEM[ADDR] & ~DATA) | DATA2;
RETURN_DATA = tmp.
// 64bit
MEM[ADDR] = DATA.
// 64bit
MEM[ADDR + OFFSET0 * 8] = DATA;
MEM[ADDR + OFFSET1 * 8] = DATA2.
// 64bit
MEM[ADDR + OFFSET0 * 8 * 64] = DATA;
MEM[ADDR + OFFSET1 * 8 * 64] = DATA2.
80 DS_CMPST_B64 Compare and store. Caution, the order of src and cmp are the
*opposite* of the BUFFER_ATOMIC_CMPSWAP_X2 opcode.
// 64bit
tmp = MEM[ADDR];
src = DATA2;
cmp = DATA;
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0] = tmp.
// 64bit
tmp = MEM[ADDR];
src = DATA2;
cmp = DATA;
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0] = tmp.
// 64bit
tmp = MEM[ADDR];
src = DATA;
cmp = DATA2;
MEM[ADDR] = (cmp < tmp) ? src : tmp.
// 64bit
tmp = MEM[ADDR];
src = DATA;
cmp = DATA2;
MEM[ADDR] = (tmp > cmp) ? src : tmp.
96 DS_ADD_RTN_U64 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] += DATA[0:1];
RETURN_DATA[0:1] = tmp.
97 DS_SUB_RTN_U64 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] -= DATA[0:1];
RETURN_DATA[0:1] = tmp.
// 64bit
tmp = MEM[ADDR];
MEM[ADDR] = DATA - MEM[ADDR];
RETURN_DATA = tmp.
99 DS_INC_RTN_U64 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp >= DATA[0:1]) ? 0 : tmp + 1; // unsigned compare
RETURN_DATA[0:1] = tmp.
108 DS_MSKOR_RTN_B64 Masked dword OR, D0 contains the mask and D1 contains the new
value.
// 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (MEM[ADDR] & ~DATA) | DATA2;
RETURN_DATA = tmp.
tmp = MEM[ADDR];
MEM[ADDR] = DATA;
RETURN_DATA = tmp.
112 DS_CMPST_RTN_B64 Compare and store. Caution, the order of src and cmp are the
*opposite* of the BUFFER_ATOMIC_CMPSWAP_X2 opcode.
// 64bit
tmp = MEM[ADDR];
src = DATA2;
cmp = DATA;
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0] = tmp.
113 DS_CMPST_RTN_F64 Floating point compare and store that handles NaN/INF/denormal
values. Caution, the order of src and cmp are the *opposite* of
the BUFFER_ATOMIC_FCMPSWAP_X2 opcode.
// 64bit
tmp = MEM[ADDR];
src = DATA2;
cmp = DATA;
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0] = tmp.
// 64bit
tmp = MEM[ADDR];
src = DATA;
cmp = DATA2;
MEM[ADDR] = (cmp < tmp) ? src : tmp.
// 64bit
tmp = MEM[ADDR];
src = DATA;
cmp = DATA2;
MEM[ADDR] = (tmp > cmp) ? src : tmp.
RETURN_DATA = MEM[ADDR].
MEM[ADDR] = DATA[23:16].
MEM[ADDR] = DATA[31:16].
162 DS_READ_U8_D16 Unsigned byte read with masked return to lower word.
RETURN_DATA[15:0] = {8'h0,MEM[ADDR][7:0]}.
163 DS_READ_U8_D16_HI Unsigned byte read with masked return to upper word.
RETURN_DATA[31:16] = {8'h0,MEM[ADDR][7:0]}.
164 DS_READ_I8_D16 Signed byte read with masked return to lower word.
RETURN_DATA[15:0] = signext(MEM[ADDR][7:0]).
165 DS_READ_I8_D16_HI Signed byte read with masked return to upper word.
RETURN_DATA[31:16] = signext(MEM[ADDR][7:0]).
166 DS_READ_U16_D16 Unsigned short read with masked return to lower word.
RETURN_DATA[15:0] = MEM[ADDR][15:0].
167 DS_READ_U16_D16_HI Unsigned short read with masked return to upper word.
RETURN_DATA[31:0] = MEM[ADDR][15:0].
Forward permute. This does not access LDS memory and may be
called even if no LDS memory is allocated to the wave. It uses
LDS hardware to implement an arbitrary swizzle across threads in
a wavefront.
VGPR[SRC0] = { A, B, C, D }
VGPR[ADDR] = { 0, 0, 12, 4 }
EXEC = 0xF, OFFSET = 0
VGPR[VDST] := { B, D, 0, C }
VGPR[SRC0] = { A, B, C, D }
VGPR[ADDR] = { 0, 0, 12, 4 }
EXEC = 0xA, OFFSET = 0
VGPR[VDST] := { -, D, -, 0 }
Backward permute. This does not access LDS memory and may be
called even if no LDS memory is allocated to the wave. It uses
LDS hardware to implement an arbitrary swizzle across threads in
a wavefront.
Note that EXEC mask is applied to both VGPR read and write. If
src_lane selects a disabled thread, zero will be returned.
VGPR[SRC0] = { A, B, C, D }
VGPR[ADDR] = { 0, 0, 12, 4 }
EXEC = 0xF, OFFSET = 0
VGPR[VDST] := { A, A, D, B }
VGPR[SRC0] = { A, B, C, D }
VGPR[ADDR] = { 0, 0, 12, 4 }
EXEC = 0xA, OFFSET = 0
VGPR[VDST] := { -, 0, -, B }
Swizzles input thread data based on offset mask and returns; note does not read or write the DS
memory banks.
This opcode supports two specific modes, FFT and rotate, plus two basic modes which swizzle
in groups of 4 or 32 consecutive threads.
The FFT mode (offset >= 0xe000) swizzles the input based on offset[4:0] to support FFT
calculation. Example swizzles using input {1, 2, … 20} are:
The rotate mode (offset >= 0xc000 and offset < 0xe000) rotates the input either left (offset[10]
== 0) or right (offset[10] == 1) a number of threads equal to offset[9:5]. The rotate mode also
uses a mask value which can alter the rotate result. For example, mask == 1 will swap the odd
threads across every other even thread (rotate left), or even threads across every other odd
thread (rotate right).
If offset < 0xc000, one of the basic swizzle modes is used based on offset[15]. If offset[15] == 1,
groups of 4 consecutive threads are swizzled together. If offset[15] == 0, all 32 threads are
swizzled together.
The first basic swizzle mode (when offset[15] == 1) allows full data sharing between a group of 4
consecutive threads. Any thread within the group of 4 can get data from any other thread within
the group of 4, specified by the corresponding offset bits --- [1:0] for the first thread, [3:2] for the
second thread, [5:4] for the third thread, [7:6] for the fourth thread. Note that the offset bits apply
to all groups of 4 within a wavefront; thus if offset[1:0] == 1, then thread0 will grab thread1,
thread4 will grab thread5, etc.
The second basic swizzle mode (when offset[15] == 0) allows limited data sharing between 32
consecutive threads. In this case, the offset is used to specify a 5-bit xor-mask, 5-bit or-mask,
and 5-bit and-mask used to generate a thread mapping. Note that the offset bits apply to each
group of 32 within a wavefront. The details of the thread mapping are listed below. Some
example usages:
Pseudocode follows:
offset = offset1:offset0;
} else if (offset >= 0xc000) { // rotate rotate = offset[9:5]; mask = offset[4:0]; if (offset[10]) { rotate
= -rotate; } for (i = 0; i < 64; i++) { j = (i & mask) \| ((i + rotate) & ~mask); j \|= i & 0x20;
thread_out[i] = thread_valid[j] ? thread_in[j] : 0; }
} else if (offset[15]) { // full data sharing within 4 consecutive threads for (i = 0; i < 64; i+=4) {
thread_out[i+0] = thread_valid[i+offset[1:0]]?thread_in[i+offset[1:0]]:0; thread_out[i+1] =
thread_valid[i+offset[3:2]]?thread_in[i+offset[3:2]]:0; thread_out[i+2] =
thread_valid[i+offset[5:4]]?thread_in[i+offset[5:4]]:0; thread_out[i+3] =
thread_valid[i+offset[7:6]]?thread_in[i+offset[7:6]]:0; }
• DS_GWS_SEMA_RELEASE_ALL
• DS_GWS_INIT
• DS_GWS_SEMA_V
• DS_GWS_SEMA_BR
• DS_GWS_SEMA_P
• DS_GWS_BARRIER
• DS_ORDERED_COUNT
where:
48 BUFFER_ATOMIC_SWAP // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = DATA;
RETURN_DATA = tmp.
49 BUFFER_ATOMIC_CMPSWAP // 32bit
tmp = MEM[ADDR];
src = DATA[0];
cmp = DATA[1];
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0] = tmp.
50 BUFFER_ATOMIC_ADD // 32bit
tmp = MEM[ADDR];
MEM[ADDR] += DATA;
RETURN_DATA = tmp.
51 BUFFER_ATOMIC_SUB // 32bit
tmp = MEM[ADDR];
MEM[ADDR] -= DATA;
RETURN_DATA = tmp.
52 BUFFER_ATOMIC_CSUB // 32bit
old_value = MEM[ADDR];
if old_value < DATA then
new_value = 0;
else
new_value = old_value - DATA;
endif;
MEM[addr] = new_value;
RETURN_DATA = old_value.
53 BUFFER_ATOMIC_SMIN // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA < tmp) ? DATA : tmp; // signed compare
RETURN_DATA = tmp.
54 BUFFER_ATOMIC_UMIN // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA < tmp) ? DATA : tmp; // unsigned
compare
RETURN_DATA = tmp.
55 BUFFER_ATOMIC_SMAX // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA > tmp) ? DATA : tmp; // signed compare
RETURN_DATA = tmp.
56 BUFFER_ATOMIC_UMAX // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA > tmp) ? DATA : tmp; // unsigned
compare
RETURN_DATA = tmp.
57 BUFFER_ATOMIC_AND // 32bit
tmp = MEM[ADDR];
MEM[ADDR] &= DATA;
RETURN_DATA = tmp.
58 BUFFER_ATOMIC_OR // 32bit
tmp = MEM[ADDR];
MEM[ADDR] |= DATA;
RETURN_DATA = tmp.
59 BUFFER_ATOMIC_XOR // 32bit
tmp = MEM[ADDR];
MEM[ADDR] ^= DATA;
RETURN_DATA = tmp.
60 BUFFER_ATOMIC_INC // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp >= DATA) ? 0 : tmp + 1; // unsigned
compare
RETURN_DATA = tmp.
61 BUFFER_ATOMIC_DEC // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp == 0 || tmp > DATA) ? DATA : tmp - 1;
// unsigned compare
RETURN_DATA = tmp.
62 BUFFER_ATOMIC_FCMPSWAP // 32bit
tmp = MEM[ADDR];
src = DATA[0];
cmp = DATA[1];
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0] = tmp.
63 BUFFER_ATOMIC_FMIN // 32bit
tmp = MEM[ADDR];
src = DATA[0];
MEM[ADDR] = (src < tmp) ? src : tmp;
RETURN_DATA[0] = tmp.
64 BUFFER_ATOMIC_FMAX // 32bit
tmp = MEM[ADDR];
src = DATA[0];
MEM[ADDR] = (src > tmp) ? src : tmp;
RETURN_DATA[0] = tmp.
80 BUFFER_ATOMIC_SWAP_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = DATA[0:1];
RETURN_DATA[0:1] = tmp.
81 BUFFER_ATOMIC_CMPSWAP_X2 // 64bit
tmp = MEM[ADDR];
src = DATA[0:1];
cmp = DATA[2:3];
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0:1] = tmp.
82 BUFFER_ATOMIC_ADD_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] += DATA[0:1];
RETURN_DATA[0:1] = tmp.
83 BUFFER_ATOMIC_SUB_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] -= DATA[0:1];
RETURN_DATA[0:1] = tmp.
85 BUFFER_ATOMIC_SMIN_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA[0:1] < tmp) ? DATA[0:1] : tmp; //
signed compare
RETURN_DATA[0:1] = tmp.
86 BUFFER_ATOMIC_UMIN_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA[0:1] < tmp) ? DATA[0:1] : tmp; //
unsigned compare
RETURN_DATA[0:1] = tmp.
87 BUFFER_ATOMIC_SMAX_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA[0:1] > tmp) ? DATA[0:1] : tmp; //
signed compare
RETURN_DATA[0:1] = tmp.
88 BUFFER_ATOMIC_UMAX_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA[0:1] > tmp) ? DATA[0:1] : tmp; //
unsigned compare
RETURN_DATA[0:1] = tmp.
89 BUFFER_ATOMIC_AND_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] &= DATA[0:1];
RETURN_DATA[0:1] = tmp.
90 BUFFER_ATOMIC_OR_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] |= DATA[0:1];
RETURN_DATA[0:1] = tmp.
91 BUFFER_ATOMIC_XOR_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] ^= DATA[0:1];
RETURN_DATA[0:1] = tmp.
92 BUFFER_ATOMIC_INC_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp >= DATA[0:1]) ? 0 : tmp + 1; //
unsigned compare
RETURN_DATA[0:1] = tmp.
93 BUFFER_ATOMIC_DEC_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp == 0 || tmp > DATA[0:1]) ? DATA[0:1] :
tmp - 1; // unsigned compare
RETURN_DATA[0:1] = tmp.
94 BUFFER_ATOMIC_FCMPSWAP_X // 64bit
2 tmp = MEM[ADDR];
src = DATA[0];
cmp = DATA[1];
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0] = tmp.
95 BUFFER_ATOMIC_FMIN_X2 // 64bit
tmp = MEM[ADDR];
src = DATA[0];
MEM[ADDR] = (src < tmp) ? src : tmp;
RETURN_DATA[0] = tmp.
96 BUFFER_ATOMIC_FMAX_X2 // 64bit
tmp = MEM[ADDR];
src = DATA[0];
MEM[ADDR] = (src > tmp) ? src : tmp;
RETURN_DATA[0] = tmp.
113 BUFFER_GL0_INV Write back and invalidate the shader L0. Returns ACK to
shader.
114 BUFFER_GL1_INV Invalidate the GL1 cache only. Returns ACK to shader.
where:
where:
14 IMAGE_GET_RESINFO Return resource info for a given mip level specified in the
address vgpr. No sampler. Returns 4 integer values into VGPRs
3-0: {num_mip_levels, depth, height, width}.
15 IMAGE_ATOMIC_SWAP // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = DATA;
RETURN_DATA = tmp.
16 IMAGE_ATOMIC_CMPSWAP // 32bit
tmp = MEM[ADDR];
src = DATA[0];
cmp = DATA[1];
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0] = tmp.
17 IMAGE_ATOMIC_ADD // 32bit
tmp = MEM[ADDR];
MEM[ADDR] += DATA;
RETURN_DATA = tmp.
18 IMAGE_ATOMIC_SUB // 32bit
tmp = MEM[ADDR];
MEM[ADDR] -= DATA;
RETURN_DATA = tmp.
20 IMAGE_ATOMIC_SMIN // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA < tmp) ? DATA : tmp; // signed compare
RETURN_DATA = tmp.
21 IMAGE_ATOMIC_UMIN // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA < tmp) ? DATA : tmp; // unsigned compare
RETURN_DATA = tmp.
22 IMAGE_ATOMIC_SMAX // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA > tmp) ? DATA : tmp; // signed compare
RETURN_DATA = tmp.
23 IMAGE_ATOMIC_UMAX // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA > tmp) ? DATA : tmp; // unsigned compare
RETURN_DATA = tmp.
24 IMAGE_ATOMIC_AND // 32bit
tmp = MEM[ADDR];
MEM[ADDR] &= DATA;
RETURN_DATA = tmp.
25 IMAGE_ATOMIC_OR // 32bit
tmp = MEM[ADDR];
MEM[ADDR] |= DATA;
RETURN_DATA = tmp.
26 IMAGE_ATOMIC_XOR // 32bit
tmp = MEM[ADDR];
MEM[ADDR] ^= DATA;
RETURN_DATA = tmp.
27 IMAGE_ATOMIC_INC // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp >= DATA) ? 0 : tmp + 1; // unsigned
compare
RETURN_DATA = tmp.
28 IMAGE_ATOMIC_DEC // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp == 0 || tmp > DATA) ? DATA : tmp - 1; //
unsigned compare
RETURN_DATA = tmp.
29 IMAGE_ATOMIC_FCMPSWAP // 32bit
tmp = MEM[ADDR];
src = DATA[0];
cmp = DATA[1];
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0] = tmp.
30 IMAGE_ATOMIC_FMIN // 32bit
tmp = MEM[ADDR];
src = DATA[0];
MEM[ADDR] = (src < tmp) ? src : tmp;
RETURN_DATA[0] = tmp.
31 IMAGE_ATOMIC_FMAX // 32bit
tmp = MEM[ADDR];
src = DATA[0];
MEM[ADDR] = (src > tmp) ? src : tmp;
RETURN_DATA[0] = tmp.
35 IMAGE_SAMPLE_D_CL sample texture map, with LOD clamp specified in shader, with
user derivatives.
38 IMAGE_SAMPLE_B_CL sample texture map, with LOD clamp specified in shader, with
lod bias.
46 IMAGE_SAMPLE_C_B_CL SAMPLE_C, with LOD clamp specified in shader, with lod bias.
54 IMAGE_SAMPLE_B_CL_O SAMPLE_O, with LOD clamp specified in shader, with lod bias.
62 IMAGE_SAMPLE_C_B_CL_O SAMPLE_C_O, with LOD clamp specified in shader, with lod bias.
65 IMAGE_GATHER4_CL gather 4 single component elements (2x2) with user LOD clamp.
70 IMAGE_GATHER4_B_CL gather 4 single component elements (2x2) with user bias and
clamp.
73 IMAGE_GATHER4_C_CL gather 4 single component elements (2x2) with user LOD clamp
and PCF.
76 IMAGE_GATHER4_C_L gather 4 single component elements (2x2) with user LOD and
PCF.
77 IMAGE_GATHER4_C_B gather 4 single component elements (2x2) with user bias and
PCF.
78 IMAGE_GATHER4_C_B_CL gather 4 single component elements (2x2) with user bias, clamp
and PCF.
DATA:
ADDR:
RSRC:
RESTRICTIONS:
ADDR:
where:
48 FLAT_ATOMIC_SWAP // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = DATA;
RETURN_DATA = tmp.
49 FLAT_ATOMIC_CMPSWAP // 32bit
tmp = MEM[ADDR];
src = DATA[0];
cmp = DATA[1];
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0] = tmp.
50 FLAT_ATOMIC_ADD // 32bit
tmp = MEM[ADDR];
MEM[ADDR] += DATA;
RETURN_DATA = tmp.
51 FLAT_ATOMIC_SUB // 32bit
tmp = MEM[ADDR];
MEM[ADDR] -= DATA;
RETURN_DATA = tmp.
53 FLAT_ATOMIC_SMIN // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA < tmp) ? DATA : tmp; // signed compare
RETURN_DATA = tmp.
54 FLAT_ATOMIC_UMIN // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA < tmp) ? DATA : tmp; // unsigned compare
RETURN_DATA = tmp.
55 FLAT_ATOMIC_SMAX // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA > tmp) ? DATA : tmp; // signed compare
RETURN_DATA = tmp.
56 FLAT_ATOMIC_UMAX // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA > tmp) ? DATA : tmp; // unsigned compare
RETURN_DATA = tmp.
57 FLAT_ATOMIC_AND // 32bit
tmp = MEM[ADDR];
MEM[ADDR] &= DATA;
RETURN_DATA = tmp.
58 FLAT_ATOMIC_OR // 32bit
tmp = MEM[ADDR];
MEM[ADDR] |= DATA;
RETURN_DATA = tmp.
59 FLAT_ATOMIC_XOR // 32bit
tmp = MEM[ADDR];
MEM[ADDR] ^= DATA;
RETURN_DATA = tmp.
60 FLAT_ATOMIC_INC // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp >= DATA) ? 0 : tmp + 1; // unsigned
compare
RETURN_DATA = tmp.
61 FLAT_ATOMIC_DEC // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp == 0 || tmp > DATA) ? DATA : tmp - 1; //
unsigned compare
RETURN_DATA = tmp.
62 FLAT_ATOMIC_FCMPSWAP // 32bit
tmp = MEM[ADDR];
src = DATA[0];
cmp = DATA[1];
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0] = tmp.
63 FLAT_ATOMIC_FMIN // 32bit
tmp = MEM[ADDR];
src = DATA[0];
MEM[ADDR] = (src < tmp) ? src : tmp;
RETURN_DATA[0] = tmp.
64 FLAT_ATOMIC_FMAX // 32bit
tmp = MEM[ADDR];
src = DATA[0];
MEM[ADDR] = (src > tmp) ? src : tmp;
RETURN_DATA[0] = tmp.
80 FLAT_ATOMIC_SWAP_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = DATA[0:1];
RETURN_DATA[0:1] = tmp.
81 FLAT_ATOMIC_CMPSWAP_X2 // 64bit
tmp = MEM[ADDR];
src = DATA[0:1];
cmp = DATA[2:3];
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0:1] = tmp.
82 FLAT_ATOMIC_ADD_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] += DATA[0:1];
RETURN_DATA[0:1] = tmp.
83 FLAT_ATOMIC_SUB_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] -= DATA[0:1];
RETURN_DATA[0:1] = tmp.
85 FLAT_ATOMIC_SMIN_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA[0:1] < tmp) ? DATA[0:1] : tmp; // signed
compare
RETURN_DATA[0:1] = tmp.
86 FLAT_ATOMIC_UMIN_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA[0:1] < tmp) ? DATA[0:1] : tmp; //
unsigned compare
RETURN_DATA[0:1] = tmp.
87 FLAT_ATOMIC_SMAX_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA[0:1] > tmp) ? DATA[0:1] : tmp; // signed
compare
RETURN_DATA[0:1] = tmp.
88 FLAT_ATOMIC_UMAX_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA[0:1] > tmp) ? DATA[0:1] : tmp; //
unsigned compare
RETURN_DATA[0:1] = tmp.
89 FLAT_ATOMIC_AND_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] &= DATA[0:1];
RETURN_DATA[0:1] = tmp.
90 FLAT_ATOMIC_OR_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] |= DATA[0:1];
RETURN_DATA[0:1] = tmp.
91 FLAT_ATOMIC_XOR_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] ^= DATA[0:1];
RETURN_DATA[0:1] = tmp.
92 FLAT_ATOMIC_INC_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp >= DATA[0:1]) ? 0 : tmp + 1; // unsigned
compare
RETURN_DATA[0:1] = tmp.
93 FLAT_ATOMIC_DEC_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp == 0 || tmp > DATA[0:1]) ? DATA[0:1] : tmp
- 1; // unsigned compare
RETURN_DATA[0:1] = tmp.
94 FLAT_ATOMIC_FCMPSWAP_X2 // 64bit
tmp = MEM[ADDR];
src = DATA[0];
cmp = DATA[1];
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0] = tmp.
95 FLAT_ATOMIC_FMIN_X2 // 64bit
tmp = MEM[ADDR];
src = DATA[0];
MEM[ADDR] = (src < tmp) ? src : tmp;
RETURN_DATA[0] = tmp.
96 FLAT_ATOMIC_FMAX_X2 // 64bit
tmp = MEM[ADDR];
src = DATA[0];
MEM[ADDR] = (src > tmp) ? src : tmp;
RETURN_DATA[0] = tmp.
48 GLOBAL_ATOMIC_SWAP // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = DATA;
RETURN_DATA = tmp.
49 GLOBAL_ATOMIC_CMPSWAP // 32bit
tmp = MEM[ADDR];
src = DATA[0];
cmp = DATA[1];
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0] = tmp.
50 GLOBAL_ATOMIC_ADD // 32bit
tmp = MEM[ADDR];
MEM[ADDR] += DATA;
RETURN_DATA = tmp.
51 GLOBAL_ATOMIC_SUB // 32bit
tmp = MEM[ADDR];
MEM[ADDR] -= DATA;
RETURN_DATA = tmp.
52 GLOBAL_ATOMIC_CSUB // 32bit
old_value = MEM[ADDR];
if old_value < DATA then
new_value = 0;
else
new_value = old_value - DATA;
endif;
MEM[addr] = new_value;
RETURN_DATA = old_value.
53 GLOBAL_ATOMIC_SMIN // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA < tmp) ? DATA : tmp; // signed compare
RETURN_DATA = tmp.
54 GLOBAL_ATOMIC_UMIN // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA < tmp) ? DATA : tmp; // unsigned compare
RETURN_DATA = tmp.
55 GLOBAL_ATOMIC_SMAX // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA > tmp) ? DATA : tmp; // signed compare
RETURN_DATA = tmp.
56 GLOBAL_ATOMIC_UMAX // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA > tmp) ? DATA : tmp; // unsigned compare
RETURN_DATA = tmp.
57 GLOBAL_ATOMIC_AND // 32bit
tmp = MEM[ADDR];
MEM[ADDR] &= DATA;
RETURN_DATA = tmp.
58 GLOBAL_ATOMIC_OR // 32bit
tmp = MEM[ADDR];
MEM[ADDR] |= DATA;
RETURN_DATA = tmp.
59 GLOBAL_ATOMIC_XOR // 32bit
tmp = MEM[ADDR];
MEM[ADDR] ^= DATA;
RETURN_DATA = tmp.
60 GLOBAL_ATOMIC_INC // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp >= DATA) ? 0 : tmp + 1; // unsigned
compare
RETURN_DATA = tmp.
61 GLOBAL_ATOMIC_DEC // 32bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp == 0 || tmp > DATA) ? DATA : tmp - 1; //
unsigned compare
RETURN_DATA = tmp.
62 GLOBAL_ATOMIC_FCMPSWAP // 32bit
tmp = MEM[ADDR];
src = DATA[0];
cmp = DATA[1];
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0] = tmp.
63 GLOBAL_ATOMIC_FMIN // 32bit
tmp = MEM[ADDR];
src = DATA[0];
MEM[ADDR] = (src < tmp) ? src : tmp;
RETURN_DATA[0] = tmp.
64 GLOBAL_ATOMIC_FMAX // 32bit
tmp = MEM[ADDR];
src = DATA[0];
MEM[ADDR] = (src > tmp) ? src : tmp;
RETURN_DATA[0] = tmp.
80 GLOBAL_ATOMIC_SWAP_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = DATA[0:1];
RETURN_DATA[0:1] = tmp.
81 GLOBAL_ATOMIC_CMPSWAP_ // 64bit
X2 tmp = MEM[ADDR];
src = DATA[0:1];
cmp = DATA[2:3];
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0:1] = tmp.
82 GLOBAL_ATOMIC_ADD_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] += DATA[0:1];
RETURN_DATA[0:1] = tmp.
83 GLOBAL_ATOMIC_SUB_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] -= DATA[0:1];
RETURN_DATA[0:1] = tmp.
85 GLOBAL_ATOMIC_SMIN_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA[0:1] < tmp) ? DATA[0:1] : tmp; // signed
compare
RETURN_DATA[0:1] = tmp.
86 GLOBAL_ATOMIC_UMIN_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA[0:1] < tmp) ? DATA[0:1] : tmp; //
unsigned compare
RETURN_DATA[0:1] = tmp.
87 GLOBAL_ATOMIC_SMAX_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA[0:1] > tmp) ? DATA[0:1] : tmp; // signed
compare
RETURN_DATA[0:1] = tmp.
88 GLOBAL_ATOMIC_UMAX_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (DATA[0:1] > tmp) ? DATA[0:1] : tmp; //
unsigned compare
RETURN_DATA[0:1] = tmp.
89 GLOBAL_ATOMIC_AND_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] &= DATA[0:1];
RETURN_DATA[0:1] = tmp.
90 GLOBAL_ATOMIC_OR_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] |= DATA[0:1];
RETURN_DATA[0:1] = tmp.
91 GLOBAL_ATOMIC_XOR_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] ^= DATA[0:1];
RETURN_DATA[0:1] = tmp.
92 GLOBAL_ATOMIC_INC_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp >= DATA[0:1]) ? 0 : tmp + 1; // unsigned
compare
RETURN_DATA[0:1] = tmp.
93 GLOBAL_ATOMIC_DEC_X2 // 64bit
tmp = MEM[ADDR];
MEM[ADDR] = (tmp == 0 || tmp > DATA[0:1]) ? DATA[0:1] : tmp
- 1; // unsigned compare
RETURN_DATA[0:1] = tmp.
94 GLOBAL_ATOMIC_FCMPSWAP // 64bit
_X2 tmp = MEM[ADDR];
src = DATA[0];
cmp = DATA[1];
MEM[ADDR] = (tmp == cmp) ? src : tmp;
RETURN_DATA[0] = tmp.
95 GLOBAL_ATOMIC_FMIN_X2 // 64bit
tmp = MEM[ADDR];
src = DATA[0];
MEM[ADDR] = (src < tmp) ? src : tmp;
RETURN_DATA[0] = tmp.
96 GLOBAL_ATOMIC_FMAX_X2 // 64bit
tmp = MEM[ADDR];
src = DATA[0];
MEM[ADDR] = (src > tmp) ? src : tmp;
RETURN_DATA[0] = tmp.
12.19.1. DPP
The following instructions cannot use DPP:
• V_FMAMK_F32
• V_FMAAK_F32
• V_FMAMK_F16
• V_FMAAK_F16
• V_READFIRSTLANE_B32
• V_CVT_I32_F64
• V_CVT_F64_I32
• V_CVT_F32_F64
• V_CVT_F64_F32
• V_CVT_U32_F64
• V_CVT_F64_U32
• V_TRUNC_F64
• V_CEIL_F64
• V_RNDNE_F64
• V_FLOOR_F64
• V_RCP_F64
• V_RSQ_F64
• V_SQRT_F64
• V_FREXP_EXP_I32_F64
• V_FREXP_MANT_F64
• V_FRACT_F64
• V_CLREXCP
• V_SWAP_B32
• V_CMP_CLASS_F64
• V_CMPX_CLASS_F64
• V_CMP_*_F64
• V_CMPX_*_F64
• V_CMP_*_I64
• V_CMP_*_U64
• V_CMPX_*_I64
• V_CMPX_*_U64
12.19.2. SDWA
The following instructions cannot use SDWA:
• V_FMAC_F32
• V_FMAMK_F32
• V_FMAAK_F32
• V_FMAC_F16
• V_FMAMK_F16
• V_FMAAK_F16
• V_READFIRSTLANE_B32
• V_CLREXCP
• V_SWAP_B32
Endian Order - The RDNA architecture addresses memory and registers using littleendian byte-
ordering and bit-ordering. Multi-byte values are stored with their least-significant (low-order) byte
(LSB) at the lowest byte address, and they are illustrated with their LSB at the right side. Byte
values are stored with their least-significant (low-order) bit (lsb) at the lowest bit address, and
they are illustrated with their lsb at the right side.
The table below summarizes the microcode formats and their widths. The sections that follow
provide details
SOP2 SOP2 32
SOP1 SOP1
SOPK SOPK
SOPP SOPP
SOPC SOPC
SMEM SMEM 64
VOP1 VOP1 32
VOP2 VOP2 32
VOPC VOPC 32
VOP3A VOP3A 64
VOP3B VOP3B 64
VOP3P VOP3P 64
DPP DPP 32
SDWA VOP2 32
VINTRP VINTRP 32
LDS/GDS Format
DS DS 64
236 of 283
"RDNA 2" Instruction Set Architecture
MTBUF MTBUF 64
MUBUF MUBUF 64
MIMG MIMG 64
Export Format
EXP EXP 64
Flat Formats
FLAT FLAT 64
GLOBAL GLOBAL 64
SCRATCH SCRATCH 64
The field-definition tables that accompany the descriptions in the sections below use the
following notation.
The default value of all fields is zero. Any bitfield not identified is assumed to be reserved.
Instruction Suffixes
Most instructions include a suffix which indicates the data type the instruction handles. This
suffix may also include a number which indicate the size of the data.
For example: "F32" indicates "32-bit floating point data", or "B16" is "16-bit binary data".
• B = binary
• F = floating point
• U = unsigned integer
• S = signed integer
When more than one data-type specifier occurs in an instruction, the last one is the result type
and size, and the earlier one(s) is/are input data type and size.
13.1.1. SOP2
Scalar format with Two inputs, one output
Format SOP2
Description This is a scalar instruction with two inputs and one output. Can be followed
by a 32-bit literal constant.
ENCODING [31:30] 10
0 S_ADD_U32 28 S_XNOR_B32
1 S_SUB_U32 29 S_XNOR_B64
2 S_ADD_I32 30 S_LSHL_B32
3 S_SUB_I32 31 S_LSHL_B64
4 S_ADDC_U32 32 S_LSHR_B32
5 S_SUBB_U32 33 S_LSHR_B64
6 S_MIN_I32 34 S_ASHR_I32
7 S_MIN_U32 35 S_ASHR_I64
8 S_MAX_I32 36 S_BFM_B32
9 S_MAX_U32 37 S_BFM_B64
10 S_CSELECT_B32 38 S_MUL_I32
11 S_CSELECT_B64 39 S_BFE_U32
14 S_AND_B32 40 S_BFE_I32
15 S_AND_B64 41 S_BFE_U64
16 S_OR_B32 42 S_BFE_I64
17 S_OR_B64 44 S_ABSDIFF_I32
18 S_XOR_B32 46 S_LSHL1_ADD_U32
19 S_XOR_B64 47 S_LSHL2_ADD_U32
20 S_ANDN2_B32 48 S_LSHL3_ADD_U32
21 S_ANDN2_B64 49 S_LSHL4_ADD_U32
22 S_ORN2_B32 50 S_PACK_LL_B32_B16
23 S_ORN2_B64 51 S_PACK_LH_B32_B16
24 S_NAND_B32 52 S_PACK_HH_B32_B16
25 S_NAND_B64 53 S_MUL_HI_U32
26 S_NOR_B32 54 S_MUL_HI_I32
27 S_NOR_B64
13.1.2. SOPK
Format SOPK
Description This is a scalar instruction with one 16-bit signed immediate (SIMM16)
input and a single destination. Instructions which take 2 inputs use the
destination as the second input.
SDST [22:16] Scalar destination, and can provide second source operand.
0 - 105 SGPR0 to SGPR105: Scalar general-purpose registers.
106 VCC_LO: vcc[31:0].
107 VCC_HI: vcc[63:32].
108-123 TTMP0 - TTMP15: Trap handler temporary register.
124 M0. Memory register 0.
125 NULL
126 EXEC_LO: exec[31:0].
127 EXEC_HI: exec[63:32].
0 S_MOVK_I32 14 S_CMPK_LE_U32
1 S_VERSION 15 S_ADDK_I32
2 S_CMOVK_I32 16 S_MULK_I32
3 S_CMPK_EQ_I32 18 S_GETREG_B32
4 S_CMPK_LG_I32 19 S_SETREG_B32
5 S_CMPK_GT_I32 21 S_SETREG_IMM32_B32
6 S_CMPK_GE_I32 22 S_CALL_B64
7 S_CMPK_LT_I32 23 S_WAITCNT_VSCNT
8 S_CMPK_LE_I32 24 S_WAITCNT_VMCNT
9 S_CMPK_EQ_U32 25 S_WAITCNT_EXPCNT
10 S_CMPK_LG_U32 26 S_WAITCNT_LGKMCNT
11 S_CMPK_GT_U32 27 S_SUBVECTOR_LOOP_BEGIN
12 S_CMPK_GE_U32 28 S_SUBVECTOR_LOOP_END
13 S_CMPK_LT_U32
13.1.3. SOP1
Format SOP1
Description This is a scalar instruction with two inputs and one output. Can be followed
by a 32-bit literal constant.
3 S_MOV_B32 37 S_OR_SAVEEXEC_B64
4 S_MOV_B64 38 S_XOR_SAVEEXEC_B64
5 S_CMOV_B32 39 S_ANDN2_SAVEEXEC_B64
6 S_CMOV_B64 40 S_ORN2_SAVEEXEC_B64
7 S_NOT_B32 41 S_NAND_SAVEEXEC_B64
8 S_NOT_B64 42 S_NOR_SAVEEXEC_B64
9 S_WQM_B32 43 S_XNOR_SAVEEXEC_B64
10 S_WQM_B64 44 S_QUADMASK_B32
11 S_BREV_B32 45 S_QUADMASK_B64
12 S_BREV_B64 46 S_MOVRELS_B32
13 S_BCNT0_I32_B32 47 S_MOVRELS_B64
14 S_BCNT0_I32_B64 48 S_MOVRELD_B32
15 S_BCNT1_I32_B32 49 S_MOVRELD_B64
16 S_BCNT1_I32_B64 52 S_ABS_I32
17 S_FF0_I32_B32 55 S_ANDN1_SAVEEXEC_B64
18 S_FF0_I32_B64 56 S_ORN1_SAVEEXEC_B64
19 S_FF1_I32_B32 57 S_ANDN1_WREXEC_B64
20 S_FF1_I32_B64 58 S_ANDN2_WREXEC_B64
21 S_FLBIT_I32_B32 59 S_BITREPLICATE_B64_B32
22 S_FLBIT_I32_B64 60 S_AND_SAVEEXEC_B32
23 S_FLBIT_I32 61 S_OR_SAVEEXEC_B32
24 S_FLBIT_I32_I64 62 S_XOR_SAVEEXEC_B32
25 S_SEXT_I32_I8 63 S_ANDN2_SAVEEXEC_B32
26 S_SEXT_I32_I16 64 S_ORN2_SAVEEXEC_B32
27 S_BITSET0_B32 65 S_NAND_SAVEEXEC_B32
28 S_BITSET0_B64 66 S_NOR_SAVEEXEC_B32
29 S_BITSET1_B32 67 S_XNOR_SAVEEXEC_B32
30 S_BITSET1_B64 68 S_ANDN1_SAVEEXEC_B32
31 S_GETPC_B64 69 S_ORN1_SAVEEXEC_B32
32 S_SETPC_B64 70 S_ANDN1_WREXEC_B32
33 S_SWAPPC_B64 71 S_ANDN2_WREXEC_B32
34 S_RFE_B64 73 S_MOVRELSD_2_B32
36 S_AND_SAVEEXEC_B64
13.1.4. SOPC
Format SOPC
Description This is a scalar instruction with two inputs which are compared and
produce SCC as a result. Can be followed by a 32-bit literal constant.
0 S_CMP_EQ_I32 9 S_CMP_GE_U32
1 S_CMP_LG_I32 10 S_CMP_LT_U32
2 S_CMP_GT_I32 11 S_CMP_LE_U32
3 S_CMP_GE_I32 12 S_BITCMP0_B32
4 S_CMP_LT_I32 13 S_BITCMP1_B32
5 S_CMP_LE_I32 14 S_BITCMP0_B64
6 S_CMP_EQ_U32 15 S_BITCMP1_B64
7 S_CMP_LG_U32 18 S_CMP_EQ_U64
8 S_CMP_GT_U32 19 S_CMP_LG_U64
13.1.5. SOPP
Format SOPP
Description This is a scalar instruction with one 16-bit signed immediate (SIMM16)
input.
0 S_NOP 18 S_TRAP
1 S_ENDPGM 19 S_ICACHE_INV
2 S_BRANCH 20 S_INCPERFLEVEL
3 S_WAKEUP 21 S_DECPERFLEVEL
4 S_CBRANCH_SCC0 22 S_TTRACEDATA
5 S_CBRANCH_SCC1 23 S_CBRANCH_CDBGSYS
6 S_CBRANCH_VCCZ 24 S_CBRANCH_CDBGUSER
7 S_CBRANCH_VCCNZ 25 S_CBRANCH_CDBGSYS_OR_USER
8 S_CBRANCH_EXECZ 26 S_CBRANCH_CDBGSYS_AND_USER
9 S_CBRANCH_EXECNZ 27 S_ENDPGM_SAVED
10 S_BARRIER 30 S_ENDPGM_ORDERED_PS_DONE
11 S_SETKILL 31 S_CODE_END
12 S_WAITCNT 32 S_INST_PREFETCH
13 S_SETHALT 33 S_CLAUSE
14 S_SLEEP 35 S_WAITCNT_DEPCTR
15 S_SETPRIO 36 S_ROUND_MODE
16 S_SENDMSG 37 S_DENORM_MODE
17 S_SENDMSGHALT 40 S_TTRACEDATA_IMM
13.2.1. SMEM
Format SMEM
SBASE [5:0] SGPR-pair which provides base address or SGPR-quad which provides V#.
(LSB of SGPR address is omitted).
SDATA [12:6] SGPR which provides write data or accepts return data.
GLC [16] Globally memory Coherent. Force bypass of L1 cache, or for atomics, cause
pre-op value to be returned.
OFFSET [52:32] An immediate signed byte offset. Signed offsets only work with
S_LOAD/STORE.
SOFFSET [63:57] SGPR which supplies an unsigned byte offset. Disabled if set to NULL.
0 S_LOAD_DWORD 11 S_BUFFER_LOAD_DWORDX8
1 S_LOAD_DWORDX2 12 S_BUFFER_LOAD_DWORDX16
2 S_LOAD_DWORDX4 31 S_GL1_INV
3 S_LOAD_DWORDX8 32 S_DCACHE_INV
4 S_LOAD_DWORDX16 36 S_MEMTIME
8 S_BUFFER_LOAD_DWORD 37 S_MEMREALTIME
9 S_BUFFER_LOAD_DWORDX2 38 S_ATC_PROBE
10 S_BUFFER_LOAD_DWORDX4 39 S_ATC_PROBE_BUFFER
13.3.1. VOP2
Format VOP2
ENCODING [31] 0
1 V_CNDMASK_B32 29 V_XOR_B32
2 V_DOT2C_F32_F16 30 V_XNOR_B32
3 V_ADD_F32 37 V_ADD_NC_U32
4 V_SUB_F32 38 V_SUB_NC_U32
5 V_SUBREV_F32 39 V_SUBREV_NC_U32
6 V_FMAC_LEGACY_F32 40 V_ADD_CO_CI_U32
7 V_MUL_LEGACY_F32 41 V_SUB_CO_CI_U32
8 V_MUL_F32 42 V_SUBREV_CO_CI_U32
9 V_MUL_I32_I24 43 V_FMAC_F32
10 V_MUL_HI_I32_I24 44 V_FMAMK_F32
11 V_MUL_U32_U24 45 V_FMAAK_F32
12 V_MUL_HI_U32_U24 47 V_CVT_PKRTZ_F16_F32
13 V_DOT4C_I32_I8 50 V_ADD_F16
15 V_MIN_F32 51 V_SUB_F16
16 V_MAX_F32 52 V_SUBREV_F16
17 V_MIN_I32 53 V_MUL_F16
18 V_MAX_I32 54 V_FMAC_F16
19 V_MIN_U32 55 V_FMAMK_F16
20 V_MAX_U32 56 V_FMAAK_F16
22 V_LSHRREV_B32 57 V_MAX_F16
24 V_ASHRREV_I32 58 V_MIN_F16
26 V_LSHLREV_B32 59 V_LDEXP_F16
27 V_AND_B32 60 V_PK_FMAC_F16
28 V_OR_B32
13.3.2. VOP1
Format VOP1
0 V_NOP 53 V_SIN_F32
1 V_MOV_B32 54 V_COS_F32
2 V_READFIRSTLANE_B32 55 V_NOT_B32
3 V_CVT_I32_F64 56 V_BFREV_B32
4 V_CVT_F64_I32 57 V_FFBH_U32
5 V_CVT_F32_I32 58 V_FFBL_B32
6 V_CVT_F32_U32 59 V_FFBH_I32
7 V_CVT_U32_F32 60 V_FREXP_EXP_I32_F64
8 V_CVT_I32_F32 61 V_FREXP_MANT_F64
10 V_CVT_F16_F32 62 V_FRACT_F64
11 V_CVT_F32_F16 63 V_FREXP_EXP_I32_F32
12 V_CVT_RPI_I32_F32 64 V_FREXP_MANT_F32
13 V_CVT_FLR_I32_F32 65 V_CLREXCP
14 V_CVT_OFF_F32_I4 66 V_MOVRELD_B32
15 V_CVT_F32_F64 67 V_MOVRELS_B32
16 V_CVT_F64_F32 68 V_MOVRELSD_B32
17 V_CVT_F32_UBYTE0 72 V_MOVRELSD_2_B32
18 V_CVT_F32_UBYTE1 80 V_CVT_F16_U16
19 V_CVT_F32_UBYTE2 81 V_CVT_F16_I16
20 V_CVT_F32_UBYTE3 82 V_CVT_U16_F16
21 V_CVT_U32_F64 83 V_CVT_I16_F16
22 V_CVT_F64_U32 84 V_RCP_F16
23 V_TRUNC_F64 85 V_SQRT_F16
24 V_CEIL_F64 86 V_RSQ_F16
25 V_RNDNE_F64 87 V_LOG_F16
26 V_FLOOR_F64 88 V_EXP_F16
27 V_PIPEFLUSH 89 V_FREXP_MANT_F16
32 V_FRACT_F32 90 V_FREXP_EXP_I16_F16
33 V_TRUNC_F32 91 V_FLOOR_F16
34 V_CEIL_F32 92 V_CEIL_F16
35 V_RNDNE_F32 93 V_TRUNC_F16
36 V_FLOOR_F32 94 V_RNDNE_F16
37 V_EXP_F32 95 V_FRACT_F16
39 V_LOG_F32 96 V_SIN_F16
42 V_RCP_F32 97 V_COS_F16
43 V_RCP_IFLAG_F32 98 V_SAT_PK_U8_I16
46 V_RSQ_F32 99 V_CVT_NORM_I16_F16
52 V_SQRT_F64
13.3.3. VOPC
Format VOPC
Description Vector instruction taking two inputs and producing a comparison result. Can
be followed by a 32- bit literal constant. Vector Comparison operations are
divided into three groups:
The final opcode number is determined by adding the base for the opcode family plus the offset
from the compare op. Compare instructions write a result to VCC (for VOPC) or an SGPR (for
VOP3). Additionally, every compare instruction has a variant that writes to the EXEC mask
instead of VCC or SGPR. The destination of the compare result is VCC or EXEC when encoded
using the VOPC format, and can be an arbitrary SGPR when only encoded in the VOP3 format.
Comparison Operations
F 0 D.u = 0
TRU 15 D.u = 1
F 0 D.u = 0
TRU 7 D.u = 1
13.3.4. VOP3A
Format VOP3A
ABS [10:8] Absolute value of input. [8] = src0, [9] = src1, [10] = src2
OPSEL [14:11] Operand select for 16-bit data. 0 = select low half, 1 = select high half. [11] =
src0, [12] = src1, [13] = src2, [14] = dest.
NEG [63:61] Negate input. [61] = src0, [62] = src1, [63] = src2
773 V_MUL_LO_U16
13.3.5. VOP3B
Format VOP3B
Description Vector ALU format with three operands and a scalar result. This encoding
is used only for a few opcodes.
This encoding allows specifying a unique scalar destination, and is used only for the opcodes
listed below. All other opcodes use VOP3A.
• V_ADD_CO_U32
• V_SUB_CO_U32
• V_SUBREV_CO_U32
• V_ADDC_CO_U32
• V_SUBB_CO_U32
• V_SUBBREV_CO_U32
• V_DIV_SCALE_F32
• V_DIV_SCALE_F64
• V_MAD_U64_U32
• V_MAD_I64_I32
NEG [63:61] Negate input. [61] = src0, [62] = src1, [63] = src2
375 V_MAD_I64_I32
13.3.6. VOP3P
Format VOP3P
Description Vector ALU format taking one, two or three pairs of 16 bit inputs and
producing two 16-bit outputs (packed into 1 dword).
OPSEL [13:11] Select low or high for low sources 0=[11], 1=[12], 2=[13].
OPSEL_HI2 [14] Select low or high for high sources 0=[14], 1=[60], 2=[59].
NEG [63:61] Negate input for low 16-bits of sources. [61] = src0, [62] = src1, [63] = src2
0 V_PK_MAD_I16 15 V_PK_ADD_F16
1 V_PK_MUL_LO_U16 16 V_PK_MUL_F16
2 V_PK_ADD_I16 17 V_PK_MIN_F16
3 V_PK_SUB_I16 18 V_PK_MAX_F16
4 V_PK_LSHLREV_B16 19 V_DOT2_F32_F16
5 V_PK_LSHRREV_B16 20 V_DOT2_I32_I16
6 V_PK_ASHRREV_I16 21 V_DOT2_U32_U16
7 V_PK_MAX_I16 22 V_DOT4_I32_I8
8 V_PK_MIN_I16 23 V_DOT4_U32_U8
9 V_PK_MAD_U16 24 V_DOT8_I32_I4
10 V_PK_ADD_U16 25 V_DOT8_U32_U4
11 V_PK_SUB_U16 32 V_FMA_MIX_F32
12 V_PK_MAX_U16 33 V_FMA_MIXLO_F16
13 V_PK_MIN_U16 34 V_FMA_MIXHI_F16
14 V_PK_FMA_F16
13.3.7. SDWA
Format SDWA
Description Sub-Dword Addressing. This is a second dword which can follow VOP1 or
VOP2 instructions (in place of a literal constant) to control selection of sub-
dword (8-bit and 16-bit) operands. Use of SDWA is indicated by assigning
the SRC0 field to SDWA, and then the actual VGPR used as source-zero is
determined in SDWA instruction word.
DST_U [44:43] Destination format: what do with the bits in the VGPR that are not selected by
DST_SEL:
0 = pad with zeros + 1 = sign extend upper / zero lower
2 = preserve (don’t modify)
3 = reserved
OMOD [47:46] Output modifiers (see VOP3). [46] = low half, [47] = high half
13.3.8. SDWAB
Format SDWAB
Description Sub-Dword Addressing. This is a second dword which can follow VOPC
instructions (in place of a literal constant) to control selection of sub-dword
(8-bit and 16-bit) operands. Use of SDWA is indicated by assigning the
SRC0 field to SDWA, and then the actual VGPR used as source-zero is
determined in SDWA instruction word. This version has a scalar
destination.
13.3.9. DPP16
Format DPP16
Description Data Parallel Primitives over 16 lanes. This is a second dword which can
follow VOP1, VOP2 or VOPC instructions (in place of a literal constant) to
control selection of data from other lanes.
FI [50] Fetch invalid data: 0 = read zero for any inactive lanes; 1 = read VGPRs even
for invalid lanes.
BC [51] Bounds Control: 0 = do not write when source is out of range, 1 = write.
BANK_MASK [59:56] Bank Mask Applies to the VGPR destination write only, does not impact the
thread mask when fetching source VGPR data.
27==0: lanes[12:15, 28:31, 44:47, 60:63] are disabled
26==0: lanes[8:11, 24:27, 40:43, 56:59] are disabled
25==0: lanes[4:7, 20:23, 36:39, 52:55] are disabled
24==0: lanes[0:3, 16:19, 32:35, 48:51] are disabled
Notice: the term "bank" here is not the same as was used for the VGPR bank.
ROW_MASK [63:60] Row Mask Applies to the VGPR destination write only, does not impact the
thread mask when fetching source VGPR data.
31==0: lanes[63:48] are disabled (wave 64 only)
30==0: lanes[47:32] are disabled (wave 64 only)
29==0: lanes[31:16] are disabled
28==0: lanes[15:0] are disabled
DPP_ROW_SL* 101- if ((n&0xf) < (16-cntl[3:0])) pix[n].srca = pix[n+ cntl[3:0]].srca Row shift left by 1-15
10F else use bound_cntl threads.
DPP_ROW_SR* 111- if \((n&0xf) >= cntl[3:0]) pix[n].srca = pix[n - cntl[3:0]].srca Row shift right by 1-15
11F else use bound_cntl threads.
DPP_ROW_RR* 121- if \((n&0xf) >= cnt[3:0]) pix[n].srca = pix[n - cntl[3:0]].srca Row rotate right by 1-15
12F else pix[n].srca = pix[n + 16 - cntl[3:0]].srca threads.
13.3.10. DPP8
Format DPP8
Description Data Parallel Primitives over 8 lanes. This is a second dword which can
follow VOP1, VOP2 or VOPC instructions (in place of a literal constant) to
control selection of data from other lanes.
LANE_SEL0 42:40 Which lane to read for 1st output lane per 8-lane group
LANE_SEL1 45:43 Which lane to read for 2nd output lane per 8-lane group
LANE_SEL2 48:46 Which lane to read for 3rd output lane per 8-lane group
LANE_SEL3 51:49 Which lane to read for 4th output lane per 8-lane group
LANE_SEL4 54:52 Which lane to read for 5th output lane per 8-lane group
LANE_SEL5 57:55 Which lane to read for 6th output lane per 8-lane group
LANE_SEL6 60:58 Which lane to read for 7th output lane per 8-lane group
LANE_SEL7 63:61 Which lane to read for 8th output lane per 8-lane group
13.4.1. VINTRP
Format VINTRP
OP [17:16] Opcode:
0: v_interp_p1_f32 : VDST = P10 * VSRC + P0
1: v_interp_p2_f32: VDST = P20 * VSRC + VDST
2: v_interp_mov_f32: VDST = (P0, P10 or P20 selected by VSRC[1:0])
13.5.1. DS
OFFSET1 [15:8] Second address offset. For some opcodes this is concatenated with OFFSET0.
0 DS_ADD_U32 64 DS_ADD_U64
1 DS_SUB_U32 65 DS_SUB_U64
2 DS_RSUB_U32 66 DS_RSUB_U64
3 DS_INC_U32 67 DS_INC_U64
4 DS_DEC_U32 68 DS_DEC_U64
5 DS_MIN_I32 69 DS_MIN_I64
6 DS_MAX_I32 70 DS_MAX_I64
7 DS_MIN_U32 71 DS_MIN_U64
8 DS_MAX_U32 72 DS_MAX_U64
9 DS_AND_B32 73 DS_AND_B64
10 DS_OR_B32 74 DS_OR_B64
11 DS_XOR_B32 75 DS_XOR_B64
12 DS_MSKOR_B32 76 DS_MSKOR_B64
13 DS_WRITE_B32 77 DS_WRITE_B64
14 DS_WRITE2_B32 78 DS_WRITE2_B64
15 DS_WRITE2ST64_B32 79 DS_WRITE2ST64_B64
16 DS_CMPST_B32 80 DS_CMPST_B64
17 DS_CMPST_F32 81 DS_CMPST_F64
18 DS_MIN_F32 82 DS_MIN_F64
19 DS_MAX_F32 83 DS_MAX_F64
20 DS_NOP 85 DS_ADD_RTN_F32
21 DS_ADD_F32 96 DS_ADD_RTN_U64
24 DS_GWS_SEMA_RELEASE_ALL 97 DS_SUB_RTN_U64
25 DS_GWS_INIT 98 DS_RSUB_RTN_U64
26 DS_GWS_SEMA_V 99 DS_INC_RTN_U64
63 DS_ORDERED_COUNT
MTBUF
typed buffer access (data type is defined by the instruction)
MUBUF
untyped buffer access (data type is defined by the buffer / resource-constant)
13.6.1. MTBUF
Format MTBUF
OFFEN [12] 1 = enable offset VGPR, 0 = use zero for address offset
IDXEN [13] 1 = enable index VGPR, 0 = use zero for address index
GLC [14] 0 = normal, 1 = globally coherent (bypass L0 cache) or for atomics, return pre-
op value to VGPR.
OP [53],[18:16 Opcode. See table below. (combined bits 53 with 18-16 to form opcode)
]
DFMT 25:19 Data Format of data in memory buffer. See chapter 8 for encoding. Buffer
Image format Table
VADDR [39:32] Address of VGPR to supply first component of address (offset or index). When
both index and offset are used, index is in the first VGPR and offset in the
second.
VDATA [47:40] Address of VGPR to supply first component of write data or receive first
component of read-data.
SLC [54] System Level Coherent. Used in conjunction with DLC to determine L2 cache
policies.
0 TBUFFER_LOAD_FORMAT_X 8 TBUFFER_LOAD_FORMAT_D16_X
1 TBUFFER_LOAD_FORMAT_XY 9 TBUFFER_LOAD_FORMAT_D16_XY
2 TBUFFER_LOAD_FORMAT_XYZ 10 TBUFFER_LOAD_FORMAT_D16_XYZ
3 TBUFFER_LOAD_FORMAT_XYZW 11 TBUFFER_LOAD_FORMAT_D16_XYZW
4 TBUFFER_STORE_FORMAT_X 12 TBUFFER_STORE_FORMAT_D16_X
5 TBUFFER_STORE_FORMAT_XY 13 TBUFFER_STORE_FORMAT_D16_XY
6 TBUFFER_STORE_FORMAT_XYZ 14 TBUFFER_STORE_FORMAT_D16_XYZ
7 TBUFFER_STORE_FORMAT_XYZW 15 TBUFFER_STORE_FORMAT_D16_XYZW
13.6.2. MUBUF
Format MUBUF
OFFEN [12] 1 = enable offset VGPR, 0 = use zero for address offset
IDXEN [13] 1 = enable index VGPR, 0 = use zero for address index
GLC [14] 0 = normal, 1 = globally coherent (bypass L0 cache) or for atomics, return pre-
op value to VGPR.
LDS [16] 0 = normal, 1 = transfer data between LDS and memory instead of VGPRs and
memory.
VADDR [39:32] Address of VGPR to supply first component of address (offset or index). When
both index and offset are used, index is in the first VGPR and offset in the
second.
VDATA [47:40] Address of VGPR to supply first component of write data or receive first
component of read-data.
SLC [54] System Level Coherent. Used in conjunction with DLC to determine L2 cache
policies.
0 BUFFER_LOAD_FORMAT_X 54 BUFFER_ATOMIC_UMIN
1 BUFFER_LOAD_FORMAT_XY 55 BUFFER_ATOMIC_SMAX
2 BUFFER_LOAD_FORMAT_XYZ 56 BUFFER_ATOMIC_UMAX
3 BUFFER_LOAD_FORMAT_XYZW 57 BUFFER_ATOMIC_AND
4 BUFFER_STORE_FORMAT_X 58 BUFFER_ATOMIC_OR
5 BUFFER_STORE_FORMAT_XY 59 BUFFER_ATOMIC_XOR
6 BUFFER_STORE_FORMAT_XYZ 60 BUFFER_ATOMIC_INC
7 BUFFER_STORE_FORMAT_XYZW 61 BUFFER_ATOMIC_DEC
8 BUFFER_LOAD_UBYTE 62 BUFFER_ATOMIC_FCMPSWAP
9 BUFFER_LOAD_SBYTE 63 BUFFER_ATOMIC_FMIN
10 BUFFER_LOAD_USHORT 64 BUFFER_ATOMIC_FMAX
11 BUFFER_LOAD_SSHORT 80 BUFFER_ATOMIC_SWAP_X2
12 BUFFER_LOAD_DWORD 81 BUFFER_ATOMIC_CMPSWAP_X2
13 BUFFER_LOAD_DWORDX2 82 BUFFER_ATOMIC_ADD_X2
14 BUFFER_LOAD_DWORDX4 83 BUFFER_ATOMIC_SUB_X2
15 BUFFER_LOAD_DWORDX3 85 BUFFER_ATOMIC_SMIN_X2
24 BUFFER_STORE_BYTE 86 BUFFER_ATOMIC_UMIN_X2
25 BUFFER_STORE_BYTE_D16_HI 87 BUFFER_ATOMIC_SMAX_X2
26 BUFFER_STORE_SHORT 88 BUFFER_ATOMIC_UMAX_X2
27 BUFFER_STORE_SHORT_D16_HI 89 BUFFER_ATOMIC_AND_X2
28 BUFFER_STORE_DWORD 90 BUFFER_ATOMIC_OR_X2
29 BUFFER_STORE_DWORDX2 91 BUFFER_ATOMIC_XOR_X2
30 BUFFER_STORE_DWORDX4 92 BUFFER_ATOMIC_INC_X2
31 BUFFER_STORE_DWORDX3 93 BUFFER_ATOMIC_DEC_X2
32 BUFFER_LOAD_UBYTE_D16 94 BUFFER_ATOMIC_FCMPSWAP_X2
33 BUFFER_LOAD_UBYTE_D16_HI 95 BUFFER_ATOMIC_FMIN_X2
34 BUFFER_LOAD_SBYTE_D16 96 BUFFER_ATOMIC_FMAX_X2
53 BUFFER_ATOMIC_SMIN
13.7.1. MIMG
Format MIMG
Memory Image instructions (MIMG format) can be betwen 2 and 5 dwords. There are two
variations of the instruction:
• Normal, where the address VGPRs are specified in the "ADDR" field, and are a contiguous
set of VGPRs. This is a 2-dword instruction.
• Non-Sequential-Address (NSA), where each address VGPR is specified individually and the
address VGPRs can be scattered. This version uses 1-3 extra dwords to specify the
individual address VGPRs.
NSA [2:1] Non-sequential address. Specifies how many additional instruction dwords
exist (0-3).
DIM [5:3] Dimensionality of the resource constant. Set to bits [3:1] of the resource type
field.
UNRM [12] Force address to be un-normalized. User must set to 1 for Image stores &
atomics.
GLC [13] 0 = normal, 1 = globally coherent (bypass L0 cache) or for atomics, return pre-
op value to VGPR.
LWE [17] LOD Warning Enable. When set to 1, a texture fetch may return
"LOD_CLAMPED = 1".
OP [0],[24:18] Opcode. See table below. (combine bits zero and 18-24 to form opcode).
SLC [25] System Level Coherent. Used in conjunction with DLC to determine L2 cache
policies.
VADDR [39:32] Address of VGPR to supply first component of address (offset or index). When
both index and offset are used, index is in the first VGPR and offset in the
second.
VDATA [47:40] Address of VGPR to supply first component of write data or receive first
component of read-data.
A16 [62] Address components are 16-bits (instead of the usual 32 bits).
When set, all address components are 16 bits (packed into 2 per dword),
except:
Texel offsets (3 6bit UINT packed into 1 dword)
PCF reference (for "_C" instructions)
Address components are 16b uint for image ops without sampler; 16b float with
sampler.
0 IMAGE_LOAD 53 IMAGE_SAMPLE_B_O
1 IMAGE_LOAD_MIP 54 IMAGE_SAMPLE_B_CL_O
2 IMAGE_LOAD_PCK 55 IMAGE_SAMPLE_LZ_O
3 IMAGE_LOAD_PCK_SGN 56 IMAGE_SAMPLE_C_O
4 IMAGE_LOAD_MIP_PCK 57 IMAGE_SAMPLE_C_CL_O
5 IMAGE_LOAD_MIP_PCK_SGN 58 IMAGE_SAMPLE_C_D_O
8 IMAGE_STORE 59 IMAGE_SAMPLE_C_D_CL_O
9 IMAGE_STORE_MIP 60 IMAGE_SAMPLE_C_L_O
10 IMAGE_STORE_PCK 61 IMAGE_SAMPLE_C_B_O
11 IMAGE_STORE_MIP_PCK 62 IMAGE_SAMPLE_C_B_CL_O
14 IMAGE_GET_RESINFO 63 IMAGE_SAMPLE_C_LZ_O
15 IMAGE_ATOMIC_SWAP 64 IMAGE_GATHER4
16 IMAGE_ATOMIC_CMPSWAP 65 IMAGE_GATHER4_CL
17 IMAGE_ATOMIC_ADD 68 IMAGE_GATHER4_L
18 IMAGE_ATOMIC_SUB 69 IMAGE_GATHER4_B
20 IMAGE_ATOMIC_SMIN 70 IMAGE_GATHER4_B_CL
21 IMAGE_ATOMIC_UMIN 71 IMAGE_GATHER4_LZ
22 IMAGE_ATOMIC_SMAX 72 IMAGE_GATHER4_C
23 IMAGE_ATOMIC_UMAX 73 IMAGE_GATHER4_C_CL
24 IMAGE_ATOMIC_AND 76 IMAGE_GATHER4_C_L
25 IMAGE_ATOMIC_OR 77 IMAGE_GATHER4_C_B
26 IMAGE_ATOMIC_XOR 78 IMAGE_GATHER4_C_B_CL
27 IMAGE_ATOMIC_INC 79 IMAGE_GATHER4_C_LZ
28 IMAGE_ATOMIC_DEC 80 IMAGE_GATHER4_O
29 IMAGE_ATOMIC_FCMPSWAP 81 IMAGE_GATHER4_CL_O
30 IMAGE_ATOMIC_FMIN 84 IMAGE_GATHER4_L_O
31 IMAGE_ATOMIC_FMAX 85 IMAGE_GATHER4_B_O
32 IMAGE_SAMPLE 86 IMAGE_GATHER4_B_CL_O
33 IMAGE_SAMPLE_CL 87 IMAGE_GATHER4_LZ_O
34 IMAGE_SAMPLE_D 88 IMAGE_GATHER4_C_O
35 IMAGE_SAMPLE_D_CL 89 IMAGE_GATHER4_C_CL_O
36 IMAGE_SAMPLE_L 92 IMAGE_GATHER4_C_L_O
37 IMAGE_SAMPLE_B 93 IMAGE_GATHER4_C_B_O
38 IMAGE_SAMPLE_B_CL 94 IMAGE_GATHER4_C_B_CL_O
39 IMAGE_SAMPLE_LZ 95 IMAGE_GATHER4_C_LZ_O
40 IMAGE_SAMPLE_C 96 IMAGE_GET_LOD
41 IMAGE_SAMPLE_C_CL 97 IMAGE_GATHER4H
The microcode format is identical for each, and only the value of the SEG (segment) field differs.
13.8.1. FLAT
Format FLAT
LDS [13] 0 = normal, 1 = transfer data between LDS and memory instead of VGPRs and
memory.
GLC [16] 0 = normal, 1 = globally coherent (bypass L0 cache) or for atomics, return pre-
op value to VGPR.
SLC [17] System Level Coherent. Used in conjunction with DLC to determine L2 cache
policies.
OP [24:18] Opcode. See tables below for FLAT, SCRATCH and GLOBAL opcodes.
ADDR [39:32] VGPR which holds address or offset. For 64-bit addresses, ADDR has the
LSB’s and ADDR+1 has the MSBs. For offset a single VGPR has a 32 bit
unsigned offset.
For FLAT_*: specifies an address.
For GLOBAL_* and SCRATCH_* when SADDR is NULL or 0x7f: specifies an
address.
For GLOBAL_* and SCRATCH_* when SADDR is not NULL or0x7f: specifies
an offset.
SADDR [54:48] Scalar SGPR which provides an address of offset (unsigned). Set this field to
NULL or 0x7f to disable use.
Meaning of this field is different for Scratch and Global:
FLAT: Unused
Scratch: use an SGPR for the address instead of a VGPR
Global: use the SGPR to provide a base address and the VGPR provides a 32-
bit byte offset.
VDST [63:56] Destination VGPR for data returned from memory to VGPRs.
8 FLAT_LOAD_UBYTE 54 FLAT_ATOMIC_UMIN
9 FLAT_LOAD_SBYTE 55 FLAT_ATOMIC_SMAX
10 FLAT_LOAD_USHORT 56 FLAT_ATOMIC_UMAX
11 FLAT_LOAD_SSHORT 57 FLAT_ATOMIC_AND
12 FLAT_LOAD_DWORD 58 FLAT_ATOMIC_OR
13 FLAT_LOAD_DWORDX2 59 FLAT_ATOMIC_XOR
14 FLAT_LOAD_DWORDX4 60 FLAT_ATOMIC_INC
15 FLAT_LOAD_DWORDX3 61 FLAT_ATOMIC_DEC
24 FLAT_STORE_BYTE 62 FLAT_ATOMIC_FCMPSWAP
25 FLAT_STORE_BYTE_D16_HI 63 FLAT_ATOMIC_FMIN
26 FLAT_STORE_SHORT 64 FLAT_ATOMIC_FMAX
27 FLAT_STORE_SHORT_D16_HI 80 FLAT_ATOMIC_SWAP_X2
28 FLAT_STORE_DWORD 81 FLAT_ATOMIC_CMPSWAP_X2
29 FLAT_STORE_DWORDX2 82 FLAT_ATOMIC_ADD_X2
30 FLAT_STORE_DWORDX4 83 FLAT_ATOMIC_SUB_X2
31 FLAT_STORE_DWORDX3 85 FLAT_ATOMIC_SMIN_X2
32 FLAT_LOAD_UBYTE_D16 86 FLAT_ATOMIC_UMIN_X2
33 FLAT_LOAD_UBYTE_D16_HI 87 FLAT_ATOMIC_SMAX_X2
34 FLAT_LOAD_SBYTE_D16 88 FLAT_ATOMIC_UMAX_X2
35 FLAT_LOAD_SBYTE_D16_HI 89 FLAT_ATOMIC_AND_X2
36 FLAT_LOAD_SHORT_D16 90 FLAT_ATOMIC_OR_X2
37 FLAT_LOAD_SHORT_D16_HI 91 FLAT_ATOMIC_XOR_X2
48 FLAT_ATOMIC_SWAP 92 FLAT_ATOMIC_INC_X2
49 FLAT_ATOMIC_CMPSWAP 93 FLAT_ATOMIC_DEC_X2
50 FLAT_ATOMIC_ADD 94 FLAT_ATOMIC_FCMPSWAP_X2
51 FLAT_ATOMIC_SUB 95 FLAT_ATOMIC_FMIN_X2
53 FLAT_ATOMIC_SMIN 96 FLAT_ATOMIC_FMAX_X2
13.8.2. GLOBAL
Table 104. GLOBAL Opcodes
Opcode # Name Opcode # Name
8 GLOBAL_LOAD_UBYTE 53 GLOBAL_ATOMIC_SMIN
9 GLOBAL_LOAD_SBYTE 54 GLOBAL_ATOMIC_UMIN
10 GLOBAL_LOAD_USHORT 55 GLOBAL_ATOMIC_SMAX
11 GLOBAL_LOAD_SSHORT 56 GLOBAL_ATOMIC_UMAX
12 GLOBAL_LOAD_DWORD 57 GLOBAL_ATOMIC_AND
13 GLOBAL_LOAD_DWORDX2 58 GLOBAL_ATOMIC_OR
14 GLOBAL_LOAD_DWORDX4 59 GLOBAL_ATOMIC_XOR
15 GLOBAL_LOAD_DWORDX3 60 GLOBAL_ATOMIC_INC
22 GLOBAL_LOAD_DWORD_ADDTID 61 GLOBAL_ATOMIC_DEC
23 GLOBAL_STORE_DWORD_ADDTID 62 GLOBAL_ATOMIC_FCMPSWAP
24 GLOBAL_STORE_BYTE 63 GLOBAL_ATOMIC_FMIN
25 GLOBAL_STORE_BYTE_D16_HI 64 GLOBAL_ATOMIC_FMAX
26 GLOBAL_STORE_SHORT 80 GLOBAL_ATOMIC_SWAP_X2
27 GLOBAL_STORE_SHORT_D16_HI 81 GLOBAL_ATOMIC_CMPSWAP_X2
28 GLOBAL_STORE_DWORD 82 GLOBAL_ATOMIC_ADD_X2
29 GLOBAL_STORE_DWORDX2 83 GLOBAL_ATOMIC_SUB_X2
30 GLOBAL_STORE_DWORDX4 85 GLOBAL_ATOMIC_SMIN_X2
31 GLOBAL_STORE_DWORDX3 86 GLOBAL_ATOMIC_UMIN_X2
32 GLOBAL_LOAD_UBYTE_D16 87 GLOBAL_ATOMIC_SMAX_X2
33 GLOBAL_LOAD_UBYTE_D16_HI 88 GLOBAL_ATOMIC_UMAX_X2
34 GLOBAL_LOAD_SBYTE_D16 89 GLOBAL_ATOMIC_AND_X2
35 GLOBAL_LOAD_SBYTE_D16_HI 90 GLOBAL_ATOMIC_OR_X2
36 GLOBAL_LOAD_SHORT_D16 91 GLOBAL_ATOMIC_XOR_X2
37 GLOBAL_LOAD_SHORT_D16_HI 92 GLOBAL_ATOMIC_INC_X2
48 GLOBAL_ATOMIC_SWAP 93 GLOBAL_ATOMIC_DEC_X2
49 GLOBAL_ATOMIC_CMPSWAP 94 GLOBAL_ATOMIC_FCMPSWAP_X2
50 GLOBAL_ATOMIC_ADD 95 GLOBAL_ATOMIC_FMIN_X2
51 GLOBAL_ATOMIC_SUB 96 GLOBAL_ATOMIC_FMAX_X2
52 GLOBAL_ATOMIC_CSUB
13.8.3. SCRATCH
Table 105. SCRATCH Opcodes
Opcode # Name Opcode # Name
8 SCRATCH_LOAD_UBYTE 27 SCRATCH_STORE_SHORT_D16_HI
9 SCRATCH_LOAD_SBYTE 28 SCRATCH_STORE_DWORD
10 SCRATCH_LOAD_USHORT 29 SCRATCH_STORE_DWORDX2
11 SCRATCH_LOAD_SSHORT 30 SCRATCH_STORE_DWORDX4
12 SCRATCH_LOAD_DWORD 31 SCRATCH_STORE_DWORDX3
13 SCRATCH_LOAD_DWORDX2 32 SCRATCH_LOAD_UBYTE_D16
14 SCRATCH_LOAD_DWORDX4 33 SCRATCH_LOAD_UBYTE_D16_HI
15 SCRATCH_LOAD_DWORDX3 34 SCRATCH_LOAD_SBYTE_D16
24 SCRATCH_STORE_BYTE 35 SCRATCH_LOAD_SBYTE_D16_HI
25 SCRATCH_STORE_BYTE_D16_HI 36 SCRATCH_LOAD_SHORT_D16
26 SCRATCH_STORE_SHORT 37 SCRATCH_LOAD_SHORT_D16_HI
13.9.1. EXP
Format EXP
DONE [11] Indicates that this is the last export from the shader. Used only for Position and
Pixel/color data.
VM [12] 1 = the exec mask IS the valid mask for this export. Can be sent multiple times,
but user must send at least once per pixel shader. This bit is only used for Pixel
Shaders.