CUBLAS Library
CUBLAS Library
User Guide
The cuBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top
of the NVIDIA®CUDA™ runtime. It allows the user to access the computational resources of
NVIDIA Graphics Processing Unit (GPU).
The cuBLAS Library exposes three sets of API:
‣ The cuBLAS API, which is simply called cuBLAS API in this document (starting with CUDA
6.0),
case, the array index of a matrix element in row “i” and column “j” can be computed via the
following macro
#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1))
Here, ld refers to the leading dimension of the matrix, which in the case of column-major
storage is the number of rows of the allocated matrix (even if only a submatrix of it is being
used). For natively written C and C++ code, one would most likely choose 0-based indexing, in
which case the array index of a matrix element in row “i” and column “j” can be computed via
the following macro
#define IDX2C(i,j,ld) (((j)*(ld))+(i))
WARNING: The legacy cuBLAS API is deprecated and will be removed in future release.
The new cuBLAS library API can be used by including the header file “cublas_v2.h”. It has
the following features that the legacy cuBLAS API does not have:
‣ The handle to the cuBLAS library context is initialized using the function and is explicitly
passed to every subsequent library function call. This allows the user to have more control
over the library setup when using multiple host threads and multiple GPUs. This also
allows the cuBLAS APIs to be reentrant.
‣ The scalars and can be passed by reference on the host or the device, instead of only
being allowed to be passed by value on the host. This change allows library functions to
execute asynchronously using streams even when and are generated by a previous
kernel.
‣ When a library routine returns a scalar result, it can be returned by reference on the
host or the device, instead of only being allowed to be returned by value only on the host.
This change allows library routines to be called asynchronously when the scalar result is
generated and returned by reference on the device resulting in maximum parallelism.
‣ The error status cublasStatus_t is returned by all cuBLAS library function calls.
This change facilitates debugging and simplifies software development. Note that
cublasStatus was renamed cublasStatus_t to be more consistent with other types in
the cuBLAS library.
‣ The cublasAlloc() and cublasFree() functions have been deprecated. This change
removes these unnecessary wrappers around cudaMalloc() and cudaFree(),
respectively.
The legacy cuBLAS API, explained in more detail in the Appendix A, can be used by including
the header file “cublas.h”. Since the legacy API is identical to the previously released cuBLAS
library API, existing applications will work out of the box and automatically use this legacy API
without any source code changes.
The current and the legacy cuBLAS APIs cannot be used simultaneously in a single translation
unit: including both “cublas.h” and “cublas_v2.h” header files will lead to compilation
errors due to incompatible symbol redeclarations.
In general, new applications should not use the legacy cuBLAS API, and existing applications
should convert to using the new API if it requires sophisticated and optimal stream
parallelism, or if it calls cuBLAS routines concurrently from multiple threads.
For the rest of the document, the new cuBLAS Library API will simply be referred to as the
cuBLAS Library API.
As mentioned earlier the interfaces to the legacy and the cuBLAS library APIs are the header
file “cublas.h” and “cublas_v2.h”, respectively. In addition, applications using the cuBLAS
library need to link against:
Note: The same dynamic library implements both the new and legacy cuBLAS APIs.
static __inline__ void modify (cublasHandle_t handle, float *m, int ldm, int n, int
p, int q, float alpha, float beta){
cublasSscal (handle, n-q+1, &alpha, &m[IDX2F(p,q,ldm)], ldm);
cublasSscal (handle, ldm-p+1, &beta, &m[IDX2F(p,q,ldm)], 1);
}
cublasHandle_t handle;
int i, j;
float* devPtrA;
float* a = 0;
a = (float *)malloc (M * N * sizeof (*a));
if (!a) {
printf ("host memory allocation failed");
return EXIT_FAILURE;
}
for (j = 1; j <= N; j++) {
for (i = 1; i <= M; i++) {
a[IDX2F(i,j,M)] = (float)((i-1) * N + j);
}
}
cudaStat = cudaMalloc ((void**)&devPtrA, M*N*sizeof(*a));
if (cudaStat != cudaSuccess) {
printf ("device memory allocation failed");
return EXIT_FAILURE;
}
stat = cublasCreate(&handle);
if (stat != CUBLAS_STATUS_SUCCESS) {
printf ("CUBLAS initialization failed\n");
return EXIT_FAILURE;
}
stat = cublasSetMatrix (M, N, sizeof(*a), a, M, devPtrA, M);
if (stat != CUBLAS_STATUS_SUCCESS) {
printf ("data download failed");
cudaFree (devPtrA);
cublasDestroy(handle);
return EXIT_FAILURE;
}
modify (handle, devPtrA, M, N, 2, 3, 16.0f, 12.0f);
stat = cublasGetMatrix (M, N, sizeof(*a), devPtrA, M, a, M);
if (stat != CUBLAS_STATUS_SUCCESS) {
printf ("data upload failed");
cudaFree (devPtrA);
cublasDestroy(handle);
return EXIT_FAILURE;
}
cudaFree (devPtrA);
cublasDestroy(handle);
for (j = 1; j <= N; j++) {
for (i = 1; i <= M; i++) {
printf ("%7.0f", a[IDX2F(i,j,M)]);
}
printf ("\n");
}
free(a);
return EXIT_SUCCESS;
}
----------------------------------------------
//Example 2. Application Using C and cuBLAS: 0-based indexing
//-----------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cuda_runtime.h>
#include "cublas_v2.h"
#define M 6
#define N 5
#define IDX2C(i,j,ld) (((j)*(ld))+(i))
static __inline__ void modify (cublasHandle_t handle, float *m, int ldm, int n, int
p, int q, float alpha, float beta){
cublasSscal (handle, n-q, &alpha, &m[IDX2C(p,q,ldm)], ldm);
subsequent cuBLAS calls in all threads. It is even more true for the destruction of the handle.
So it is not recommended that multiple thread share the same cuBLAS handle.
‣ provide a separate workspace for each used stream using the cublasSetWorkspace()
function, or
‣ Functions that take alpha and/or beta parameters by reference on the host or the device
as scaling factors, such as gemm.
‣ Functions that return a scalar result on the host or the device such as amax(), amin,
asum(), rotg(), rotmg(), dot() and nrm2().
For the functions of the first category, when the pointer mode is set to
CUBLAS_POINTER_MODE_HOST, the scalar parameters alpha and/or beta can be on the
stack or allocated on the heap, shouldn't be placed in managed memory. Underneath, the
CUDA kernels related to those functions will be launched with the value of alpha and/or
beta. Therefore if they were allocated on the heap, they can be freed just after the return
of the call even though the kernel launch is asynchronous. When the pointer mode is set to
CUBLAS_POINTER_MODE_DEVICE, alpha and/or beta must be accessible on the device and
their values should not be modified until the kernel is done. Note that since cudaFree() does
an implicit cudaDeviceSynchronize(), cudaFree() can still be called on alpha and/or beta
just after the call but it would defeat the purpose of using this pointer mode in that case.
For the functions of the second category, when the pointer mode is set to
CUBLAS_POINTER_MODE_HOST, these functions block the CPU, until the GPU has completed its
computation and the results have been copied back to the Host. When the pointer mode is set
to CUBLAS_POINTER_MODE_DEVICE, these functions return immediately. In this case, similar to
matrix and vector results, the scalar result is ready only when execution of the routine on the
GPU has completed. This requires proper synchronization in order to read the result from the
host.
In either case, the pointer mode CUBLAS_POINTER_MODE_DEVICE allows the library functions
to execute completely asynchronously from the Host even when alpha and/or beta are
generated by a previous kernel. For example, this situation can arise when iterative methods
for solution of linear systems and eigenvalue problems are implemented using the cuBLAS
library.
matrix multiplication performs operations for input size, while 1024 small
Whereas to compile against the static cuBLAS library, the following command must be used:
It is also possible to use the native Host C++ compiler. Depending on the Host operating
system, some additional libraries like pthread or dl might be needed on the linking line. The
following command on Linux is suggested :
Note that in the latter case, the library cuda is not needed. The CUDA Runtime will try to open
explicitly the cuda library if needed. In the case of a system which does not have the CUDA
driver installed, this allows the application to gracefully manage this issue and potentially run
if a CPU-only path is available.
Starting with release 11.2, using the typed functions instead of the extension functions
(cublas**Ex()) helps in reducing the binary size when linking to static cuBLAS Library.
‣ m % 8 == 0
‣ k % 8 == 0
‣ intptr_t(A) % 16 == 0
‣ intptr_t(B) % 16 == 0
‣ intptr_t(C) % 16 == 0
‣ intptr_t(A+lda) % 16 == 0
‣ intptr_t(B+ldb) % 16 == 0
‣ intptr_t(C+ldc) % 16 == 0
‣ In the case of pointer modes with device pointers - coefficient value is accessed using the
device pointer at the time of graph execution.
NOTE: Every time cuBLAS routines are captured in a new CUDA Graph, cuBLAS will allocate
workspace memory on the device. This memory is only freed when the cuBLAS handle used
during capture is deleted. To avoid this, use cublasSetWorkspace() function to provide user
owned workspace memory.
2.2.2. cublasStatus_t
The type is used for function status returns. All cuBLAS library functions return their status,
which can have the following values.
Value Meaning
CUBLAS_STATUS_SUCCESS The operation completed successfully.
Value Meaning
Runtime API called by the cuBLAS routine, or an
error in the hardware setup.
To correct: call cublasCreate() prior to the
function call; and check that the hardware, an
appropriate version of the driver, and the cuBLAS
library are correctly installed.
Value Meaning
the current licensing. This error can happen if
the license is not present or is expired or if the
environment variable NVIDIA_LICENSE_FILE is
not set properly.
2.2.3. cublasOperation_t
The cublasOperation_t type indicates which operation needs to be performed with the
dense matrix. Its values correspond to Fortran characters ‘N’ or ‘n’ (non-transpose), ‘T’ or
‘t’ (transpose) and ‘C’ or ‘c’ (conjugate transpose) that are often used as parameters to
legacy BLAS implementations.
Value Meaning
CUBLAS_OP_N the non-transpose operation is selected
2.2.4. cublasFillMode_t
The type indicates which part (lower or upper) of the dense matrix was filled and consequently
should be used by the function. Its values correspond to Fortran characters ‘L’ or ‘l’ (lower)
and ‘U’ or ‘u’ (upper) that are often used as parameters to legacy BLAS implementations.
Value Meaning
CUBLAS_FILL_MODE_LOWER the lower part of the matrix is filled
2.2.5. cublasDiagType_t
The type indicates whether the main diagonal of the dense matrix is unity and consequently
should not be touched or modified by the function. Its values correspond to Fortran characters
‘N’ or ‘n’ (non-unit) and ‘U’ or ‘u’ (unit) that are often used as parameters to legacy BLAS
implementations.
Value Meaning
CUBLAS_DIAG_NON_UNIT the matrix diagonal has non-unit elements
2.2.6. cublasSideMode_t
The type indicates whether the dense matrix is on the left or right side in the matrix equation
solved by a particular function. Its values correspond to Fortran characters ‘L’ or ‘l’ (left)
and ‘R’ or ‘r’ (right) that are often used as parameters to legacy BLAS implementations.
Value Meaning
CUBLAS_SIDE_LEFT the matrix is on the left side in the equation
2.2.7. cublasPointerMode_t
The cublasPointerMode_t type indicates whether the scalar values are passed by reference
on the host or device. It is important to point out that if several scalar values are present in the
function call, all of them must conform to the same single pointer mode. The pointer mode
can be set and retrieved using cublasSetPointerMode() and cublasGetPointerMode()
routines, respectively.
Value Meaning
CUBLAS_POINTER_MODE_HOST the scalars are passed by reference on the host
2.2.8. cublasAtomicsMode_t
The type indicates whether cuBLAS routines which has an alternate implementation
using atomics can be used. The atomics mode can be set and queried using
cublasSetAtomicsMode() and cublasGetAtomicsMode() and routines, respectively.
Value Meaning
CUBLAS_ATOMICS_NOT_ALLOWED the usage of atomics is not allowed
2.2.9. cublasGemmAlgo_t
cublasGemmAlgo_t type is an enumerant to specify the algorithm for matrix-matrix
multiplication on GPU architectures up to sm_75. On sm_80 and newer GPU architectures, this
enumarant has no effect. cuBLAS has the following algorithm options:
Value Meaning
CUBLAS_GEMM_DEFAULT Apply Heuristics to select the GEMM algorithm
Value Meaning
CUBLAS_GEMM_DEFAULT_TENSOR_OP[DEPRECATED] This mode is deprecated and will be removed in
a future release. Apply Heuristics to select the
GEMM algorithm, while allowing use of reduced
precision CUBLAS_COMPUTE_32F_FAST_16F
kernels (for backward compatibility).
2.2.10. cublasMath_t
cublasMath_t enumerate type is used in cublasSetMathMode() to choose compute
precision modes as defined below. Since this setting does not directly control the use of
Tensor Cores, the mode CUBLAS_TENSOR_OP_MATH is being deprecated and will be removed in
a future release.
Value Meaning
CUBLAS_DEFAULT_MATH This is the default and highest-performance mode
that uses compute and intermediate storage
precisions with at least the same number of
mantissa and exponent bits as requested. Tensor
Cores will be used whenever possible.
Value Meaning
CUBLAS_COMPUTE_32F_FAST_16F compute
type.
2.2.11. cublasComputeType_t
cublasComputeType_t enumerate type is used in cublasGemmEx and cublasLtMatmul
(including all batched and strided batched variants) to choose compute precision modes as
defined below.
Value Meaning
CUBLAS_COMPUTE_16F This is the default and highest-performance mode
for 16-bit half precision floating point and all
compute and intermediate storage precisions with
at least 16-bit half precision. Tensor Cores will be
used whenever possible.
Value Meaning
CUBLAS_COMPUTE_64F_PEDANTIC Uses 64-bit double precision floatin point
arithmetic for all phases of calculations and
also disables algorithmic optimizations such as
Gaussian complexity reduction (3M).
2.3.1. cudaDataType_t
The cudaDataType_t type is an enumerant to specify the data precision. It is used when the
data reference does not carry the type itself (e.g void *)
For example, it is used in the routine cublasSgemmEx.
Value Meaning
CUDA_R_16F the data type is 16-bit real half precision floating-
point
Value Meaning
CUDA_C_64F the data type is 64-bit complex double precision
floating-point
2.3.2. libraryPropertyType_t
The libraryPropertyType_t is used as a parameter to specify which property is requested
when using the routine cublasGetProperty
Value Meaning
MAJOR_VERSION enumerant to query the major version
This function initializes the cuBLAS library and creates a handle to an opaque structure
holding the cuBLAS library context. It allocates hardware resources on the host and device
and must be called prior to making any other cuBLAS library calls. The cuBLAS library
context is tied to the current CUDA device. To use the library on multiple devices, one cuBLAS
handle needs to be created for each device. Furthermore, for a given device, multiple cuBLAS
handles with different configurations can be created. Because cublasCreate() allocates
some internal resources and the release of those resources by calling cublasDestroy() will
implicitly call cublasDeviceSynchronize(), it is recommended to minimize the number of
cublasCreate()/cublasDestroy() occurrences. For multi-threaded applications that use
the same device from different threads, the recommended programming model is to create
one cuBLAS handle per thread and use that cuBLAS handle for the entire life of the thread.
2.4.2. cublasDestroy()
cublasStatus_t
cublasDestroy(cublasHandle_t handle)
This function releases hardware resources used by the cuBLAS library. This function is
usually the last call with a particular handle to the cuBLAS library. Because cublasCreate()
allocates some internal resources and the release of those resources by calling
cublasDestroy() will implicitly call cublasDeviceSynchronize(), it is recommended to
minimize the number of cublasCreate()/cublasDestroy() occurrences.
Return Value Meaning
CUBLAS_STATUS_SUCCESS the shut down succeeded
2.4.3. cublasGetVersion()
cublasStatus_t
cublasGetVersion(cublasHandle_t handle, int *version)
2.4.4. cublasGetProperty()
cublasStatus_t
cublasGetProperty(libraryPropertyType type, int *value)
This function returns the value of the requested property in memory pointed to by value. Refer
to libraryPropertyType for supported types.
Return Value Meaning
CUBLAS_STATUS_SUCCESS The operation completed successfully
2.4.5. cublasGetStatusName()
const char* cublasGetStatusName(cublasStatus_t status)
2.4.6. cublasGetStatusString()
const char* cublasGetStatusString(cublasStatus_t status)
2.4.7. cublasSetStream()
cublasStatus_t
cublasSetStream(cublasHandle_t handle, cudaStream_t streamId)
This function sets the cuBLAS library stream, which will be used to execute all subsequent
calls to the cuBLAS library functions. If the cuBLAS library stream is not set, all kernels use
the default NULL stream. In particular, this routine can be used to change the stream between
kernel launches and then to reset the cuBLAS library stream back to NULL. Additionally this
function unconditionally resets the cuBLAS library workspace back to the default workspace
pool (see cublasSetWorkspace()).
Return Value Meaning
CUBLAS_STATUS_SUCCESS the stream was set successfully
2.4.8. cublasSetWorkspace()
cublasStatus_t
cublasSetWorkspace(cublasHandle_t handle, void *workspace, size_t
workspaceSizeInBytes)
This function sets the cuBLAS library workspace to a user-owned device buffer, which will
be used to execute all subsequent calls to the cuBLAS library functions (on the currently
set stream). If the cuBLAS library workspace is not set, all kernels will use the default
workspace pool allocated during the cuBLAS context creation. In particular, this routine
can be used to change the workspace between kernel launches. The workspace pointer
has to be aligned to at least 256 bytes, otherwise CUBLAS_STATUS_INVALID_VALUE error
is returned. The cublasSetStream() function unconditionally resets the cuBLAS library
workspace back to the default workspace pool. Too small workspaceSizeInBytes may
cause some routines to fail with CUBLAS_STATUS_ALLOC_FAILED error returned or cause
large regressions in performance. Workspace size equal to or larger than 16KiB is enough
to prevent CUBLAS_STATUS_ALLOC_FAILED error, while a larger workspace can provide
performance benefits for some routines. Recommended size of user-provided workspace is at
least 4MiB (to match cuBLAS’ default workspace pool).
Return Value Meaning
CUBLAS_STATUS_SUCCESS the stream was set successfully
2.4.9. cublasGetStream()
cublasStatus_t
cublasGetStream(cublasHandle_t handle, cudaStream_t *streamId)
This function gets the cuBLAS library stream, which is being used to execute all calls to the
cuBLAS library functions. If the cuBLAS library stream is not set, all kernels use the default
NULL stream.
2.4.10. cublasGetPointerMode()
cublasStatus_t
cublasGetPointerMode(cublasHandle_t handle, cublasPointerMode_t *mode)
This function obtains the pointer mode used by the cuBLAS library. Please see the section on
the cublasPointerMode_t type for more details.
Return Value Meaning
CUBLAS_STATUS_SUCCESS the pointer mode was obtained successfully
2.4.11. cublasSetPointerMode()
cublasStatus_t
cublasSetPointerMode(cublasHandle_t handle, cublasPointerMode_t mode)
This function sets the pointer mode used by the cuBLAS library. The default is for the values to
be passed by reference on the host. Please see the section on the cublasPointerMode_t type
for more details.
Return Value Meaning
CUBLAS_STATUS_SUCCESS the pointer mode was set successfully
2.4.12. cublasSetVector()
cublasStatus_t
cublasSetVector(int n, int elemSize,
const void *x, int incx, void *y, int incy)
This function copies n elements from a vector x in host memory space to a vector y in GPU
memory space. Elements in both vectors are assumed to have a size of elemSize bytes. The
storage spacing between consecutive elements is given by incx for the source vector x and by
incy for the destination vector y.
Since column-major format for two-dimensional matrices is assumed, if a vector is part of
a matrix, a vector increment equal to 1 accesses a (partial) column of that matrix. Similarly,
using an increment equal to the leading dimension of the matrix results in accesses to a
(partial) row of that matrix.
Return Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.4.13. cublasGetVector()
cublasStatus_t
cublasGetVector(int n, int elemSize,
const void *x, int incx, void *y, int incy)
This function copies n elements from a vector x in GPU memory space to a vector y in host
memory space. Elements in both vectors are assumed to have a size of elemSize bytes. The
storage spacing between consecutive elements is given by incx for the source vector and
incy for the destination vector y.
2.4.14. cublasSetMatrix()
cublasStatus_t
cublasSetMatrix(int rows, int cols, int elemSize,
const void *A, int lda, void *B, int ldb)
This function copies a tile of rows x cols elements from a matrix A in host memory space
to a matrix B in GPU memory space. It is assumed that each element requires storage of
elemSize bytes and that both matrices are stored in column-major format, with the leading
dimension of the source matrix A and destination matrix B given in lda and ldb, respectively.
The leading dimension indicates the number of rows of the allocated matrix, even if only a
submatrix of it is being used.
Return Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.4.15. cublasGetMatrix()
cublasStatus_t
cublasGetMatrix(int rows, int cols, int elemSize,
const void *A, int lda, void *B, int ldb)
This function copies a tile of rows x cols elements from a matrix A in GPU memory space
to a matrix B in host memory space. It is assumed that each element requires storage of
elemSize bytes and that both matrices are stored in column-major format, with the leading
dimension of the source matrix A and destination matrix B given in lda and ldb, respectively.
The leading dimension indicates the number of rows of the allocated matrix, even if only a
submatrix of it is being used.
Return Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.4.16. cublasSetVectorAsync()
cublasStatus_t
cublasSetVectorAsync(int n, int elemSize, const void *hostPtr, int incx,
void *devicePtr, int incy, cudaStream_t stream)
This function has the same functionality as cublasSetVector(), with the exception that the
data transfer is done asynchronously (with respect to the host) using the given CUDA™ stream
parameter.
Return Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.4.17. cublasGetVectorAsync()
cublasStatus_t
cublasGetVectorAsync(int n, int elemSize, const void *devicePtr, int incx,
void *hostPtr, int incy, cudaStream_t stream)
This function has the same functionality as cublasGetVector(), with the exception that the
data transfer is done asynchronously (with respect to the host) using the given CUDA™ stream
parameter.
Return Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.4.18. cublasSetMatrixAsync()
cublasStatus_t
cublasSetMatrixAsync(int rows, int cols, int elemSize, const void *A,
int lda, void *B, int ldb, cudaStream_t stream)
This function has the same functionality as cublasSetMatrix(), with the exception that the
data transfer is done asynchronously (with respect to the host) using the given CUDA™ stream
parameter.
Return Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.4.19. cublasGetMatrixAsync()
cublasStatus_t
cublasGetMatrixAsync(int rows, int cols, int elemSize, const void *A,
int lda, void *B, int ldb, cudaStream_t stream)
This function has the same functionality as cublasGetMatrix(), with the exception that the
data transfer is done asynchronously (with respect to the host) using the given CUDA™ stream
parameter.
Return Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.4.20. cublasSetAtomicsMode()
cublasStatus_t cublasSetAtomicsMode(cublasHandlet handle, cublasAtomicsMode_t mode)
Some routines like cublas<t>symv and cublas<t>hemv have an alternate implementation that
use atomics to cumulate results. This implementation is generally significantly faster but can
generate results that are not strictly identical from one run to the others. Mathematically,
those different results are not significant but when debugging those differences can be
prejudicial.
This function allows or disallows the usage of atomics in the cuBLAS library for all routines
which have an alternate implementation. When not explicitly specified in the documentation
of any cuBLAS routine, it means that this routine does not have an alternate implementation
that use atomics. When atomics mode is disabled, each cuBLAS routine should produce the
same results from one run to the other when called with identical parameters on the same
Hardware.
The default atomics mode of default initialized cublasHandle_t object is
CUBLAS_ATOMICS_NOT_ALLOWED. Please see the section on the type for more details.
2.4.21. cublasGetAtomicsMode()
cublasStatus_t cublasGetAtomicsMode(cublasHandle_t handle, cublasAtomicsMode_t
*mode)
2.4.22. cublasSetMathMode()
cublasStatus_t cublasSetMathMode(cublasHandle_t handle, cublasMath_t mode)
2.4.23. cublasGetMathMode()
cublasStatus_t cublasGetMathMode(cublasHandle_t handle, cublasMath_t *mode)
This function returns the math mode used by the library routines.
Return Value Meaning
CUBLAS_STATUS_SUCCESS the math type was returned successfully.
2.4.24. cublasSetSmCountTarget()
cublasStatus_t cublasSetSmCountTarget(cublasHandle_t handle, int smCountTarget)
2.4.25. cublasGetSmCountTarget()
cublasStatus_t cublasGetSmCountTarget(cublasHandle_t handle, int *smCountTarget)
This function obtains the value previously programmed to the library handle.
Return Value Meaning
CUBLAS_STATUS_SUCCESS SM count target was set successfully.
2.4.26. cublasLoggerConfigure()
cublasStatus_t cublasLoggerConfigure(
int logIsOn,
int logToStdOut,
int logToStdErr,
const char* logFileName)
This function configures logging during runtime. Besides this type of configuration, it is
possible to configure logging with special environment variables which will be checked by
libcublas:
‣ CUBLAS_LOGINFO_DBG - Setup env. variable to "1" means turn on logging (by default
logging is off).
2.4.27. cublasGetLoggerCallback()
cublasStatus_t cublasGetLoggerCallback(
cublasLogCallback* userCallback)
This function retrieves function pointer to previously installed custom user defined callback
function via cublasSetLoggerCallback or zero otherwise.
Parameters
userCallback
Output. Pointer to user defined callback function.
Returns
CUBLAS_STATUS_SUCCESS
Success.
2.4.28. cublasSetLoggerCallback()
cublasStatus_t cublasSetLoggerCallback(
cublasLogCallback userCallback)
This function installs a custom user defined callback function via cublas C public API.
Parameters
userCallback
Input. Pointer to user defined callback function.
Returns
CUBLAS_STATUS_SUCCESS
Success.
When the parameters and returned values of the function differ, which sometimes happens for
complex input, the <t> can also have the following meanings ‘Sc’, ‘Cs’, ‘Dz’ and ‘Zd’.
The abbreviation Re(.) and Im(.) will stand for the real and imaginary part of a number,
respectively. Since imaginary part of a real number does not exist, we will consider it to be
zero and can usually simply discard it from the equation where it is being used. Also, the will
denote the complex conjugate of .
In general throughout the documentation, the lower case Greek symbols and will denote
scalars, lower case English letters in bold type and will denote vectors and capital English
letters , and will denote matrices.
2.5.1. cublasI<t>amax()
cublasStatus_t cublasIsamax(cublasHandle_t handle, int n,
const float *x, int incx, int *result)
cublasStatus_t cublasIdamax(cublasHandle_t handle, int n,
const double *x, int incx, int *result)
cublasStatus_t cublasIcamax(cublasHandle_t handle, int n,
const cuComplex *x, int incx, int *result)
cublasStatus_t cublasIzamax(cublasHandle_t handle, int n,
const cuDoubleComplex *x, int incx, int *result)
This function finds the (smallest) index of the element of the maximum magnitude.
Hence, the result is the first such that is maximum for
and . Notice that the last equation reflects 1-based indexing used for
compatibility with Fortran.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.5.2. cublasI<t>amin()
cublasStatus_t cublasIsamin(cublasHandle_t handle, int n,
const float *x, int incx, int *result)
cublasStatus_t cublasIdamin(cublasHandle_t handle, int n,
const double *x, int incx, int *result)
cublasStatus_t cublasIcamin(cublasHandle_t handle, int n,
const cuComplex *x, int incx, int *result)
cublasStatus_t cublasIzamin(cublasHandle_t handle, int n,
const cuDoubleComplex *x, int incx, int *result)
This function finds the (smallest) index of the element of the minimum magnitude.
Hence, the result is the first such that is minimum for
and Notice that the last equation reflects 1-based indexing used for
compatibility with Fortran.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.5.3. cublas<t>asum()
cublasStatus_t cublasSasum(cublasHandle_t handle, int n,
const float *x, int incx, float *result)
cublasStatus_t cublasDasum(cublasHandle_t handle, int n,
const double *x, int incx, double *result)
cublasStatus_t cublasScasum(cublasHandle_t handle, int n,
const cuComplex *x, int incx, float *result)
cublasStatus_t cublasDzasum(cublasHandle_t handle, int n,
const cuDoubleComplex *x, int incx, double *result)
This function computes the sum of the absolute values of the elements of vector x. Hence, the
result is where . Notice that the last equation reflects
1-based indexing used for compatibility with Fortran.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
result host or device output the resulting index, which is 0.0 if n,incx<=0.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.5.4. cublas<t>axpy()
cublasStatus_t cublasSaxpy(cublasHandle_t handle, int n,
const float *alpha,
const float *x, int incx,
float *y, int incy)
cublasStatus_t cublasDaxpy(cublasHandle_t handle, int n,
const double *alpha,
const double *x, int incx,
double *y, int incy)
cublasStatus_t cublasCaxpy(cublasHandle_t handle, int n,
const cuComplex *alpha,
const cuComplex *x, int incx,
cuComplex *y, int incy)
cublasStatus_t cublasZaxpy(cublasHandle_t handle, int n,
const cuDoubleComplex *alpha,
const cuDoubleComplex *x, int incx,
cuDoubleComplex *y, int incy)
This function multiplies the vector x by the scalar and adds it to the vector y overwriting
the latest vector with the result. Hence, the performed operation is for
, and . Notice that the last two equations
reflect 1-based indexing used for compatibility with Fortran.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.5.5. cublas<t>copy()
cublasStatus_t cublasScopy(cublasHandle_t handle, int n,
const float *x, int incx,
float *y, int incy)
cublasStatus_t cublasDcopy(cublasHandle_t handle, int n,
const double *x, int incx,
double *y, int incy)
cublasStatus_t cublasCcopy(cublasHandle_t handle, int n,
const cuComplex *x, int incx,
cuComplex *y, int incy)
cublasStatus_t cublasZcopy(cublasHandle_t handle, int n,
const cuDoubleComplex *x, int incx,
cuDoubleComplex *y, int incy)
This function copies the vector x into the vector y. Hence, the performed operation is
for , and . Notice that the last two equations
reflect 1-based indexing used for compatibility with Fortran.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.5.6. cublas<t>dot()
cublasStatus_t cublasSdot (cublasHandle_t handle, int n,
const float *x, int incx,
const float *y, int incy,
float *result)
cublasStatus_t cublasDdot (cublasHandle_t handle, int n,
const double *x, int incx,
const double *y, int incy,
double *result)
cublasStatus_t cublasCdotu(cublasHandle_t handle, int n,
const cuComplex *x, int incx,
const cuComplex *y, int incy,
cuComplex *result)
cublasStatus_t cublasCdotc(cublasHandle_t handle, int n,
const cuComplex *x, int incx,
const cuComplex *y, int incy,
cuComplex *result)
cublasStatus_t cublasZdotu(cublasHandle_t handle, int n,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *y, int incy,
cuDoubleComplex *result)
cublasStatus_t cublasZdotc(cublasHandle_t handle, int n,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *y, int incy,
cuDoubleComplex *result)
This function computes the dot product of vectors x and y. Hence, the result is
where and . Notice that in the first equation the
conjugate of the element of vector x should be used if the function name ends in character ‘c’
and that the last two equations reflect 1-based indexing used for compatibility with Fortran.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
result host or device output the resulting dot product, which is 0.0 if n<=0.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.5.7. cublas<t>nrm2()
cublasStatus_t cublasSnrm2(cublasHandle_t handle, int n,
const float *x, int incx, float *result)
cublasStatus_t cublasDnrm2(cublasHandle_t handle, int n,
const double *x, int incx, double *result)
cublasStatus_t cublasScnrm2(cublasHandle_t handle, int n,
const cuComplex *x, int incx, float *result)
cublasStatus_t cublasDznrm2(cublasHandle_t handle, int n,
const cuDoubleComplex *x, int incx, double *result)
This function computes the Euclidean norm of the vector x. The code uses a multiphase model
of accumulation to avoid intermediate underflow and overflow, with the result being equivalent
to where in exact arithmetic. Notice that the last equation
reflects 1-based indexing used for compatibility with Fortran.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
result host or device output the resulting norm, which is 0.0 if n,incx<=0.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.5.8. cublas<t>rot()
cublasStatus_t cublasSrot(cublasHandle_t handle, int n,
This function applies Givens rotation matrix (i.e., rotation in the x,y plane counter-clockwise by
angle defined by cos(alpha)=c, sin(alpha)=s):
to vectors x and y.
Hence, the result is and where
and . Notice that the last two equations reflect 1-based
indexing used for compatibility with Fortran.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.5.9. cublas<t>rotg()
cublasStatus_t cublasSrotg(cublasHandle_t handle,
float *a, float *b,
float *c, float *s)
cublasStatus_t cublasDrotg(cublasHandle_t handle,
double *a, double *b,
double *c, double *s)
cublasStatus_t cublasCrotg(cublasHandle_t handle,
cuComplex *a, cuComplex *b,
float *c, cuComplex *s)
cublasStatus_t cublasZrotg(cublasHandle_t handle,
cuDoubleComplex *a, cuDoubleComplex *b,
double *c, cuDoubleComplex *s)
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.5.10. cublas<t>rotm()
cublasStatus_t cublasSrotm(cublasHandle_t handle, int n, float *x, int incx,
float *y, int incy, const float* param)
cublasStatus_t cublasDrotm(cublasHandle_t handle, int n, double *x, int incx,
double *y, int incy, const double* param)
to vectors x and y.
Hence, the result is and where
and . Notice that the last two equations reflect 1-based
indexing used for compatibility with Fortran.
The elements , , and of matrix are stored in param[1], param[2], param[3] and param[4],
respectively. The flag=param[0] defines the following predefined values for the matrix
entries
flag=-1.0 flag= 0.0 flag= 1.0 flag=-2.0
Notice that the values -1.0, 0.0 and 1.0 implied by the flag are not stored in param.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
param host or device input <type> vector of 5 elements, where param[0] and
param[1-4] contain the flag and matrix .
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.5.11. cublas<t>rotmg()
cublasStatus_t cublasSrotmg(cublasHandle_t handle, float *d1, float *d2,
float *x1, const float *y1, float *param)
cublasStatus_t cublasDrotmg(cublasHandle_t handle, double *d1, double *d2,
double *x1, const double *y1, double *param)
Notice that the values -1.0, 0.0 and 1.0 implied by the flag are not stored in param.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
param host or device output <type> vector of 5 elements, where param[0] and
param[1-4] contain the flag and matrix .
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.5.12. cublas<t>scal()
cublasStatus_t cublasSscal(cublasHandle_t handle, int n,
const float *alpha,
float *x, int incx)
cublasStatus_t cublasDscal(cublasHandle_t handle, int n,
const double *alpha,
double *x, int incx)
cublasStatus_t cublasCscal(cublasHandle_t handle, int n,
const cuComplex *alpha,
cuComplex *x, int incx)
cublasStatus_t cublasCsscal(cublasHandle_t handle, int n,
const float *alpha,
cuComplex *x, int incx)
cublasStatus_t cublasZscal(cublasHandle_t handle, int n,
const cuDoubleComplex *alpha,
cuDoubleComplex *x, int incx)
cublasStatus_t cublasZdscal(cublasHandle_t handle, int n,
const double *alpha,
cuDoubleComplex *x, int incx)
This function scales the vector x by the scalar and overwrites it with the result. Hence, the
performed operation is for and . Notice that the
last two equations reflect 1-based indexing used for compatibility with Fortran.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.5.13. cublas<t>swap()
cublasStatus_t cublasSswap(cublasHandle_t handle, int n, float *x,
int incx, float *y, int incy)
cublasStatus_t cublasDswap(cublasHandle_t handle, int n, double *x,
int incx, double *y, int incy)
cublasStatus_t cublasCswap(cublasHandle_t handle, int n, cuComplex *x,
int incx, cuComplex *y, int incy)
cublasStatus_t cublasZswap(cublasHandle_t handle, int n, cuDoubleComplex *x,
int incx, cuDoubleComplex *y, int incy)
This function interchanges the elements of vector x and y. Hence, the performed operation is
for , and . Notice that the last two
equations reflect 1-based indexing used for compatibility with Fortran.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.6.1. cublas<t>gbmv()
cublasStatus_t cublasSgbmv(cublasHandle_t handle, cublasOperation_t trans,
int m, int n, int kl, int ku,
const float *alpha,
const float *A, int lda,
const float *x, int incx,
const float *beta,
float *y, int incy)
cublasStatus_t cublasDgbmv(cublasHandle_t handle, cublasOperation_t trans,
int m, int n, int kl, int ku,
const double *alpha,
const double *A, int lda,
const double *x, int incx,
const double *beta,
double *y, int incy)
cublasStatus_t cublasCgbmv(cublasHandle_t handle, cublasOperation_t trans,
int m, int n, int kl, int ku,
const cuComplex *alpha,
const cuComplex *A, int lda,
const cuComplex *x, int incx,
const cuComplex *beta,
cuComplex *y, int incy)
cublasStatus_t cublasZgbmv(cublasHandle_t handle, cublasOperation_t trans,
int m, int n, int kl, int ku,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *beta,
cuDoubleComplex *y, int incy)
where is a banded matrix with subdiagonals and superdiagonals, and are vectors,
and and are scalars. Also, for matrix
The banded matrix is stored column by column, with the main diagonal stored in row
(starting in first position), the first superdiagonal stored in row (starting in second position),
the first subdiagonal stored in row (starting in first position), etc. So that in general,
the element is stored in the memory location A(ku+1+i-j,j) for and
. Also, the elements in the array that do not conceptually
correspond to the elements in the banded matrix (the top left and bottom right
triangles) are not referenced.
beta host or device input <type> scalar used for multiplication, if beta == 0 then y
does not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
‣ if incx, incy == 0 or
2.6.2. cublas<t>gemv()
cublasStatus_t cublasSgemv(cublasHandle_t handle, cublasOperation_t trans,
int m, int n,
const float *alpha,
const float *A, int lda,
const float *x, int incx,
const float *beta,
float *y, int incy)
cublasStatus_t cublasDgemv(cublasHandle_t handle, cublasOperation_t trans,
int m, int n,
const double *alpha,
const double *A, int lda,
const double *x, int incx,
const double *beta,
double *y, int incy)
cublasStatus_t cublasCgemv(cublasHandle_t handle, cublasOperation_t trans,
int m, int n,
const cuComplex *alpha,
const cuComplex *A, int lda,
const cuComplex *x, int incx,
const cuComplex *beta,
cuComplex *y, int incy)
cublasStatus_t cublasZgemv(cublasHandle_t handle, cublasOperation_t trans,
int m, int n,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *beta,
cuDoubleComplex *y, int incy)
where is a matrix stored in column-major format, and are vectors, and and
are scalars. Also, for matrix
beta host or device input <type> scalar used for multiplication, if beta==0 then y
does not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.6.3. cublas<t>ger()
cublasStatus_t cublasSger(cublasHandle_t handle, int m, int n,
const float *alpha,
const float *x, int incx,
const float *y, int incy,
float *A, int lda)
cublasStatus_t cublasDger(cublasHandle_t handle, int m, int n,
const double *alpha,
const double *x, int incx,
const double *y, int incy,
double *A, int lda)
cublasStatus_t cublasCgeru(cublasHandle_t handle, int m, int n,
const cuComplex *alpha,
const cuComplex *x, int incx,
const cuComplex *y, int incy,
cuComplex *A, int lda)
cublasStatus_t cublasCgerc(cublasHandle_t handle, int m, int n,
const cuComplex *alpha,
const cuComplex *x, int incx,
const cuComplex *y, int incy,
cuComplex *A, int lda)
cublasStatus_t cublasZgeru(cublasHandle_t handle, int m, int n,
const cuDoubleComplex *alpha,
A device in/out <type> array of dimension lda x n with lda >= max(1,m).
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.6.4. cublas<t>sbmv()
cublasStatus_t cublasSsbmv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, int k, const float *alpha,
const float *A, int lda,
const float *x, int incx,
const float *beta, float *y, int incy)
cublasStatus_t cublasDsbmv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, int k, const double *alpha,
const double *A, int lda,
const double *x, int incx,
const double *beta, double *y, int incy)
uplo input indicates if matrix A lower or upper part is stored, the other
symmetric part is not referenced and is inferred from the
stored elements.
A device input <type> array of dimension lda x n with \lda >= k+1.
beta host or device input <type> scalar used for multiplication, if beta==0 then y does
not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.6.5. cublas<t>spmv()
cublasStatus_t cublasSspmv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const float *alpha, const float *AP,
const float *x, int incx, const float *beta,
float *y, int incy)
cublasStatus_t cublasDspmv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const double *alpha, const double *AP,
const double *x, int incx, const double *beta,
double *y, int incy)
where is a symmetric matrix stored in packed format, and are vectors, and and
are scalars.
If uplo == CUBLAS_FILL_MODE_LOWER then the elements in the lower triangular part of the
symmetric matrix are packed together column by column without gaps, so that the element
is stored in the memory location AP[i+((2*n-j+1)*j)/2] for and .
Consequently, the packed format requires only elements for storage.
uplo input indicates if matrix lower or upper part is stored, the other
symmetric part is not referenced and is inferred from the
stored elements.
beta host or device input <type> scalar used for multiplication, if beta==0 then y does
not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.6.6. cublas<t>spr()
cublasStatus_t cublasSspr(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const float *alpha,
const float *x, int incx, float *AP)
cublasStatus_t cublasDspr(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const double *alpha,
const double *x, int incx, double *AP)
uplo input indicates if matrix lower or upper part is stored, the other
symmetric part is not referenced and is inferred from the
stored elements.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.6.7. cublas<t>spr2()
cublasStatus_t cublasSspr2(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const float *alpha,
const float *x, int incx,
const float *y, int incy, float *AP)
cublasStatus_t cublasDspr2(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const double *alpha,
const double *x, int incx,
const double *y, int incy, double *AP)
uplo input indicates if matrix lower or upper part is stored, the other
symmetric part is not referenced and is inferred from the
stored elements.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.6.8. cublas<t>symv()
cublasStatus_t cublasSsymv(cublasHandle_t handle, cublasFillMode_t uplo,
where is a symmetric matrix stored in lower or upper mode, and are vectors, and
and are scalars.
This function has an alternate faster implementation using atomics that can be enabled with
cublasSetAtomicsMode().
Please see the section on the function cublasSetAtomicsMode() for more details about the
usage of atomics.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
uplo input indicates if matrix lower or upper part is stored, the other
symmetric part is not referenced and is inferred from the
stored elements.
beta host or device input <type> scalar used for multiplication, if beta==0 then y does
not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.6.9. cublas<t>syr()
cublasStatus_t cublasSsyr(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const float *alpha,
const float *x, int incx, float
*A, int lda)
cublasStatus_t cublasDsyr(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const double *alpha,
const double *x, int incx, double
*A, int lda)
cublasStatus_t cublasCsyr(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const cuComplex *alpha,
const cuComplex *x, int incx, cuComplex
*A, int lda)
cublasStatus_t cublasZsyr(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const cuDoubleComplex *alpha,
const cuDoubleComplex *x, int incx, cuDoubleComplex
*A, int lda)
uplo input indicates if matrix A lower or upper part is stored, the other
symmetric part is not referenced and is inferred from the
stored elements.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.6.10. cublas<t>syr2()
cublasStatus_t cublasSsyr2(cublasHandle_t handle, cublasFillMode_t uplo, int n,
const float *alpha, const float
*x, int incx,
const float *y, int incy, float
*A, int lda
cublasStatus_t cublasDsyr2(cublasHandle_t handle, cublasFillMode_t uplo, int n,
const double *alpha, const double
*x, int incx,
const double *y, int incy, double
*A, int lda
cublasStatus_t cublasCsyr2(cublasHandle_t handle, cublasFillMode_t uplo, int n,
const cuComplex *alpha, const cuComplex
*x, int incx,
const cuComplex *y, int incy, cuComplex
*A, int lda
cublasStatus_t cublasZsyr2(cublasHandle_t handle, cublasFillMode_t uplo, int n,
const cuDoubleComplex *alpha, const cuDoubleComplex
*x, int incx,
const cuDoubleComplex *y, int incy, cuDoubleComplex
*A, int lda
where is a symmetric matrix stored in column-major format, and are vectors, and
is a scalar.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.6.11. cublas<t>tbmv()
cublasStatus_t cublasStbmv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, int k, const float *A, int lda,
float *x, int incx)
cublasStatus_t cublasDtbmv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, int k, const double *A, int lda,
double *x, int incx)
cublasStatus_t cublasCtbmv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, int k, const cuComplex *A, int lda,
cuComplex *x, int incx)
cublasStatus_t cublasZtbmv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, int k, const cuDoubleComplex *A, int lda,
cuDoubleComplex *x, int incx)
uplo input indicates if matrix A lower or upper part is stored, the other part
is not referenced and is inferred from the stored elements.
diag input indicates if the elements on the main diagonal of matrix A are
unity and should not be accessed.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.6.12. cublas<t>tbsv()
cublasStatus_t cublasStbsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, int k, const float *A, int lda,
float *x, int incx)
cublasStatus_t cublasDtbsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, int k, const double *A, int lda,
double *x, int incx)
cublasStatus_t cublasCtbsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, int k, const cuComplex *A, int lda,
cuComplex *x, int incx)
cublasStatus_t cublasZtbsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, int k, const cuDoubleComplex *A, int lda,
cuDoubleComplex *x, int incx)
This function solves the triangular banded linear system with a single right-hand-side
where is a triangular banded matrix, and and are vectors. Also, for matrix
conceptually correspond to the elements in the banded matrix (the top left triangle) are
not referenced.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
uplo input indicates if matrix A lower or upper part is stored, the other part
is not referenced and is inferred from the stored elements.
diag input indicates if the elements on the main diagonal of matrix A are
unity and should not be accessed.
A device input <type> array of dimension lda x n, with lda >= k+1.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.6.13. cublas<t>tpmv()
cublasStatus_t cublasStpmv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const float *AP,
float *x, int incx)
cublasStatus_t cublasDtpmv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const double *AP,
double *x, int incx)
cublasStatus_t cublasCtpmv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const cuComplex *AP,
cuComplex *x, int incx)
cublasStatus_t cublasZtpmv(cublasHandle_t handle, cublasFillMode_t uplo,
where is a triangular matrix stored in packed format, and is a vector. Also, for matrix
If uplo == CUBLAS_FILL_MODE_LOWER then the elements in the lower triangular part of the
triangular matrix are packed together column by column without gaps, so that the element
is stored in the memory location AP[i+((2*n-j+1)*j)/2] for and .
Consequently, the packed format requires only elements for storage.
uplo input indicates if matrix A lower or upper part is stored, the other part
is not referenced and is inferred from the stored elements.
diag input indicates if the elements on the main diagonal of matrix A are
unity and should not be accessed.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
CUBLAS_STATUS_INVALID_VALUE ‣ If n < 0 or
‣ if incx == 0 or
‣ diag != CUBLAS_DIAG_UNIT,
CUBLAS_DIAG_NON_UNIT
2.6.14. cublas<t>tpsv()
cublasStatus_t cublasStpsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const float *AP,
float *x, int incx)
cublasStatus_t cublasDtpsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const double *AP,
double *x, int incx)
cublasStatus_t cublasCtpsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const cuComplex *AP,
cuComplex *x, int incx)
cublasStatus_t cublasZtpsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const cuDoubleComplex *AP,
cuDoubleComplex *x, int incx)
This function solves the packed triangular linear system with a single right-hand-side
where is a triangular matrix stored in packed format, and and are vectors. Also, for
matrix
uplo input indicates if matrix A lower or upper part is stored, the other part
is not referenced and is inferred from the stored elements.
diag input indicates if the elements on the main diagonal of matrix are
unity and should not be accessed.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
CUBLAS_STATUS_INVALID_VALUE ‣ If n < 0 or
‣ if incx = 0 or
‣ if trans != CUBLAS_OP_N, CUBLAS_OP_C,
CUBLAS_OP_T or
‣ if uplo != CUBLAS_FILL_MODE_LOWER,
CUBLAS_FILL_MODE_UPPER or
‣ diag != CUBLAS_DIAG_UNIT,
CUBLAS_DIAG_NON_UNIT
2.6.15. cublas<t>trmv()
cublasStatus_t cublasStrmv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
where is a triangular matrix stored in lower or upper mode with or without the main
diagonal, and is a vector. Also, for matrix
uplo input indicates if matrix A lower or upper part is stored, the other part
is not referenced and is inferred from the stored elements.
diag input indicates if the elements on the main diagonal of matrix A are
unity and should not be accessed.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
CUBLAS_STATUS_INVALID_VALUE ‣ If n < 0 or
‣ if incx = 0 or
‣ if uplo != CUBLAS_FILL_MODE_LOWER,
CUBLAS_FILL_MODE_UPPER or
‣ if diag != CUBLAS_DIAG_UNIT,
CUBLAS_DIAG_NON_UNIT or
2.6.16. cublas<t>trsv()
cublasStatus_t cublasStrsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const float *A, int lda,
float *x, int incx)
cublasStatus_t cublasDtrsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const double *A, int lda,
double *x, int incx)
cublasStatus_t cublasCtrsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const cuComplex *A, int lda,
cuComplex *x, int incx)
cublasStatus_t cublasZtrsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const cuDoubleComplex *A, int lda,
cuDoubleComplex *x, int incx)
This function solves the triangular linear system with a single right-hand-side
where is a triangular matrix stored in lower or upper mode with or without the main
diagonal, and and are vectors. Also, for matrix
diag input indicates if the elements on the main diagonal of matrix A are
unity and should not be accessed.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
CUBLAS_STATUS_INVALID_VALUE ‣ If n < 0 or
‣ if incx = 0 or
‣ if trans != CUBLAS_OP_N, CUBLAS_OP_C,
CUBLAS_OP_T or
‣ if uplo != CUBLAS_FILL_MODE_LOWER,
CUBLAS_FILL_MODE_UPPER or
‣ if diag != CUBLAS_DIAG_UNIT,
CUBLAS_DIAG_NON_UNIT or
2.6.17. cublas<t>hemv()
cublasStatus_t cublasChemv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const cuComplex *alpha,
const cuComplex *A, int lda,
const cuComplex *x, int incx,
const cuComplex *beta,
cuComplex *y, int incy)
where is a Hermitian matrix stored in lower or upper mode, and are vectors, and
and are scalars.
This function has an alternate faster implementation using atomics that can be enabled with
Please see the section on the for more details about the usage of atomics
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
uplo input indicates if matrix A lower or upper part is stored, the other
Hermitian part is not referenced and is inferred from the
stored elements.
beta host or device input <type> scalar used for multiplication, if beta==0 then y does
not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
CUBLAS_STATUS_INVALID_VALUE ‣ If n < 0 or
‣ if incx = 0 or incy = 0 or
‣ lda < n
2.6.18. cublas<t>hbmv()
cublasStatus_t cublasChbmv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, int k, const cuComplex *alpha,
const cuComplex *A, int lda,
const cuComplex *x, int incx,
const cuComplex *beta,
cuComplex *y, int incy)
cublasStatus_t cublasZhbmv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, int k, const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *beta,
cuDoubleComplex *y, int incy)
uplo input indicates if matrix A lower or upper part is stored, the other
Hermitian part is not referenced and is inferred from the
stored elements.
beta host or device input <type> scalar used for multiplication, if beta==0 then does
not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
‣ if lda < (k + 1) or
‣ alpha == NULL or beta == NULL
2.6.19. cublas<t>hpmv()
cublasStatus_t cublasChpmv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const cuComplex *alpha,
const cuComplex *AP,
const cuComplex *x, int incx,
where is a Hermitian matrix stored in packed format, and are vectors, and and
are scalars.
If uplo == CUBLAS_FILL_MODE_LOWER then the elements in the lower triangular part of the
Hermitian matrix are packed together column by column without gaps, so that the element
is stored in the memory location AP[i+((2*n-j+1)*j)/2] for and .
Consequently, the packed format requires only elements for storage.
uplo input indicates if matrix A lower or upper part is stored, the other
Hermitian part is not referenced and is inferred from the
stored elements.
AP device input <type> array with A stored in packed format. The imaginary
parts of the diagonal elements are assumed to be zero.
beta host or device input <type> scalar used for multiplication, if beta==0 then y
does not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
CUBLAS_STATUS_INVALID_VALUE ‣ If n < 0 or
‣ if incx == 0 or incy == 0 or
‣ if uplo != CUBLAS_FILL_MODE_UPPER,
CUBLAS_FILL_MODE_LOWER or
2.6.20. cublas<t>her()
cublasStatus_t cublasCher(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const float *alpha,
const cuComplex *x, int incx,
cuComplex *A, int lda)
cublasStatus_t cublasZher(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const double *alpha,
const cuDoubleComplex *x, int incx,
cuDoubleComplex *A, int lda)
uplo input indicates if matrix A lower or upper part is stored, the other
Hermitian part is not referenced and is inferred from the
stored elements.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
CUBLAS_STATUS_INVALID_VALUE ‣ If n < 0 or
‣ if incx == 0 or
‣ if uplo != CUBLAS_FILL_MODE_UPPER,
CUBLAS_FILL_MODE_LOWER or
2.6.21. cublas<t>her2()
cublasStatus_t cublasCher2(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const cuComplex *alpha,
const cuComplex *x, int incx,
const cuComplex *y, int incy,
cuComplex *A, int lda)
cublasStatus_t cublasZher2(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const cuDoubleComplex *alpha,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *y, int incy,
cuDoubleComplex *A, int lda)
where is a Hermitian matrix stored in column-major format, and are vectors, and
is a scalar.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
uplo input indicates if matrix A lower or upper part is stored, the other
Hermitian part is not referenced and is inferred from the
stored elements.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
CUBLAS_STATUS_INVALID_VALUE ‣ If n < 0 or
‣ if incx == 0 or incy == 0 or
‣ if uplo != CUBLAS_FILL_MODE_UPPER,
CUBLAS_FILL_MODE_LOWER or
2.6.22. cublas<t>hpr()
cublasStatus_t cublasChpr(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const float *alpha,
const cuComplex *x, int incx,
cuComplex *AP)
cublasStatus_t cublasZhpr(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const double *alpha,
const cuDoubleComplex *x, int incx,
cuDoubleComplex *AP)
uplo input indicates if matrix A lower or upper part is stored, the other
Hermitian part is not referenced and is inferred from the
stored elements.
AP device in/out <type> array with A stored in packed format. The imaginary
parts of the diagonal elements are assumed and set to zero.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
CUBLAS_STATUS_INVALID_VALUE ‣ If n < 0 or
‣ if incx == 0 or
‣ if uplo != CUBLAS_FILL_MODE_UPPER,
CUBLAS_FILL_MODE_LOWER or
‣ alpha == NULL
chpr, zhpr
2.6.23. cublas<t>hpr2()
cublasStatus_t cublasChpr2(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const cuComplex *alpha,
const cuComplex *x, int incx,
const cuComplex *y, int incy,
cuComplex *AP)
cublasStatus_t cublasZhpr2(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const cuDoubleComplex *alpha,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *y, int incy,
cuDoubleComplex *AP)
where is a Hermitian matrix stored in packed format, and are vectors, and is a
scalar.
If uplo == CUBLAS_FILL_MODE_LOWER then the elements in the lower triangular part of the
Hermitian matrix are packed together column by column without gaps, so that the element
is stored in the memory location AP[i+((2*n-j+1)*j)/2] for and .
Consequently, the packed format requires only elements for storage.
uplo input indicates if matrix A lower or upper part is stored, the other
Hermitian part is not referenced and is inferred from the
stored elements.
AP device in/out <type> array with A stored in packed format. The imaginary
parts of the diagonal elements are assumed and set to zero.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
CUBLAS_STATUS_INVALID_VALUE ‣ If n < 0 or
‣ if incx == 0 or incy == 0 or
‣ if uplo != CUBLAS_FILL_MODE_UPPER,
CUBLAS_FILL_MODE_LOWER or
‣ alpha == NULL
2.6.24. cublas<t>gemvBatched()
cublasStatus_t cublasSgemvBatched(cublasHandle_t handle, cublasOperation_t trans,
int m, int n,
const float *alpha,
const float *Aarray[], int lda,
const float *xarray[], int incx,
const float *beta,
float *yarray[], int incy,
int batchCount)
cublasStatus_t cublasDgemvBatched(cublasHandle_t handle, cublasOperation_t trans,
int m, int n,
const double *alpha,
const double *Aarray[], int lda,
const double *xarray[], int incx,
const double *beta,
double *yarray[], int incy,
int batchCount)
cublasStatus_t cublasCgemvBatched(cublasHandle_t handle, cublasOperation_t trans,
int m, int n,
const cuComplex *alpha,
const cuComplex *Aarray[], int lda,
const cuComplex *xarray[], int incx,
const cuComplex *beta,
cuComplex *yarray[], int incy,
int batchCount)
cublasStatus_t cublasZgemvBatched(cublasHandle_t handle, cublasOperation_t trans,
int m, int n,
const cuDoubleComplex *alpha,
const cuDoubleComplex *Aarray[], int lda,
const cuDoubleComplex *xarray[], int incx,
const cuDoubleComplex *beta,
cuDoubleComplex *yarray[], int incy,
int batchCount)
cublasStatus_t cublasHSHgemvBatched(cublasHandle_t handle, cublasOperation_t trans,
int m, int n,
const float *alpha,
const __half *Aarray[], int lda,
const __half *xarray[], int incx,
This function performs the matrix-vector multiplication of a batch of matrices and vectors.
The batch is considered to be "uniform", i.e. all instances have the same dimensions (m, n),
leading dimension (lda), increments (incx, incy) and transposition (trans) for their respective
A matrix, x and y vectors. The address of the input matrix and vector, and the output vector
of each instance of the batch are read from arrays of pointers passed to the function by the
caller.
where and are scalars, and is an array of pointers to matrice stored in column-
major format with dimension , and and are arrays of pointers to vectors. Also, for
matrix ,
Note: vectors must not overlap, i.e. the individual gemv operations must be computable
independently; otherwise, undefined behavior is expected.
On certain problem sizes, it might be advantageous to make multiple calls to cublas<t>gemv
in different CUDA streams, rather than use this API.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
Aarray device input array of pointers to <type> array, with each array of dim.
lda x n with lda>=max(1,m).
All pointers must meet certain alignment criteria. Please
see below for details.
xarray device input array of pointers to <type> array, with each dimension n if
trans==CUBLAS_OP_N and m otherwise.
All pointers must meet certain alignment criteria. Please
see below for details.
beta host or device input <type> scalar used for multiplication. If beta == 0, y does
not have to be a valid input.
If math mode enables fast math modes when using cublasSgemvBatched(), pointers (not
the pointer arrays) placed in the GPU memory must be properly aligned to avoid misaligned
memory access errors. Ideally all pointers are aligned to at least 16 Bytes. Otherwise it is
recommended that they meet the following rule:
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.6.25. cublas<t>gemvStridedBatched()
cublasStatus_t cublasSgemvStridedBatched(cublasHandle_t handle,
cublasOperation_t trans,
int m, int n,
const float *alpha,
const float *A, int lda,
long long int strideA,
const float *x, int incx,
long long int stridex,
const float *beta,
float *y, int incy,
long long int stridey,
int batchCount)
cublasStatus_t cublasDgemvStridedBatched(cublasHandle_t handle,
cublasOperation_t trans,
int m, int n,
const double *alpha,
const double *A, int lda,
long long int strideA,
const double *x, int incx,
long long int stridex,
const double *beta,
double *yarray[], int incy,
long long int stridey,
int batchCount)
cublasStatus_t cublasCgemvStridedBatched(cublasHandle_t handle,
cublasOperation_t trans,
int m, int n,
const cuComplex *alpha,
const cuComplex *A, int lda,
long long int strideA,
const cuComplex *x, int incx,
long long int stridex,
const cuComplex *beta,
cuComplex *y, int incy,
long long int stridey,
int batchCount)
cublasStatus_t cublasZgemvStridedBatched(cublasHandle_t handle,
cublasOperation_t trans,
int m, int n,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
long long int strideA,
const cuDoubleComplex *x, int incx,
long long int stridex,
const cuDoubleComplex *beta,
cuDoubleComplex *y, int incy,
long long int stridey,
int batchCount)
cublasStatus_t cublasHSHgemvStridedBatched(cublasHandle_t handle,
cublasOperation_t trans,
int m, int n,
const float *alpha,
const __half *A, int lda,
long long int strideA,
const __half *x, int incx,
long long int stridex,
const float *beta,
__half *y, int incy,
long long int stridey,
int batchCount)
cublasStatus_t cublasHSSgemvStridedBatched(cublasHandle_t handle,
cublasOperation_t trans,
int m, int n,
This function performs the matrix-vector multiplication of a batch of matrices and vectors.
The batch is considered to be "uniform", i.e. all instances have the same dimensions (m, n),
leading dimension (lda), increments (incx, incy) and transposition (trans) for their respective
A matrix, x and y vectors. Input matrix A and vector x, and output vector y for each instance
of the batch are located at fixed offsets in number of elements from their locations in the
previous instance. Pointers to A matrix, x and y vectors for the first instance are passed
to the function by the user along with offsets in number of elements - strideA, stridex and
stridey that determine the locations of input matrices and vectors, and output vectors in future
instances.
where and are scalars, and is an array of pointers to matrix stored in column-major
format with dimension , and and are arrays of pointers to vectors. Also, for
matrix
Note: matrices must not overlap, i.e. the individual gemv operations must be computable
independently; otherwise, undefined behavior is expected.
On certain problem sizes, it might be advantageous to make multiple calls to cublas<t>gemv
in different CUDA streams, rather than use this API.
Note: In the table below, we use A[i], x[i], y[i] as notation for A matrix, and x and y
vectors in the ith instance of the batch, implicitly assuming they are respectively offsets in
number of elements strideA, stridex, stridey away from A[i-1], x[i-1], y[i-1].
The unit for the offset is number of elements and must not be zero .
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
strideA input Value of type long long int that gives the offset in number
of elements between A[i] and A[i+1]
stridex input Value of type long long int that gives the offset in number
of elements between x[i] and x[i+1]
beta host or device input <type> scalar used for multiplication. If beta == 0, y does
not have to be a valid input.
stridey input Value of type long long int that gives the offset in number
of elements between y[i] and y[i+1]
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.7.1. cublas<t>gemm()
cublasStatus_t cublasSgemm(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n, int k,
const float *alpha,
const float *A, int lda,
const float *B, int ldb,
const float *beta,
float *C, int ldc)
cublasStatus_t cublasDgemm(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n, int k,
const double *alpha,
const double *A, int lda,
const double *B, int ldb,
const double *beta,
double *C, int ldc)
cublasStatus_t cublasCgemm(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n, int k,
const cuComplex *alpha,
const cuComplex *A, int lda,
const cuComplex *B, int ldb,
const cuComplex *beta,
cuComplex *C, int ldc)
cublasStatus_t cublasZgemm(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n, int k,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *B, int ldb,
const cuDoubleComplex *beta,
cuDoubleComplex *C, int ldc)
cublasStatus_t cublasHgemm(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n, int k,
const __half *alpha,
const __half *A, int lda,
const __half *B, int ldb,
const __half *beta,
__half *C, int ldc)
where and are scalars, and , and are matrices stored in column-major format with
dimensions , and , respectively. Also, for matrix
beta host or device input <type> scalar used for multiplication. If beta==0, C does not
have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
CUBLAS_STATUS_INVALID_VALUE ‣ If m, n, k < 0 or
2.7.2. cublas<t>gemm3m()
cublasStatus_t cublasCgemm3m(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n, int k,
const cuComplex *alpha,
const cuComplex *A, int lda,
const cuComplex *B, int ldb,
const cuComplex *beta,
cuComplex *C, int ldc)
cublasStatus_t cublasZgemm3m(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n, int k,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *B, int ldb,
const cuDoubleComplex *beta,
cuDoubleComplex *C, int ldc)
This function performs the complex matrix-matrix multiplication, using Gauss complexity
reduction algorithm. This can lead to an increase in performance up to 25%
where and are scalars, and , and are matrices stored in column-major format with
dimensions , and , respectively. Also, for matrix
Note: These 2 routines are only supported on GPUs with architecture capabilities equal or
greater than 5.0
beta host or device input <type> scalar used for multiplication. If beta==0, C does not
have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
CUBLAS_STATUS_INVALID_VALUE ‣ If m, n, k < 0 or
‣ if transa, transb != CUBLAS_OP_N,
CUBLAS_OP_C, CUBLAS_OP_T or
2.7.3. cublas<t>gemmBatched()
cublasStatus_t cublasHgemmBatched(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m, int n, int k,
const __half *alpha,
const __half *Aarray[], int lda,
const __half *Barray[], int ldb,
const __half *beta,
__half *Carray[], int ldc,
int batchCount)
cublasStatus_t cublasSgemmBatched(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m, int n, int k,
const float *alpha,
const float *Aarray[], int lda,
const float *Barray[], int ldb,
const float *beta,
float *Carray[], int ldc,
int batchCount)
cublasStatus_t cublasDgemmBatched(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m, int n, int k,
const double *alpha,
const double *Aarray[], int lda,
const double *Barray[], int ldb,
const double *beta,
double *Carray[], int ldc,
int batchCount)
cublasStatus_t cublasCgemmBatched(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m, int n, int k,
const cuComplex *alpha,
const cuComplex *Aarray[], int lda,
const cuComplex *Barray[], int ldb,
This function performs the matrix-matrix multiplication of a batch of matrices. The batch is
considered to be "uniform", i.e. all instances have the same dimensions (m, n, k), leading
dimensions (lda, ldb, ldc) and transpositions (transa, transb) for their respective A, B and C
matrices. The address of the input matrices and the output matrix of each instance of the
batch are read from arrays of pointers passed to the function by the caller.
where and are scalars, and , and are arrays of pointers to matrices stored
in column-major format with dimensions , and ,
respectively. Also, for matrix
Aarray device input array of pointers to <type> array, with each array of dim.
lda x k with lda>=max(1,m) if transa==CUBLAS_OP_N
and lda x m with lda>=max(1,k) otherwise.
Barray device input array of pointers to <type> array, with each array of dim.
ldb x n with ldb>=max(1,k) if transb==CUBLAS_OP_N
and ldb x k with ldb>=max(1,n) max(1,) otherwise.
All pointers must meet certain alignment criteria. Please
see below for details.
beta host or device input <type> scalar used for multiplication. If beta == 0, C does
not have to be a valid input.
Carray device in/out array of pointers to <type> array. It has dimensions ldc x
n with ldc>=max(1,m). Matrices C[i] should not overlap;
otherwise, undefined behavior is expected.
All pointers must meet certain alignment criteria. Please
see below for details.
If math mode enables fast math modes when using cublasSgemmBatched(), pointers (not
the pointer arrays) placed in the GPU memory must be properly aligned to avoid misaligned
memory access errors. Ideally all pointers are aligned to at least 16 Bytes. Otherwise it is
recommended that they meet the following rule:
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.7.4. cublas<t>gemmStridedBatched()
cublasStatus_t cublasHgemmStridedBatched(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m, int n, int k,
const __half *alpha,
const __half *A, int lda,
long long int strideA,
const __half *B, int ldb,
long long int strideB,
const __half *beta,
__half *C, int ldc,
long long int strideC,
int batchCount)
cublasStatus_t cublasSgemmStridedBatched(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m, int n, int k,
const float *alpha,
const float *A, int lda,
long long int strideA,
const float *B, int ldb,
long long int strideB,
const float *beta,
float *C, int ldc,
long long int strideC,
int batchCount)
cublasStatus_t cublasDgemmStridedBatched(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m, int n, int k,
const double *alpha,
const double *A, int lda,
long long int strideA,
const double *B, int ldb,
long long int strideB,
const double *beta,
double *C, int ldc,
long long int strideC,
int batchCount)
cublasStatus_t cublasCgemmStridedBatched(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m, int n, int k,
const cuComplex *alpha,
const cuComplex *A, int lda,
long long int strideA,
const cuComplex *B, int ldb,
long long int strideB,
This function performs the matrix-matrix multiplication of a batch of matrices. The batch is
considered to be "uniform", i.e. all instances have the same dimensions (m, n, k), leading
dimensions (lda, ldb, ldc) and transpositions (transa, transb) for their respective A, B and C
matrices. Input matrices A, B and output matrix C for each instance of the batch are located
at fixed offsets in number of elements from their locations in the previous instance. Pointers
to A, B and C matrices for the first instance are passed to the function by the user along with
offsets in number of elements - strideA, strideB and strideC that determine the locations of
input and output matrices in future instances.
where and are scalars, and , and are arrays of pointers to matrices stored
in column-major format with dimensions , and ,
respectively. Also, for matrix
elements strideA, strideB, strideC away from A[i-1], B[i-1], C[i-1]. The unit for
the offset is number of elements and must not be zero .
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
strideA input Value of type long long int that gives the offset in number
of elements between A[i] and A[i+1]
strideB input Value of type long long int that gives the offset in number
of elements between B[i] and B[i+1]
beta host or device input <type> scalar used for multiplication. If beta == 0, C does
not have to be a valid input.
strideC input Value of type long long int that gives the offset in number
of elements between C[i] and C[i+1]
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.7.5. cublas<t>symm()
cublasStatus_t cublasSsymm(cublasHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
int m, int n,
const float *alpha,
const float *A, int lda,
const float *B, int ldb,
const float *beta,
float *C, int ldc)
cublasStatus_t cublasDsymm(cublasHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
int m, int n,
const double *alpha,
const double *A, int lda,
const double *B, int ldb,
const double *beta,
double *C, int ldc)
cublasStatus_t cublasCsymm(cublasHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
int m, int n,
const cuComplex *alpha,
const cuComplex *A, int lda,
const cuComplex *B, int ldb,
const cuComplex *beta,
cuComplex *C, int ldc)
cublasStatus_t cublasZsymm(cublasHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
int m, int n,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *B, int ldb,
const cuDoubleComplex *beta,
cuDoubleComplex *C, int ldc)
where is a symmetric matrix stored in lower or upper mode, and are matrices,
and and are scalars.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
uplo input indicates if matrix A lower or upper part is stored, the other
symmetric part is not referenced and is inferred from the
stored elements.
beta host or device input <type> scalar used for multiplication, if beta == 0 then C
does not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
CUBLAS_STATUS_INVALID_VALUE ‣ If m, n < 0 or
‣ if side != CUBLAS_SIDE_LEFT,
CUBLAS_SIDE_RIGHT or
2.7.6. cublas<t>syrk()
cublasStatus_t cublasSsyrk(cublasHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
int n, int k,
const float *alpha,
const float *A, int lda,
const float *beta,
float *C, int ldc)
cublasStatus_t cublasDsyrk(cublasHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
int n, int k,
const double *alpha,
const double *A, int lda,
const double *beta,
double *C, int ldc)
cublasStatus_t cublasCsyrk(cublasHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
int n, int k,
const cuComplex *alpha,
const cuComplex *A, int lda,
const cuComplex *beta,
cuComplex *C, int ldc)
cublasStatus_t cublasZsyrk(cublasHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
int n, int k,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *beta,
cuDoubleComplex *C, int ldc)
where and are scalars, is a symmetric matrix stored in lower or upper mode, and is a
matrix with dimensions . Also, for matrix
uplo input indicates if matrix C lower or upper part is stored, the other
symmetric part is not referenced and is inferred from the
stored elements.
beta host or device input <type> scalar used for multiplication, if beta==0 then C does
not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
CUBLAS_STATUS_INVALID_VALUE ‣ If n, k < 0 or
‣ if trans != CUBLAS_OP_N, CUBLAS_OP_C,
CUBLAS_OP_T or
‣ if uplo != CUBLAS_FILL_MODE_LOWER,
CUBLAS_FILL_MODE_UPPER or
2.7.7. cublas<t>syr2k()
cublasStatus_t cublasSsyr2k(cublasHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
int n, int k,
const float *alpha,
const float *A, int lda,
const float *B, int ldb,
const float *beta,
float *C, int ldc)
cublasStatus_t cublasDsyr2k(cublasHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
int n, int k,
const double *alpha,
const double *A, int lda,
const double *B, int ldb,
const double *beta,
double *C, int ldc)
cublasStatus_t cublasCsyr2k(cublasHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
int n, int k,
const cuComplex *alpha,
const cuComplex *A, int lda,
const cuComplex *B, int ldb,
const cuComplex *beta,
cuComplex *C, int ldc)
cublasStatus_t cublasZsyr2k(cublasHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
int n, int k,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *B, int ldb,
const cuDoubleComplex *beta,
cuDoubleComplex *C, int ldc)
where and are scalars, is a symmetric matrix stored in lower or upper mode, and and
are matrices with dimensions and , respectively. Also, for matrix
and
beta host or device input <type> scalar used for multiplication, if beta==0, then C does
not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
CUBLAS_STATUS_INVALID_VALUE ‣ If n, k < 0 or
‣ if trans != CUBLAS_OP_N, CUBLAS_OP_C,
CUBLAS_OP_T or
‣ if uplo != CUBLAS_FILL_MODE_LOWER,
CUBLAS_FILL_MODE_UPPER or
2.7.8. cublas<t>syrkx()
where and are scalars, is a symmetric matrix stored in lower or upper mode, and and
are matrices with dimensions and , respectively. Also, for matrices
and
This routine can be used when B is in such way that the result is guaranteed to be symmetric.
A usual example is when the matrix B is a scaled form of the matrix A : this is equivalent to B
being the product of the matrix A and a diagonal matrix. For an efficient computation of the
product of a regular matrix with a diagonal matrix, refer to the routine cublas<t>dgmm.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
uplo input indicates if matrix C lower or upper part, is stored, the other
symmetric part is not referenced and is inferred from the
stored elements.
beta host or device input <type> scalar used for multiplication, if beta==0, then C does
not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
CUBLAS_STATUS_INVALID_VALUE ‣ If n, k < 0 or
‣ if trans != CUBLAS_OP_N, CUBLAS_OP_C,
CUBLAS_OP_T or
‣ if uplo != CUBLAS_FILL_MODE_LOWER,
CUBLAS_FILL_MODE_UPPER or
2.7.9. cublas<t>trmm()
cublasStatus_t cublasStrmm(cublasHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int m, int n,
const float *alpha,
const float *A, int lda,
const float *B, int ldb,
float *C, int ldc)
cublasStatus_t cublasDtrmm(cublasHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int m, int n,
const double *alpha,
const double *A, int lda,
const double *B, int ldb,
double *C, int ldc)
cublasStatus_t cublasCtrmm(cublasHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int m, int n,
const cuComplex *alpha,
const cuComplex *A, int lda,
const cuComplex *B, int ldb,
cuComplex *C, int ldc)
cublasStatus_t cublasZtrmm(cublasHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int m, int n,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *B, int ldb,
cuDoubleComplex *C, int ldc)
where is a triangular matrix stored in lower or upper mode with or without the main
diagonal, and are matrix, and is a scalar. Also, for matrix
Notice that in order to achieve better parallelism cuBLAS differs from the BLAS API only for
this routine. The BLAS API assumes an in-place implementation (with results written back
to B), while the cuBLAS API assumes an out-of-place implementation (with results written
into C). The application can obtain the in-place functionality of BLAS in the cuBLAS API by
passing the address of the matrix B in place of the matrix C. No other overlapping in the input
parameters is supported.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
alpha host or device input <type> scalar used for multiplication, if alpha==0 then A is
not referenced and B does not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
CUBLAS_STATUS_INVALID_VALUE ‣ If m, n < 0 or
‣ if trans != CUBLAS_OP_N, CUBLAS_OP_C,
CUBLAS_OP_T or
‣ if uplo != CUBLAS_FILL_MODE_LOWER,
CUBLAS_FILL_MODE_UPPER or
‣ if side != CUBLAS_SIDE_LEFT,
CUBLAS_SIDE_RIGHT or
2.7.10. cublas<t>trsm()
cublasStatus_t cublasStrsm(cublasHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int m, int n,
const float *alpha,
const float *A, int lda,
float *B, int ldb)
cublasStatus_t cublasDtrsm(cublasHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int m, int n,
const double *alpha,
const double *A, int lda,
double *B, int ldb)
cublasStatus_t cublasCtrsm(cublasHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int m, int n,
const cuComplex *alpha,
const cuComplex *A, int lda,
cuComplex *B, int ldb)
cublasStatus_t cublasZtrsm(cublasHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int m, int n,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
cuDoubleComplex *B, int ldb)
This function solves the triangular linear system with multiple right-hand-sides
where is a triangular matrix stored in lower or upper mode with or without the main
diagonal, and are matrices, and is a scalar. Also, for matrix
alpha host or device input <type> scalar used for multiplication, if alpha==0 then A is
not referenced and B does not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
‣ if uplo != CUBLAS_FILL_MODE_LOWER,
CUBLAS_FILL_MODE_UPPER or
‣ if side != CUBLAS_SIDE_LEFT,
CUBLAS_SIDE_RIGHT or
‣ if diag != CUBLAS_DIAG_NON_UNIT,
CUBLAS_DIAG_UNIT or
2.7.11. cublas<t>trsmBatched()
cublasStatus_t cublasStrsmBatched( cublasHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
cublasDiagType_t diag,
int m,
int n,
const float *alpha,
const float *const A[],
int lda,
float *const B[],
int ldb,
int batchCount);
cublasStatus_t cublasDtrsmBatched( cublasHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
cublasDiagType_t diag,
int m,
int n,
const double *alpha,
const double *const A[],
int lda,
double *const B[],
int ldb,
int batchCount);
cublasStatus_t cublasCtrsmBatched( cublasHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
cublasDiagType_t diag,
int m,
int n,
const cuComplex *alpha,
const cuComplex *const A[],
int lda,
cuComplex *const B[],
int ldb,
int batchCount);
cublasStatus_t cublasZtrsmBatched( cublasHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
cublasDiagType_t diag,
int m,
int n,
const cuDoubleComplex *alpha,
const cuDoubleComplex *const A[],
int lda,
cuDoubleComplex *const B[],
int ldb,
int batchCount);
This function solves an array of triangular linear systems with multiple right-hand-sides
where is a triangular matrix stored in lower or upper mode with or without the main
diagonal, and are matrices, and is a scalar. Also, for matrix
uplo input indicates if matrix A[i] lower or upper part is stored, the
other part is not referenced and is inferred from the stored
elements.
alpha host or device input <type> scalar used for multiplication, if alpha==0 then A[i]
is not referenced and B[i] does not have to be a valid input.
A device input array of pointers to <type> array, with each array of dim. lda
x m with lda>=max(1,m) if side == CUBLAS_SIDE_LEFT
and lda x n with lda>=max(1,n) otherwise.
B device in/out array of pointers to <type> array, with each array of dim. ldb
x n with ldb>=max(1,m). Matrices B[i] should not overlap;
otherwise, undefined behavior is expected.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
‣ if uplo != CUBLAS_FILL_MODE_LOWER,
CUBLAS_FILL_MODE_UPPER or
‣ if side != CUBLAS_SIDE_LEFT,
CUBLAS_SIDE_RIGHT or
‣ if diag != CUBLAS_DIAG_NON_UNIT,
CUBLAS_DIAG_UNIT or
2.7.12. cublas<t>hemm()
cublasStatus_t cublasChemm(cublasHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
int m, int n,
const cuComplex *alpha,
const cuComplex *A, int lda,
const cuComplex *B, int ldb,
const cuComplex *beta,
cuComplex *C, int ldc)
cublasStatus_t cublasZhemm(cublasHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
int m, int n,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *B, int ldb,
const cuDoubleComplex *beta,
cuDoubleComplex *C, int ldc)
where is a Hermitian matrix stored in lower or upper mode, and are matrices,
and and are scalars.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
uplo input indicates if matrix A lower or upper part is stored, the other
Hermitian part is not referenced and is inferred from the
stored elements.
beta input <type> scalar used for multiplication, if beta==0 then C does
not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
‣ if uplo != CUBLAS_FILL_MODE_LOWER,
CUBLAS_FILL_MODE_UPPER or
2.7.13. cublas<t>herk()
cublasStatus_t cublasCherk(cublasHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
int n, int k,
where and are scalars, is a Hermitian matrix stored in lower or upper mode, and is a
matrix with dimensions . Also, for matrix
uplo input indicates if matrix A lower or upper part is stored, the other
Hermitian part is not referenced and is inferred from the
stored elements.
beta input <type> scalar used for multiplication, if beta==0 then C does
not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
‣ if uplo != CUBLAS_FILL_MODE_LOWER,
CUBLAS_FILL_MODE_UPPER or
2.7.14. cublas<t>her2k()
cublasStatus_t cublasCher2k(cublasHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
int n, int k,
const cuComplex *alpha,
const cuComplex *A, int lda,
const cuComplex *B, int ldb,
const float *beta,
cuComplex *C, int ldc)
cublasStatus_t cublasZher2k(cublasHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
int n, int k,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *B, int ldb,
const double *beta,
cuDoubleComplex *C, int ldc)
where and are scalars, is a Hermitian matrix stored in lower or upper mode, and and
are matrices with dimensions and , respectively. Also, for matrix
and
uplo input indicates if matrix A lower or upper part is stored, the other
Hermitian part is not referenced and is inferred from the
stored elements.
beta host or device input <type> scalar used for multiplication, if beta==0 then C does
not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
‣ if uplo != CUBLAS_FILL_MODE_LOWER,
CUBLAS_FILL_MODE_UPPER or
2.7.15. cublas<t>herkx()
cublasStatus_t cublasCherkx(cublasHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
int n, int k,
const cuComplex *alpha,
const cuComplex *A, int lda,
const cuComplex *B, int ldb,
const float *beta,
cuComplex *C, int ldc)
cublasStatus_t cublasZherkx(cublasHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
int n, int k,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *B, int ldb,
const double *beta,
cuDoubleComplex *C, int ldc)
where and are scalars, is a Hermitian matrix stored in lower or upper mode, and and
are matrices with dimensions and , respectively. Also, for matrix
and
This routine can be used when the matrix B is in such way that the result is guaranteed to
be hermitian. An usual example is when the matrix B is a scaled form of the matrix A : this
is equivalent to B being the product of the matrix A and a diagonal matrix. For an efficient
computation of the product of a regular matrix with a diagonal matrix, refer to the routine
cublas<t>dgmm.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
beta host or device input real scalar used for multiplication, if beta==0 then C does
not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
‣ if uplo != CUBLAS_FILL_MODE_LOWER,
CUBLAS_FILL_MODE_UPPER or
2.8.1. cublas<t>geam()
cublasStatus_t cublasSgeam(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n,
const float *alpha,
const float *A, int lda,
const float *beta,
const float *B, int ldb,
float *C, int ldc)
cublasStatus_t cublasDgeam(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n,
const double *alpha,
const double *A, int lda,
const double *beta,
const double *B, int ldb,
double *C, int ldc)
cublasStatus_t cublasCgeam(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n,
const cuComplex *alpha,
const cuComplex *A, int lda,
const cuComplex *beta ,
const cuComplex *B, int ldb,
cuComplex *C, int ldc)
cublasStatus_t cublasZgeam(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *beta,
const cuDoubleComplex *B, int ldb,
cuDoubleComplex *C, int ldc)
where and are scalars, and , and are matrices stored in column-major format with
dimensions , and , respectively. Also, for matrix
alpha host or device input <type> scalar used for multiplication. If *alpha == 0, A does
not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.8.2. cublas<t>dgmm()
cublasStatust cublasSdgmm(cublasHandle_t handle, cublasSideMode_t mode,
int m, int n,
const float *A, int lda,
const float *x, int incx,
float *C, int ldc)
cublasStatus_t cublasDdgmm(cublasHandle_t handle, cublasSideMode_t mode,
int m, int n,
const double *A, int lda,
const double *x, int incx,
double *C, int ldc)
cublasStatus_t cublasCdgmm(cublasHandle_t handle, cublasSideMode_t mode,
int m, int n,
const cuComplex *A, int lda,
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.8.3. cublas<t>getrfBatched()
cublasStatus_t cublasSgetrfBatched(cublasHandle_t handle,
int n,
float *const Aarray[],
int lda,
int *PivotArray,
int *infoArray,
int batchSize);
where P is a permutation matrix which represents partial pivoting with row interchanges. L is a
lower triangular matrix with unit diagonal and U is an upper triangular matrix.
Formally P is written by a product of permutation matrices Pj, for j = 1,2,...,n, say P =
P1 * P2 * P3 * .... * Pn. Pj is a permutation matrix which interchanges two rows of
vector x when performing Pj*x. Pj can be constructed by j element of PivotArray[i] by the
following matlab code
L and U are written back to original matrix A, and diagonal elements of L are discarded. The L
and U can be constructed by the following matlab code
If matrix A(=Aarray[i]) is singular, getrf still works and the value of info(=infoArray[i])
reports first row index that LU factorization cannot proceed. If info is k, U(k,k) is zero. The
equation P*A=L*U still holds, however L and U reconstruction needs different matlab code as
follows:
This function is intended to be used for matrices of small sizes where the launch overhead is a
significant factor.
cublas<t>getrfBatched supports non-pivot LU factorization if PivotArray is nil.
cublas<t>getrfBatched supports arbitrary dimension.
cublas<t>getrfBatched only supports compute capability 2.0 or above.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
Aarray device input/ array of pointers to <type> array, with each array of dim. n
output x n with lda>=max(1,n). Matrices Aarray[i] should not
overlap; otherwise, undefined behavior is expected.
PivotArray device output array of size n x batchSize that contains the pivoting
sequence of each factorization of Aarray[i] stored in a
linear fashion. If PivotArray is nil, pivoting is disabled.
infoArray device output array of size batchSize that info(=infoArray[i]) contains
the information of factorization of Aarray[i].
If info=0, the execution is successful.
If info = -j, the j-th parameter had an illegal value.
If info = k, U(k,k) is 0. The factorization has been
completed, but U is exactly singular.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.8.4. cublas<t>getrsBatched()
cublasStatus_t cublasSgetrsBatched(cublasHandle_t handle,
cublasOperation_t trans,
int n,
int nrhs,
const float *const Aarray[],
int lda,
const int *devIpiv,
float *const Barray[],
int ldb,
int *info,
int batchSize);
int n,
int nrhs,
const double *const Aarray[],
int lda,
const int *devIpiv,
double *const Barray[],
int ldb,
int *info,
int batchSize);
where is a matrix which has been LU factorized with pivoting , and are
matrices. Also, for matrix
This function is intended to be used for matrices of small sizes where the launch overhead is a
significant factor.
cublas<t>getrsBatched supports non-pivot LU factorization if devIpiv is nil.
cublas<t>getrsBatched supports arbitrary dimension.
cublas<t>getrsBatched only supports compute capability 2.0 or above.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
devIpiv device input array of size n x batchSize that contains the pivoting
sequence of each factorization of Aarray[i] stored in a
linear fashion. If devIpiv is nil, pivoting for all Aarray[i]
is ignored.
Barray device input/ array of pointers to <type> array, with each array of dim. n
output x nrhs with ldb>=max(1,n). Matrices Barray[i] should
not overlap; otherwise, undefined behavior is expected.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.8.5. cublas<t>getriBatched()
cublasStatus_t cublasSgetriBatched(cublasHandle_t handle,
int n,
const float *const Aarray[],
int lda,
int *PivotArray,
float *const Carray[],
int ldc,
int *infoArray,
int batchSize);
Aarray and Carray are arrays of pointers to matrices stored in column-major format with
dimensions n*n and leading dimension lda and ldc respectively.
This function performs the inversion of matrices A[i] for i = 0, ..., batchSize-1.
Prior to calling cublas<t>getriBatched, the matrix A[i] must be factorized first using the
routine cublas<t>getrfBatched. After the call of cublas<t>getrfBatched, the matrix pointing
by Aarray[i] will contain the LU factors of the matrix A[i] and the vector pointing by
(PivotArray+i) will contain the pivoting sequence.
Following the LU factorization, cublas<t>getriBatched uses forward and backward triangular
solvers to complete inversion of matrices A[i] for i = 0, ..., batchSize-1. The inversion is out-
of-place, so memory space of Carray[i] cannot overlap memory space of Array[i].
Typically all parameters in cublas<t>getrfBatched would be passed into
cublas<t>getriBatched. For example,
Aarray device input array of pointers to <type> array, with each array of
dimension n*n with lda>=max(1,n).
PivotArray device output array of size n*batchSize that contains the pivoting
sequence of each factorization of Aarray[i] stored in a
linear fashion. If PivotArray is nil, pivoting is disabled.
Carray device output array of pointers to <type> array, with each array of
dimension n*n with ldc>=max(1,n). Matrices Carray[i]
should not overlap; otherwise, undefined behavior is
expected.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.8.6. cublas<t>matinvBatched()
cublasStatus_t cublasSmatinvBatched(cublasHandle_t handle,
int n,
const float *const A[],
int lda,
float *const Ainv[],
int lda_inv,
int *info,
int batchSize);
A and Ainv are arrays of pointers to matrices stored in column-major format with dimensions
n*n and leading dimension lda and lda_inv respectively.
This function performs the inversion of matrices A[i] for i = 0, ..., batchSize-1.
This function is a short cut of cublas<t>getrfBatched plus cublas<t>getriBatched.
However it doesn't work if n is greater than 32. If not, the user has to go through
cublas<t>getrfBatched and cublas<t>getriBatched.
If the matrix A[i] is singular, then info[i] reports singularity, the same as
cublas<t>getrfBatched.
Ainv device output array of pointers to <type> array, with each array of
dimension n*n with lda_inv>=max(1,n). Matrices
Ainv[i] should not overlap; otherwise, undefined
behavior is expected.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.8.7. cublas<t>geqrfBatched()
cublasStatus_t cublasSgeqrfBatched( cublasHandle_t handle,
int m,
int n,
float *const Aarray[],
int lda,
float *const TauArray[],
int *info,
int batchSize);
int n,
double *const Aarray[],
int lda,
double *const TauArray[],
int *info,
int batchSize);
This function is intended to be used for matrices of small sizes where the launch overhead is a
significant factor.
cublas<t>geqrfBatched supports arbitrary dimension.
cublas<t>geqrfBatched only supports compute capability 2.0 or above.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
Aarray device input array of pointers to <type> array, with each array of dim. m
x n with lda>=max(1,m).
TauArray device output array of pointers to <type> vector, with each vector of dim.
max(1,min(m,n)).
info host output If info=0, the parameters passed to the function are valid
If info<0, the parameter in postion -info is invalid
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.8.8. cublas<t>gelsBatched()
cublasStatus_t cublasSgelsBatched( cublasHandle_t handle,
cublasOperation_t trans,
int m,
int n,
int nrhs,
float *const Aarray[],
int lda,
float *const Carray[],
int ldc,
int *info,
int *devInfoArray,
int batchSize );
On exit, each Aarray[i] is overwritten with their QR factorization and each Carray[i] is
overwritten with the least square solution
cublas<t>gelsBatched supports only the non-transpose operation and only solves over-
determined systems (m >= n).
cublas<t>gelsBatched only supports compute capability 2.0 or above.
This function is intended to be used for matrices of small sizes where the launch overhead is a
significant factor.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
Carray device input/ array of pointers to <type> array, with each array of dim. n
output x nrhs with ldc>=max(1,m). Matrices Carray[i] should
not overlap; otherwise, undefined behavior is expected.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.8.9. cublas<t>tpttr()
cublasStatus_t cublasStpttr ( cublasHandle_t handle,
cublasFillMode_t uplo,
int n,
const float *AP,
float *A,
int lda );
This function performs the conversion from the triangular packed format to the triangular
format
If uplo == CUBLAS_FILL_MODE_LOWER then the elements of AP are copied into the lower
triangular part of the triangular matrix A and the upper part of A is left untouched. If uplo ==
CUBLAS_FILL_MODE_UPPER then the elements of AP are copied into the upper triangular part
of the triangular matrix A and the lower part of A is left untouched.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
CUBLAS_STATUS_INVALID_VALUE ‣ If n < 0 or
2.8.10. cublas<t>trttp()
cublasStatus_t cublasStrttp ( cublasHandle_t handle,
cublasFillMode_t uplo,
int n,
const float *A,
int lda,
float *AP );
This function performs the conversion from the triangular format to the triangular packed
format
If uplo == CUBLAS_FILL_MODE_LOWER then the lower triangular part of the triangular matrix
A is copied into the array AP. If uplo == CUBLAS_FILL_MODE_UPPER then then the upper
triangular part of the triangular matrix A is copied into the array AP.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
CUBLAS_STATUS_INVALID_VALUE ‣ If n < 0 or
‣ if uplo != CUBLAS_FILL_MODE_UPPER,
CUBLAS_FILL_MODE_LOWER or
2.8.11. cublas<t>gemmEx()
cublasStatus_t cublasSgemmEx(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m,
int n,
int k,
const float *alpha,
const void *A,
cudaDataType_t Atype,
int lda,
const void *B,
cudaDataType_t Btype,
int ldb,
const float *beta,
void *C,
cudaDataType_t Ctype,
int ldc)
cublasStatus_t cublasCgemmEx(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m,
int n,
int k,
const cuComplex *alpha,
const void *A,
cudaDataType_t Atype,
int lda,
const void *B,
cudaDataType_t Btype,
int ldb,
const cuComplex *beta,
void *C,
cudaDataType_t Ctype,
int ldc)
This function is an extension of cublas<t>gemm. In this function the input matrices and
output matrices can have a lower precision but the computation is still done in the type
<t>. For example, in the type float for cublasSgemmEx and in the type cuComplex for
cublasCgemmEx.
where and are scalars, and , and are matrices stored in column-major format with
dimensions , and , respectively. Also, for matrix
beta host or device input <type> scalar used for multiplication. If beta==0, C does not
have to be a valid input.
The matrix types combinations supported for cublasSgemmEx are listed below:
C A/B
CUDA_R_16BF CUDA_R_16BF
CUDA_R_16F CUDA_R_16F
CUDA_R_8I
CUDA_R_16BF
CUDA_R_32F
CUDA_R_16F
CUDA_R_32F
The matrix types combinations supported for cublasCgemmEx are listed below :
C A/B
CUDA_C_8I
CUDA_C_32F
CUDA_C_32F
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.8.12. cublasGemmEx()
cublasStatus_t cublasGemmEx(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m,
int n,
int k,
const void *alpha,
const void *A,
cudaDataType_t Atype,
int lda,
const void *B,
cudaDataType_t Btype,
int ldb,
const void *beta,
void *C,
cudaDataType_t Ctype,
int ldc,
cublasComputeType_t computeType,
cublasGemmAlgo_t algo)
#if defined(__cplusplus)
cublasStatus_t cublasGemmEx(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m,
int n,
int k,
const void *alpha,
const void *A,
cudaDataType Atype,
int lda,
const void *B,
cudaDataType Btype,
int ldb,
const void *beta,
void *C,
cudaDataType Ctype,
int ldc,
cudaDataType computeType,
cublasGemmAlgo_t algo)
#endif
This function is an extension of cublas<t>gemm that allows the user to individually specify the
data types for each of the A, B and C matrices, the precision of computation and the GEMM
algorithm to be run. Supported combinations of arguments are listed further down in this
section.
Note: The second variant of cublasGemmEx function is provided for backward compatibility
with C++ applications code, where the computeType parameter is of cudaDataType instead
of cublasComputeType_t. C applications would still compile with the updated function
signature.
This function is only supported on devices with compute capability 5.0 or later.
where and are scalars, and , and are matrices stored in column-major format with
dimensions , and , respectively. Also, for matrix
alpha host or device input scaling factor for A*B of the type that corresponds to the
computeType and Ctype, see the table below for details.
beta host or device input scaling factor for C of the type that corresponds to the
computeType and Ctype, see the table below for details. If
beta==0, C does not have to be a valid input.
cublasGemmEx supports the following Compute Type, Scale Type, Atype/Btype, and Ctype:
CUDA_R_16BF CUDA_R_16BF
CUDA_R_16F CUDA_R_16F
CUDA_R_8I CUDA_R_32F
CUDA_R_32F
CUBLAS_COMPUTE_32F CUDA_R_16BF CUDA_R_32F
or
CUBLAS_COMPUTE_32F_PEDANTIC CUDA_R_16F CUDA_R_32F
CUDA_R_32F CUDA_R_32F
CUDA_C_8I CUDA_C_32F
CUDA_C_32F
CUDA_C_32F CUDA_C_32F
CUBLAS_COMPUTE_32F_FAST_16F
CUDA_R_32F CUDA_R_32F CUDA_R_32F
or
CUDA_C_32F CUDA_C_32F CUDA_C_32F
CUBLAS_COMPUTE_32F_FAST_16BF
or
CUBLAS_COMPUTE_32F_FAST_TF32
cublasGemmEx routine is run for the algorithms in the following table. Note: for NVIDIA
Ampere Architecture GPUs and beyond, i.e. SM version >= 80, the algorithms below are
equivalent to CUBLAS_GEMM_DEFAULT or CUBLAS_GEMM_DEFAULT_TENSOR_OP respectively.
Specifying algorithm >= 99 for a single precision operation is equivalent to using
CUBLAS_COMPUTE_32F_FAST_16F compute type, even if math mode or compute type are
specified to be CUBLAS_COMPUTE_32F or CUBLAS_COMPUTE_32F_FAST_TF32.
CublasGemmAlgo_t Meaning
CUBLAS_GEMM_DEFAULT Apply Heuristics to select the GEMM algorithm
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
Starting with release 11.2, using the typed functions instead of the extension functions
(cublas**Ex()) helps in reducing the binary size when linking to static cuBLAS Library.
2.8.13. cublasGemmBatchedEx()
cublasStatus_t cublasGemmBatchedEx(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m,
int n,
int k,
const void *alpha,
const void *Aarray[],
cudaDataType_t Atype,
int lda,
const void *Barray[],
cudaDataType_t Btype,
int ldb,
const void *beta,
void *Carray[],
cudaDataType_t Ctype,
int ldc,
int batchCount,
cublasComputeType_t computeType,
cublasGemmAlgo_t algo)
#if defined(__cplusplus)
cublasStatus_t cublasGemmBatchedEx(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m,
int n,
int k,
const void *alpha,
const void *Aarray[],
cudaDataType Atype,
int lda,
const void *Barray[],
cudaDataType Btype,
int ldb,
const void *beta,
void *Carray[],
cudaDataType Ctype,
int ldc,
int batchCount,
cudaDataType computeType,
cublasGemmAlgo_t algo)
#endif
where and are scalars, and , and are arrays of pointers to matrices stored
in column-major format with dimensions , and ,
respectively. Also, for matrix
alpha host or device input scaling factor for A*B of the type that corresponds to the
computeType and Ctype, see the table below for details.
Aarray device input array of pointers to <Atype> array, with each array of dim.
lda x k with lda>=max(1,m) if transa == CUBLAS_OP_N
and lda x m with lda>=max(1,k) otherwise.
All pointers must meet certain alignment criteria. Please
see below for details.
Barray device input array of pointers to <Btype> array, with each array of dim.
ldb x n with ldb>=max(1,k) if transb == CUBLAS_OP_N
and ldb x k with ldb>=max(1,n) otherwise.
All pointers must meet certain alignment criteria. Please
see below for details.
beta host or device input scaling factor for C of the type that corresponds to the
computeType and Ctype, see the table below for details. If
beta==0, C[i] does not have to be a valid input.
Carray device in/out array of pointers to <Ctype> array. It has dimensions ldc x
n with ldc>=max(1,m). Matrices C[i] should not overlap;
otherwise, undefined behavior is expected.
All pointers must meet certain alignment criteria. Please
see below for details.
cublasGemmBatchedEx supports the following Compute Type, Scale Type, Atype/Btype, and
Ctype:
Scale Type (alpha
Compute Type and beta) Atype/Btype Ctype
CUBLAS_COMPUTE_16F CUDA_R_16F CUDA_R_16F CUDA_R_16F
or
CUBLAS_COMPUTE_16F_PEDANTIC
CUDA_R_16BF CUDA_R_16BF
CUDA_R_16F CUDA_R_16F
CUDA_R_8I CUDA_R_32F
CUBLAS_COMPUTE_32F CUDA_R_32F
or CUDA_R_16BF CUDA_R_32F
CUBLAS_COMPUTE_32F_PEDANTIC
CUDA_R_16F CUDA_R_32F
CUDA_R_32F CUDA_R_32F
CUBLAS_COMPUTE_32F_FAST_16F
CUDA_R_32F CUDA_R_32F CUDA_R_32F
or
CUDA_C_32F CUDA_C_32F CUDA_C_32F
CUBLAS_COMPUTE_32F_FAST_16BF
or
CUBLAS_COMPUTE_32F_FAST_TF32
The possible error values returned by this function and their meanings are listed below.
2.8.14. cublasGemmStridedBatchedEx()
cublasStatus_t cublasGemmStridedBatchedEx(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m,
int n,
int k,
const void *alpha,
const void *A,
cudaDataType_t Atype,
int lda,
long long int strideA,
const void *B,
cudaDataType_t Btype,
int ldb,
long long int strideB,
const void *beta,
void *C,
cudaDataType_t Ctype,
int ldc,
long long int strideC,
int batchCount,
cublasComputeType_t computeType,
cublasGemmAlgo_t algo)
#if defined(__cplusplus)
cublasStatus_t cublasGemmStridedBatchedEx(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m,
int n,
int k,
const void *alpha,
const void *A,
cudaDataType Atype,
int lda,
long long int strideA,
const void *B,
cudaDataType Btype,
int ldb,
long long int strideB,
const void *beta,
void *C,
cudaDataType Ctype,
int ldc,
long long int strideC,
int batchCount,
cudaDataType computeType,
cublasGemmAlgo_t algo)
#endif
where and are scalars, and , and are arrays of pointers to matrices stored
in column-major format with dimensions , and ,
respectively. Also, for matrix
alpha host or device input scaling factor for A*B of the type that corresponds to the
computeType and Ctype, see the table below for details.
strideA input value of type long long int that gives the offset in number of
elements between A[i] and A[i+1].
strideB input value of type long long int that gives the offset in number of
elements between B[i] and B[i+1].
beta host or device input scaling factor for C of the type that corresponds to the
computeType and Ctype, see the table below for details. If
beta==0, C[i] does not have to be a valid input.
strideC input value of type long long int that gives the offset in number of
elements between C[i] and C[i+1].
CUDA_R_16BF CUDA_R_16BF
CUDA_R_16F CUDA_R_16F
CUDA_R_8I CUDA_R_32F
CUDA_R_32F
CUBLAS_COMPUTE_32F CUDA_R_16BF CUDA_R_32F
or
CUBLAS_COMPUTE_32F_PEDANTIC CUDA_R_16F CUDA_R_32F
CUDA_R_32F CUDA_R_32F
CUDA_C_8I CUDA_C_32F
CUDA_C_32F
CUDA_C_32F CUDA_C_32F
CUBLAS_COMPUTE_32F_FAST_16F
CUDA_R_32F CUDA_R_32F CUDA_R_32F
or
CUDA_C_32F CUDA_C_32F CUDA_C_32F
CUBLAS_COMPUTE_32F_FAST_16BF
or
CUBLAS_COMPUTE_32F_FAST_TF32
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.8.15. cublasCsyrkEx()
cublasStatus_t cublasCsyrkEx(cublasHandle_t handle,
cublasFillMode_t uplo,
cublasOperation_t trans,
int n,
int k,
const float *alpha,
const void *A,
cudaDataType Atype,
int lda,
const float *beta,
cuComplex *C,
cudaDataType Ctype,
int ldc)
This function is an extension of cublasCsyrk where the input matrix and output matrix can
have a lower precision but the computation is still done in the type cuComplex
This function performs the symmetric rank- update
where and are scalars, is a symmetric matrix stored in lower or upper mode, and is a
matrix with dimensions . Also, for matrix
Note: This routine is only supported on GPUs with architecture capabilities equal or greater
than 5.0
uplo input indicates if matrix C lower or upper part is stored, the other
symmetric part is not referenced and is inferred from the
stored elements.
beta host or device input <type> scalar used for multiplication, if beta==0 then C does
not have to be a valid input.
The matrix types combinations supported for cublasCsyrkEx are listed below :
A C
CUDA_C_8I CUDA_C_32F
CUDA_C_32F CUDA_C_32F
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.8.16. cublasCsyrk3mEx()
cublasStatus_t cublasCsyrk3mEx(cublasHandle_t handle,
cublasFillMode_t uplo,
cublasOperation_t trans,
int n,
int k,
const float *alpha,
const void *A,
cudaDataType Atype,
int lda,
const float *beta,
cuComplex *C,
cudaDataType Ctype,
int ldc)
This function is an extension of cublasCsyrk where the input matrix and output matrix can
have a lower precision but the computation is still done in the type cuComplex. This routine is
implemented using the Gauss complexity reduction algorithm which can lead to an increase in
performance up to 25%
This function performs the symmetric rank- update
where and are scalars, is a symmetric matrix stored in lower or upper mode, and is a
matrix with dimensions . Also, for matrix
Note: This routine is only supported on GPUs with architecture capabilities equal or greater
than 5.0
uplo input indicates if matrix C lower or upper part is stored, the other
symmetric part is not referenced and is inferred from the
stored elements.
beta host or device input <type> scalar used for multiplication, if beta==0 then C does
not have to be a valid input.
The matrix types combinations supported for cublasCsyrk3mEx are listed below :
A C
CUDA_C_8I CUDA_C_32F
CUDA_C_32F CUDA_C_32F
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.8.17. cublasCherkEx()
cublasStatus_t cublasCherkEx(cublasHandle_t handle,
cublasFillMode_t uplo,
cublasOperation_t trans,
int n,
int k,
const float *alpha,
const void *A,
cudaDataType Atype,
int lda,
const float *beta,
cuComplex *C,
cudaDataType Ctype,
int ldc)
This function is an extension of cublasCherk where the input matrix and output matrix can
have a lower precision but the computation is still done in the type cuComplex
This function performs the Hermitian rank- update
where and are scalars, is a Hermitian matrix stored in lower or upper mode, and is a
matrix with dimensions . Also, for matrix
Note: This routine is only supported on GPUs with architecture capabilities equal or greater
than 5.0
uplo input indicates if matrix A lower or upper part is stored, the other
Hermitian part is not referenced and is inferred from the
stored elements.
beta input <type> scalar used for multiplication, if beta==0 then C does
not have to be a valid input.
The matrix types combinations supported for cublasCherkEx are listed below :
A C
CUDA_C_8I CUDA_C_32F
CUDA_C_32F CUDA_C_32F
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.8.18. cublasCherk3mEx()
cublasStatus_t cublasCherk3mEx(cublasHandle_t handle,
cublasFillMode_t uplo,
cublasOperation_t trans,
int n,
int k,
const float *alpha,
const void *A,
cudaDataType Atype,
int lda,
const float *beta,
cuComplex *C,
cudaDataType Ctype,
int ldc)
This function is an extension of cublasCherk where the input matrix and output matrix can
have a lower precision but the computation is still done in the type cuComplex. This routine is
implemented using the Gauss complexity reduction algorithm which can lead to an increase in
performance up to 25%
This function performs the Hermitian rank- update
where and are scalars, is a Hermitian matrix stored in lower or upper mode, and is a
matrix with dimensions . Also, for matrix
Note: This routine is only supported on GPUs with architecture capabilities equal or greater
than 5.0
beta input <type> scalar used for multiplication, if beta==0 then C does
not have to be a valid input.
The matrix types combinations supported for cublasCherk3mEx are listed below :
A C
CUDA_C_8I CUDA_C_32F
CUDA_C_32F CUDA_C_32F
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.8.19. cublasNrm2Ex()
cublasStatus_t cublasNrm2Ex( cublasHandle_t handle,
int n,
const void *x,
cudaDataType xType,
int incx,
void *result,
cudaDataType resultType,
cudaDataType executionType)
This function is an API generalization of the routine cublas<t>nrm2 where input data, output
data and compute type can be specified independently.
This function computes the Euclidean norm of the vector x. The code uses a multiphase model
of accumulation to avoid intermediate underflow and overflow, with the result being equivalent
to where in exact arithmetic. Notice that the last equation
reflects 1-based indexing used for compatibility with Fortran.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
The datatypes combinations currently supported for cublasNrm2Ex are listed below :
x result execution
CUDA_R_16F CUDA_R_16F CUDA_R_32F
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
‣ result == NULL
2.8.20. cublasAxpyEx()
cublasStatus_t cublasAxpyEx (cublasHandle_t handle,
int n,
const void *alpha,
cudaDataType alphaType,
const void *x,
cudaDataType xType,
int incx,
void *y,
cudaDataType yType,
int incy,
cudaDataType executiontype);
This function is an API generalization of the routine cublas<t>axpy where input data, output
data and compute type can be specified independently.
This function multiplies the vector x by the scalar and adds it to the vector y overwriting
the latest vector with the result. Hence, the performed operation is for
, and . Notice that the last two equations
reflect 1-based indexing used for compatibility with Fortran.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
The datatypes combinations currently supported for cublasAxpyEx are listed below :
x y execution
CUDA_R_16F CUDA_R_16F CUDA_R_32F
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.8.21. cublasDotEx()
cublasStatus_t cublasDotEx (cublasHandle_t handle,
int n,
const void *x,
cudaDataType xType,
int incx,
const void *y,
cudaDataType yType,
int incy,
void *result,
cudaDataType resultType,
cudaDataType executionType);
These functions are an API generalization of the routines cublas<t>dot and cublas<t>dotc
where input data, output data and compute type can be specified independently. Note:
cublas<t>dotc is dot product conjugated, cublas<t>dotu is dot product unconjugated.
This function computes the dot product of vectors x and y. Hence, the result is
where and . Notice that in the first equation the
conjugate of the element of vector x should be used if the function name ends in character ‘c’
and that the last two equations reflect 1-based indexing used for compatibility with Fortran.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
result host or device output the resulting dot product, which is 0.0 if n<=0.
The datatypes combinations currently supported for cublasDotEx and cublasDotcEx are
listed below :
x y result execution
CUDA_R_16F CUDA_R_16F CUDA_R_16F CUDA_R_32F
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.8.22. cublasRotEx()
This function is an extension to the routine cublas<t>rot where input data, output data,
cosine/sine type, and compute type can be specified independently.
This function applies Givens rotation matrix (i.e., rotation in the x,y plane counter-clockwise by
angle defined by cos(alpha)=c, sin(alpha)=s):
to vectors x and y.
Hence, the result is and where
and . Notice that the last two equations reflect 1-based
indexing used for compatibility with Fortran.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
The datatypes combinations currently supported for cublasRotEx are listed below :
executionType xType / yType csType
CUDA_R_32F CUDA_R_16BF CUDA_R_16BF
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
2.8.23. cublasScalEx()
cublasStatus_t cublasScalEx(cublasHandle_t handle,
int n,
const void *alpha,
cudaDataType alphaType,
void *x,
cudaDataType xType,
int incx,
cudaDataType executionType);
This function scales the vector x by the scalar and overwrites it with the result. Hence, the
performed operation is for and . Notice that the
last two equations reflect 1-based indexing used for compatibility with Fortran.
Param. Memory In/out Meaning
handle input handle to the cuBLAS library context.
The datatypes combinations currently supported for cublasScalEx are listed below :
x execution
CUDA_R_16F CUDA_R_32F
CUDA_R_32F CUDA_R_32F
CUDA_R_64F CUDA_R_64F
CUDA_C_32F CUDA_C_32F
CUDA_C_64F CUDA_C_64F
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
‣ "3" - Hints - hints that can potentially improve the application's performance
‣ "4" - Info - provides general information about the library execution, may contain details
about heuristic status
‣ "5" - API Trace - API calls will log their parameter and important information
‣ "0" - Off
‣ "1" - Error
‣ "2" - Trace
‣ "4" - Hints
‣ "8" - Info
‣ "16" - API Trace
CUBLASLT_LOG_FILE=<file_name> - while file name is a path to a logging file. File name may
contain %i, that will be replaced with the process id. E.g "<file_name>_%i.log".
If CUBLASLT_LOG_FILE is not defined, the log messages are printed to stdout.
Another option is to use the experimental cublasLt logging API. See:
cublasLtLoggerSetCallback(), cublasLtLoggerSetFile(), cublasLtLoggerOpenFile(),
cublasLtLoggerSetLevel(), cublasLtLoggerSetMask(), cublasLtLoggerForceDisable()
3.3.2. cublasLtEpilogue_t
The cublasLtEpilogue_t is an enum type to set the postprocessing options for the epilogue.
Value Description
CUBLASLT_EPILOGUE_DEFAULT = 1 No special postprocessing, just scale and quantize the
results if necessary.
Value Description
CUBLASLT_EPILOGUE_RELU = 2 Apply ReLU point-wise transform to the results:
(x:=max(x, 0))
CUBLASLT_EPILOGUE_RELU_AUX = Apply ReLU point-wise transform to the
(CUBLASLT_EPILOGUE_RELU | 128) results: (x:=max(x, 0)). This epilogue
mode produces an extra output, see
CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER of
cublasLtMatmulDescAttributes_t.
CUBLASLT_EPILOGUE_BIAS = 4 Apply (broadcasted) bias from the bias vector. Bias vector
length must match matrix D rows, and it must be packed
(i.e., stride between vector elements is 1). Bias vector is
broadcasted to all columns and added before applying the
final postprocessing.
CUBLASLT_EPILOGUE_RELU_BIAS Apply bias and then ReLU transform.
= (CUBLASLT_EPILOGUE_RELU |
CUBLASLT_EPILOGUE_BIAS)
CUBLASLT_EPILOGUE_RELU_AUX_BIAS Apply bias and then ReLU transform. This
= (CUBLASLT_EPILOGUE_RELU_AUX | epilogue mode produces an extra output, see
CUBLASLT_EPILOGUE_BIAS) CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER of
cublasLtMatmulDescAttributes_t.
CUBLASLT_EPILOGUE_DRELU_BGRAD Apply independently ReLu and Bias gradient
= 8 | 16 | 128 to matmul output. Store ReLu gradient in the
output matrix, and Bias gradient in the bias buffer
(see CUBLASLT_MATMUL_DESC_BIAS_POINTER).
This epilogue mode requires an extra input, see
CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER of
cublasLtMatmulDescAttributes_t.
CUBLASLT_EPILOGUE_GELU = 32 Apply GELU point-wise transform to the results
(x:=GELU(x)).
CUBLASLT_EPILOGUE_GELU_AUX = Apply GELU point-wise transform to the results
(CUBLASLT_EPILOGUE_GELU | 128) (x:=GELU(x)). This epilogue mode outputs GELU
input as a separate matrix (useful for training). See
CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER of
cublasLtMatmulDescAttributes_t.
CUBLASLT_EPILOGUE_GELU_BIAS Apply Bias and then GELU transform1.
= (CUBLASLT_EPILOGUE_GELU |
CUBLASLT_EPILOGUE_BIAS)
CUBLASLT_EPILOGUE_GELU_AUX_BIAS Apply Bias and then GELU transform1. This
= (CUBLASLT_EPILOGUE_GELU_AUX | epilogue mode outputs GELU input as a
CUBLASLT_EPILOGUE_BIAS) separate matrix (useful for training). See
CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER of
cublasLtMatmulDescAttributes_t.
CUBLASLT_EPILOGUE_DGELU_BGRAD Apply independently GELU and Bias gradient
= 16 | 64 | 128 to matmul output. Store GELU gradient in the
output matrix, and Bias gradient in the bias buffer
(see CUBLASLT_MATMUL_DESC_BIAS_POINTER).
This epilogue mode requires an extra input, see
CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER of
cublasLtMatmulDescAttributes_t.
Value Description
CUBLASLT_EPILOGUE_BGRADA = 256 Apply Bias gradient to the input matrix A. The bias
size corresponds to the number of rows of the matrix
D. The reduction happens over the GEMM's "k"
dimension. Store Bias gradient in the bias buffer,
see CUBLASLT_MATMUL_DESC_BIAS_POINTER of
cublasLtMatmulDescAttributes_t.
CUBLASLT_EPILOGUE_BGRADB = 512 Apply Bias gradient to the input matrix B. The bias
size corresponds to the number of columns of the
matrix D. The reduction happens over the GEMM's
"k" dimension. Store Bias gradient in the bias buffer,
see CUBLASLT_MATMUL_DESC_BIAS_POINTER of
cublasLtMatmulDescAttributes_t.
NOTES:
1. GELU (Gaussian Error Linear Unit) is approximated by:
3.3.3. cublasLtHandle_t
The cublasLtHandle_t type is a pointer type to an opaque structure holding the cuBLASLt
library context. Use the below functions to manipulate this library context:
cublasLtCreate():
To initialize the cuBLASLt library context and return a handle to an opaque structure
holding the cuBLASLt library context.
cublasLtDestroy():
To destroy a previously created cuBLASLt library context descriptor and release the
resources.
3.3.4. cublasLtLoggerCallback_t
cublasLtLoggerCallback_t is a callback function pointer type.
Parameters:
cublasLtLoggerSetCallback()
3.3.5. cublasLtMatmulAlgo_t
cublasLtMatmulAlgo_t is an opaque structure holding the description of the matrix
multiplication algorithm. This structure can be trivially serialized and later restored for use
with the same version of cuBLAS library to save on selecting the right configuration again.
3.3.6. cublasLtMatmulAlgoCapAttributes_t
cublasLtMatmulAlgoCapAttributes_t enumerates matrix multiplication algorithm
capability attributes that can be retrieved from an initialized cublasLtMatmulAlgo_t descriptor.
Value Description Data Type
CUBLASLT_ALGO_CAP_SPLITK_SUPPORT Support for split-K. Boolean int32_t
(0 or 1) to express if split-K
implementation is supported.
0 means no support, and
supported otherwise. See
CUBLASLT_ALGO_CONFIG_SPLITK_NUM
of
cublasLtMatmulAlgoConfigAttributes_t.
CUBLASLT_ALGO_CAP_REDUCTION_SCHEME_MASK
Mask to express the types of uint32_t
reduction schemes supported,
see cublasLtReductionScheme_t.
If the reduction scheme is
not masked out then it is
supported. For example: int
isReductionSchemeComputeTypeSupported ?
(reductionSchemeMask &
CUBLASLT_REDUCTION_SCHEME_COMPUTE_TYPE)
==
CUBLASLT_REDUCTION_SCHEME_COMPUTE_TYPE ?
1 : 0;
CUBLASLT_ALGO_CAP_CTA_SWIZZLING_SUPPORT
Support for CTA-swizzling. uint32_t
Boolean (0 or 1) to express if
CTA-swizzling implementation
is supported. 0 means
no support, and 1 means
supported value of 1; other
values are reserved. See also
CUBLASLT_ALGO_CONFIG_CTA_SWIZZLING
of
cublasLtMatmulAlgoConfigAttributes_t.
CUBLASLT_ALGO_CAP_STRIDED_BATCH_SUPPORT
Support strided batch. 0 means int32_t
no support, supported otherwise.
CUBLASLT_ALGO_CAP_OUT_OF_PLACE_RESULT_SUPPORT
Support results out of place (D ! int32_t
= C in D = alpha.A.B + beta.C).
0 means no support, supported
otherwise.
3.3.7. cublasLtMatmulAlgoConfigAttributes_t
cublasLtMatmulAlgoConfigAttributes_t is an enumerated type that contains the
configuration attributes for the matrix multiply algorithms. These configuration attributes are
algorithm-specific, and can be set. The attributes configuration of a given algorithm should lie
within the boundaries expressed by its capability attributes.
Value Description Data Type
CUBLASLT_ALGO_CONFIG_ID Read-only attribute. Algorithm index. int32_t
See cublasLtMatmulAlgoGetIds()().
Set by cublasLtMatmulAlgoInit().
CUBLASLT_ALGO_CONFIG_TILE_ID Tile id. See uint32_t
cublasLtMatmulTile_t. Default:
CUBLASLT_MATMUL_TILE_UNDEFINED.
CUBLASLT_ALGO_CONFIG_STAGES_IDstages id, see uint32_t
cublasLtMatmulStages_t. Default:
CUBLASLT_MATMUL_STAGES_UNDEFINED.
3.3.8. cublasLtMatmulDesc_t
The cublasLtMatmulDesc_t is a pointer to an opaque structure holding the description of the
matrix multiplication operation cublasLtMatmul(). Use the below functions to manipulate
this descriptor:
cublasLtMatmulDescCreate():
To create one instance of the descriptor.
cublasLtMatmulDescDestroy():
To destroy a previously created descriptor and release the resources.
3.3.9. cublasLtMatmulDescAttributes_t
cublasLtMatmulDescAttributes_t is a descriptor structure containing the attributes that
define the specifics of the matrix multiply operation.
CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_BATCH_STRIDE
Batch stride for epilogue auxiliary buffer. int64_t
3.3.10. cublasLtMatmulHeuristicResult_t
cublasLtMatmulHeuristicResult_t is a descriptor that holds the configured matrix
multiplication algorithm descriptor and its runtime properties.
Member Description
cublasLtMatmulAlgo_t algo Must be initialized with
cublasLtMatmulAlgoInit() if the preference
CUBLASLT_MATMUL_PERF_SEARCH_MODE
is set to
CUBLASLT_SEARCH_LIMITED_BY_ALGO_ID. See
cublasLtMatmulSearch_t.
size_t workspaceSize; Actual size of workspace memory required.
cublasStatus_t state; Result status. Other fields
are valid only if, after call to
cublasLtMatmulAlgoGetHeuristic(), this
member is set to CUBLAS_STATUS_SUCCESS.
float wavesCount; Waves count is a device utilization metric. A
wavesCount value of 1.0f suggests that when the
kernel is launched it will fully occupy the GPU.
int reserved[4]; Reserved.
3.3.11. cublasLtMatmulPreference_t
The cublasLtMatmulPreference_t is a pointer to an opaque structure holding the
description of the preferences for cublasLtMatmulAlgoGetHeuristic() configuration. Use
the below functions to manipulate this descriptor:
cublasLtMatmulPreferenceCreate():
To create one instance of the descriptor.
cublasLtMatmulPreferenceDestroy():
To destroy a previously created descriptor and release the resources.
3.3.12. cublasLtMatmulPreferenceAttributes_t
cublasLtMatmulPreferenceAttributes_t is an enumerated type used to apply algorithm
search preferences while fine-tuning the heuristic function.
Value Description Data Type
CUBLASLT_MATMUL_PREF_SEARCH_MODESearch mode. See uint32_t
cublasLtMatmulSearch_t. Default is
CUBLASLT_SEARCH_BEST_FIT.
CUBLASLT_MATMUL_PREF_MAX_WORKSPACE_BYTES
Maximum allowed workspace memory. uint64_t
Default is 0 (no workspace memory
allowed).
CUBLASLT_MATMUL_PREF_MATH_MODE_MASK
Math mode mask. See uint32_t
cublasMath_t. Only algorithms with
CUBLASLT_ALGO_CAP_MATHMODE_IMPL
that is not masked out by this attribute are
allowed. Default is 1 (allows both default
and tensor op math).
DEPRECATED, will be removed
in a future release, see
cublasLtNumericalImplFlags_t for
replacement
CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK
Reduction scheme mask. See uint32_t
cublasLtReductionScheme_t. Only
algorithm configurations specifying
CUBLASLT_ALGO_CONFIG_REDUCTION_SCHEME
that is not masked out by this
attribute are allowed. For example,
a mask value of 0x03 will allow only
INPLACE and COMPUTE_TYPE
reduction schemes. Default is
CUBLASLT_REDUCTION_SCHEME_MASK
(i.e., allows all reduction schemes).
CUBLASLT_MATMUL_PREF_GAUSSIAN_MODE_MASK
Gaussian mode mask. See uint32_t
cublasLt3mMode_t. Only algorithms with
CUBLASLT_ALGO_CAP_GAUSSIAN_IMPL
that is not masked out by this
attribute are allowed. Default is
CUBLASLT_3M_MODE_ALLOWED (i.e.,
allows both Gaussian and regular math).
DEPRECATED, will be removed
in a future release, see
cublasLtNumericalImplFlags_t for
replacement
CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_A_BYTES
Minimum buffer alignment for matrix A uint32_t
(in bytes). Selecting a smaller value will
3.3.13. cublasLtMatmulSearch_t
cublasLtMatmulSearch_t is an enumerated type that contains the attributes for heuristics
search type.
Value Description Data Type
CUBLASLT_SEARCH_BEST_FIT Request heuristics for the best
algorithm for the given use case.
CUBLASLT_SEARCH_LIMITED_BY_ALGO_ID
Request heuristics only for the pre-
configured algo id.
3.3.14. cublasLtMatmulTile_t
cublasLtMatmulTile_t is an enumerated type used to set the tile size in rows x columns.
See also CUTLASS: Fast Linear Algebra in CUDA C++.
Value Description
CUBLASLT_MATMUL_TILE_UNDEFINED Tile size is undefined.
CUBLASLT_MATMUL_TILE_8x8 Tile size is 8 rows x 8 columns.
CUBLASLT_MATMUL_TILE_8x16 Tile size is 8 rows x 16 columns.
CUBLASLT_MATMUL_TILE_16x8 Tile size is 16 rows x 8 columns.
CUBLASLT_MATMUL_TILE_8x32 Tile size is 8 rows x 32 columns.
CUBLASLT_MATMUL_TILE_16x16 Tile size is 16 rows x 16 columns.
CUBLASLT_MATMUL_TILE_32x8 Tile size is 32 rows x 8 columns.
CUBLASLT_MATMUL_TILE_8x64 Tile size is 8 rows x 64 columns.
CUBLASLT_MATMUL_TILE_16x32 Tile size is 16 rows x 32 columns.
CUBLASLT_MATMUL_TILE_32x16 Tile size is 32 rows x 16 columns.
CUBLASLT_MATMUL_TILE_64x8 Tile size is 64 rows x 8 columns.
CUBLASLT_MATMUL_TILE_32x32 Tile size is 32 rows x 32 columns.
CUBLASLT_MATMUL_TILE_32x64 Tile size is 32 rows x 64 columns.
CUBLASLT_MATMUL_TILE_64x32 Tile size is 64 rows x 32 columns.
CUBLASLT_MATMUL_TILE_32x128 Tile size is 32 rows x 128 columns.
CUBLASLT_MATMUL_TILE_64x64 Tile size is 64 rows x 64 columns.
CUBLASLT_MATMUL_TILE_128x32 Tile size is 128 rows x 32 columns.
CUBLASLT_MATMUL_TILE_64x128 Tile size is 64 rows x 128 columns.
CUBLASLT_MATMUL_TILE_128x64 Tile size is 128 rows x 64 columns.
CUBLASLT_MATMUL_TILE_64x256 Tile size is 64 rows x 256 columns.
CUBLASLT_MATMUL_TILE_128x128 Tile size is 128 rows x 128 columns.
CUBLASLT_MATMUL_TILE_256x64 Tile size is 256 rows x 64 columns.
CUBLASLT_MATMUL_TILE_64x512 Tile size is 64 rows x 512 columns.
Value Description
CUBLASLT_MATMUL_TILE_128x256 Tile size is 128 rows x 256 columns.
CUBLASLT_MATMUL_TILE_256x128 Tile size is 256 rows x 128 columns.
CUBLASLT_MATMUL_TILE_512x64 Tile size is 512 rows x 64 columns.
3.3.15. cublasLtMatmulStages_t
cublasLtMatmulStages_t is an enumerated type used to configure the size and number of
shared memory buffers where input elements are staged. Number of staging buffers defines
kernel's pipeline depth.
Value Description
CUBLASLT_MATMUL_STAGES_UNDEFINED Stage size is undefined.
CUBLASLT_MATMUL_STAGES_16x1 Stage size is 16, number of stages is 1.
CUBLASLT_MATMUL_STAGES_16x2 Stage size is 16, number of stages is 2.
CUBLASLT_MATMUL_STAGES_16x3 Stage size is 16, number of stages is 3.
CUBLASLT_MATMUL_STAGES_16x4 Stage size is 16, number of stages is 4.
CUBLASLT_MATMUL_STAGES_16x5 Stage size is 16, number of stages is 5.
CUBLASLT_MATMUL_STAGES_16x6 Stage size is 16, number of stages is 6.
CUBLASLT_MATMUL_STAGES_32x1 Stage size is 32, number of stages is 1.
CUBLASLT_MATMUL_STAGES_32x2 Stage size is 32, number of stages is 2.
CUBLASLT_MATMUL_STAGES_32x3 Stage size is 32, number of stages is 3.
CUBLASLT_MATMUL_STAGES_32x4 Stage size is 32, number of stages is 4.
CUBLASLT_MATMUL_STAGES_32x5 Stage size is 32, number of stages is 5.
CUBLASLT_MATMUL_STAGES_32x6 Stage size is 32, number of stages is 6.
CUBLASLT_MATMUL_STAGES_64x1 Stage size is 64, number of stages is 1.
CUBLASLT_MATMUL_STAGES_64x2 Stage size is 64, number of stages is 2.
CUBLASLT_MATMUL_STAGES_64x3 Stage size is 64, number of stages is 3.
CUBLASLT_MATMUL_STAGES_64x4 Stage size is 64, number of stages is 4.
CUBLASLT_MATMUL_STAGES_64x5 Stage size is 64, number of stages is 5.
CUBLASLT_MATMUL_STAGES_64x6 Stage size is 64, number of stages is 6.
CUBLASLT_MATMUL_STAGES_128x1 Stage size is 128, number of stages is 1.
CUBLASLT_MATMUL_STAGES_128x2 Stage size is 128, number of stages is 2.
CUBLASLT_MATMUL_STAGES_128x3 Stage size is 128, number of stages is 3.
CUBLASLT_MATMUL_STAGES_128x4 Stage size is 128, number of stages is 4.
CUBLASLT_MATMUL_STAGES_128x5 Stage size is 128, number of stages is 5.
CUBLASLT_MATMUL_STAGES_128x6 Stage size is 128, number of stages is 6.
CUBLASLT_MATMUL_STAGES_32x10 Stage size is 32, number of stages is 10.
CUBLASLT_MATMUL_STAGES_8x4 Stage size is 8, number of stages is 4.
CUBLASLT_MATMUL_STAGES_16x10 Stage size is 16, number of stages is 10.
3.3.16. cublasLtNumericalImplFlags_t
cublasLtNumericalImplFlags_t: a set of bit-flags that can be specified to select
implementation details that may affect numerical behavior of algorithms.
Flags below can be combined using the bit OR operator "|".
Value Description
CUBLASLT_NUMERICAL_IMPL_FLAGS_FMA Specify that the implementation is based on
[H,F,D]FMA (fused multiply-add) family instructions.
CUBLASLT_NUMERICAL_IMPL_FLAGS_HMMASpecify that the implementation is based on HMMA
(tensor operation) family instructions.
CUBLASLT_NUMERICAL_IMPL_FLAGS_IMMA Specify that the implementation is based on IMMA
(integer tensor operation) family instructions.
CUBLASLT_NUMERICAL_IMPL_FLAGS_DMMASpecify that the implementation is based on DMMA
(double precision tensor operation) family instructions.
CUBLASLT_NUMERICAL_IMPL_FLAGS_TENSOR_OP_MASK
Mask to filter implementations using any of the above
kinds of tensor operations.
CUBLASLT_NUMERICAL_IMPL_FLAGS_OP_TYPE_MASK
Mask to filter implementation details about multiply-
accumulate instructions used.
CUBLASLT_NUMERICAL_IMPL_FLAGS_ACCUMULATOR_16F
Specify that the implementation's inner dot product is
using half precision accumulator.
CUBLASLT_NUMERICAL_IMPL_FLAGS_ACCUMULATOR_32F
Specify that the implementation's inner dot product is
using single precision accumulator.
CUBLASLT_NUMERICAL_IMPL_FLAGS_ACCUMULATOR_64F
Specify that the implementation's inner dot product is
using double precision accumulator.
CUBLASLT_NUMERICAL_IMPL_FLAGS_ACCUMULATOR_32I
Specify that the implementation's inner dot product is
using 32 bit signed integer precision accumulator.
CUBLASLT_NUMERICAL_IMPL_FLAGS_ACCUMULATOR_TYPE_MASK
Mask to filter implementation details about
accumulator used.
CUBLASLT_NUMERICAL_IMPL_FLAGS_INPUT_16F
Specify that the implementation's inner dot product
multiply-accumulate instruction is using half-precision
inputs.
CUBLASLT_NUMERICAL_IMPL_FLAGS_INPUT_16BF
Specify that the implementation's inner dot product
multiply-accumulate instruction is using bfloat16
inputs.
CUBLASLT_NUMERICAL_IMPL_FLAGS_INPUT_TF32
Specify that the implementation's inner dot product
multiply-accumulate instruction is using TF32 inputs.
CUBLASLT_NUMERICAL_IMPL_FLAGS_INPUT_32F
Specify that the implementation's inner dot product
multiply-accumulate instruction is using single-
precision inputs.
CUBLASLT_NUMERICAL_IMPL_FLAGS_INPUT_64F
Specify that the implementation's inner dot product
multiply-accumulate instruction is using double-
precision inputs.
Value Description
CUBLASLT_NUMERICAL_IMPL_FLAGS_INPUT_8I
Specify that the implementation's inner dot product
multiply-accumulate instruction is using 8-bit integer
inputs.
CUBLASLT_NUMERICAL_IMPL_FLAGS_OP_INPUT_TYPE_MASK
Mask to filter implementation details about
accumulator input used.
CUBLASLT_NUMERICAL_IMPL_FLAGS_GAUSSIAN
Specify that the implementation applies Gauss
complexity reduction algorithm to reduce arithmetic
complexity of the complex matrix multiplication
problem
3.3.17. cublasLtMatrixLayout_t
The cublasLtMatrixLayout_t is a pointer to an opaque structure holding the description of
a matrix layout. Use the below functions to manipulate this descriptor:
cublasLtMatrixLayoutCreate():
To create one instance of the descriptor.
cublasLtMatrixLayoutDestroy():
To destroy a previously created descriptor and release the resources.
3.3.18. cublasLtMatrixLayoutAttribute_t
cublasLtMatrixLayoutAttribute_t is a descriptor structure containing the attributes that
define the details of the matrix operation.
Attribute Name Description Data Type
CUBLASLT_MATRIX_LAYOUT_TYPE
Specifies the data precision type. See uint32_t
cudaDataType_t.
CUBLASLT_MATRIX_LAYOUT_ORDER
Specifies the memory order of the int32_t
data of the matrix. Default value
is CUBLASLT_ORDER_COL. See
cublasLtOrder_t .
CUBLASLT_MATRIX_LAYOUT_ROWS
Describes the number of rows in the uint64_t
matrix. Normally only values that can be
expressed as int32_t are supported.
CUBLASLT_MATRIX_LAYOUT_COLS
Describes the number of columns in the uint64_t
matrix. Normally only values that can be
expressed as int32_t are supported.
CUBLASLT_MATRIX_LAYOUT_LDThe leading dimension of the matrix. For int64_t
CUBLASLT_ORDER_COL this is the stride
(in elements) of matrix column. See also
cublasLtOrder_t.
CUBLASLT_MATRIX_LAYOUT_BATCH_COUNT
Number of matmul operations to perform int32_t
in the batch. Default value is 1. See also
CUBLASLT_ALGO_CAP_STRIDED_BATCH_SUPPORT
in cublasLtMatmulAlgoCapAttributes_t.
CUBLASLT_MATRIX_LAYOUT_STRIDED_BATCH_OFFSET
Stride (in elements) to the next matrix for int64_t
the strided batch operation. Default value
is 0. When matrix type is planar-complex
(CUBLASLT_MATRIX_LAYOUT_PLANE_OFFSET !
= 0), batch stride is interpreted by
cublasLtMatmul() in number of real
valued sub-elements. E.g. for data of
type CUDA_C_16F, offset of 1024B is
encoded as a stride of value 512 (since
each element of the real and imaginary
matrices is a 2B (16bit) floating point type).
NOTE: A bug in cublasLtMatrixTransform()
causes it to interpret the batch stride
for a planar-complex matrix as if it was
specified in number of complex elements.
Therefore an offset of 1024B must be
encoded as stride value 256 when calling
cublasLtMatrixTransform() (each complex
element is 4B with real and imaginary
values 2B each). This behavior is expected
to be corrected in the next major cuBLAS
version.
CUBLASLT_MATRIX_LAYOUT_PLANE_OFFSET
Stride (in bytes) to the imaginary plane int64_t
for planar complex layout. Default value
is 0, indicating that the layout is regular
(real and imaginary parts of complex
numbers are interleaved in memory for
each element).
3.3.19. cublasLtMatrixTransformDesc_t
The cublasLtMatrixTransformDesc_t is a pointer to an opaque structure holding the
description of a matrix transformation operation. Use the below functions to manipulate this
descriptor:
cublasLtMatrixTransformDescCreate():
To create one instance of the descriptor.
cublasLtMatrixTransformDescDestroy():
To destroy a previously created descriptor and release the resources.
3.3.20. cublasLtMatrixTransformDescAttributes_t
cublasLtMatrixTransformDescAttributes_t is a descriptor structure containing the
attributes that define the specifics of the matrix transform operation.
Transform Attribute Name Description Data Type
CUBLASLT_MATRIX_TRANSFORM_DESC_SCALE_TYPE
Scale type. Inputs are converted to the scale int32_t
type for scaling and summation, and results
are then converted to the output type to store in
the memory. For the supported data types see
cuda_datatype_t.
CUBLASLT_MATRIX_TRANSFORM_DESC_POINTER_MODE
Specifies the scalars alpha and beta int32_t
are passed by reference whether on the
host or on the device. Default value is:
CUBLASLT_POINTER_MODE_HOST (i.e., on the
host). See cublasLtPointerMode_t.
CUBLASLT_MATRIX_TRANSFORM_DESC_TRANSA
Specifies the type of operation that should be int32_t
performed on the matrix A. Default value is:
CUBLAS_OP_N (i.e., non-transpose operation).
See cublasOperation_t.
CUBLASLT_MATRIX_TRANSFORM_DESC_TRANSB
Specifies the type of operation that should be int32_t
performed on the matrix B. Default value is:
CUBLAS_OP_N (i.e., non-transpose operation).
See cublasOperation_t.
3.3.21. cublasLtOrder_t
cublasLtOrder_t is an enumerated type used to indicate the data ordering of the matrix.
3.3.22. cublasLtPointerMode_t
cublasLtPointerMode_t is an enumerated type used to set the pointer mode for the scaling
factors alpha and beta.
Value Description
CUBLASLT_POINTER_MODE_HOST = Matches CUBLAS_POINTER_MODE_HOST, and
CUBLAS_POINTER_MODE_HOST the pointer targets a single value host memory.
CUBLASLT_POINTER_MODE_DEVICE = Matches CUBLAS_POINTER_MODE_DEVICE, and
CUBLAS_POINTER_MODE_DEVICE the pointer targets a single value device memory.
CUBLASLT_POINTER_MODE_DEVICE_VECTOR = Pointers target device memory vectors of length
2 equal to the number of rows of matrix D.
CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_ZERO
alpha pointer targets a device memory vector of
=3 length equal to the number of rows of matrix D,
and beta is zero.
CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_HOST
alpha pointer targets a device memory vector of
=4 length equal to the number of rows of matrix D,
and beta is a single value in host memory.
3.3.23. cublasLtPointerModeMask_t
cublasLtPointerModeMask_t is an enumerated type used to define and query the pointer
mode capability.
Value Description
CUBLASLT_POINTER_MODE_MASK_HOST = 1 See CUBLASLT_POINTER_MODE_HOST in
cublasLtPointerMode_t.
CUBLASLT_POINTER_MODE_MASK_DEVICE = 2 See CUBLASLT_POINTER_MODE_DEVICE in
cublasLtPointerMode_t.
Value Description
CUBLASLT_POINTER_MODE_MASK_DEVICE_VECTOR
See
=4 CUBLASLT_POINTER_MODE_DEVICE_VECTOR in
cublasLtPointerMode_t
CUBLASLT_POINTER_MODE_MASK_ALPHA_DEVICE_VECTOR_BETA_ZERO
See
=8 CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_ZERO
in cublasLtPointerMode_t
CUBLASLT_POINTER_MODE_MASK_ALPHA_DEVICE_VECTOR_BETA_HOST
See
= 16 CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_HOS
in cublasLtPointerMode_t
3.3.24. cublasLtReductionScheme_t
cublasLtReductionScheme_t is an enumerated type used to specify a reduction scheme for
the portions of the dot-product calculated in parallel (i.e., "split - K").
Value Description
CUBLASLT_REDUCTION_SCHEME_NONE Do not apply reduction. The dot-product will be
performed in one sequence.
CUBLASLT_REDUCTION_SCHEME_INPLACE Reduction is performed "in place" using the
output buffer, parts are added up in the output
data type. Workspace is only used for counters
that guarantee sequentiality.
CUBLASLT_REDUCTION_SCHEME_COMPUTE_TYPE Reduction done out of place in a user-provided
workspace. The intermediate results are stored
in the compute type in the workspace and
reduced in a separate step.
CUBLASLT_REDUCTION_SCHEME_OUTPUT_TYPE Reduction done out of place in a user-provided
workspace. The intermediate results are stored
in the output type in the workspace and reduced
in a separate step.
CUBLASLT_REDUCTION_SCHEME_MASK Allows all reduction schemes.
This function initializes the cuBLASLt library and creates a handle to an opaque structure
holding the cuBLASLt library context. It allocates light hardware resources on the host and
device, and must be called prior to making any other cuBLASLt library calls.
The cuBLASLt library context is tied to the current CUDA device. To use the library on multiple
devices, one cuBLASLt handle should be created for each device.
Parameters:
Parameter Memory Input / Output Description
lightHandle Output Pointer to the allocated
cuBLASLt handle for the created
cuBLASLt context.
Returns:
Return Value Description
CUBLAS_STATUS_SUCCESS The allocation completed successfully.
CUBLAS_STATUS_NOT_INITIALIZED The cuBLASLt library was not initialized. This
usually happens:
- when cublasLtCreate() is not called first
- an error in the CUDA Runtime API called by the
cuBLASLt routine, or
- an error in the hardware setup.
CUBLAS_STATUS_ALLOC_FAILED Resource allocation failed inside the cuBLASLt
library. This is usually caused by a cudaMalloc()
failure.
To correct: prior to the function call, deallocate
the previously allocated memory as much as
possible.
3.4.2. cublasLtDestroy()
cublasStatus_t
cublasLtDestroy(cublasLtHandle_t lightHandle)
This function releases hardware resources used by the cuBLASLt library. This function
is usually the last call with a particular handle to the cuBLASLt library. Because
cublasLtCreate() allocates some internal resources and the release of those resources
by calling cublasLtDestroy() will implicitly call cudaDeviceSynchronize(), it is
recommended to minimize the number of cublasLtCreate()/cublasLtDestroy()
occurrences.
Parameters:
Returns:
3.4.3. cublasLtGetCudartVersion()
size_t cublasLtGetCudartVersion(void);
This function returns the version number of the CUDA Runtime library.
Parameters: None.
Returns: size_t - The version number of the CUDA Runtime library.
3.4.4. cublasLtGetProperty()
cublasStatus_t cublasLtGetProperty(libraryPropertyType type, int *value);
This function returns the value of the requested property by writing it to the memory location
pointed to by the value parameter.
Parameters:
Returns:
Return Value Meaning
CUBLAS_STATUS_SUCCESS The requested libraryPropertyType
information is successfully written at the provided
address.
CUBLAS_STATUS_INVALID_VALUE ‣ If invalid value of the type input argument or
3.4.5. cublasLtGetStatusName()
const char* cublasLtGetStatusName(cublasStatus_t status);
3.4.6. cublasLtGetStatusString()
const char* cublasLtGetStatusString(cublasStatus_t status);
3.4.7. cublasLtGetVersion()
size_t cublasLtGetVersion(void);
3.4.8. cublasLtLoggerSetCallback()
cublasStatus_t cublasLtLoggerSetCallback(cublasLtLoggerCallback_t callback);
Returns:
3.4.9. cublasLtLoggerSetFile()
cublasStatus_t cublasLtLoggerSetFile(FILE* file);
Experimental: This function sets the logging output file. Note: once registered using this
function call, the provided file handle must not be closed unless the function is called again to
switch to a different file handle.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_SUCCESS If logging file was successfully set.
3.4.10. cublasLtLoggerOpenFile()
cublasStatus_t cublasLtLoggerOpenFile(const char* logFile);
Experimental: This function opens a logging output file in the given path.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_SUCCESS If the logging file was successfully opened.
3.4.11. cublasLtLoggerSetLevel()
cublasStatus_t cublasLtLoggerSetLevel(int level);
Returns:
Return Value Description
CUBLAS_STATUS_INVALID_VALUE If the value was not a valid logging level. See
cuBLASLt Logging.
CUBLAS_STATUS_SUCCESS If the logging level was successfully set.
3.4.12. cublasLtLoggerSetMask()
cublasStatus_t cublasLtLoggerSetMask(int mask);
Returns:
Return Value Description
CUBLAS_STATUS_SUCCESS If the logging mask was successfully set.
3.4.13. cublasLtLoggerForceDisable()
cublasStatus_t cublasLtLoggerForceDisable();
3.4.14. cublasLtMatmul()
cublasStatus_t cublasLtMatmul(
cublasLtHandle_t lightHandle,
cublasLtMatmulDesc_t computeDesc,
const void *alpha,
const void *A,
cublasLtMatrixLayout_t Adesc,
const void *B,
cublasLtMatrixLayout_t Bdesc,
const void *beta,
const void *C,
cublasLtMatrixLayout_t Cdesc,
void *D,
cublasLtMatrixLayout_t Ddesc,
const cublasLtMatmulAlgo_t *algo,
void *workspace,
size_t workspaceSizeInBytes,
cudaStream_t stream);
This function computes the matrix multiplication of matrices A and B to produce the the output
matrix D, according to the following operation:
D = alpha*(A*B) + beta*(C),
where A, B, and C are input matrices, and alpha and beta are input scalars.
Note: This function supports both in-place matrix multiplication (C == D and Cdesc == Ddesc)
and out-of-place matrix multiplication (C != D, both matrices must have the same data type,
number of rows, number of columns, batch size, and memory order). In the out-of-place case,
the leading dimension of C can be different from the leading dimension of D. Specifically the
leading dimension of C can be 0 to achieve row or column broadcast. If Cdesc is omitted, this
function assumes it to be equal to Ddesc.
Datatypes Supported:
cublasLtMatmul supports the following computeType, scaleType, Atype/Btype, and Ctype:
4
computeType scaleType Atype/Btype Ctype Bias Type
CUDA_R_32F1 CUDA_R_8I1 CUDA_R_8I1 Non-default
epilogue not
supported.
CUBLAS_COMPUTE_CUDA_R_32F CUDA_R_16BF CUDA_R_16BF CUDA_R_16BF4
32F or CUDA_R_16F CUDA_R_16F CUDA_R_16F4
CUBLAS_COMPUTE_
CUDA_R_8I CUDA_R_32F Non-default
32F_PEDANTIC
epilogue not
supported.
CUDA_R_16BF CUDA_R_32F CUDA_R_32F4
CUDA_R_16F CUDA_R_32F CUDA_R_32F4
CUDA_R_32F CUDA_R_32F CUDA_R_32F4
CUDA_C_32F5 CUDA_C_8I5 CUDA_C_32F5 Non-default
epilogue not
CUDA_C_32F5 CUDA_C_32F5 supported.
CUBLAS_COMPUTE_CUDA_R_32F CUDA_R_32F CUDA_R_32F CUDA_R_32F4
32F_FAST_16F or Non-default
CUDA_C_32F5 CUDA_C_32F5 CUDA_C_32F5
CUBLAS_COMPUTE_ epilogue not
32F_FAST_16BF supported.
or
CUBLAS_COMPUTE_
32F_FAST_TF32
To use IMMA kernels, one of the following sets of requirements, with the first being the
preferred one, must be met:
1. Using a regular data ordering:
‣ All matrix pointers must be 4-byte aligned. For even better performance, this condition
should hold with 16 instead of 4.
‣ If scaleType CUDA_R_32I is used, the only supported values for alpha and beta are 0
or 1.
See below table when using IMMA kernels. Note that IMMA does not work with
CUBLASLT_POINTER_MODE_MASK_DEVICE pointer mode.
And finally, see below table when A,B,C,D are planar complex matrices (see
CUBLASLT_MATRIX_LAYOUT_PLANE_OFFSET) to make use of mixed precision tensor core
acceleration.
NOTES:
1. When using regular memory order and when compute type is 32I, input type is R_8I and output
type is R_8I, only "TN" format is supported - "A" must be transposed and "B" non-transposed.
2. IMMA kernel with computeType=32I and Ctype=CUDA_R_8I supports
per row scaling (see CUBLASLT_POINTER_MODE_DEVICE_VECTOR and
CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_ZERO in cublasLtPointerMode_t)
as well as ReLU and Bias epilogue modes (see CUBLASLT_MATMUL_DESC_EPILOGUE in
cublasLtMatmulDescAttributes_t).
3. These can only be used with planar layout (CUBLASLT_MATRIX_LAYOUT_PLANE_OFFSET != 0).
4. ReLU, dReLu, GELU, dGELU and Bias epilogue modes (see CUBLASLT_MATMUL_DESC_EPILOGUE in
cublasLtMatmulDescAttributes_t) are not supported when D matrix memory order is defined as
CUBLASLT_ORDER_ROW. For best performance when using the bias vector, specify beta == 0 and
CUBLASLT_POINTER_MODE_HOST.
Parameters:
Parameter Memory Input / Output Description
lightHandle Input Pointer to the allocated
cuBLASLt handle for
the cuBLASLt context.
See cublasLtHandle_t.
computeDesc Input Handle to a previously
created matrix
multiplication
descriptor of type
cublasLtMatmulDesc_t.
alpha, beta Device or host Input Pointers to the
scalars used in the
multiplication.
A, B, and C Device Input Pointers to the GPU
memory associated
with the corresponding
descriptors Adesc,
Bdesc and Cdesc.
Adesc, Bdesc and Input Handles to the
Cdesc. previous created
descriptors of the type
cublasLtMatrixLayout_t.
D Device Output Pointer to the GPU
memory associated
with the descriptor
Ddesc.
Ddesc Input Handle to the
previous created
descriptor of the type
cublasLtMatrixLayout_t.
algo Input Handle for matrix
multiplication algorithm
to be used. See
cublasLtMatmulAlgo_t.
When NULL, an
implicit heuritics query
with default search
preferences will be
performed to determine
actual algorithm to use.
workspace Device Pointer to the
workspace buffer
allocated in the GPU
memory. Pointer must
be 16B aligned (i.e.
Returns:
Return Value Description
CUBLAS_STATUS_NOT_INITIALIZED If cuBLASLt handle has not been initialized.
CUBLAS_STATUS_INVALID_VALUE If the parameters are unexpectedly NULL, in
conflict or in an impossible configuration. For
example, when workspaceSizeInBytes is less
than workspace required by the configured algo.
CUBLAS_STATUS_NOT_SUPPORTED If the current implementation on the selected
device doesn't support the configured operation.
CUBLAS_STATUS_ARCH_MISMATCH If the configured operation cannot be run using
the selected device.
CUBLAS_STATUS_EXECUTION_FAILED If CUDA reported an execution error from the
device.
CUBLAS_STATUS_SUCCESS If the operation completed successfully.
3.4.15. cublasLtMatmulAlgoCapGetAttribute()
cublasStatus_t cublasLtMatmulAlgoCapGetAttribute(
const cublasLtMatmulAlgo_t *algo,
cublasLtMatmulAlgoCapAttributes_t attr,
void *buf,
size_t sizeInBytes,
size_t *sizeWritten);
This function returns the value of the queried capability attribute for an initialized
cublasLtMatmulAlgo_t descriptor structure. The capability attribute value is retrieved from the
enumerated type cublasLtMatmulAlgoCapAttributes_t.
For example, to get list of supported Tile IDs:
cublasLtMatmulTile_t tiles[CUBLASLT_MATMUL_TILE_END];
size_t num_tiles, size_written;
if (cublasLtMatmulAlgoCapGetAttribute(algo, CUBLASLT_ALGO_CAP_TILE_IDS,
tiles, sizeof(tiles), &size_written) == CUBLAS_STATUS_SUCCESS) {
num_tiles = size_written / sizeof(tiles[0]);}
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_INVALID_VALUE
‣ If sizeInBytes is 0 and sizeWritten is
NULL, or
3.4.16. cublasLtMatmulAlgoCheck()
cublasStatus_t cublasLtMatmulAlgoCheck(
cublasLtHandle_t lightHandle,
cublasLtMatmulDesc_t operationDesc,
cublasLtMatrixLayout_t Adesc,
cublasLtMatrixLayout_t Bdesc,
cublasLtMatrixLayout_t Cdesc,
cublasLtMatrixLayout_t Ddesc,
const cublasLtMatmulAlgo_t *algo,
cublasLtMatmulHeuristicResult_t *result);
This function performs the correctness check on the matrix multiply algorithm descriptor for
the matrix multiply operation cublasLtMatmul() function with the given input matrices A, B
and C, and the output matrix D. It checks whether the descriptor is supported on the current
device, and returns the result containing the required workspace and the calculated wave
count.
Note: CUBLAS_STATUS_SUCCESS doesn't fully guarantee that the algo will run. The algo will
fail if, for example, the buffers are not correctly aligned. However, if cublasLtMatmulAlgoCheck
fails, the algo will not run.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_INVALID_VALUE If matrix layout descriptors or the operation
descriptor do not match the algo descriptor.
CUBLAS_STATUS_NOT_SUPPORTED If the algo configuration or data type combination
is not currently supported on the given device.
CUBLAS_STATUS_ARCH_MISMATCH If the algo configuration cannot be run using the
selected device.
CUBLAS_STATUS_SUCCESS If the check was successful.
3.4.17. cublasLtMatmulAlgoConfigGetAttribute()
cublasStatus_t cublasLtMatmulAlgoConfigGetAttribute(
const cublasLtMatmulAlgo_t *algo,
cublasLtMatmulAlgoConfigAttributes_t attr,
void *buf,
size_t sizeInBytes,
size_t *sizeWritten);
This function returns the value of the queried configuration attribute for an initialized
cublasLtMatmulAlgo_t descriptor. The configuration attribute value is retrieved from the
enumerated type cublasLtMatmulAlgoConfigAttributes_t.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_INVALID_VALUE
‣ If sizeInBytes is 0 and sizeWritten is
NULL, or
3.4.18. cublasLtMatmulAlgoConfigSetAttribute()
cublasStatus_t cublasLtMatmulAlgoConfigSetAttribute(
cublasLtMatmulAlgo_t *algo,
cublasLtMatmulAlgoConfigAttributes_t attr,
const void *buf,
size_t sizeInBytes);
This function sets the value of the specified configuration attribute for an initialized
cublasLtMatmulAlgo_t descriptor. The configuration attribute is an enumerant of the type
cublasLtMatmulAlgoConfigAttributes_t.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_INVALID_VALUE If buf is NULL or sizeInBytes doesn't match
the size of the internal storage for the selected
attribute.
CUBLAS_STATUS_SUCCESS If the attribute was set successfully.
3.4.19. cublasLtMatmulAlgoGetHeuristic()
cublasStatus_t cublasLtMatmulAlgoGetHeuristic(
cublasLtHandle_t lightHandle,
cublasLtMatmulDesc_t operationDesc,
cublasLtMatrixLayout_t Adesc,
cublasLtMatrixLayout_t Bdesc,
cublasLtMatrixLayout_t Cdesc,
cublasLtMatrixLayout_t Ddesc,
cublasLtMatmulPreference_t preference,
int requestedAlgoCount,
cublasLtMatmulHeuristicResult_t heuristicResultsArray[]
int *returnAlgoCount);
This function retrieves the possible algorithms for the matrix multiply operation
cublasLtMatmul() function with the given input matrices A, B and C, and the output matrix
D. The output is placed in heuristicResultsArray[] in the order of increasing estimated
compute time.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_INVALID_VALUE If requestedAlgoCount is less or equal to zero.
CUBLAS_STATUS_NOT_SUPPORTED If no heuristic function available for current
configuration.
CUBLAS_STATUS_SUCCESS If query was successful. Inspect
heuristicResultsArray[0 to
(returnAlgoCount -1)].state for the status
of the results.
3.4.20. cublasLtMatmulAlgoGetIds()
cublasStatus_t cublasLtMatmulAlgoGetIds(
cublasLtHandle_t lightHandle,
cublasComputeType_t computeType,
cudaDataType_t scaleType,
cudaDataType_t Atype,
cudaDataType_t Btype,
cudaDataType_t Ctype,
cudaDataType_t Dtype,
int requestedAlgoCount,
int algoIdsArray[],
int *returnAlgoCount);
This function retrieves the IDs of all the matrix multiply algorithms that are valid, and can
potentially be run by the cublasLtMatmul() function, for given types of the input matrices A, B
and C, and of the output matrix D.
Note: the IDs are returned in no particular order. To make sure the best possible algo is
contained in the list, make requestedAlgoCount large enough to receive the full list. The list
is guaranteed to be full if returnAlgoCount < requestedAlgoCount.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_INVALID_VALUE If requestedAlgoCount is less or equal to zero.
CUBLAS_STATUS_SUCCESS If query was successful. Inspect
returnAlgoCount to get actual number of IDs
available.
3.4.21. cublasLtMatmulAlgoInit()
cublasStatus_t cublasLtMatmulAlgoInit(
cublasLtHandle_t lightHandle,
cublasComputeType_t computeType,
cudaDataType_t scaleType,
cudaDataType_t Atype,
cudaDataType_t Btype,
cudaDataType_t Ctype,
cudaDataType_t Dtype,
int algoId,
cublasLtMatmulAlgo_t *algo);
This function initializes the matrix multiply algorithm structure for the cublasLtMatmul() , for a
specified matrix multiply algorithm and input matrices A, B and C, and the output matrix D.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_INVALID_VALUE If algo is NULL or algoId is outside the
recognized range.
CUBLAS_STATUS_NOT_SUPPORTED If algoId is not supported for given combination
of data types.
CUBLAS_STATUS_SUCCESS If the structure was successfully initialized.
3.4.22. cublasLtMatmulDescCreate()
cublasStatus_t cublasLtMatmulDescCreate( cublasLtMatmulDesc_t *matmulDesc,
cublasComputeType_t computeType,
cudaDataType_t scaleType);
This function creates a matrix multiply descriptor by allocating the memory needed to hold its
opaque structure.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_ALLOC_FAILED If memory could not be allocated.
CUBLAS_STATUS_SUCCESS If the descriptor was created successfully.
3.4.23. cublasLtMatmulDescInit()
cublasStatus_t cublasLtMatmulDescInit( cublasLtMatmulDesc_t matmulDesc,
cublasComputeType_t computeType,
cudaDataType_t scaleType);
Returns:
Return Value Description
CUBLAS_STATUS_ALLOC_FAILED If memory could not be allocated.
CUBLAS_STATUS_SUCCESS If the descriptor was created successfully.
3.4.24. cublasLtMatmulDescDestroy()
cublasStatus_t cublasLtMatmulDescDestroy(
cublasLtMatmulDesc_t matmulDesc);
Returns:
Return Value Description
CUBLAS_STATUS_SUCCESS If operation was successful.
3.4.25. cublasLtMatmulDescGetAttribute()
cublasStatus_t cublasLtMatmulDescGetAttribute(
cublasLtMatmulDesc_t matmulDesc,
cublasLtMatmulDescAttributes_t attr,
void *buf,
size_t sizeInBytes,
size_t *sizeWritten);
This function returns the value of the queried attribute belonging to a previously created
matrix multiply descriptor.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_INVALID_VALUE
‣ If sizeInBytes is 0 and sizeWritten is
NULL, or
3.4.26. cublasLtMatmulDescSetAttribute()
cublasStatus_t cublasLtMatmulDescSetAttribute(
cublasLtMatmulDesc_t matmulDesc,
cublasLtMatmulDescAttributes_t attr,
const void *buf,
size_t sizeInBytes);
This function sets the value of the specified attribute belonging to a previously created matrix
multiply descriptor.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_INVALID_VALUE If buf is NULL or sizeInBytes doesn't match
the size of the internal storage for the selected
attribute.
CUBLAS_STATUS_SUCCESS If the attribute was set successfully.
3.4.27. cublasLtMatmulPreferenceCreate()
cublasStatus_t cublasLtMatmulPreferenceCreate(
cublasLtMatmulPreference_t *pref);
This function creates a matrix multiply heuristic search preferences descriptor by allocating
the memory needed to hold its opaque structure.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_ALLOC_FAILED If memory could not be allocated.
CUBLAS_STATUS_SUCCESS If the descriptor was created successfully.
3.4.28. cublasLtMatmulPreferenceInit()
cublasStatus_t cublasLtMatmulPreferenceInit(
cublasLtMatmulPreference_t pref);
Returns:
Return Value Description
CUBLAS_STATUS_ALLOC_FAILED If memory could not be allocated.
3.4.29. cublasLtMatmulPreferenceDestroy()
cublasStatus_t cublasLtMatmulPreferenceDestroy(
cublasLtMatmulPreference_t pref);
This function destroys a previously created matrix multiply preferences descriptor object.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_SUCCESS If the operation was successful.
3.4.30. cublasLtMatmulPreferenceGetAttribute()
cublasStatus_t cublasLtMatmulPreferenceGetAttribute(
cublasLtMatmulPreference_t pref,
cublasLtMatmulPreferenceAttributes_t attr,
void *buf,
size_t sizeInBytes,
size_t *sizeWritten);
This function returns the value of the queried attribute belonging to a previously created
matrix multiply heuristic search preferences descriptor.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_INVALID_VALUE
‣ If sizeInBytes is 0 and sizeWritten is
NULL, or
3.4.31. cublasLtMatmulPreferenceSetAttribute()
cublasStatus_t cublasLtMatmulPreferenceSetAttribute(
cublasLtMatmulPreference_t pref,
cublasLtMatmulPreferenceAttributes_t attr,
const void *buf,
size_t sizeInBytes);
This function sets the value of the specified attribute belonging to a previously created matrix
multiply preferences descriptor.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_INVALID_VALUE If buf is NULL or sizeInBytes doesn't match
the size of the internal storage for the selected
attribute.
CUBLAS_STATUS_SUCCESS If the attribute was set successfully.
3.4.32. cublasLtMatrixLayoutCreate()
cublasStatus_t cublasLtMatrixLayoutCreate( cublasLtMatrixLayout_t *matLayout,
cudaDataType type,
uint64_t rows,
uint64_t cols,
int64_t ld);
This function creates a matrix layout descriptor by allocating the memory needed to hold its
opaque structure.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_ALLOC_FAILED If the memory could not be allocated.
CUBLAS_STATUS_SUCCESS If the descriptor was created successfully.
3.4.33. cublasLtMatrixLayoutInit()
cublasStatus_t cublasLtMatrixLayoutInit( cublasLtMatrixLayout_t matLayout,
cudaDataType type,
uint64_t rows,
uint64_t cols,
int64_t ld);
Returns:
Return Value Description
CUBLAS_STATUS_ALLOC_FAILED If the memory could not be allocated.
CUBLAS_STATUS_SUCCESS If the descriptor was created successfully.
3.4.34. cublasLtMatrixLayoutDestroy()
cublasStatus_t cublasLtMatrixLayoutDestroy(
cublasLtMatrixLayout_t matLayout);
Returns:
Return Value Description
CUBLAS_STATUS_SUCCESS If the operation was successful.
3.4.35. cublasLtMatrixLayoutGetAttribute()
cublasStatus_t cublasLtMatrixLayoutGetAttribute(
cublasLtMatrixLayout_t matLayout,
cublasLtMatrixLayoutAttribute_t attr,
void *buf,
size_t sizeInBytes,
size_t *sizeWritten);
This function returns the value of the queried attribute belonging to the specified matrix layout
descriptor.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_INVALID_VALUE
‣ If sizeInBytes is 0 and sizeWritten is
NULL, or
3.4.36. cublasLtMatrixLayoutSetAttribute()
cublasStatus_t cublasLtMatrixLayoutSetAttribute(
cublasLtMatrixLayout_t matLayout,
cublasLtMatrixLayoutAttribute_t attr,
const void *buf,
size_t sizeInBytes);
This function sets the value of the specified attribute belonging to a previously created matrix
layout descriptor.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_INVALID_VALUE If buf is NULL or sizeInBytes doesn't match
size of internal storage for the selected attribute.
CUBLAS_STATUS_SUCCESS If attribute was set successfully.
3.4.37. cublasLtMatrixTransform()
cublasStatus_t cublasLtMatrixTransform(
cublasLtHandle_t lightHandle,
cublasLtMatrixTransformDesc_t transformDesc,
const void *alpha,
const void *A,
cublasLtMatrixLayout_t Adesc,
const void *beta,
const void *B,
cublasLtMatrixLayout_t Bdesc,
void *C,
cublasLtMatrixLayout_t Cdesc,
cudaStream_t stream);
This function computes the matrix transformation operation on the input matrices A and B, to
produce the output matrix C, according to the below operation:
C = alpha*transformation(A) + beta*transformation(B),
where A, B are input matrices, and alpha and beta are input scalars. The transformation
operation is defined by the transformDesc pointer. This function can be used to change the
memory order of data or to scale and shift the values.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_NOT_INITIALIZED If cuBLASLt handle has not been initialized.
3.4.38. cublasLtMatrixTransformDescCreate()
cublasStatus_t cublasLtMatrixTransformDescCreate(
cublasLtMatrixTransformDesc_t *transformDesc,
cudaDataType scaleType);
This function creates a matrix transform descriptor by allocating the memory needed to hold
its opaque structure.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_ALLOC_FAILED If memory could not be allocated.
CUBLAS_STATUS_SUCCESS If the descriptor was created successfully.
3.4.39. cublasLtMatrixTransformDescInit()
cublasStatus_t cublasLtMatrixTransformDescInit(
cublasLtMatrixTransformDesc_t transformDesc,
cudaDataType scaleType);
Returns:
Return Value Description
CUBLAS_STATUS_ALLOC_FAILED If memory could not be allocated.
CUBLAS_STATUS_SUCCESS If the descriptor was created successfully.
3.4.40. cublasLtMatrixTransformDescDestroy()
cublasStatus_t cublasLtMatrixTransformDescDestroy(
cublasLtMatrixTransformDesc_t transformDesc);
Returns:
Return Value Description
CUBLAS_STATUS_SUCCESS If the operation was successful.
3.4.41. cublasLtMatrixTransformDescGetAttribute()
cublasStatus_t cublasLtMatrixTransformDescGetAttribute(
cublasLtMatrixTransformDesc_t transformDesc,
cublasLtMatrixTransformDescAttributes_t attr,
void *buf,
size_t sizeInBytes,
size_t *sizeWritten);
This function returns the value of the queried attribute belonging to a previously created
matrix transform descriptor.
Parameters:
Returns:
Return Value Description
CUBLAS_STATUS_INVALID_VALUE
‣ If sizeInBytes is 0 and sizeWritten is
NULL, or
3.4.42. cublasLtMatrixTransformDescSetAttribute()
cublasStatus_t cublasLtMatrixTransformDescSetAttribute(
cublasLtMatrixTransformDesc_t transformDesc,
cublasLtMatrixTransformDescAttributes_t attr,
const void *buf,
size_t sizeInBytes);
This function sets the value of the specified attribute belonging to a previously created matrix
transform descriptor.
Parameters:
Returns:
When the tile dimension is not an exact multiple of the dimensions of C, some tiles are
partially filled on the right border or/and the bottom border. The current implementation does
not pad the incomplete tiles but simply keep track of those incomplete tiles by doing the right
reduced cuBLAS opearations : this way, no extra computation is done. However it still can lead
to some load unbalance when all GPUS do not have the same number of incomplete tiles to
work on.
When one or more matrices are located on some GPU devices, the same tiling approach and
workload sharing is applied. The memory transfers are in this case done between devices.
However, when the computation of a tile and some data are located on the same GPU device,
the memory transfer to/from the local data into tiles is bypassed and the GPU operates
directly on the local data. This can lead to a significant performance increase, especially when
only one GPU is used for the computation.
The matrices can be located on any GPU device, and do not have to be located on the same
GPU device. Furthermore, the matrices can even be located on a GPU device that do not
participate to the computation.
On the contrary of the cuBLAS API, even if all matrices are located on the same device, the
cuBLASXt API is still a blocking API from the Host point of view : the data results wherever
located will be valid on the call return and no device synchronization is required.
‣ all GPUs particating to the computation have the same compute-capabilities and the same
number of SMs.
4.2.2. cublasXtOpType_t
The cublasOpType_t enumerates the four possible types supported by BLAS routines.
This enum is used as parameters of the routines cublasXtSetCpuRoutine and
cublasXtSetCpuRatio to setup the hybrid configuration.
Value Meaning
CUBLASXT_FLOAT float or single precision type
Value Meaning
CUBLASXT_DOUBLE double precision type
4.2.3. cublasXtBlasOp_t
The cublasXtBlasOp_t type enumerates the BLAS3 or BLAS-like routine supported by
cuBLASXt API. This enum is used as parameters of the routines cublasXtSetCpuRoutine and
cublasXtSetCpuRatio to setup the hybrid configuration.
Value Meaning
CUBLASXT_GEMM GEMM routine
4.2.4. cublasXtPinningMemMode_t
The type is used to enable or disable the Pinning Memory mode through the routine
cubasMgSetPinningMemMode
Value Meaning
CUBLASXT_PINNING_DISABLED the Pinning Memory mode is disabled
This function initializes the cuBLASXt API and creates a handle to an opaque structure holding
the cuBLASXt API context. It allocates hardware resources on the host and device and must be
called prior to making any other cuBLASXt API calls.
Return Value Meaning
CUBLAS_STATUS_SUCCESS the initialization succeeded
4.3.2. cublasXtDestroy()
cublasStatus_t
cublasXtDestroy(cublasXtHandle_t handle)
This function releases hardware resources used by the cuBLASXt API context. The release of
GPU resources may be deferred until the application exits. This function is usually the last call
with a particular handle to the cuBLASXt API.
Return Value Meaning
CUBLAS_STATUS_SUCCESS the shut down succeeded
4.3.3. cublasXtDeviceSelect()
cublasXtDeviceSelect(cublasXtHandle_t handle, int nbDevices, int deviceId[])
This function allows the user to provide the number of GPU devices and their respective Ids
that will participate to the subsequent cuBLASXt API Math function calls. This function will
create a cuBLAS context for every GPU provided in that list. Currently the device configuration
is static and cannot be changed between Math function calls. In that regard, this function
should be called only once after cublasXtCreate. To be able to run multiple configurations,
multiple cuBLASXt API contexts should be created.
4.3.4. cublasXtSetBlockDim()
cublasXtSetBlockDim(cublasXtHandle_t handle, int blockDim)
This function allows the user to set the block dimension used for the tiling of the matrices for
the subsequent Math function calls. Matrices are split in square tiles of blockDim x blockDim
dimension. This function can be called anytime and will take effect for the following Math
function calls. The block dimension should be chosen in a way to optimize the math operation
and to make sure that the PCI transfers are well overlapped with the computation.
Return Value Meaning
CUBLAS_STATUS_SUCCESS the call has been successful
4.3.5. cublasXtGetBlockDim()
cublasXtGetBlockDim(cublasXtHandle_t handle, int *blockDim)
This function allows the user to query the block dimension used for the tiling of the matrices.
Return Value Meaning
CUBLAS_STATUS_SUCCESS the call has been successful
4.3.6. cublasXtSetCpuRoutine()
cublasXtSetCpuRoutine(cublasXtHandle_t handle, cublasXtBlasOp_t blasOp,
cublasXtOpType_t type, void *blasFunctor)
This function allows the user to provide a CPU implementation of the corresponding BLAS
routine. This function can be used with the function cublasXtSetCpuRatio() to define an hybrid
computation between the CPU and the GPUs. Currently the hybrid feature is only supported
for the xGEMM routines.
Return Value Meaning
CUBLAS_STATUS_SUCCESS the call has been successful
4.3.7. cublasXtSetCpuRatio()
cublasXtSetCpuRatio(cublasXtHandle_t handle, cublasXtBlasOp_t blasOp,
cublasXtOpType_t type, float ratio )
This function allows the user to define the percentage of workload that should be done on
a CPU in the context of an hybrid computation. This function can be used with the function
cublasXtSetCpuRoutine() to define an hybrid computation between the CPU and the GPUs.
Currently the hybrid feature is only supported for the xGEMM routines.
Return Value Meaning
CUBLAS_STATUS_SUCCESS the call has been successful
4.3.8. cublasXtSetPinningMemMode()
cublasXtSetPinningMemMode(cublasXtHandle_t handle, cublasXtPinningMemMode_t mode)
This function allows the user to enable or disable the Pinning Memory mode. When enabled,
the matrices passed in subsequent cuBLASXt API calls will be pinned/unpinned using
the CUDART routine cudaHostRegister and cudaHostUnregister respectively if the
matrices are not already pinned. If a matrix happened to be pinned partially, it will also not
be pinned. Pinning the memory improve PCI transfer performace and allows to overlap PCI
memory transfer with computation. However pinning/unpinning the memory take some time
which might not be amortized. It is advised that the user pins the memory on its own using
cudaMallocHost or cudaHostRegister and unpin it when the computation sequence is
completed. By default, the Pinning Memory mode is disabled.
Note: The Pinning Memory mode should not enabled when matrices used for different calls to
cuBLASXt API overlap. cuBLASXt determines that a matrix is pinned or not if the first address
of that matrix is pinned using cudaHostGetFlags, thus cannot know if the matrix is already
partially pinned or not. This is especially true in multi-threaded application where memory
could be partially or totally pinned or unpinned while another thread is accessing that memory.
4.3.9. cublasXtGetPinningMemMode()
cublasXtGetPinningMemMode(cublasXtHandle_t handle, cublasXtPinningMemMode_t *mode)
This function allows the user to query the Pinning Memory mode. By default, the Pinning
Memory mode is disabled.
Return Value Meaning
CUBLAS_STATUS_SUCCESS the call has been successful
The abbreviation Re(.) and Im(.) will stand for the real and imaginary part of a number,
respectively. Since imaginary part of a real number does not exist, we will consider it to be
zero and can usually simply discard it from the equation where it is being used. Also, the will
denote the complex conjugate of .
In general throughout the documentation, the lower case Greek symbols and will denote
scalars, lower case English letters in bold type and will denote vectors and capital English
letters , and will denote matrices.
4.4.1. cublasXt<t>gemm()
cublasStatus_t cublasXtSgemm(cublasXtHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
size_t m, size_t n, size_t k,
const float *alpha,
const float *A, int lda,
const float *B, int ldb,
const float *beta,
float *C, int ldc)
cublasStatus_t cublasXtDgemm(cublasXtHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n, int k,
where and are scalars, and , and are matrices stored in column-major format with
dimensions , and , respectively. Also, for matrix
beta host input <type> scalar used for multiplication. If beta==0, C does not
have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
4.4.2. cublasXt<t>hemm()
cublasStatus_t cublasXtChemm(cublasXtHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
size_t m, size_t n,
const cuComplex *alpha,
const cuComplex *A, size_t lda,
const cuComplex *B, size_t ldb,
const cuComplex *beta,
cuComplex *C, size_t ldc)
cublasStatus_t cublasXtZhemm(cublasXtHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
size_t m, size_t n,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, size_t lda,
const cuDoubleComplex *B, size_t ldb,
const cuDoubleComplex *beta,
cuDoubleComplex *C, size_t ldc)
where is a Hermitian matrix stored in lower or upper mode, and are matrices,
and and are scalars.
Param. Memory In/out Meaning
handle input handle to the cuBLASXt API context.
uplo input indicates if matrix A lower or upper part is stored, the other
Hermitian part is not referenced and is inferred from the
stored elements.
beta host input <type> scalar used for multiplication, if beta==0 then C does
not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
4.4.3. cublasXt<t>symm()
cublasStatus_t cublasXtSsymm(cublasXtHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
size_t m, size_t n,
const float *alpha,
const float *A, size_t lda,
const float *B, size_t ldb,
const float *beta,
float *C, size_t ldc)
cublasStatus_t cublasXtDsymm(cublasXtHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
size_t m, size_t n,
const double *alpha,
const double *A, size_t lda,
const double *B, size_t ldb,
const double *beta,
double *C, size_t ldc)
cublasStatus_t cublasXtCsymm(cublasXtHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
size_t m, size_t n,
const cuComplex *alpha,
const cuComplex *A, size_t lda,
const cuComplex *B, size_t ldb,
const cuComplex *beta,
cuComplex *C, size_t ldc)
cublasStatus_t cublasXtZsymm(cublasXtHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
size_t m, size_t n,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, size_t lda,
const cuDoubleComplex *B, size_t ldb,
const cuDoubleComplex *beta,
cuDoubleComplex *C, size_t ldc)
where is a symmetric matrix stored in lower or upper mode, and are matrices,
and and are scalars.
Param. Memory In/out Meaning
handle input handle to the cuBLASXt API context.
uplo input indicates if matrix A lower or upper part is stored, the other
symmetric part is not referenced and is inferred from the
stored elements.
beta host input <type> scalar used for multiplication, if beta == 0 then C
does not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
4.4.4. cublasXt<t>syrk()
cublasStatus_t cublasXtSsyrk(cublasXtHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
int n, int k,
const float *alpha,
const float *A, int lda,
const float *beta,
float *C, int ldc)
cublasStatus_t cublasXtDsyrk(cublasXtHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
int n, int k,
const double *alpha,
const double *A, int lda,
const double *beta,
double *C, int ldc)
cublasStatus_t cublasXtCsyrk(cublasXtHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
int n, int k,
const cuComplex *alpha,
const cuComplex *A, int lda,
const cuComplex *beta,
cuComplex *C, int ldc)
cublasStatus_t cublasXtZsyrk(cublasXtHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
int n, int k,
where and are scalars, is a symmetric matrix stored in lower or upper mode, and is a
matrix with dimensions . Also, for matrix
uplo input indicates if matrix C lower or upper part is stored, the other
symmetric part is not referenced and is inferred from the
stored elements.
beta host input <type> scalar used for multiplication, if beta==0 then C does
not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
4.4.5. cublasXt<t>syr2k()
cublasStatus_t cublasXtSsyr2k(cublasXtHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
size_t n, size_t k,
const float *alpha,
const float *A, size_t lda,
const float *B, size_t ldb,
const float *beta,
float *C, size_t ldc)
cublasStatus_t cublasXtDsyr2k(cublasXtHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
size_t n, size_t k,
const double *alpha,
const double *A, size_t lda,
const double *B, size_t ldb,
const double *beta,
double *C, size_t ldc)
cublasStatus_t cublasXtCsyr2k(cublasXtHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
size_t n, size_t k,
const cuComplex *alpha,
const cuComplex *A, size_t lda,
const cuComplex *B, size_t ldb,
const cuComplex *beta,
cuComplex *C, size_t ldc)
cublasStatus_t cublasXtZsyr2k(cublasXtHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
size_t n, size_t k,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, size_t lda,
const cuDoubleComplex *B, size_t ldb,
const cuDoubleComplex *beta,
cuDoubleComplex *C, size_t ldc)
where and are scalars, is a symmetric matrix stored in lower or upper mode, and and
are matrices with dimensions and , respectively. Also, for matrix
and
uplo input indicates if matrix C lower or upper part, is stored, the other
symmetric part is not referenced and is inferred from the
stored elements.
beta host input <type> scalar used for multiplication, if beta==0, then C does
not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
4.4.6. cublasXt<t>syrkx()
size_t n, size_t k,
const double *alpha,
const double *A, size_t lda,
const double *B, size_t ldb,
const double *beta,
double *C, size_t ldc)
cublasStatus_t cublasXtCsyrkx(cublasXtHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
size_t n, size_t k,
const cuComplex *alpha,
const cuComplex *A, size_t lda,
const cuComplex *B, size_t ldb,
const cuComplex *beta,
cuComplex *C, size_t ldc)
cublasStatus_t cublasXtZsyrkx(cublasXtHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
size_t n, size_t k,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, size_t lda,
const cuDoubleComplex *B, size_t ldb,
const cuDoubleComplex *beta,
cuDoubleComplex *C, size_t ldc)
where and are scalars, is a symmetric matrix stored in lower or upper mode, and and
are matrices with dimensions and , respectively. Also, for matrix
and
This routine can be used when B is in such way that the result is guaranteed to be symmetric.
An usual example is when the matrix B is a scaled form of the matrix A : this is equivalent to B
being the product of the matrix A and a diagonal matrix.
Param. Memory In/out Meaning
handle input handle to the cuBLASXt API context.
uplo input indicates if matrix C lower or upper part, is stored, the other
symmetric part is not referenced and is inferred from the
stored elements.
beta host input <type> scalar used for multiplication, if beta==0, then C does
not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
4.4.7. cublasXt<t>herk()
cublasStatus_t cublasXtCherk(cublasXtHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
int n, int k,
const float *alpha,
const cuComplex *A, int lda,
const float *beta,
cuComplex *C, int ldc)
cublasStatus_t cublasXtZherk(cublasXtHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
int n, int k,
const double *alpha,
const cuDoubleComplex *A, int lda,
const double *beta,
cuDoubleComplex *C, int ldc)
where and are scalars, is a Hermitian matrix stored in lower or upper mode, and is a
matrix with dimensions . Also, for matrix
uplo input indicates if matrix A lower or upper part is stored, the other
Hermitian part is not referenced and is inferred from the
stored elements.
beta host input <type> scalar used for multiplication, if beta==0 then C does
not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
4.4.8. cublasXt<t>her2k()
cublasStatus_t cublasXtCher2k(cublasXtHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
size_t n, size_t k,
where and are scalars, is a Hermitian matrix stored in lower or upper mode, and and
are matrices with dimensions and , respectively. Also, for matrix
and
uplo input indicates if matrix A lower or upper part is stored, the other
Hermitian part is not referenced and is inferred from the
stored elements.
beta host input <type> scalar used for multiplication, if beta==0 then C does
not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
4.4.9. cublasXt<t>herkx()
cublasStatus_t cublasXtCherkx(cublasXtHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
size_t n, size_t k,
const cuComplex *alpha,
const cuComplex *A, size_t lda,
const cuComplex *B, size_t ldb,
const float *beta,
cuComplex *C, size_t ldc)
cublasStatus_t cublasXtZherkx(cublasXtHandle_t handle,
cublasFillMode_t uplo, cublasOperation_t trans,
size_t n, size_t k,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, size_t lda,
const cuDoubleComplex *B, size_t ldb,
const double *beta,
cuDoubleComplex *C, size_t ldc)
where and are scalars, is a Hermitian matrix stored in lower or upper mode, and and
are matrices with dimensions and , respectively. Also, for matrix
and
This routine can be used when the matrix B is in such way that the result is guaranteed to
be hermitian. An usual example is when the matrix B is a scaled form of the matrix A : this
is equivalent to B being the product of the matrix A and a diagonal matrix. For an efficient
computation of the product of a regular matrix with a diagonal matrix, refer to the routine
cublasXt<t>dgmm.
Param. Memory In/out Meaning
handle input handle to the cuBLASXt API context.
uplo input indicates if matrix A lower or upper part is stored, the other
Hermitian part is not referenced and is inferred from the
stored elements.
beta host input real scalar used for multiplication, if beta==0 then C does not
have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
4.4.10. cublasXt<t>trsm()
cublasStatus_t cublasXtStrsm(cublasXtHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
cublasOperation_t trans, cublasXtDiagType_t diag,
size_t m, size_t n,
const float *alpha,
const float *A, size_t lda,
float *B, size_t ldb)
cublasStatus_t cublasXtDtrsm(cublasXtHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
cublasOperation_t trans, cublasXtDiagType_t diag,
size_t m, size_t n,
const double *alpha,
const double *A, size_t lda,
double *B, size_t ldb)
cublasStatus_t cublasXtCtrsm(cublasXtHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
cublasOperation_t trans, cublasXtDiagType_t diag,
size_t m, size_t n,
const cuComplex *alpha,
const cuComplex *A, size_t lda,
cuComplex *B, size_t ldb)
cublasStatus_t cublasXtZtrsm(cublasXtHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
cublasOperation_t trans, cublasXtDiagType_t diag,
size_t m, size_t n,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, size_t lda,
cuDoubleComplex *B, size_t ldb)
This function solves the triangular linear system with multiple right-hand-sides
where is a triangular matrix stored in lower or upper mode with or without the main
diagonal, and are matrices, and is a scalar. Also, for matrix
diag input indicates if the elements on the main diagonal of matrix A are
unity and should not be accessed.
alpha host input <type> scalar used for multiplication, if alpha==0 then A is
not referenced and B does not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
4.4.11. cublasXt<t>trmm()
cublasStatus_t cublasXtStrmm(cublasXtHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
size_t m, size_t n,
const float *alpha,
const float *A, size_t lda,
const float *B, size_t ldb,
float *C, size_t ldc)
cublasStatus_t cublasXtDtrmm(cublasXtHandle_t handle,
cublasSideMode_t side, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
size_t m, size_t n,
where is a triangular matrix stored in lower or upper mode with or without the main
diagonal, and are matrix, and is a scalar. Also, for matrix
Notice that in order to achieve better parallelism, similarly to the cublas API, cuBLASXt
API differs from the BLAS API for this routine. The BLAS API assumes an in-place
implementation (with results written back to B), while the cuBLASXt API assumes an out-of-
place implementation (with results written into C). The application can still obtain the in-place
functionality of BLAS in the cuBLASXt API by passing the address of the matrix B in place of
the matrix C. No other overlapping in the input parameters is supported.
Param. Memory In/out Meaning
handle input handle to the cuBLASXt API context.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
4.4.12. cublasXt<t>spmm()
cublasStatus_t cublasXtSspmm( cublasXtHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
size_t m,
size_t n,
const float *alpha,
const float *AP,
const float *B,
size_t ldb,
const float *beta,
float *C,
size_t ldc );
Note: The packed matrix AP must be located on the Host whereas the other matrices can be
located on the Host or any GPU device
uplo input indicates if matrix A lower or upper part is stored, the other
symmetric part is not referenced and is inferred from the
stored elements.
beta host input <type> scalar used for multiplication, if beta == 0 then C
does not have to be a valid input.
The possible error values returned by this function and their meanings are listed below.
Error Value Meaning
CUBLAS_STATUS_SUCCESS the operation completed successfully
This appendix does not provide a full reference of each Legacy API datatype and entry point.
Instead, it describes how to use the API, especially where this is different from the regular
cuBLAS API.
Note that in this section, all references to the “cuBLAS Library” refer to the Legacy cuBLAS
API only.
WARNING: The legacy cuBLAS API is deprecated and will be removed in future release.
This legacy type corresponds to type cublasStatus_t in the cuBLAS library API.
‣ Functions that take alpha and/or beta parameters by reference on the host or the device
as scaling factors, such as gemm.
‣ Functions that return a scalar result on the host or the device such as amax(), amin,
asum(), rotg(), rotmg(), dot() and nrm2().
For the functions of the first category, when the pointer mode is set to
CUBLAS_POINTER_MODE_HOST, the scalar parameters alpha and/or beta can be on the
stack or allocated on the heap, shouldn't be placed in managed memory. Underneath, the
CUDA kernels related to those functions will be launched with the value of alpha and/or
beta. Therefore if they were allocated on the heap, they can be freed just after the return
of the call even though the kernel launch is asynchronous. When the pointer mode is set to
CUBLAS_POINTER_MODE_DEVICE, alpha and/or beta must be accessible on the device and
their values should not be modified until the kernel is done. Note that since cudaFree() does
an implicit cudaDeviceSynchronize(), cudaFree() can still be called on alpha and/or beta
just after the call but it would defeat the purpose of using this pointer mode in that case.
For the functions of the second category, when the pointer mode is set to
CUBLAS_POINTER_MODE_HOST, these functions block the CPU, until the GPU has completed its
computation and the results have been copied back to the Host. When the pointer mode is set
to CUBLAS_POINTER_MODE_DEVICE, these functions return immediately. In this case, similar to
matrix and vector results, the scalar result is ready only when execution of the routine on the
GPU has completed. This requires proper synchronization in order to read the result from the
host.
In either case, the pointer mode CUBLAS_POINTER_MODE_DEVICE allows the library functions
to execute completely asynchronously from the Host even when alpha and/or beta are
generated by a previous kernel. For example, this situation can arise when iterative methods
for solution of linear systems and eigenvalue problems are implemented using the cuBLAS
library.
Change the parameter characters 'N' or 'n' (non-transpose operation), 'T' or 't'
(transpose operation) and 'C' or 'c' (conjugate transpose operation) to CUBLAS_OP_N,
CUBLAS_OP_T and CUBLAS_OP_C, respectively.
Change the parameter characters 'L' or 'l' (lower part filled) and 'U' or 'u' (upper part
filled) to CUBLAS_FILL_MODE_LOWER and CUBLAS_FILL_MODE_UPPER, respectively.
Change the parameter characters 'N' or 'n' (non-unit diagonal) and 'U' or 'u' (unit
diagonal) to CUBLAS_DIAG_NON_UNIT and CUBLAS_DIAG_UNIT, respectively.
Change the parameter characters 'L' or 'l' (left side) and 'R' or 'r' (right side) to
CUBLAS_SIDE_LEFT and CUBLAS_SIDE_RIGHT, respectively.
If the legacy API function returns a scalar value, add an extra scalar parameter of the same
type passed by reference, as the last parameter to the same function.
Instead of using cublasGetError, use the return value of the function itself to check for
errors.
Finally, please use the function prototypes in the header files “cublas.h” and “cublas_v2.h” to
check the code for correctness.
A.9. Examples
For sample code references that use the legacy cuBLAS API please see the two examples
below. They show an application written in C using the legacy cuBLAS library API with two
indexing styles (Example A.1. "Application Using C and cuBLAS: 1-based indexing" and
Example A.2. "Application Using C and cuBLAS: 0-based Indexing"). This application is
analogous to the one using the cuBLAS library API that is shown in the Introduction chapter.
Example A.1. Application Using C and cuBLAS: 1-based indexing
//-----------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "cublas.h"
#define M 6
#define N 5
#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1))
static __inline__ void modify (float *m, int ldm, int n, int p, int q, float
alpha, float beta){
cublasSscal (n-q+1, alpha, &m[IDX2F(p,q,ldm)], ldm);
cublasSscal (ldm-p+1, beta, &m[IDX2F(p,q,ldm)], 1);
}
//-----------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "cublas.h"
#define M 6
#define N 5
#define IDX2C(i,j,ld) (((j)*(ld))+(i))
static __inline__ void modify (float *m, int ldm, int n, int p, int q, float
alpha, float beta){
cublasSscal (n-q, alpha, &m[IDX2C(p,q,ldm)], ldm);
cublasSscal (ldm-p, beta, &m[IDX2C(p,q,ldm)], 1);
}
}
for (j = 0; j < N; j++) {
for (i = 0; i < M; i++) {
a[IDX2C(i,j,M)] = (float)(i * M + j + 1);
}
}
cublasInit();
stat = cublasAlloc (M*N, sizeof(*a), (void**)&devPtrA);
if (stat != cuBLAS_STATUS_SUCCESS) {
printf ("device memory allocation failed");
cublasShutdown();
return EXIT_FAILURE;
}
stat = cublasSetMatrix (M, N, sizeof(*a), a, M, devPtrA, M);
if (stat != cuBLAS_STATUS_SUCCESS) {
printf ("data download failed");
cublasFree (devPtrA);
cublasShutdown();
return EXIT_FAILURE;
}
modify (devPtrA, M, N, 1, 2, 16.0f, 12.0f);
stat = cublasGetMatrix (M, N, sizeof(*a), devPtrA, M, a, M);
if (stat != cuBLAS_STATUS_SUCCESS) {
printf ("data upload failed");
cublasFree (devPtrA);
cublasShutdown();
return EXIT_FAILURE;
}
cublasFree (devPtrA);
cublasShutdown();
for (j = 0; j < N; j++) {
for (i = 0; i < M; i++) {
printf ("%7.0f", a[IDX2C(i,j,M)]);
}
printf ("\n");
}
free(a);
return EXIT_SUCCESS;
}
The cuBLAS library is implemented using the C-based CUDA toolchain. Thus, it provides a C-
style API. This makes interfacing to applications written in C and C++ trivial, but the library
can also be used by applications written in Fortran. In particular, the cuBLAS library uses 1-
based indexing and Fortran-style column-major storage for multidimensional data to simplify
interfacing to Fortran applications. Unfortunately, Fortran-to-C calling conventions are not
standardized and differ by platform and toolchain. In particular, differences may exist in the
following areas:
The thunking wrappers allow interfacing to existing Fortran applications without any changes
to the application. During each call, the wrappers allocate GPU memory, copy source data
from CPU memory space to GPU memory space, call cuBLAS, and finally copy back the
results to CPU memory space and deallocate the GPU memory. As this process causes very
significant call overhead, these wrappers are intended for light testing, not for production
code. To use the thunking wrappers, the application needs to be compiled with the file
fortran_thunking.c
The direct wrappers, intended for production code, substitute device pointers for vector and
matrix arguments in all BLAS functions. To use these interfaces, existing applications need
to be modified slightly to allocate and deallocate data structures in GPU memory space
(using cuBLAS_ALLOC and cuBLAS_FREE) and to copy data between GPU and CPU memory
spaces (using cuBLAS_SET_VECTOR, cuBLAS_GET_VECTOR, cuBLAS_SET_MATRIX, and
cuBLAS_GET_MATRIX). The sample wrappers provided in fortran.c map device pointers to the
OS-dependent type size_t, which is 32-bit wide on 32-bit platforms and 64-bit wide on a 64-bit
platforms.
One approach to deal with index arithmetic on device pointers in Fortran code is to use C-
style macros, and use the C preprocessor to expand these, as shown in the example below.
On Linux and Mac OS X, one way of pre-processing is to use the option ’-E -x f77-cpp-input’
when using g77 compiler, or simply the option ’-cpp’ when using g95 or gfortran. On Windows
platforms with Microsoft Visual C/C++, using ’cl -EP’ achieves similar results.
program matrixmod
implicit none
integer M,N
parameter (M=6, N=5)
real*4 a(M,N)
integer i, j
external cublas_init
external cublas_shutdown
do j = 1, N
do i = 1, M
a(i, j) = (i-1)*M + j
enddo
enddo
call cublas_init
call modify ( a, M, N, 2, 3, 16.0, 12.0 )
call cublas_shutdown
do j = 1 , N
do i = 1 , M
write(*,"(F7.0$)") a(i,j)
enddo
write (*,*) ""
enddo
stop
end
When traditional fixed-form Fortran 77 code is ported to use the cuBLAS library, line length
often increases when the BLAS calls are exchanged for cuBLAS calls. Longer function names
and possible macro expansion are contributing factors. Inadvertently exceeding the maximum
line length can lead to run-time errors that are difficult to find, so care should be taken not to
exceed the 72-column limit if fixed form is retained.
The examples in this chapter show a small application implemented in Fortran 77 on the host
and the same application with the non-thunking wrappers after it has been ported to use the
cuBLAS library.
The second example should be compiled with ARCH_64 defined as 1 on 64-bit OS system and
as 0 on 32-bit OS system. For example for g95 or gfortran, this can be done directly on the
command line by using the option ’-cpp -DARCH_64=1’.
endif
stat = cublas_set_matrix(M,N,sizeof_real,a,M,devPtrA,M)
if (stat.NE.0) then
call cublas_free( devPtrA )
write(*,*) "data download failed"
call cublas_shutdown
stop
endif
---
--- Code block continues below. Space added for formatting purposes. ---
---
This appendix describes important requirements and recommendations that ensure correct
use of cuBLAS with other libraries and utilities.
C.1. nvprune
nvprune enables pruning relocatable host objects and static libraries to only contain device
code for the specific target architectures. In case of cuBLAS, particular care must be taken
if using nvprune with compute capabilities, whose minor revision number is different than
0. To reduce binary size, cuBLAS may only store major revision equivalents of CUDA binary
files for kernels reused between different minor revision versions. Therefore, to ensure that a
pruned library does not fail for arbitrary problems, the user must keep binaries for a selected
architecture and all prior minor architectures in its major architecture.
For example, the following call prunes libcublas_static.a to contain only sm_75 (Volta)
and sm_70 (Turing) cubins:
NVIDIA would like to thank the following individuals and institutions for their contributions:
‣ Portions of the SGEMM, DGEMM, CGEMM and ZGEMM library routines were written by
Vasily Volkov of the University of California.
‣ Portions of the SGEMM, DGEMM and ZGEMM library routines were written by Davide
Barbieri of the University of Rome Tor Vergata.
‣ Portions of the DGEMM and SGEMM library routines optimized for Fermi architecture were
developed by the University of Tennessee. Subsequently, several other routines that are
optimized for the Fermi architecture have been derived from these initial DGEMM and
SGEMM implementations.
‣ The substantial optimizations of the STRSV, DTRSV, CTRSV and ZTRSV library routines
were developed by Jonathan Hogg of The Science and Technology Facilities Council
(STFC). Subsequently, some optimizations of the STRSM, DTRSM, CTRSM and ZTRSM have
been derived from these TRSV implementations.
‣ Substantial optimizations of the SYMV and HEMV library routines were developed by
Ahmad Abdelfattah, David Keyes and Hatem Ltaief of King Abdullah University of Science
and Technology (KAUST).
‣ Substantial optimizations of the TRMM and TRSM library routines were developed by
Ali Charara, David Keyes and Hatem Ltaief of King Abdullah University of Science and
Technology (KAUST).
‣ This product includes {fmt} - A modern formatting library https://fmt.devCopyright (c) 2012
- present, Victor Zverovich.
‣ This product includes SIMD Library for Evaluating Elementary Functions, vectorized libm
and DFT https://sleef.orgBoost Software License - Version 1.0 - August 17th, 2003.
‣ This product includes Frozen - a header-only, constexpr alternative to gperf for C++14
users. https://github.com/serge-sans-paille/frozen Apache License - Version 2.0, January
2004.
‣ This product includes Boost C++ Libraries - free peer-reviewed portable C++ source
libraries https://www.boost.org/ Boost Software License - Version 1.0 - August 17th, 2003.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.
Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.
NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed
in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any
customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed
either directly or indirectly by this document.
OpenCL
OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.
Trademarks
NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may
be trademarks of the respective companies with which they are associated.
Copyright
© -2022 NVIDIA Corporation & affiliates. All rights reserved.