0% found this document useful (0 votes)

26 views15 pages

Shared Bank Conflicts

Uploaded by

onementalist

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views15 pages

Shared Bank Conflicts

Uploaded by

onementalist

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

SHARED MEMORY BANK CONFLICTS

SAMPLE

v2023.1.1 | September 2024

TABLE OF CONTENTS

Chapter 1. Introduction.........................................................................................1
Chapter 2. Application.......................................................................................... 2
Chapter 3. Configuration....................................................................................... 3
Chapter 4. Initial version of the kernel..................................................................... 4
Chapter 5. Updated version of the kernel..................................................................9
Chapter 6. Resources.......................................................................................... 12

www.nvidia.com
Shared Memory Bank Conflicts Sample v2023.1.1 | ii
Chapter 1.
INTRODUCTION

This sample profiles a CUDA kernel which transposes an N x N square matrix of float
elements in global memory using the Nsight Compute profiler. To avoid uncoalesced
global memory accesses this kernel reads the data into shared memory. The profiler
is used to analyze and identify the shared memory bank conflicts which result in
inefficient shared memory accesses.

Shared memory accesses on a GPU

Shared memory is located on-chip, so it has much higher bandwidth and much lower
latency than either local or global memory. Shared memory can be shared across a
compute Cooperative Thread Array (CTA). In CUDA, CTAs are referred to as Thread
Blocks. Compute CTAs attempting to share data across threads via shared memory must
use synchronization operations (such as __syncthreads()) between stores and loads to
ensure data written by any one thread is visible to other threads in the CTA.
Shared memory has 32 banks that are organized such that successive 32-bit words map
to successive banks that can be accessed simultaneously. Any 32-bit memory read or
write request made of 32 addresses that fall in 32 distinct memory banks can therefore
be serviced simultaneously, yielding an overall bandwidth that is 32 times as high as
the bandwidth of a single request. However, if two addresses of a memory request fall
in the same memory bank, there is a bank conflict and the access has to be serialized.
The exception to this rule is when all threads read the same shared memory address,
which results in a broadcast where the data at that address is sent to all threads in one
transaction.
To get maximum performance, it is therefore important to understand how memory
addresses map to memory banks in order to schedule the memory requests so as to
minimize bank conflicts.

www.nvidia.com
Shared Memory Bank Conflicts Sample v2023.1.1 | 1
Chapter 2.
APPLICATION

The sample CUDA application transposes a matrix of floats. The input and output
matrices are at separate memory locations. For simplicity it only handles square matrices
whose dimensions are integral multiples of 32, the tile size.
The sharedBankConflicts sample is available with Nsight Compute under <nsight-
compute-install-directory>/extras/samples/sharedBankConflicts.

www.nvidia.com
Shared Memory Bank Conflicts Sample v2023.1.1 | 2
Chapter 3.
CONFIGURATION

The profiling results included in this document were collected on the following
configuration:
‣ Target system: Linux (x86_64) with a NVIDIA RTX A4500 (Ampere GA102) GPU
‣ Nsight Compute version: 2023.3.1
The Nsight Compute UI screen shots in the document are taken by opening the profiling
reports on a Windows 10 system.

www.nvidia.com
Shared Memory Bank Conflicts Sample v2023.1.1 | 3
Chapter 4.
INITIAL VERSION OF THE KERNEL

The initial version of the kernel transposeCoalesced uses shared memory to ensure that
global memory accesses for loading data from the input matrix idata and storing data
in the output matrix odata are coalesced. The matrix is sub-divided into tiles of size 32 x
32. The tile size is defined as:

#define TILE_DIM 32

For simplicity the code only handles square matrices whose dimensions are integral
multiples of 32, the tile size. Each block transposes a tile of 32 x 32 elements. Each thread
in the block transposes TILE_DIM/BLOCK_ROWS i.e. 4 elements, where BLOCK_ROWS
is defined as:

#define BLOCK_ROWS 8

TILE_DIM must be an integral multiple of BLOCK_ROWS.

The way to avoid uncoalesced global memory access is to read the data into shared
memory, and have each warp access noncontiguous locations in shared memory in order
to write contiguous data to odata. The above procedure requires that each element in a
tile be accessed by different threads, so a __syncthreads() call is required to ensure

www.nvidia.com
Shared Memory Bank Conflicts Sample v2023.1.1 | 4
Initial version of the kernel

that all reads from idata to shared memory have completed before writes from shared
memory to odata commence.

global void transposeCoalesced(float* odata, float* idata, int width, int

height)
{
__shared__ float tile[TILE_DIM][TILE_DIM];

int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;

int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
int indexIn = xIndex + yIndex*width;

xIndex = blockIdx.y * TILE_DIM + threadIdx.x;

yIndex = blockIdx.x * TILE_DIM + threadIdx.y;
int indexOut = xIndex + yIndex*height;

for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS)

{
tile[threadIdx.y + i][threadIdx.x] = idata[indexIn + i * width];
}

__syncthreads();

for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS)

{
odata[indexOut + i * height] = tile[threadIdx.x][threadIdx.y + i];
}
}

A depiction of the data flow of a warp in the coalesced transpose kernel is given below.
The warp writes four rows of the idata matrix tile to the shared memory 32x32 array
"tile" indicated by the yellow line segments. After a __syncthreads() call to ensure
all writes to tile are completed, the warp writes four columns of tile to four rows of an
odata matrix tile, indicated by the green line segments.

Profile the initial version of the kernel

There are multiple ways to profile kernels with Nsight Compute. For full details see the
Nsight Compute Documentation. One example workflow to follow for this sample:

‣ Refer to the README distributed with the sample on how to build the application
‣ Run ncu-ui on the host system

www.nvidia.com
Shared Memory Bank Conflicts Sample v2023.1.1 | 5
Initial version of the kernel

‣ Use a local connection if the GPU is on the host system. If the GPU is on a remote
system, set up a remote connection to the target system
‣ Use the Profile activity to profile the sample application
‣ Choose the full section set
‣ Use defaults for all other options
‣ Set a report name and then click on Launch

Summary page
The Summary page lists the kernels profiled and provides some key metrics for each
profiled kernel. It also lists the performance opportunities and estimated speedup for
each. In this sample we have only one kernel launch.
The duration for this initial version of the kernel is 1.42 milliseconds and this is used as
the baseline for further optimizations.

For this kernel it shows three performance opportunities. The topmost performance
opportunity is for Uncoalesced Shared Accesses and it suggests checking the L1
Wavefronts Shared Excessive table for the primary source locations. Click on
Uncoalesced Shared Accesses rule link to see more context on the Details page. It
opens the Source Counters section on the Details page.

www.nvidia.com
Shared Memory Bank Conflicts Sample v2023.1.1 | 6
Initial version of the kernel

Details page
The Source Counters section table for the metric L1 Wavefronts Shared Excessive
which is an indicator for shared memory bank conflicts lists the source lines with the
highest value.

We can also check the Memory Workload Analysis section. It shows a hint for
Shared Load Bank Conflicts and suggests looking at the Source Counters section for
uncoalesced shared loads. The Shared Memory table shows a high count of bank
conflicts.

Click on the Apply Rules button at the top to apply rules so that the we can also see the
hints at the source line level on the source page. In the Source Counters section table for

www.nvidia.com
Shared Memory Bank Conflicts Sample v2023.1.1 | 7
Initial version of the kernel

the metric L1 Wavefronts Shared Excessive click on one of the source lines to view
the kernel source at which the bottleneck occurs.

Source page
The CUDA source for the kernel is shown. When opening the Source page from Source
Counters section, the Navigation metric is automatically filled in to match, in this case
the L1 Wavefronts Shared Excessive metric. You can see this by the bolding in the
column header. The source line at which the bottleneck occurs is highlighted.
It shows shared memory bank conflicts at line #95:

odata[indexOut + i * height] = tile[threadIdx.x][threadIdx.y + i];

The source page shows notification as Source Markers in the left header of the source
code. By hovering the mouse on a marker it shows details in a pop-up window for the
specific source line.

www.nvidia.com
Shared Memory Bank Conflicts Sample v2023.1.1 | 8
Chapter 5.
UPDATED VERSION OF THE KERNEL

Considering the shared memory bank conflicts reported by the profiler we analyze the
shared memory access pattern. The coalesced transpose uses a 32x32 shared memory
array of floats. For this sized array, all data in each column is mapped to the same shared
memory bank. As a result, when writing columns from the tile in shared memory to
rows in odata the warp experiences a 32-way bank conflict and serializes the request.

A simple way to avoid this conflict is to pad the shared memory array by one column:

shared float tile[TILE_DIM][TILE_DIM+1];

The padding does not affect shared memory bank access pattern when a warp writes
data to shared memory, which remains conflict free. But by adding a single column now
the access of data in a column from shared memory by a warp is also conflict free. In
the diagram below the elements in the extra column added for padding are shown with
grey background.

www.nvidia.com
Shared Memory Bank Conflicts Sample v2023.1.1 | 9
Updated version of the kernel

Profile the updated kernel

The kernel duration has reduced from 1.42 milliseconds to 997.63 microseconds. We can
set a baseline to the initial version of the kernel and compare the profiling results.

We can confirm that there are no shared memory bank conflicts by looking at the Shared
Memory metrics table under the Memory workload analysis section.

www.nvidia.com
Shared Memory Bank Conflicts Sample v2023.1.1 | 10
Updated version of the kernel

Note that the reported bank conflicts in the shared memory metrics table under the
Memory workload analysis section includes:

‣ (A) conflicts within the warp due to shared memory access pattern for the active
threads of the warp; and
‣ (B) additional conflicts that are caused by multiple clients trying to access the
memory banks at the same time, as the L1 Cache and Shared Memory are both
backed by the same physical memory banks.
The Source Counters section in the Details page and the Source page only count
conflicts of type (A) mentioned above. So in some cases there can be a difference in
bank conflict counts between the Memory workload analysis and source counters. Also
due to conflicts of type (B) in some cases the bank conflicts can be non-zero for the
transposeNoBankConflicts kernel in the shared memory table.

www.nvidia.com
Shared Memory Bank Conflicts Sample v2023.1.1 | 11
Chapter 6.
RESOURCES

‣ GPU Technology Conference 2022 talk S41723: How to Understand and Optimize
Shared Memory Accesses using Nsight Compute
‣ NVIDIA CUDA Sample transpose document - Optimizing Matrix Transpose
in CUDA https://github.com/NVIDIA/cuda-samples/blob/master/
Samples/6_Performance/transpose/doc/MatrixTranspose.pdf
‣ NVIDIA CUDA Sample transpose source code transpose.cu
‣ Nsight Compute Documentation

www.nvidia.com
Shared Memory Bank Conflicts Sample v2023.1.1 | 12
Notice
ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS,
DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY,
"MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES,
EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE
MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF
NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR
PURPOSE.
Information furnished is believed to be accurate and reliable. However, NVIDIA
Corporation assumes no responsibility for the consequences of use of such
information or for any infringement of patents or other rights of third parties
that may result from its use. No license is granted by implication of otherwise
under any patent rights of NVIDIA Corporation. Specifications mentioned in this
publication are subject to change without notice. This publication supersedes and
replaces all other information previously supplied. NVIDIA Corporation products
are not authorized as critical components in life support devices or systems
without express written approval of NVIDIA Corporation.

Trademarks
NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA
Corporation in the U.S. and other countries. Other company and product names
may be trademarks of the respective companies with which they are associated.

This product includes software developed by the Syncro Soft SRL (http://
www.sync.ro/).

www.nvidia.com

GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
Shared Bank Conflicts
No ratings yet
Shared Bank Conflicts
15 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Shared Bank Conflicts
No ratings yet
Shared Bank Conflicts
15 pages
1083 Wang
No ratings yet
1083 Wang
56 pages
Shared Bank Conflicts
No ratings yet
Shared Bank Conflicts
16 pages
Release Notes
No ratings yet
Release Notes
7 pages
LM32 Ait L23
No ratings yet
LM32 Ait L23
22 pages
Part3 22
No ratings yet
Part3 22
85 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
Un Coalesced Global Accesses
No ratings yet
Un Coalesced Global Accesses
14 pages
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
Chap9 - CUDA Optimization
No ratings yet
Chap9 - CUDA Optimization
73 pages
Instruction Mix
No ratings yet
Instruction Mix
14 pages
Shared Memory
No ratings yet
Shared Memory
10 pages
Uncoalesced Global Accesses
No ratings yet
Uncoalesced Global Accesses
14 pages
Un Coalesced Global Accesses
No ratings yet
Un Coalesced Global Accesses
14 pages
Uncoalesced Global Accesses
No ratings yet
Uncoalesced Global Accesses
14 pages
S62256 - Demystify CUDA Debugging and Performance With Powerful Developer Tools
No ratings yet
S62256 - Demystify CUDA Debugging and Performance With Powerful Developer Tools
44 pages
GPU Programming Basics - Slides
No ratings yet
GPU Programming Basics - Slides
68 pages
Un Coalesced Global Accesses
No ratings yet
Un Coalesced Global Accesses
14 pages
Instruction Mix
No ratings yet
Instruction Mix
14 pages
Instruction Mix
No ratings yet
Instruction Mix
14 pages
02 CUDA Shared Memory
No ratings yet
02 CUDA Shared Memory
21 pages
Class 13
No ratings yet
Class 13
19 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
No ratings yet
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
21 pages
Release Notes
No ratings yet
Release Notes
7 pages
Release Notes
No ratings yet
Release Notes
7 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
No ratings yet
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
56 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
05 GPU Memory
No ratings yet
05 GPU Memory
80 pages
Cs 8803 Ss Project
No ratings yet
Cs 8803 Ss Project
17 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
CUDA Odds and Ends: Joseph Kider University of Pennsylvania CIS 565 - Fall 2011
No ratings yet
CUDA Odds and Ends: Joseph Kider University of Pennsylvania CIS 565 - Fall 2011
64 pages
04 CUDA Fundamental Optimization
No ratings yet
04 CUDA Fundamental Optimization
30 pages
GPU Computing CIS-543: Lecture 08: CUDA Memory Model
No ratings yet
GPU Computing CIS-543: Lecture 08: CUDA Memory Model
50 pages
Find YourSelf - Khyber EyeCon
No ratings yet
Find YourSelf - Khyber EyeCon
52 pages
Threads and Memory7
No ratings yet
Threads and Memory7
42 pages
Multithreaded Architectures: Lecture 5: Performance Considerations
No ratings yet
Multithreaded Architectures: Lecture 5: Performance Considerations
49 pages
AcceleratingAIAdvancements Pre Print Doube Blind
No ratings yet
AcceleratingAIAdvancements Pre Print Doube Blind
9 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
4bs1 02 Rms 20250123
No ratings yet
4bs1 02 Rms 20250123
27 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
How To Optimize A CUDA Matmul Kernel For CuBLAS-like Performance - A Worklog
No ratings yet
How To Optimize A CUDA Matmul Kernel For CuBLAS-like Performance - A Worklog
23 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
No ratings yet
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
43 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
CS 179: GPU Computing: Recitation 2: Synchronization, Shared
No ratings yet
CS 179: GPU Computing: Recitation 2: Synchronization, Shared
22 pages
Expansion of Theme
100% (2)
Expansion of Theme
10 pages
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
No ratings yet
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
29 pages
Tense and Aspect in IE PDF
No ratings yet
Tense and Aspect in IE PDF
255 pages
Be Electrical Engineering Semester 5 2023 December Renewable Energy Sourcesrev 2019 C Scheme
No ratings yet
Be Electrical Engineering Semester 5 2023 December Renewable Energy Sourcesrev 2019 C Scheme
1 page
Cuda - New Features and Beyond Ampere Programming For Developers PDF
No ratings yet
Cuda - New Features and Beyond Ampere Programming For Developers PDF
78 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Imprest Format
No ratings yet
Imprest Format
3 pages
Production Process of Monolithic IC
100% (2)
Production Process of Monolithic IC
5 pages
Nama Alat Dan Spesifikasi
No ratings yet
Nama Alat Dan Spesifikasi
128 pages
Copyright and Licenses
No ratings yet
Copyright and Licenses
46 pages
Rahwaz Syndicate Profile
No ratings yet
Rahwaz Syndicate Profile
3 pages
Customization Guide
No ratings yet
Customization Guide
25 pages
Anfis Based Kinematic Analysis of A 4-Dofs Scara Robot: Jyotindra Narayan Ashish Singla
No ratings yet
Anfis Based Kinematic Analysis of A 4-Dofs Scara Robot: Jyotindra Narayan Ashish Singla
7 pages
#01 G.R. No. 100113
No ratings yet
#01 G.R. No. 100113
19 pages
Bachelor Thesis
No ratings yet
Bachelor Thesis
88 pages
Boiling: 1. Neutralization of Magma Gas in Host Rock at Deep Location
No ratings yet
Boiling: 1. Neutralization of Magma Gas in Host Rock at Deep Location
84 pages
Sanitizer NVTX Guide
No ratings yet
Sanitizer NVTX Guide
12 pages
Pebeo
No ratings yet
Pebeo
1 page
Altman Z Score Model
No ratings yet
Altman Z Score Model
7 pages
Judo Physiological Profile Sportsmedicine Franchini
No ratings yet
Judo Physiological Profile Sportsmedicine Franchini
21 pages
Archives
No ratings yet
Archives
4 pages
Financial Kake Da Hotel (N)
No ratings yet
Financial Kake Da Hotel (N)
10 pages
Assignment 1 ECN3112
No ratings yet
Assignment 1 ECN3112
4 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
B10x Technical Reference 1.4
No ratings yet
B10x Technical Reference 1.4
29 pages
Krisis Hipertensi
No ratings yet
Krisis Hipertensi
29 pages
College Code / Name: 9615 - Maria College of Engineering and Technology Branch Code / Name: 103 - B.E. Civil Engineering
No ratings yet
College Code / Name: 9615 - Maria College of Engineering and Technology Branch Code / Name: 103 - B.E. Civil Engineering
3 pages
Training
No ratings yet
Training
4 pages
Type A Type B 72 78 78 76 73 81 69 74 75 82 74 75 69 75 Heaters? Find The Approximate P-Value For The Test and Interpret Its Value
No ratings yet
Type A Type B 72 78 78 76 73 81 69 74 75 82 74 75 69 75 Heaters? Find The Approximate P-Value For The Test and Interpret Its Value
9 pages
New Microsoft Office Word Document
No ratings yet
New Microsoft Office Word Document
6 pages
Dilution Systems For Aerosols Series DIL, DDS and HDS: Special Advantages
No ratings yet
Dilution Systems For Aerosols Series DIL, DDS and HDS: Special Advantages
4 pages
Isd Process V1
100% (1)
Isd Process V1
3 pages
31st MCMC
No ratings yet
31st MCMC
11 pages
ECE CAD Introduction To AutoCAD
No ratings yet
ECE CAD Introduction To AutoCAD
5 pages
IOT Smart Energy Grid
No ratings yet
IOT Smart Energy Grid
10 pages
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
No ratings yet
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
3 pages
Pickle Brand Auditing and Strengthening
No ratings yet
Pickle Brand Auditing and Strengthening
34 pages
My NoteBook
No ratings yet
My NoteBook
17 pages
Build your own Blockchain: Make your own blockchain and trading bot on your pc
From Everand
Build your own Blockchain: Make your own blockchain and trading bot on your pc
Magelan Cybersecurity
No ratings yet
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Shared Bank Conflicts

Uploaded by

Shared Bank Conflicts

Uploaded by

SHARED MEMORY BANK CONFLICTS

v2023.1.1 | September 2024

Shared memory accesses on a GPU

TILE_DIM must be an integral multiple of BLOCK_ROWS.

global void transposeCoalesced(float* odata, float* idata, int width, int

int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;

xIndex = blockIdx.y * TILE_DIM + threadIdx.x;

for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS)

for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS)

Profile the initial version of the kernel

odata[indexOut + i * height] = tile[threadIdx.x][threadIdx.y + i];

shared float tile[TILE_DIM][TILE_DIM+1];

Profile the updated kernel

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Shared Bank Conflicts

Uploaded by

Shared Bank Conflicts

Uploaded by

SHARED MEMORY BANK CONFLICTS

v2023.1.1 | September 2024

Shared memory accesses on a GPU

TILE_DIM must be an integral multiple of BLOCK_ROWS.

__global__ void transposeCoalesced(float* odata, float* idata, int width, int

int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;

xIndex = blockIdx.y * TILE_DIM + threadIdx.x;

for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS)

for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS)

Profile the initial version of the kernel

odata[indexOut + i * height] = tile[threadIdx.x][threadIdx.y + i];

shared float tile[TILE_DIM][TILE_DIM+1];

Profile the updated kernel

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

global void transposeCoalesced(float* odata, float* idata, int width, int