0% found this document useful (0 votes)

10 views67 pages

Owens

The document provides an introduction to parallel GPU computing, discussing the advantages of GPU over CPU for data-intensive applications, and highlights recent performance trends. It covers key concepts such as single and multi-GPU computing, programming models, and the architecture of modern computer systems. The document also emphasizes the importance of data-parallel computing and CUDA programming for optimizing performance in various applications.

Uploaded by

黄家毅

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views67 pages

Owens

Uploaded by

黄家毅

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

Introduction to Parallel GPU Computing

John Owens
Department of Electrical and Computer Engineering
Institute for Data Analysis and Visualization
University of California, Davis
Goals for this Hour

• Why GPU computing?

• Multi-GPU computing

• Single-GPU computing
“If you were plowing a field, which
would you rather use? Two strong
oxen or 1024 chickens?”
—Seymour Cray
Recent GPU Performance Trends
Historical Single−/Double−Precision Peak Compute Rates
153.6
GB/s
103 177.4
GB/s $390
! !

r5870
Vendor
!
! AMD (GPU)
! NVIDIA (GPU)
$450
GFLOPS

102
! !
34 GB/s
!
! Intel (CPU)
gtx480
Precision
! !
! DP
!
SP
!

$3692
1
!
!
x7560
10

! !
!!
!
!
!

2002 2004 2006 2008 2010

Date
Early data courtesy Ian Buck; from Owens et al. 2007 [CGF]
What’s new?

• Double precision

• Fast atomics

• Hardware cache
& ECC

• (CUDA) debuggers
& profilers
Intel ISCA Paper (June 2010)
Debunking the 100X GPU vs. CPU Myth:
An Evaluation of Throughput Computing on CPU and GPU

Victor W Lee† , Changkyu Kim† , Jatin Chhugani† , Michael Deisher† ,

Daehyun Kim† , Anthony D. Nguyen† , Nadathur Satish† , Mikhail Smelyanskiy† ,
Srinivas Chennupaty! , Per Hammarlund! , Ronak Singhal! and Pradeep Dubey†
victor.w.lee@intel.com
† Throughput Computing Lab, ! IntelArchitecture Group,
Intel Corporation Intel Corporation

ABSTRACT The past decade has seen a huge increase in digital content as
Recent advances in computing have led to an explosion in the amount more documents are being created in digital form than ever be-
of data being generated. Processing the ever-growing data in a fore. Moreover, the web has become the medium of choice for
timely manner has made throughput computing an important as- storing and delivering information such as stock market data, per-
pect for emerging applications. Our analysis of a set of important sonal records, and news. Soon, the amount of digital data will ex-
throughput computing kernels shows that there is an ample amount ceed exabytes (1018 ) [31]. The massive amount of data makes stor-
of parallelism in these kernels which makes them suitable for to- ing, cataloging, processing, and retrieving information challenging.
day’s multi-core CPUs and GPUs. In the past few years there have A new class of applications has emerged across different domains
been many studies claiming GPUs deliver substantial speedups (be- such as database, games, video, and finance that can process this
tween 10X and 1000X) over multi-core CPUs on these kernels. To huge amount of data to distill and deliver appropriate content to
understand where such large performance difference comes from, users. A distinguishing feature of these applications is that they
we perform a rigorous performance analysis and find that after ap- have plenty of data level parallelism and the data can be processed
plying optimizations appropriate for both CPUs and GPUs the per- independently and in any order on different processing elements
formance gap between an Nvidia GTX280 processor and the Intel for a similar set of operations such as filtering, aggregating, rank-
Core i7 960 processor narrows to only 2.5x on average. In this pa- ing, etc. This feature together with a processing deadline defines
per, we discuss optimization techniques for both CPU and GPU, throughput computing applications. Going forward, as digital data
analyze what architecture features contributed to performance dif- continues to grow rapidly, throughput computing applications are
1 116.6 311.1/933.1 77.8
Top-Level Results
dth, SP: Single-Precision Floating Point, DP: Double-Precision
CUDA Successes

146X 36X 18X 50X 100X

Medical Imaging Molecular Dynamics Video Transcoding Matlab Computing Astrophysics
U of Utah U of Illinois Elemental Tech AccelerEyes RIKEN

149X 47X 20X 130X 30X

Financial simulation Linear Algebra 3D Ultrasound Quantum Chemistry Gene Sequencing
Oxford Universidad Jaime Techniscan U of Illinois U of Maryland

(c) 2010 NVIDIA Corporation [courtesy David Luebke, NVIDIA]

1024

(3.6, 515)
512 !

256 ~ 6x
(Case studies 2 &
3)
128
(1.7, 86) Platform
!
a Fermi
Gflop/s

64 (0.8, 78)
a C1060
(1.7, 43)
a Nehalem x 2
32 ~ 3x a Nehalem
(Case study 1)

8
Double-precision

1/8 1/4 1/2 1 2 4 8 16

Intensity (flop : byte) Courtesy Rich Vuduc, Georgia Tech
13 Dwarfs
• 1. Dense Linear Algebra • 8. Combinational Logic

• 2. Sparse Linear • 9. Graph Traversal

Algebra
• 10. Dynamic
• 3. Spectral Methods Programming

• 4. N-Body Methods • 11. Backtrack and

Branch-and-Bound
• 5. Structured Grids
• 12. Graphical Models
• 6. Unstructured Grids
• 13. Finite State
• 7. MapReduce Machines
13 Dwarfs
• 1. Dense Linear Algebra • 8. Combinational Logic

• 2. Sparse Linear • 9. Graph Traversal

Algebra
• 10. Dynamic
• 3. Spectral Methods Programming

• 4. N-Body Methods • 11. Backtrack and

Branch-and-Bound
• 5. Structured Grids
• 12. Graphical Models
• 6. Unstructured Grids
• 13. Finite State
• 7. MapReduce Machines
GPU in system (3 alternatives)

CPU GPU Mem CPU GPU

Chip Chipset

PCI
Express Discrete
Mem CPU Chipset Mem
GPU
A Modern Computer

CPU GPU

Chipset

Network
A Modern Computer
Kernel Call
CPU GPU

Chipset

Network
A Modern Computer
Kernel Call
Memory Xfer
CPU GPU

Chipset

Network
A Modern Computer
Kernel Call
Memory Xfer
CPU GPU

Chipset
Se
nd
/R
cee
ive

Network
A Modern Computer
Kernel Call
Memory Xfer
CPU GPU

Chipset
Se
nd
/R
cee
ive

Network
A Modern Computer
Kernel Call
Memory Xfer
CPU GPU

Chipset
Se
nd
/R
cee
ive

Network
Mellanox GPUDirect

InfiniBand

InfiniBand
Fast & Flexible Communication

• CPUs are good at creating & manipulating data structures?

• GPUs are good at accessing & updating data structures?

http://www.watch.impress.co.jp/game/docs/20060329/3dps303.jpg
Structuring CPU-GPU Programs
CPU GPU

Marshal data
Send to GPU
Receive from CPU
Call kernel

Execute kernel
Retrieve from GPU
Send to CPU
Structuring Multi-GPU Programs
CPU Static division of work
(Global Arrays: Zippy,
CUDASA)

GPU GPU GPU GPU GPU

Structuring Multi-GPU Programs
CPU Static division of work
(Global Arrays: Zippy,
CUDASA)

GPU GPU GPU GPU GPU

Want to run on GPU:

if (foo == true) {
GPU[x][bar] = baz;
} else {
bar = GPU[y][baz];
}
Structuring Multi-GPU Programs
CPU Static division of work
(Global Arrays: Zippy,
CUDASA)

GPU GPU GPU GPU GPU

Want to run on GPU:

if (foo == true) {
Instead, GPU as slave.
GPU[x][bar] = baz;
Goal: GPU as first-class citizen.
} else {
bar = GPU[y][baz];
}
Our Research Program

Programming Models

Abstractions

Mechanisms
Example

• Abstraction: GPU Programming Models

initiates network send
Abstractions
• Problems:

• GPU can’t communicate Mechanisms

with NI

• GPU signals CPU

Example

• Abstraction: GPU
initiates network send
Programming Models
• Solution:
Abstractions
• CPU allocates
“mailbox” in GPU mem Mechanisms
• GPU sets mailbox to
initiate network send

• CPU polls mailbox

Example Take-home: Abstraction
does not change even if
underlying
mechanisms change
• Abstraction: GPU
initiates network send
Programming Models
• Solution:
Abstractions
• CPU allocates
“mailbox” in GPU mem Mechanisms
• GPU sets mailbox to
initiate network send

• CPU polls mailbox

DCGN: MPI-Like Programming Model
• Distributed Computing for GPU Networks (DCGN,
pronounced decagon)

• MPI-like interface

• Allows communication between all CPUs and GPUs in

system

• Allow GPU to source/sink communication

• Multithreaded communication via MPI

• Both synchronous and asynchronous (<- overlap!)

• Collectives

• Multiplex MPI addresses (“slots”)

Architecture
MapReduce: Keys to Performance
!"#$ !"#$
%#$ %#$
&'($ )'($

&'($8$ &'($9$
+#2$ :$$:$$:$ +#2$

•
)'($8$ )'($9$
Process data in chunks
63;12"01.$ 63;12"01.$

• More eﬃcient transmission &

*+,$-$ *+,$-$
computation '+./+0$ '+./+0$
!12"31$-$ !12"31$-$

• Also allows out of core '+.//%#$ '+.//%#$

• Overlap computation and 45#$ 45#$

communication

• Accumulate 6%.7$ 6%.7$

• Partial Reduce 63;12"01.$ 63;12"01.$

!12"31$ !12"31$
Why is data-parallel computing fast?
• The GPU is specialized for compute-intensive, highly parallel
computation (exactly what graphics rendering is about)

• So, more transistors can be devoted to data processing rather than data
caching and flow control

ALU ALU
Control
ALU ALU

Cache

DRAM DRAM

CPU GPU
Programming Model: A Massively Multi-threaded Processor

• Move data-parallel application portions to the GPU

• Diﬀerences between GPU and CPU threads

• Lightweight threads

• GPU supports 1000s of threads

• Today:

• GPU hardware

• CUDA programming
environment
Big Idea #1

• One thread per data element.

• Doesn’t this mean that large problems will have

millions of threads?
Big Idea #2

• Write one program.

• That program runs on ALL threads in parallel.

• NVIDIA’s terminology here is “SIMT”: single-instruction,

multiple-thread.

• Roughly: SIMD means many threads run in lockstep; SIMT

means that some divergence is allowed and handled by
the hardware
CUDA Kernels and Threads
• Parallel portions of an application are executed on the
device as kernels Definitions:
Device = GPU; Host = CPU
• One SIMT kernel is executed at a time Kernel = function that
runs on the device
• Many threads execute each kernel

• Diﬀerences between CUDA and CPU threads

• CUDA threads are extremely lightweight

• Very little creation overhead

• Instant switching

• CUDA must use 1000s of threads to achieve eﬃciency

• Multi-core CPUs can use only a few

SM Multithreaded Multiprocessor
This figure is 1 generation old

SM • Each SM runs a block of threads

MT IU
• SM has 8 SP Thread Processors

• 32 GFLOPS peak at 1.35 GHz

SP
• IEEE 754 32-bit floating point

•
IU IU

SP SP
Scalar ISA

Shared
Memory
Shared
Memory
• Up to 768 threads,
hardware multithreaded

• 16KB Shared Memory

Shared
Memory • Concurrent threads share data

• Low latency load/store

GPU Computing (G80 GPUs)
• Processors execute
• 128 Thread Processors
computing threads

• Thread Execution • Parallel Data Cache

accelerates processing
Manager issues threads
Host

Input Assembler

Thread Execution Manager

Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors

Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel
Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data
Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache

Load/store

Global Memory
NVIDIA Fermi
• !"#$%&'()#*+),-.-%/#%0#1*2.
Performance • 3444#!567899:#;*#<#$*#=(%>?/@#*%-/A

• 3/,+)>.)B#;C>+)B#D)E%+F#0+%E#GH#IJ#A%#H6#IJ
• KBB)B#LG#>/B#L8#1>,C).
Flexibility • 411#%/#>((#3/A)+/>(#>/B#4"A)+/>(#D)E%+-).
• 4/>'()#&M#A%#G#N)+>JFA)#%0#O*2#D)E%+-).
• P-@C#;M))B#O$$Q5#D)E%+F#3/A)+0>,)

• D&(?M()#;-E&(A>/)%&.#N>.R.#
%/#O*2
Usability • G9"#=>.A)+#KA%E-,#SM)+>?%/.
• 1TT#;&MM%+A
• ;F.A)E#1>((.U#M+-/V#.&MM%+A

Slide courtesy NVIDIA, image from http://images.anandtech.com/

reviews/video/NVIDIA/GTX460/fullGF100.jpg
Big Idea #3

• Latency hiding.

• It takes a long time to go to memory.

• So while one set of threads is waiting for memory ...

• ... run another set of threads during the wait.

• In practice, 32 threads run in a “warp” and an eﬃcient program

usually has 128–256 threads in a block.
HW Goal: Scalability
• Scalable execution

• Program must be insensitive to the number of cores

• Write one program for any number of SM cores

• Program runs on any size GPU without recompiling

• Hierarchical execution model

• Decompose problem into sequential steps (kernels)

• Decompose kernel into computing parallel blocks

• Decompose block into computing parallel threads This is very

important.

• Hardware distributes independent blocks to SMs as available

Scaling the Architecture
• Same program

• Scalable performance

Host Host

Input Assembler Input Assembler

Thread Execution Manager Thread Execution Manager

Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors

Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel
Data Data Data Data Data Data Data Data Data Data Data Data
Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache

Load/store Load/store

Global Memory Global Memory

CUDA Software Development Kit
CUDA Optimized Libraries: Integrated CPU + GPU
math.h, FFT, BLAS, … C Source Code

NVIDIA C Compiler

NVIDIA Assembly
CPU Host Code
for Computing (PTX)

CUDA Debugger
Standard C Compiler
Driver Profiler

GPU CPU
Compiling CUDA for GPUs
C/C++ CUDA
Application

NVCC CPU Code

Generic PTX Code

Specialized
PTX to Target
Translator

GPU … GPU
Target device code
Programming Model (SPMD + SIMD): Thread Batching

• A kernel is executed as a grid of Host Device

thread blocks Grid 1

• A thread block is a batch of Kernel 1

Block
(0, 0)
Block
(1, 0)
Block
(2, 0)

threads that can cooperate with Block Block Block

each other by: (0, 1) (1, 1) (2, 1)

• Eﬃciently sharing data through Grid 2

shared memory
Kernel 2

• Synchronizing their execution

Block (1, 1)
• For hazard-free shared memory
accesses Thread Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

• Two threads from two diﬀerent Thread Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)
blocks cannot cooperate
Thread Thread Thread Thread Thread
(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

• Blocks are independent

Blocks must be independent
• Any possible interleaving of blocks should be valid

• presumed to run to completion without pre-emption

• can run in any order

• can run concurrently OR sequentially

• Blocks may coordinate but not synchronize

• shared queue pointer: OK

• shared lock: BAD … can easily deadlock

• Independence requirement gives scalability

Big Idea #4

• Organization into independent blocks allows

scalability / diﬀerent hardware instantiations

• If you organize your kernels to run over many blocks ...

• ... the same code will be eﬃcient on hardware that runs

one block at once and on hardware that runs many blocks
at once
CUDA: Programming GPU in C
• Philosophy: provide minimal set of extensions necessary to expose power

• Declaration specifiers to indicate where things live

global void KernelFunc(...); // kernel callable from host

device void DeviceFunc(...); // function callable on device

device int GlobalVar; // variable in device memory

shared int SharedVar; // shared within thread block

• Extend function invocation syntax for parallel kernel launch

KernelFunc<<<500, 128>>>(...); // launch 500 blocks w/ 128 threads each

• Special variables for thread identification in kernels

dim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim;

• Intrinsics that expose specific operations in kernel code

__syncthreads(); // barrier synchronization within kernel
Example: Vector Addition Kernel

global void vecAdd(float* A, float* B, float* C)

int i = threadIdx.x + blockDim.x * blockIdx.x;

C[i] = A[i] + B[i];

}
Host Code
int main()

// Run N/256 blocks of 256 threads each

vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);

}
Synchronization of blocks
• Threads within block may synchronize with barriers
… Step 1 …
__syncthreads();
… Step 2 …

• Blocks coordinate via atomic memory operations

• e.g., increment shared queue pointer with atomicInc()

• Implicit barrier between dependent kernels

vec_minus<<<nblocks, blksize>>>(a, b, c);
vec_dot<<<nblocks, blksize>>>(c, c);
CUDA: Runtime support
• Explicit memory allocation returns pointers to GPU memory

cudaMalloc(), cudaFree()

• Explicit memory copy for host device, device device

cudaMemcpy(), cudaMemcpy2D(), ...

• Texture management

cudaBindTexture(), cudaBindTextureToArray(), ...

• OpenGL & DirectX interoperability

cudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(), …
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C){
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}

int main(){
// Run N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
Example: Host code for vecAdd
// allocate and initialize host (CPU) memory
float *h_A = …, *h_B = …;
// allocate device (GPU) memory
float *d_A, *d_B, *d_C;

cudaMalloc( (void**) &d_A, N * sizeof(float));

cudaMalloc( (void**) &d_B, N * sizeof(float));
cudaMalloc( (void**) &d_C, N * sizeof(float));
// copy host memory to device
cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice) );
cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice) );

// execute the kernel on N/256 blocks of 256 threads each

vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
Example: Host code for vecAdd
// allocate and initialize host (CPU) memory
float *h_A = …, *h_B = …;
// allocate device (GPU) memory
float *d_A, *d_B, *d_C;

cudaMalloc( (void**) &d_A, N * sizeof(float));

// execute the kernel on N/256 blocks of 256 threads each

vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
Example: Host code for vecAdd
// allocate and initialize host (CPU) memory
float *h_A = …, *h_B = …;
// allocate device (GPU) memory
float *d_A, *d_B, *d_C;

cudaMalloc( (void**) &d_A, N * sizeof(float));

// execute the kernel on N/256 blocks of 256 threads each

vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
Example: Host code for vecAdd
// allocate and initialize host (CPU) memory
float *h_A = …, *h_B = …;
// allocate device (GPU) memory
float *d_A, *d_B, *d_C;

cudaMalloc( (void**) &d_A, N * sizeof(float));

// execute the kernel on N/256 blocks of 256 threads each

vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
Basic Eﬃciency Rules

• Develop algorithms with a data parallel mindset

• Minimize divergence of execution within blocks

• Maximize locality of global memory accesses

• Exploit per-block shared memory as scratchpad

• Expose enough parallelism

Summing Up
• Four big ideas:

1. One thread per data element

2. Write one program, runs on all threads

3. Hide latency by switching to diﬀerent work

4. Independent blocks allow scalability

• Three key abstractions:

1. hierarchy of parallel threads

2. corresponding levels of synchronization

3. corresponding memory spaces

GPU Computing Challenges

• Addressing other dwarfs

• Sparseness & adaptivity

• Scalability: Multi-GPU algorithms and data structures

• Heterogeneity (Fusion/Knight’s Corner architectures)

• Irregularity

• Incremental data structures

• Abstract models of GPU computation

Thanks to …

• David Luebke and Rich Vuduc for slides

• NVIDIA for hardware donations; Argonne and University

of Illinois / NCSA for cluster access

• Funding agencies: Department of Energy (SciDAC

Institute for Ultrascale Visualization, Early Career
Principal Investigator Award), NSF, LANL, BMW, NVIDIA,
HP, Intel, UC MICRO, Microsoft, ChevronTexaco, Rambus

10 - Introduction and Overview GPGPU
100% (1)
10 - Introduction and Overview GPGPU
69 pages
BW465_EN_Col18
No ratings yet
BW465_EN_Col18
365 pages
Parallel Programming Module 4
No ratings yet
Parallel Programming Module 4
93 pages
GPU Architecture
33% (3)
GPU Architecture
28 pages
PostgreSQL OpenCL Procedural Language
No ratings yet
PostgreSQL OpenCL Procedural Language
29 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
2023 CSC14120 Lecture00 CourseIntroduction
No ratings yet
2023 CSC14120 Lecture00 CourseIntroduction
30 pages
Week 1 Csc447
No ratings yet
Week 1 Csc447
36 pages
WINSEM2022-23_CSE4001_ETH_VL2022230503160_Reference_Material_I_10-01-2023_2.2_GPU
No ratings yet
WINSEM2022-23_CSE4001_ETH_VL2022230503160_Reference_Material_I_10-01-2023_2.2_GPU
34 pages
GPU Programming Slides 1
No ratings yet
GPU Programming Slides 1
33 pages
Topic 8
No ratings yet
Topic 8
71 pages
Dynamic Heterogeneous Scheduling of GPU-CPU in Distributed Environment
No ratings yet
Dynamic Heterogeneous Scheduling of GPU-CPU in Distributed Environment
8 pages
CUDA Optimization Fundamentals
No ratings yet
CUDA Optimization Fundamentals
150 pages
Debunking the 100X GPU vs. CPU Myth
No ratings yet
Debunking the 100X GPU vs. CPU Myth
10 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Lecture 17-Introduction to GPU
No ratings yet
Lecture 17-Introduction to GPU
36 pages
Programming For Graphics Processing Units (Gpus) : Parallel
No ratings yet
Programming For Graphics Processing Units (Gpus) : Parallel
35 pages
A Complete Gpu Guide - Cherry Servers
No ratings yet
A Complete Gpu Guide - Cherry Servers
29 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
Distributed Systems Perspective
No ratings yet
Distributed Systems Perspective
83 pages
Developing Library of Internet Protocol Suite On CUDA Platform
No ratings yet
Developing Library of Internet Protocol Suite On CUDA Platform
4 pages
1
No ratings yet
1
44 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
GPGPU Tutorial
No ratings yet
GPGPU Tutorial
155 pages
Unit 5'
No ratings yet
Unit 5'
33 pages
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
CUDA C Best Practices Guide
No ratings yet
CUDA C Best Practices Guide
100 pages
Cks 2012 It Art 002
No ratings yet
Cks 2012 It Art 002
10 pages
p10-cuda
No ratings yet
p10-cuda
28 pages
Intro Computing BCSM-F18-071 - Assignment 1
No ratings yet
Intro Computing BCSM-F18-071 - Assignment 1
10 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
Khan Muhammad Nafee Mostafa: Presented by
No ratings yet
Khan Muhammad Nafee Mostafa: Presented by
20 pages
UNIT-4
No ratings yet
UNIT-4
48 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
Lec 14
No ratings yet
Lec 14
52 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
217 Lec1
No ratings yet
217 Lec1
35 pages
Cloud Computing Ebook
No ratings yet
Cloud Computing Ebook
165 pages
chapter-8
No ratings yet
chapter-8
58 pages
cuuda nvidai guide_Part1
No ratings yet
cuuda nvidai guide_Part1
15 pages
Gpu Series i Cpu vs Gpu 1720694318
No ratings yet
Gpu Series i Cpu vs Gpu 1720694318
4 pages
FPGA and OpenCL
No ratings yet
FPGA and OpenCL
31 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
IIS DataStageSortPerformance PDF
No ratings yet
IIS DataStageSortPerformance PDF
26 pages
DS1822 - Parallel Computing-unit3
No ratings yet
DS1822 - Parallel Computing-unit3
17 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
Dell Precision Customer NDA July 2011
No ratings yet
Dell Precision Customer NDA July 2011
139 pages
Lecture 2
No ratings yet
Lecture 2
15 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
note2_4
No ratings yet
note2_4
11 pages
E2 E3 Infosphere Datastage - Compilation and Execution
No ratings yet
E2 E3 Infosphere Datastage - Compilation and Execution
52 pages
Star Lion College of Engineering & Technology: Cs2354 Aca-2 Marks & 16 Marks
No ratings yet
Star Lion College of Engineering & Technology: Cs2354 Aca-2 Marks & 16 Marks
14 pages
GPGPU
No ratings yet
GPGPU
139 pages
Gpu Cuda Part1
No ratings yet
Gpu Cuda Part1
27 pages
Unit 1 History of Information Technology: Structure
No ratings yet
Unit 1 History of Information Technology: Structure
25 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Operating System: Semester 5
No ratings yet
Operating System: Semester 5
18 pages
Detailed List & Syllabuses of Courses: Syrian Arab Republic Damascus University
No ratings yet
Detailed List & Syllabuses of Courses: Syrian Arab Republic Damascus University
14 pages
Eos io-TechnicalWhitePaper
No ratings yet
Eos io-TechnicalWhitePaper
15 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
No ratings yet
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
3 pages
GPU vs CPU Smackdown - The Rise of Throughput-Oriented Architectures
No ratings yet
GPU vs CPU Smackdown - The Rise of Throughput-Oriented Architectures
5 pages
Accelerating Large Graph Algorithms On The GPU Using Cuda
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using Cuda
12 pages
Tiger SHARC Processor
No ratings yet
Tiger SHARC Processor
36 pages
Multithreading vs Asynchronous Programming vs Parallel Programming
No ratings yet
Multithreading vs Asynchronous Programming vs Parallel Programming
16 pages
Parallel Processing
No ratings yet
Parallel Processing
97 pages
GPU in Supercomputer
No ratings yet
GPU in Supercomputer
7 pages
Best-Effort Computing: Re-Thinking Parallel Software and Hardware
No ratings yet
Best-Effort Computing: Re-Thinking Parallel Software and Hardware
6 pages
09 Communication models of Parallel platforms
No ratings yet
09 Communication models of Parallel platforms
25 pages
Updated Syllabus - ME CSE Word Document PDF
No ratings yet
Updated Syllabus - ME CSE Word Document PDF
62 pages
Internal Assignment: Name Sneha Sankhla Roll Number 2214505216 Program Master of Computer Applications (Mca) Semester 1
No ratings yet
Internal Assignment: Name Sneha Sankhla Roll Number 2214505216 Program Master of Computer Applications (Mca) Semester 1
13 pages
JNTUK R20 B.tech CSE 4-1 Cloud Computing Unit 1 Notes
No ratings yet
JNTUK R20 B.tech CSE 4-1 Cloud Computing Unit 1 Notes
18 pages
Course Outline PDC
0% (1)
Course Outline PDC
1 page
Introduction To LS-DYNA MPP&Restart
No ratings yet
Introduction To LS-DYNA MPP&Restart
38 pages
Pipelining vs. Parallel Processing
No ratings yet
Pipelining vs. Parallel Processing
23 pages
FSI Methods
No ratings yet
FSI Methods
38 pages
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
No ratings yet
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
11 pages
J, DL ¿ MSS - KL Ixl, Amh (Cloud Computing)
No ratings yet
J, DL ¿ MSS - KL Ixl, Amh (Cloud Computing)
5 pages
Improving Gate-Level Simulation Performance Author: Gagandeep Singh
No ratings yet
Improving Gate-Level Simulation Performance Author: Gagandeep Singh
34 pages
Accelerating Large Graph Algorithms On The GPU Using CUDA
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using CUDA
12 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Spark Architecture
100% (1)
Spark Architecture
12 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
CC qp2
No ratings yet
CC qp2
1 page
Mining for Knowledge: Exploring GPU Architectures In Cryptocurrency and AI: The Crypto Mining Mastery Series, #2
From Everand
Mining for Knowledge: Exploring GPU Architectures In Cryptocurrency and AI: The Crypto Mining Mastery Series, #2
Ladd Baby
No ratings yet
Neo Geo Architecture: Architecture of Consoles: A Practical Analysis, #23
From Everand
Neo Geo Architecture: Architecture of Consoles: A Practical Analysis, #23
Rodrigo Copetti
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.