0% found this document useful (0 votes)

63 views64 pages

GPUMod 2

This document provides an introduction to different approaches for GPU computing including CUDA C, libraries, compiler directives, and programming languages. It discusses the tradeoffs of each approach in terms of ease of use, performance, flexibility, and portability. The document also demonstrates the basic CUDA C API functions for device memory allocation and host-device data transfer needed to implement a vector addition example on the GPU.

Uploaded by

Disinlung Kamei Disinlung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views64 pages

GPUMod 2

Uploaded by

Disinlung Kamei Disinlung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

GPU Teaching Kit

Accelerated Computing

Lecture 2.1 - Introduction to CUDA C

CUDA C vs. Thrust vs. CUDA Libraries
Objective
– To learn the main venues and developer resources
for GPU computing
– Where CUDA C fits in the big picture

2
3 Ways to Accelerate Applications

Applications

Compiler Programming
Libraries
Directives Languages

Easy to use Easy to use Most Performance

Most Performance Portable code Most Flexibility

3
Libraries: Easy, High-Quality Acceleration

Ease of use: Using libraries enables GPU acceleration without in-

depth knowledge of GPU programming

“Drop-in”: Many GPU-accelerated libraries follow standard APIs,

thus enabling acceleration with minimal code changes

Quality: Libraries offer high-quality implementations of functions

encountered in a broad range of applications

4
GPU Accelerated Libraries

Linear Algebra NVIDIA

cuFFT,
FFT, BLAS, cuBLAS,
SPARSE, Matrix cuSPARSE

Numerical & Math NVIDIA

Math
NVIDIA
cuRAND
RAND, Statistics Lib

Data Struct. & AI GPU AI –

Board
GPU AI –
Path
Sort, Scan, Zero Sum Games Finding

NVIDIA
Visual Processing Video
NVIDIA Encode
Image & Video NPP

5
Vector Addition in Thrust

thrust::device_vector<float> deviceInput1(inputLength);
thrust::device_vector<float> deviceInput2(inputLength);
thrust::device_vector<float> deviceOutput(inputLength);

thrust::copy(hostInput1, hostInput1 + inputLength,

deviceInput1.begin());
thrust::copy(hostInput2, hostInput2 + inputLength,
deviceInput2.begin());

thrust::transform(deviceInput1.begin(), deviceInput1.end(),
deviceInput2.begin(), deviceOutput.begin(),
thrust::plus<float>());

6
Compiler Directives: Easy, Portable
Acceleration

Ease of use: Compiler takes care of details of parallelism

management and data movement

Portable: The code is generic, not specific to any type of hardware

and can be deployed into multiple languages

Uncertain: Performance of code can vary across compiler versions

7
OpenACC

– Compiler directives for C, C++, and FORTRAN

#pragma acc parallel loop

copyin(input1[0:inputLength],input2[0:inputLength]),
copyout(output[0:inputLength])
for(i = 0; i < inputLength; ++i) {
output[i] = input1[i] + input2[i];
}

8
Programming Languages: Most Performance and
Flexible Acceleration

Performance: Programmer has best control of parallelism and

data movement

Flexible: The computation does not need to fit into a limited set of
library patterns or directive types

Verbose: The programmer often needs to express more details

9
GPU Programming Languages

Numerical analytics MATLAB, Mathematica, LabVIEW

Fortran CUDA Fortran

C CUDA C

C++ CUDA C++

Python PyCUDA, Copperhead, Numba

F# Alea.cuBase

10
CUDA - C

Applications

Compiler Programming
Libraries
Directives Languages

Easy to use Easy to use Most Performance

Most Performance Portable code Most Flexibility

11
GPU Teaching Kit
Accelerated Computing

Lecture 2.2 - Introduction to CUDA C

Memory Allocation and Data Movement API Functions
Objective
– To learn the basic API functions in CUDA host code
– Device Memory Allocation
– Host-Device Data Transfer

2
Vector Addition – Traditional C Code
// Compute vector sum C = A + B
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int i;
for (i = 0; i<n; i++) h_C[i] = h_A[i] + h_B[i];
}

int main()
{
// Memory allocation for h_A, h_B, and h_C
// I/O to read h_A and h_B, N elements
…
vecAdd(h_A, h_B, h_C, N);
}
4

4
Heterogeneous Computing vecAdd CUDA Host Code
Part 1

Device Memory #include <cuda.h>

Host Memory Part 2
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
CPU GPU
int size = n* sizeof(float);
float *d_A, *d_B, *d_C;
// Part 1
Part 3 // Allocate device memory for A, B, and C
// copy A and B to device memory

// Part 2
// Kernel launch code – the device performs the actual vector addition

// Part 3
// copy C from the device memory
// Free device vectors
}

5
Partial Overview of CUDA Memories
– Device code can:
(Device) Grid – R/W per-thread registers
Block (0, 0) Block (0, 1) – R/W all-shared global
Registers Registers Registers Registers
memory
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)
– Host code can
Host
Global
– Transfer data to/from per
Memory grid global memory

We will cover more memory types and more

sophisticated memory models later.

6
CUDA Device Memory Management API functions
– cudaMalloc()
(Device) Grid – Allocates an object in the device
global memory
Block (0, 0) Block (0, 1)
– Two parameters
Registers Registers Registers Registers – Address of a pointer to the
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)
allocated object
– Size of allocated object in terms
Host of bytes
Global
Memory – cudaFree()
– Frees object from device global
memory
– One parameter
– Pointer to freed object

7
Host-Device Data Transfer API functions
– cudaMemcpy()
(Device) Grid – memory data transfer
Block (0, 0) Block (0, 1) – Requires four parameters
– Pointer to destination
Registers Registers Registers Registers
– Pointer to source
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)
– Number of bytes copied
– Type/Direction of transfer
Host
Global
Memory
– Transfer to device is asynchronous

8
Vector Addition Host Code
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int size = n * sizeof(float); float *d_A, *d_B, *d_C;

cudaMalloc((void **) &d_A, size);

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_B, size);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_C, size);

// Kernel invocation code – to be shown later

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);
}

9
In Practice, Check for API Errors in Host Code
cudaError_t err = cudaMalloc((void **) &d_A, size);

if (err != cudaSuccess) {
printf(“%s in %s at line %d\n”, cudaGetErrorString(err), __FILE__,
__LINE__);
exit(EXIT_FAILURE);
}

10
GPU Teaching Kit
Accelerated Computing

Lecture 2.3 – Introduction to CUDA C

Threads and Kernel Functions
Objective
– To learn about CUDA threads, the main mechanism for exploiting of
data parallelism
– Hierarchical thread organization
– Launching parallel execution
– Thread index to data index mapping

2
Data Parallelism - Vector Addition Example

vector A A[0] A[1] A[2] … A[N-1]

vector B B[0] B[1] B[2] … B[N-1]

+ + + +

vector C C[0] C[1] C[2] … C[N-1]

3
CUDA Execution Model
– Heterogeneous host (CPU) + device (GPU) application C program
– Serial parts in host C code
– Parallel parts in device SPMD kernel code

Serial Code (host)

Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args); ...

Serial Code (host)

Parallel Kernel (device)

KernelB<<< nBlk, nTid >>>(args); ...

4
From Natural Language to Electrons

Natural Language (e.g, English)

Algorithm
High-Level Language (C/C++…)
Compiler
Instruction Set Architecture
Microarchitecture
Circuits
Electrons

©Yale Patt and Sanjay Patel, From bits and bytes to gates and beyond

5
A program at the ISA level
– A program is a set of instructions stored in memory that can be read,
interpreted, and executed by the hardware.
– Both CPUs and GPUs are designed based on (different) instruction sets

– Program instructions operate on data stored in memory and/or

registers.

6
A Thread as a Von-Neumann Processor
A thread is a “virtualized” or
“abstracted”
Von-Neumann Processor

Memory
I/O

Processing Unit
Reg
ALU File

Control Unit
PC IR

7
Arrays of Parallel Threads
• A CUDA kernel is executed by a grid (array) of threads
– All threads in a grid run the same kernel code (Single Program Multiple Data)
– Each thread has indexes that it uses to compute memory addresses and make
control decisions

0 1 2 254 255
…

i = blockIdx.x * blockDim.x + threadIdx.x;

C[i] = A[i] + B[i];

8
Thread Blocks: Scalable Cooperation
Thread Block 0 Thread Block 1 Thread Block N-1
0 1 2 254 255 0 1 2 254 255 0 1 2 254 255

… … …
i = blockIdx.x * blockDim.x + i = blockIdx.x * blockDim.x + i = blockIdx.x * blockDim.x +
…
threadIdx.x; threadIdx.x; threadIdx.x;
C[i] = A[i] + B[i]; C[i] = A[i] + B[i]; C[i] = A[i] + B[i];

… … …

– Divide thread array into multiple blocks

– Threads within a block cooperate via shared memory, atomic operations and
barrier synchronization
– Threads in different blocks do not interact

9
blockIdx and threadIdx

• Each thread uses indices to decide what data to work

on
– blockIdx: 1D, 2D, or 3D (CUDA 4.0)
– threadIdx: 1D, 2D, or 3D
device
• Simplifies memory Grid Block (0, Block (0,
addressing when processing 0) 1)
multidimensional data Block (1, Block (1,
– Image processing 0) 1)
– Solving PDEs on volumes
– … Block (1,1)
(1,0,0) (1,0,1) (1,0,2) (1,0,3)

Thread Thread Thread 10 Thread

(0,0,0) (0,0,1) (0,0,2) (0,0,3)
Thread
Thread Thread Thread (0,0,0)
Thread
(0,1,0) (0,1,1) (0,1,2) (0,1,3)

10
GPU Teaching Kit
Accelerated Computing

Lecture 2.4 – Introduction to CUDA C

Introduction to the CUDA Toolkit
Objective
– To become familiar with some valuable tools and resources from the
CUDA Toolkit
– Compiler flags
– Debuggers
– Profilers

2
GPU Programming Languages

Numerical analytics MATLAB, Mathematica, LabVIEW

Fortran CUDA Fortran

C CUDA C

C++ CUDA C++

Python PyCUDA, Copperhead, Numba, NumbaPro

F# Alea.cuBase

3
CUDA - C

Applications

Compiler Programming
Libraries
Directives Languages

Easy to use Easy to use Most Performance

Most Performance Portable code Most Flexibility

4
NVCC Compiler
– NVIDIA provides a CUDA-C compiler
– nvcc
– NVCC compiles device code then forwards code on to the host
compiler (e.g. g++)
– Can be used to compile & link host only applications

5
Example 1: Hello World
int main() {
printf("Hello World!\n");
return 0;
}

Instructions:
1. Build and run the hello world code
2. Modify Makefile to use nvcc
instead of g++
3. Rebuild and run

6
CUDA Example 1: Hello World
__global__ void mykernel(void) {
}

int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!\n");
return 0;
}

Instructions:
1. Add kernel and kernel launch to
main.cu
2. Try to build

7
CUDA Example 1: Build Considerations
– Build failed
– Nvcc only parses .cu files for CUDA
– Fixes:
– Rename main.cc to main.cu
OR
– nvcc –x cu
– Treat all input files as .cu files

Instructions:
1. Rename main.cc to main.cu
2. Rebuild and Run

8
Hello World! with Device Code

global void mykernel(void) {

}

int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!\n");
return 0;
}

Output:

$ nvcc main.cu
$ ./a.out
Hello World!

– mykernel(does nothing, somewhat anticlimactic!)

9
Developer Tools - Debuggers

NSIGHT CUDA-GDB CUDA MEMCHECK

NVIDIA Provided

3rd Party
https://developer.nvidia.com/debugging-solutions

10
Compiler Flags
– Remember there are two compilers being used
– NVCC: Device code
– Host Compiler: C/C++ code
– NVCC supports some host compiler flags
– If flag is unsupported, use –Xcompiler to forward to host
– e.g. –Xcompiler –fopenmp
– Debugging Flags
– -g: Include host debugging symbols
– -G: Include device debugging symbols
– -lineinfo: Include line information with symbols

11
CUDA-MEMCHECK
– Memory debugging tool
– No recompilation necessary
%> cuda-memcheck ./exe
– Can detect the following errors
– Memory leaks
– Memory errors (OOB, misaligned access, illegal instruction, etc)
– Race conditions
– Illegal Barriers
– Uninitialized Memory
– For line numbers use the following compiler flags:
– -Xcompiler -rdynamic -lineinfo

http://docs.nvidia.com/cuda/cuda-memcheck

12
Example 2: CUDA-MEMCHECK

Instructions:
1. Build & Run Example 2
Output should be the numbers 0-9
Do you get the correct results?
2. Run with cuda-memcheck
%> cuda-memcheck ./a.out
3. Add nvcc flags “–Xcompiler –
rdynamic –lineinfo”
4. Rebuild & Run with cuda-memcheck
5. Fix the illegal write

http://docs.nvidia.com/cuda/cuda-memcheck

13
CUDA-GDB
– cuda-gdb is an extension of GDB
– Provides seamless debugging of CUDA and CPU code
– Works on Linux and Macintosh
– For a Windows debugger use NSIGHT Visual Studio Edition

http://docs.nvidia.com/cuda/cuda-gdb

14
Example 3: cuda-gdb

Instructions:
1. Run exercise 3 in cuda-gdb
%> cuda-gdb --args ./a.out
2. Run a few cuda-gdb commands:
(cuda-gdb) b main //set break point at main
(cuda-gdb) r //run application
(cuda-gdb) l //print line context
(cuda-gdb) b foo //break at kernel foo
(cuda-gdb) c //continue
(cuda-gdb) cuda thread //print current thread
(cuda-gdb) cuda thread 10 //switch to thread 10
(cuda-gdb) cuda block //print current block
(cuda-gdb) cuda block 1 //switch to block 1
(cuda-gdb) d //delete all break points
(cuda-gdb) set cuda memcheck on //turn on cuda memcheck
(cuda-gdb) r //run from the beginning
3. Fix Bug
http://docs.nvidia.com/cuda/cuda-gdb

15
Developer Tools - Profilers

NSIGHT NVVP NVPROF

NVIDIA Provided

TAU VampirTrace

3rd Party
https://developer.nvidia.com/performance-analysis-tools

16
NVPROF
Command Line Profiler
– Compute time in each kernel
– Compute memory transfer time
– Collect metrics and events
– Support complex process hierarchy's
– Collect profiles for NVIDIA Visual Profiler
– No need to recompile

17
Example 4: nvprof

Instructions:
1. Collect profile information for the matrix add
example
%> nvprof ./a.out
2. How much faster is add_v2 than add_v1?
3. View available metrics
%> nvprof --query-metrics
4. View global load/store efficiency
%> nvprof --metrics
gld_efficiency,gst_efficiency ./a.out
5. Store a timeline to load in NVVP
%> nvprof –o profile.timeline ./a.out
6. Store analysis metrics to load in NVVP
%> nvprof –o profile.metrics --analysis-metrics
./a.out

18
NVIDIA’s Visual Profiler (NVVP)

Timeline

Guided
System Analysis

19
Example 4: NVVP

Instructions:
1. Import nvprof profile into NVVP
Launch nvvp
Click File/ Import/ Nvprof/ Next/ Single
process/ Next / Browse
Select profile.timeline
Add Metrics to timeline
Click on 2nd Browse
Select profile.metrics
Click Finish
2. Explore Timeline
Control + mouse drag in timeline to zoom in
Control + mouse drag in measure bar (on top)
to measure time

20
Example 4: NVVP
Instructions:
1. Click on a kernel
2. On Analysis tab click on the unguided analysis

2. Click Analyze All

Explore metrics and properties
What differences do you see between the two
kernels?

Note:
If kernel order is non-deterministic you can only load the timeline or the metrics
but not both.
If you load just metrics the timeline looks odd but metrics are correct.

21
Example 4: NVVP
Let’s now generate the same data within NVVP
Instructions:
1. Click File / New Session / Browse
Select Example 4/a.out
Click Next / Finish

2. Click on a kernel
Select Unguided Analysis
Click Analyze All

22
NVTX
– Our current tools only profile API calls on the host
– What if we want to understand better what the host is doing?
– The NVTX library allows us to annotate profiles with ranges
– Add: #include <nvToolsExt.h>
– Link with: -lnvToolsExt
– Mark the start of a range
– nvtxRangePushA(“description”);
– Mark the end of a range
– nvtxRangePop();
– Ranges are allowed to overlap

http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx/

23
NVTX Profile

24
NSIGHT
– CUDA enabled Integrated Development Environment
– Source code editor: syntax highlighting, code refactoring, etc
– Build Manger
– Visual Debugger
– Visual Profiler
– Linux/Macintosh
– Editor = Eclipse
– Debugger = cuda-gdb with a visual wrapper
– Profiler = NVVP
– Windows
– Integrates directly into Visual Studio
– Profiler is NSIGHT VSE

25
Example 4: NSIGHT
Let’s import an existing Makefile project into NSIGHT
Instructions:
1. Run nsight
Select default workspace
2. Click File / New / Makefile Project With
Existing CodeTest
3. Enter Project Name and select the Example15
directory
4. Click Finish
5. Right Click On Project / Properties / Run
Settings / New / C++ Application
6. Browse for Example 4/a.out
7. In Project Explorer double click on main.cu and
explore source
8. Click on the build icon
9. Click on the run icon
10.Click on the profile icon
26
Profiler Summary
– Many profile tools are available
– NVIDIA Provided
– NVPROF: Command Line
– NVVP: Visual profiler
– NSIGHT: IDE (Visual Studio and Eclipse)
– 3rd Party
– TAU
– VAMPIR

27
Optimization

Assess

Deploy Parallelize

Optimize

28
Assess

HOTSPOTS

– Profile the code, find the hotspot(s)

– Focus your attention where it will give the most benefit

29
Parallelize

Applications

Compiler Programming
Libraries
Directives Languages

30
Optimize

Timeline

Guided
System Analysis

31
Bottleneck Analysis

– Don’t assume an optimization was wrong

– Verify if it was wrong with the profiler

129 GB/s 84 GB/s

32
Performance Analysis

84 GB/s 137 GB/s

33
GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under
the Creative Commons Attribution-NonCommercial 4.0 International License.

AndeSight STD v5.2 IDE User Manual UM257 V1.1
No ratings yet
AndeSight STD v5.2 IDE User Manual UM257 V1.1
658 pages
Annual Report English 21 22
No ratings yet
Annual Report English 21 22
272 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Methods of Calculating Helicopter Power, Fuel Consumption and Mission Performance
No ratings yet
Methods of Calculating Helicopter Power, Fuel Consumption and Mission Performance
54 pages
GATE Solved Question Papers For Mining Engineering MN by AglaSem Com PDF
100% (1)
GATE Solved Question Papers For Mining Engineering MN by AglaSem Com PDF
50 pages
Basic Residential Installation Guide
No ratings yet
Basic Residential Installation Guide
56 pages
CUDA - Part 1 LMS
No ratings yet
CUDA - Part 1 LMS
51 pages
COBAT
No ratings yet
COBAT
46 pages
Curriculum For Post Graduate Diploma Course in Computer Application For The State of Uttar Pradesh
No ratings yet
Curriculum For Post Graduate Diploma Course in Computer Application For The State of Uttar Pradesh
41 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
Materi Multi Step
No ratings yet
Materi Multi Step
28 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
HUAWEI S Series Switches After-Sales Documentation Bookshelf (Enterprise)
No ratings yet
HUAWEI S Series Switches After-Sales Documentation Bookshelf (Enterprise)
32 pages
AI in Insurance: Top Use Cases, Challenges, and Trends
No ratings yet
AI in Insurance: Top Use Cases, Challenges, and Trends
24 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
Microwave Engineering
No ratings yet
Microwave Engineering
58 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
The Relationship Between Cyber Security Among Mobile Banking User in Thailand
No ratings yet
The Relationship Between Cyber Security Among Mobile Banking User in Thailand
19 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
Programming Models For GPU Architecture
No ratings yet
Programming Models For GPU Architecture
55 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
ECE Mock Board Exam April 2022 Electronics
No ratings yet
ECE Mock Board Exam April 2022 Electronics
7 pages
IQ4 - Metal Detection Conveyor Stop On Detect: Product Highlights
No ratings yet
IQ4 - Metal Detection Conveyor Stop On Detect: Product Highlights
2 pages
UNIT 4 Seminar
50% (2)
UNIT 4 Seminar
10 pages
Threads
No ratings yet
Threads
54 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
A Business Model Based On FTTH Infrastructure Neutral Operator To Provide Service To Several Operators
No ratings yet
A Business Model Based On FTTH Infrastructure Neutral Operator To Provide Service To Several Operators
18 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Basic-Cuda
No ratings yet
Basic-Cuda
49 pages
Spring Boot Interview Questions
No ratings yet
Spring Boot Interview Questions
13 pages
Networking Concepts - Basics
No ratings yet
Networking Concepts - Basics
33 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Overview of GPGPU's
No ratings yet
Overview of GPGPU's
81 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
Driving Digital
No ratings yet
Driving Digital
7 pages
Intro To CUDA
No ratings yet
Intro To CUDA
16 pages
Codes Couleur Roblox
No ratings yet
Codes Couleur Roblox
10 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Circular Queue Operations
No ratings yet
Circular Queue Operations
4 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Sensor de Color
No ratings yet
Sensor de Color
5 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Fall 2023 - CS606 - 2
No ratings yet
Fall 2023 - CS606 - 2
4 pages
GPU Series III CUDA Compilation Host Side 1721302802
No ratings yet
GPU Series III CUDA Compilation Host Side 1721302802
8 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
Thirteen Rules That Expire: Number
No ratings yet
Thirteen Rules That Expire: Number
3 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Cuda C
No ratings yet
Cuda C
70 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Lec 1
No ratings yet
Lec 1
27 pages
WEEK 6, Questions
No ratings yet
WEEK 6, Questions
2 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Maxbox Arduino Pascal Magazine
100% (1)
Maxbox Arduino Pascal Magazine
8 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Schema Generator Curent Kipor Kde 13ss3
100% (1)
Schema Generator Curent Kipor Kde 13ss3
1 page
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
No ratings yet
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
11 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
CUDA
No ratings yet
CUDA
33 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Material Management For Construction Site With Using ERP
100% (1)
Material Management For Construction Site With Using ERP
2 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Legrand - Floor Socket New
No ratings yet
Legrand - Floor Socket New
2 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
2D Turbulent Flow Past A Cylinder in Star
No ratings yet
2D Turbulent Flow Past A Cylinder in Star
6 pages
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet

Uploaded by

Uploaded by

GPU Teaching Kit

Lecture 2.1 - Introduction to CUDA C

Easy to use Easy to use Most Performance

Ease of use: Using libraries enables GPU acceleration without in-

“Drop-in”: Many GPU-accelerated libraries follow standard APIs,

Quality: Libraries offer high-quality implementations of functions

Linear Algebra NVIDIA

Numerical & Math NVIDIA

Data Struct. & AI GPU AI –

thrust::copy(hostInput1, hostInput1 + inputLength,

Ease of use: Compiler takes care of details of parallelism

Portable: The code is generic, not specific to any type of hardware

Uncertain: Performance of code can vary across compiler versions

– Compiler directives for C, C++, and FORTRAN

#pragma acc parallel loop

Performance: Programmer has best control of parallelism and

Verbose: The programmer often needs to express more details

Numerical analytics MATLAB, Mathematica, LabVIEW

Fortran CUDA Fortran

C++ CUDA C++

Python PyCUDA, Copperhead, Numba

Easy to use Easy to use Most Performance

Lecture 2.2 - Introduction to CUDA C

Device Memory #include <cuda.h>

We will cover more memory types and more

cudaMalloc((void **) &d_A, size);

// Kernel invocation code – to be shown later

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

Lecture 2.3 – Introduction to CUDA C

vector A A[0] A[1] A[2] … A[N-1]

vector B B[0] B[1] B[2] … B[N-1]

vector C C[0] C[1] C[2] … C[N-1]

Serial Code (host)

Serial Code (host)

Parallel Kernel (device)

Natural Language (e.g, English)

– Program instructions operate on data stored in memory and/or

i = blockIdx.x * blockDim.x + threadIdx.x;

– Divide thread array into multiple blocks

• Each thread uses indices to decide what data to work

Thread Thread Thread 10 Thread

Lecture 2.4 – Introduction to CUDA C

Numerical analytics MATLAB, Mathematica, LabVIEW

Fortran CUDA Fortran

C++ CUDA C++

Python PyCUDA, Copperhead, Numba, NumbaPro

Easy to use Easy to use Most Performance

__global__ void mykernel(void) {

– mykernel(does nothing, somewhat anticlimactic!)

NSIGHT CUDA-GDB CUDA MEMCHECK

NSIGHT NVVP NVPROF

2. Click Analyze All

– Profile the code, find the hotspot(s)

– Don’t assume an optimization was wrong

129 GB/s 84 GB/s

84 GB/s 137 GB/s

You might also like

global void mykernel(void) {