Освоение параллельного программирования на GPU с CUDA: (HW & SW) [Udemy] [Hamdy egy]

Топикстартер · 10 апр 2026

Складчина: Освоение параллельного программирования на GPU с CUDA: (HW & SW) [Udemy] [Hamdy egy]
Mastering GPU Parallel Programming with CUDA: ( HW & SW )
Язык: английский

Оптимизация и анализ производительности для высокопроизводительных вычислений

Это практический курс, который показывает, как использовать огромную мощность параллельной обработки современных GPU с помощью CUDA. Вы начнёте с основ аппаратной части GPU, проследите эволюцию ключевых архитектур — Fermi, Pascal, Volta, Ampere, Hopper — и через практические задания научитесь писать, профилировать и оптимизировать высокопроизводительные CUDA kernels.

Это независимый обучающий материал. Он не спонсируется, не одобрен и не аффилирован с NVIDIA Corporation. Названия CUDA, Nsight и кодовые названия архитектур являются товарными знаками NVIDIA и используются только в справочных целях.

Чему вы научитесь:

получите комплексное понимание архитектуры GPU по сравнению с CPU

изучите историю графических процессоров GPU вплоть до самых современных решений

разберётесь во внутренней структуре GPU

поймёте различные типы памяти и их влияние на производительность

изучите современные технологии внутренних компонентов GPU

освоите основы программирования CUDA на GPU

начнёте писать программы для GPU с использованием CUDA в Windows и Linux

поймёте наиболее эффективные способы распараллеливания

освоите профилирование и настройку производительности

научитесь использовать shared memory

Что вы освоите:

основы различий GPU и CPU и причины доминирования GPU в задачах параллельной обработки данных

развитие архитектур GPU по поколениям и аппаратные особенности, влияющие на производительность

установку CUDA Toolkit в Windows, Linux и WSL

базовые концепции CUDA: threads, blocks, grids и иерархия памяти

профилирование и оптимизацию через Nsight Compute и nvprof

двумерную индексацию для работы с матрицами

методы оптимизации: работа с данными, размер которых не равен степени двойки, использование shared memory, повышение пропускной способности, снижение warp divergence

отладку и обработку ошибок через runtime API checks

К концу курса вы сможете проектировать, анализировать и тонко настраивать CUDA kernels, которые эффективно работают на современных GPU и подходят для сложных научных, инженерных и AI-задач.

Для кого этот курс:

для всех, кто интересуется GPU и CUDA

для студентов инженерных направлений

для исследователей

для всех, кто хочет глубже разобраться в параллельном программировании на GPU

Требования:

базовые знания C и C++

базовые знания Linux и Windows

базовые знания архитектуры компьютера

Содержание курса:

12 разделов

58 лекций

23 ч 3 мин общей продолжительности

Программа:

Введение в аппаратную часть Nvidia GPU

GPU vs CPU

История Nvidia: как компания начала доминировать в секторе GPU

Связь архитектур и поколений: Hopper, Ampere, GeForce, Tesla

Как определить архитектуру и поколение

Разница между GPU и чипом GPU

Архитектуры и соответствующие им чипы

Архитектуры Nvidia GPU от Fermi до Hopper

Параметры для сравнения разных архитектур

Half, single и double precision operations

Compute capability и использование GPU

Volta, Ampere, Pascal, SIMD

Установка CUDA и других программ

Какие возможности устанавливаются вместе с CUDA Toolkit

Установка CUDA в Windows

Установка WSL для использования Linux в Windows

Установка CUDA Toolkit в Linux

Введение в программирование на CUDA

GitHub-репозиторий курса

Сопоставление программной части CUDA с аппаратной частью

Hello World program: threads и blocks

Компиляция CUDA в Linux

Hello World program: Warp_IDs

Сложение векторов и шаги для любого CUDA-проекта

Индексация блоков и потоков

Уровни параллелизма при работе с очень большими векторами

Профилирование

Получение свойств устройства через Runtime API

Nvidia-smi и его конфигурации

Occupancy и скрытие задержек на GPU

Количество активных блоков на один SM

Сколько блоков можно выполнять одновременно на одном SM

Начало работы с Nsight Compute

Инструменты профилирования Nvidia: Nsight Systems, Nsight Compute, nvprof

API для проверки ошибок

Анализ производительности через командную строку

Графический Nsight Compute в Windows и Linux

Анализ производительности предыдущих приложений

Анализ производительности

Сложение векторов с размером, не являющимся степенью двойки

Двумерная индексация

Сложение матриц с использованием двумерных блоков и потоков

Почему L1 hit-rate равен нулю

Shared Memory и Warp Divergence

Shared memory

Quiz 1

Warp divergence

Инструменты отладки

Отладка с использованием Visual Studio

Vector Reduction

Vector reduction только с использованием global memory

Разбор кода и профилирование vector reduction

Оптимизация vector reduction

Race condition и возможности отладки

Оптимизация использования потоков

Оптимизация через shared memory и unrolling

Оптимизация через shuffle operations

Roofline Model

Roofline analysis: приложения, ограниченные вычислениями и памятью

Performance Optimization and Analysis for High-Performance Computing

This hands-on course teaches you how to unlock the huge parallel-processing power of modern GPUs with CUDA. You’ll start with the fundamentals of GPU hardware, trace the evolution of flagship architectures (Fermi → Pascal → Volta → Ampere → Hopper), and learn—through code-along labs—how to write, profile, and optimize high-performance kernels.
This is an independent training resource. It is not sponsored by, endorsed by, or otherwise affiliated with NVIDIA Corporation. “CUDA”, “Nsight”, and the architecture codenames are trademarks of NVIDIA and are used here only as factual references.

What you'll learn

Comprehensive Understanding of GPU vs CPU Architecture

learn the history of graphical processing unit (GPU) until the most recent products

Understand the internal structure of GPU

Understand the different types of memories and how they affect the performance

Understand the most recent technologies in the GPU internal components

Understand the basics of the CUDA programming on GPU

Start programming GPU using both CUDA on Both windows and linux

understand the most efficient ways for parallelization

Profiling and Performance Tuning

Leveraging Shared Memory

What you’ll master

GPU vs. CPU fundamentals – why GPUs dominate data-parallel workloads.

Generational design advances – the hardware features that matter most for performance.

CUDA toolkit installation – Windows, Linux, and WSL, plus first-run sanity checks.

Core CUDA concepts – threads, blocks, grids, and the memory hierarchy, built up with labs such as vector addition.

Profiling & tuning with Nsight Compute / nvprof – measure occupancy, hide latency, and break bottlenecks.

2-D indexing for matrices – write efficient kernels for real-world linear-algebra tasks.

Optimization playbook – handle non-power-of-two data, leverage shared memory, maximize bandwidth, and minimize warp divergence.

Robust debugging & error handling – use runtime-API checks to ship production-ready code.

By the end, you’ll be able to design, analyze, and fine-tune CUDA kernels that run efficiently on today’s GPUs—equipping you to tackle demanding scientific, engineering, and AI workloads.

Who this course is for:

For any one interested in GPU and CUDA like engineering students, researchers and any other one

Requirements

C and C++ basics

Linux and windows basics

Computer Architecture basics

Course Content

Introduction to the Nvidia GPUs hardware
12 lectures • 2hr 52min

GPU vs CPU (very important)

NVidia's history (How Nvidia started dominating the GPU sector)

Architectures and Generations relationship [Hopper, Ampere, GeForce and Tesla]

How to know the Architecture and Generation

The difference between the GPU and the GPU Chip

The architectures and the corresponding chips

Nvidia GPU architectures From Fermi to hopper

Parameters required to compare between different Architectures

Please don't skip this video. It is pivotal for the the whole course.

Half, single and double precision operations

Compute capability and utilizations of the GPUs

Before reading any whitepapers !! look at this

Volta+Ampere+Pascal+SIMD (Don't skip)

Installing Cuda and other programs
4 lectures • 22min

What features installed with the CUDA toolkit?

Installing CUDA on Windows

Installing WSL to use Linux on windows OS.

Installing Cuda toolkits on Linux

Introduction to CUDA programming
8 lectures • 1hr 52min

The course github repo

Mapping SW from CUDA to HW + introducing CUDA.

001 Hello World program (threads - Blocks)

Compiling Cuda on Linux

002 Hello World program ( Warp_IDs)

003 : Vector addition + the Steps for any CUDA project

004 : Vector addition + blocks and thread indexing + GPU performance

005 levels of parallelization - Vector addition with Extra-large vectors

Profiling
9 lectures • 4hr 18min

Query the device properties using the Runtime APIs

Nvidia-smi and its configurations (Linux User)

The GPU's Occupancy and Latency hiding

Allocated active blocks per SM (important)

how many blocks can we run concurrently per SM?

Starting with the nsight compute (first issue)

All profiling tools from NVidia (Nsight systems - compute - nvprof ...)

Error checking APIs

Nsight Compute performance using command line analysis

Graphical Nsight Compute (windows and linux)

Performance analysis for the previous applications
2 lectures • 45min

Performance analysis

Vector addition with a size not power of 2 !!! important

2D Indexing
2 lectures • 1hr 16min

Matrices addition using 2D of blocks and threads

Why L1 Hit-rate is zero?

Shared Memory + Warp Divergence
2 lectures • 50min

The shared memory

Quiz 1

Warp Divergence

Debugging tools
1 lecture • 40min

Debugging using visual studio (important) 1

Vector Reduction
7 lectures • 4hr 30min

Vector Reduction using global memory only (baseline)

Understanding the code and the profiling of the vector reduction

Optimizing the vector reduction (removing the filter)

The Race Condition and the debugging option

Optimizing the thread utilizations on vector reduction

Optimization using shared memory and unrolling

Shuffle operations optimizations

Roofline model
1 lecture • 43min

Roofline Analysis/ Compute and Memory bounds apps)

About the author:

Hamdy egy is a Research Assistant and a Ph.D. student. He graduated from the Computer and System Engineering Department in 2012 and was ranked second in his class. After graduation, he worked as a teaching assistant in the same department for about 10 years. He also worked as an embedded systems instructor for 5 years.

Автор: Hamdy egy - исследовательский ассистент и PhD-студент. Он окончил факультет компьютерной и системной инженерии в 2012 году и занял второе место в своем выпуске. После окончания обучения он около 10 лет работал ассистентом преподавателя на том же факультете. Также он 5 лет работал инструктором по встраиваемым системам.

Цена 1500 руб. (14,99 евро)
Скрытая ссылка