An Introduction to Writing Fast GPU Code in Python

Track:: Machine Learning: Research & Applications
Type:: Talk
Level:: intermediate
Duration:: 30 minutes

Abstract

GPUs power almost every modern Python workload in machine learning, vision, and scientific computing. Yet most Python developers treat GPUs like “faster CPUs” and hope frameworks will handle performance automatically. That mental model is wrong and it is the main reason Python GPU code underperforms by orders of magnitude.

This introductory talk builds the correct mental model for writing fast GPU code in Python. We will explore how modern GPUs are structured, why global memory dominates execution time, and how to determine whether a kernel is compute-bound or memory-bound using the roofline model. Through concrete examples, attendees will see why naive parallel kernels fail to scale and which hardware-aware patterns; tiling, data reuse, and kernel fusion actually lead to speedups.

The session concludes with a practical look at how Python can express these ideas using Triton, showing how high-level code is compiled into PTX and how Python can approach near-CUDA performance when it respects hardware constraints.