NVIDIA GPU Architecture, CUDA, and PTX — How Modern GPU Computing Actually Works

When people talk about modern AI, high-performance computing, or accelerated graphics, the conversation almost always arrives at NVIDIA.
But the real story is not just the hardware.

It’s the layered software and execution model built around the GPU:

  • The GPU architecture itself
  • The CUDA programming platform
  • The intermediate instruction layer called PTX

Together, these form one of the most influential computing stacks of the last two decades.


From Graphics Card to Parallel Supercomputer

Originally, GPUs were designed to accelerate graphics rendering:

  • drawing pixels
  • shading polygons
  • texture processing
  • lighting calculations

These tasks are highly parallel:

thousands of small calculations happening simultaneously.

That made GPUs fundamentally different from CPUs.

A traditional CPU is designed for:

  • low latency
  • branch-heavy logic
  • sequential execution
  • operating system orchestration

A GPU is designed for:

  • massive throughput
  • vectorized operations
  • predictable workloads
  • parallel execution across thousands of cores

Over time, developers realized these same properties were ideal for:

  • scientific computing
  • simulations
  • cryptography
  • video encoding
  • machine learning
  • neural networks

That shift became known as GPGPU:

General Purpose GPU Computing.


The Core NVIDIA GPU Architecture

Modern NVIDIA GPUs are built around collections of units called Streaming Multiprocessors (SMs).

Each SM contains:

  • CUDA cores
  • schedulers
  • registers
  • shared memory
  • cache
  • tensor hardware
  • execution pipelines

Conceptually:

GPU
 ├── SM 0
 ├── SM 1
 ├── SM 2
 └── ...

Each SM executes many threads concurrently.


SIMT — Single Instruction, Multiple Threads

NVIDIA’s execution model is called:

SIMT — Single Instruction, Multiple Threads

It resembles SIMD vector processing, but instead of explicit vectors, the GPU manages enormous groups of lightweight threads.

Threads are grouped into:

  • Warps (typically 32 threads)
  • Blocks
  • Grids

The scheduler rapidly swaps between warps to hide memory latency.

If one warp stalls waiting for memory:

  • another warp executes immediately.

This is one reason GPUs achieve extraordinary throughput.


CUDA — NVIDIA’s Parallel Computing Platform

In 2006, NVIDIA introduced:

CUDA — Compute Unified Device Architecture

CUDA transformed GPU programming from graphics APIs into a general software platform.

Before CUDA, developers often abused graphics pipelines using:

  • OpenGL shaders
  • DirectX shader tricks

CUDA replaced that with:

  • C/C++ style programming
  • dedicated compute kernels
  • memory management APIs
  • parallel execution control

CUDA effectively turned the GPU into:

a programmable parallel coprocessor.


CUDA Kernels

A CUDA program launches functions called:

kernels

A kernel executes across many threads simultaneously.

Example conceptual model:

__global__ void addVectors(float* a, float* b, float* c) {
    int i = threadIdx.x;
    c[i] = a[i] + b[i];
}

The same function executes thousands of times in parallel.

Each thread operates on different data.

This is the foundation of:

  • AI tensor operations
  • image processing
  • physics simulation
  • matrix multiplication
  • scientific workloads

Memory Hierarchy Matters

GPU performance is heavily tied to memory behavior.

NVIDIA GPUs include multiple memory layers:

Memory TypeSpeedScope
RegistersFastestPer-thread
Shared MemoryVery fastPer-block
L1/L2 CacheFastShared
Global MemorySlowerEntire GPU

Efficient CUDA code tries to:

  • minimize global memory access
  • maximize locality
  • coalesce reads
  • reduce divergence

Because on GPUs:

memory movement is often more expensive than computation.


Tensor Cores and AI Acceleration

Modern NVIDIA architectures introduced specialized hardware:

Tensor Cores

These accelerate matrix multiplication operations central to deep learning.

Architectures evolved rapidly:

ArchitectureNotable Feature
KeplerEarly CUDA maturity
MaxwellEfficiency improvements
PascalAI acceleration begins
VoltaTensor Cores introduced
TuringRT cores + AI inference
AmpereLarge-scale AI acceleration
HopperTransformer optimization
BlackwellMassive AI/HPC scaling

Tensor cores dramatically increased:

  • AI training speed
  • inference throughput
  • FP16/BF16 computation
  • transformer performance

This is one reason OpenAI and many others rely heavily on NVIDIA hardware.


What PTX Actually Is

CUDA source code is not executed directly by the GPU.

Instead, it passes through several compilation stages.

One of the most important layers is:

PTX — Parallel Thread Execution

PTX is NVIDIA’s intermediate assembly-like language.

Think of it as:

LayerComparable To
CUDA C++High-level language
PTXVirtual ISA / intermediate representation
SASSActual hardware machine code

PTX sits between:

  • developer code
  • final GPU instructions

Why PTX Exists

PTX provides portability across GPU generations.

Instead of compiling directly to one exact GPU model:

CUDA Source
   ↓
PTX
   ↓
Driver JIT Compiler
   ↓
Hardware-specific machine code

The NVIDIA driver performs:

  • optimization
  • scheduling
  • hardware targeting
  • instruction selection

This allows older CUDA applications to continue working on newer GPUs.


PTX Example

A simple PTX instruction might look like:

add.f32 %f3, %f1, %f2;

Meaning:

f3 = f1 + f2

PTX resembles assembly language, but remains:

  • virtualized
  • hardware-independent
  • forward compatible

It acts almost like a GPU-focused bytecode layer.


SASS — The Real Hardware Instructions

Eventually PTX becomes:

SASS

This is the actual machine code executed by the GPU.

Unlike PTX:

  • SASS is architecture-specific
  • tightly tied to GPU generations
  • not generally portable

Developers usually work at:

  • CUDA level
  • sometimes PTX level

Very few work directly with SASS unless optimizing at extreme low levels.


CUDA’s Real Strategic Advantage

The real power of CUDA is not just the language.

It’s the ecosystem:

  • cuDNN
  • TensorRT
  • NCCL
  • CUDA-X
  • optimized AI libraries
  • scientific tooling
  • compilers
  • debuggers
  • framework integration

Frameworks like:

  • PyTorch
  • TensorFlow
  • JAX

all heavily rely on CUDA underneath.

This ecosystem lock-in became one of NVIDIA’s greatest competitive advantages.


CUDA vs PTX — The Simple Analogy

A useful mental model is:

LayerAnalogy
CUDAWriting in C++
PTXLLVM IR / Java bytecode
SASSNative CPU machine code

CUDA is what developers write.

PTX is the portable intermediate representation.

SASS is what the GPU actually executes.


Why This Matters Beyond Gaming

Modern AI fundamentally depends on:

  • parallel matrix math
  • memory throughput
  • distributed compute acceleration

That means modern AI infrastructure is deeply coupled to GPU architecture.

The rise of:

  • LLMs
  • diffusion models
  • transformers
  • inference systems

has effectively turned NVIDIA GPUs into:

the compute substrate of modern AI.

CUDA and PTX are the software layers enabling that scale.


The Bigger Picture

What NVIDIA built was not merely a graphics card ecosystem.

It became:

  • a parallel computing platform
  • a software ecosystem
  • a compiler stack
  • a hardware abstraction layer
  • an AI acceleration framework

CUDA made GPU programming practical.

PTX made it portable.

The GPU architecture made it fast.

Together, they reshaped modern computing.


Note: This article was developed using AI-assisted drafting and editing tools, including ChatGPT, with human direction, review, and refinement.

Leave a comment