{"id":401,"date":"2026-05-06T19:41:15","date_gmt":"2026-05-06T09:41:15","guid":{"rendered":"https:\/\/www.the-bach.kiwi\/?p=401"},"modified":"2026-05-08T12:14:36","modified_gmt":"2026-05-08T02:14:36","slug":"nvidia-gpu-architecture-cuda-and-ptx-how-modern-gpu-computing-actually-works","status":"publish","type":"post","link":"https:\/\/www.the-bach.kiwi\/index.php\/2026\/05\/06\/nvidia-gpu-architecture-cuda-and-ptx-how-modern-gpu-computing-actually-works\/","title":{"rendered":"NVIDIA GPU Architecture, CUDA, and PTX \u2014 How Modern GPU Computing Actually Works"},"content":{"rendered":"\n<p>When people talk about modern AI, high-performance computing, or accelerated graphics, the conversation almost always arrives at NVIDIA.<br>But the real story is not just the hardware.<\/p>\n\n\n\n<p>It\u2019s the layered software and execution model built around the GPU:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The <strong>GPU architecture<\/strong> itself<\/li>\n\n\n\n<li>The <strong>CUDA<\/strong> programming platform<\/li>\n\n\n\n<li>The intermediate instruction layer called <strong>PTX<\/strong><\/li>\n<\/ul>\n\n\n\n<p>Together, these form one of the most influential computing stacks of the last two decades.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_83 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title ez-toc-toggle\" style=\"cursor:pointer\">Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.the-bach.kiwi\/index.php\/2026\/05\/06\/nvidia-gpu-architecture-cuda-and-ptx-how-modern-gpu-computing-actually-works\/#From-Graphics-Card-to-Parallel-Supercomputer\" >From Graphics Card to Parallel Supercomputer<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.the-bach.kiwi\/index.php\/2026\/05\/06\/nvidia-gpu-architecture-cuda-and-ptx-how-modern-gpu-computing-actually-works\/#The-Core-NVIDIA-GPU-Architecture\" >The Core NVIDIA GPU Architecture<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.the-bach.kiwi\/index.php\/2026\/05\/06\/nvidia-gpu-architecture-cuda-and-ptx-how-modern-gpu-computing-actually-works\/#SIMT-%E2%80%94-Single-Instruction-Multiple-Threads\" >SIMT \u2014 Single Instruction, Multiple Threads<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.the-bach.kiwi\/index.php\/2026\/05\/06\/nvidia-gpu-architecture-cuda-and-ptx-how-modern-gpu-computing-actually-works\/#CUDA-%E2%80%94-NVIDIAs-Parallel-Computing-Platform\" >CUDA \u2014 NVIDIA\u2019s Parallel Computing Platform<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.the-bach.kiwi\/index.php\/2026\/05\/06\/nvidia-gpu-architecture-cuda-and-ptx-how-modern-gpu-computing-actually-works\/#CUDA-Kernels\" >CUDA Kernels<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.the-bach.kiwi\/index.php\/2026\/05\/06\/nvidia-gpu-architecture-cuda-and-ptx-how-modern-gpu-computing-actually-works\/#Memory-Hierarchy-Matters\" >Memory Hierarchy Matters<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/www.the-bach.kiwi\/index.php\/2026\/05\/06\/nvidia-gpu-architecture-cuda-and-ptx-how-modern-gpu-computing-actually-works\/#Tensor-Cores-and-AI-Acceleration\" >Tensor Cores and AI Acceleration<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/www.the-bach.kiwi\/index.php\/2026\/05\/06\/nvidia-gpu-architecture-cuda-and-ptx-how-modern-gpu-computing-actually-works\/#What-PTX-Actually-Is\" >What PTX Actually Is<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/www.the-bach.kiwi\/index.php\/2026\/05\/06\/nvidia-gpu-architecture-cuda-and-ptx-how-modern-gpu-computing-actually-works\/#Why-PTX-Exists\" >Why PTX Exists<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/www.the-bach.kiwi\/index.php\/2026\/05\/06\/nvidia-gpu-architecture-cuda-and-ptx-how-modern-gpu-computing-actually-works\/#PTX-Example\" >PTX Example<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/www.the-bach.kiwi\/index.php\/2026\/05\/06\/nvidia-gpu-architecture-cuda-and-ptx-how-modern-gpu-computing-actually-works\/#SASS-%E2%80%94-The-Real-Hardware-Instructions\" >SASS \u2014 The Real Hardware Instructions<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/www.the-bach.kiwi\/index.php\/2026\/05\/06\/nvidia-gpu-architecture-cuda-and-ptx-how-modern-gpu-computing-actually-works\/#CUDAs-Real-Strategic-Advantage\" >CUDA\u2019s Real Strategic Advantage<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-13\" href=\"https:\/\/www.the-bach.kiwi\/index.php\/2026\/05\/06\/nvidia-gpu-architecture-cuda-and-ptx-how-modern-gpu-computing-actually-works\/#CUDA-vs-PTX-%E2%80%94-The-Simple-Analogy\" >CUDA vs PTX \u2014 The Simple Analogy<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-14\" href=\"https:\/\/www.the-bach.kiwi\/index.php\/2026\/05\/06\/nvidia-gpu-architecture-cuda-and-ptx-how-modern-gpu-computing-actually-works\/#Why-This-Matters-Beyond-Gaming\" >Why This Matters Beyond Gaming<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-15\" href=\"https:\/\/www.the-bach.kiwi\/index.php\/2026\/05\/06\/nvidia-gpu-architecture-cuda-and-ptx-how-modern-gpu-computing-actually-works\/#The-Bigger-Picture\" >The Bigger Picture<\/a><\/li><\/ul><\/nav><\/div>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"From-Graphics-Card-to-Parallel-Supercomputer\"><\/span>From Graphics Card to Parallel Supercomputer<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Originally, GPUs were designed to accelerate graphics rendering:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>drawing pixels<\/li>\n\n\n\n<li>shading polygons<\/li>\n\n\n\n<li>texture processing<\/li>\n\n\n\n<li>lighting calculations<\/li>\n<\/ul>\n\n\n\n<p>These tasks are highly parallel:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>thousands of small calculations happening simultaneously.<\/p>\n<\/blockquote>\n\n\n\n<p>That made GPUs fundamentally different from CPUs.<\/p>\n\n\n\n<p>A traditional CPU is designed for:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>low latency<\/li>\n\n\n\n<li>branch-heavy logic<\/li>\n\n\n\n<li>sequential execution<\/li>\n\n\n\n<li>operating system orchestration<\/li>\n<\/ul>\n\n\n\n<p>A GPU is designed for:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>massive throughput<\/li>\n\n\n\n<li>vectorized operations<\/li>\n\n\n\n<li>predictable workloads<\/li>\n\n\n\n<li>parallel execution across thousands of cores<\/li>\n<\/ul>\n\n\n\n<p>Over time, developers realized these same properties were ideal for:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>scientific computing<\/li>\n\n\n\n<li>simulations<\/li>\n\n\n\n<li>cryptography<\/li>\n\n\n\n<li>video encoding<\/li>\n\n\n\n<li>machine learning<\/li>\n\n\n\n<li>neural networks<\/li>\n<\/ul>\n\n\n\n<p>That shift became known as <strong>GPGPU<\/strong>:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><em>General Purpose GPU Computing.<\/em><\/p>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"The-Core-NVIDIA-GPU-Architecture\"><\/span>The Core NVIDIA GPU Architecture<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Modern NVIDIA GPUs are built around collections of units called <strong>Streaming Multiprocessors (SMs)<\/strong>.<\/p>\n\n\n\n<p>Each SM contains:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>CUDA cores<\/li>\n\n\n\n<li>schedulers<\/li>\n\n\n\n<li>registers<\/li>\n\n\n\n<li>shared memory<\/li>\n\n\n\n<li>cache<\/li>\n\n\n\n<li>tensor hardware<\/li>\n\n\n\n<li>execution pipelines<\/li>\n<\/ul>\n\n\n\n<p>Conceptually:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>GPU\n \u251c\u2500\u2500 SM 0\n \u251c\u2500\u2500 SM 1\n \u251c\u2500\u2500 SM 2\n \u2514\u2500\u2500 ...\n<\/code><\/pre>\n\n\n\n<p>Each SM executes many threads concurrently.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"SIMT-%E2%80%94-Single-Instruction-Multiple-Threads\"><\/span>SIMT \u2014 Single Instruction, Multiple Threads<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>NVIDIA\u2019s execution model is called:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><strong>SIMT<\/strong> \u2014 Single Instruction, Multiple Threads<\/p>\n<\/blockquote>\n\n\n\n<p>It resembles SIMD vector processing, but instead of explicit vectors, the GPU manages enormous groups of lightweight threads.<\/p>\n\n\n\n<p>Threads are grouped into:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Warps<\/strong> (typically 32 threads)<\/li>\n\n\n\n<li><strong>Blocks<\/strong><\/li>\n\n\n\n<li><strong>Grids<\/strong><\/li>\n<\/ul>\n\n\n\n<p>The scheduler rapidly swaps between warps to hide memory latency.<\/p>\n\n\n\n<p>If one warp stalls waiting for memory:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>another warp executes immediately.<\/li>\n<\/ul>\n\n\n\n<p>This is one reason GPUs achieve extraordinary throughput.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"CUDA-%E2%80%94-NVIDIAs-Parallel-Computing-Platform\"><\/span>CUDA \u2014 NVIDIA\u2019s Parallel Computing Platform<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>In 2006, NVIDIA introduced:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><strong>CUDA<\/strong> \u2014 Compute Unified Device Architecture<\/p>\n<\/blockquote>\n\n\n\n<p>CUDA transformed GPU programming from graphics APIs into a general software platform.<\/p>\n\n\n\n<p>Before CUDA, developers often abused graphics pipelines using:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>OpenGL shaders<\/li>\n\n\n\n<li>DirectX shader tricks<\/li>\n<\/ul>\n\n\n\n<p>CUDA replaced that with:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>C\/C++ style programming<\/li>\n\n\n\n<li>dedicated compute kernels<\/li>\n\n\n\n<li>memory management APIs<\/li>\n\n\n\n<li>parallel execution control<\/li>\n<\/ul>\n\n\n\n<p>CUDA effectively turned the GPU into:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>a programmable parallel coprocessor.<\/p>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"CUDA-Kernels\"><\/span>CUDA Kernels<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>A CUDA program launches functions called:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><strong>kernels<\/strong><\/p>\n<\/blockquote>\n\n\n\n<p>A kernel executes across many threads simultaneously.<\/p>\n\n\n\n<p>Example conceptual model:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>__global__ void addVectors(float* a, float* b, float* c) {\n    int i = threadIdx.x;\n    c&#91;i] = a&#91;i] + b&#91;i];\n}\n<\/code><\/pre>\n\n\n\n<p>The same function executes thousands of times in parallel.<\/p>\n\n\n\n<p>Each thread operates on different data.<\/p>\n\n\n\n<p>This is the foundation of:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>AI tensor operations<\/li>\n\n\n\n<li>image processing<\/li>\n\n\n\n<li>physics simulation<\/li>\n\n\n\n<li>matrix multiplication<\/li>\n\n\n\n<li>scientific workloads<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Memory-Hierarchy-Matters\"><\/span>Memory Hierarchy Matters<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>GPU performance is heavily tied to memory behavior.<\/p>\n\n\n\n<p>NVIDIA GPUs include multiple memory layers:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Memory Type<\/th><th>Speed<\/th><th>Scope<\/th><\/tr><\/thead><tbody><tr><td>Registers<\/td><td>Fastest<\/td><td>Per-thread<\/td><\/tr><tr><td>Shared Memory<\/td><td>Very fast<\/td><td>Per-block<\/td><\/tr><tr><td>L1\/L2 Cache<\/td><td>Fast<\/td><td>Shared<\/td><\/tr><tr><td>Global Memory<\/td><td>Slower<\/td><td>Entire GPU<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Efficient CUDA code tries to:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>minimize global memory access<\/li>\n\n\n\n<li>maximize locality<\/li>\n\n\n\n<li>coalesce reads<\/li>\n\n\n\n<li>reduce divergence<\/li>\n<\/ul>\n\n\n\n<p>Because on GPUs:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>memory movement is often more expensive than computation.<\/p>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Tensor-Cores-and-AI-Acceleration\"><\/span>Tensor Cores and AI Acceleration<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Modern NVIDIA architectures introduced specialized hardware:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><strong>Tensor Cores<\/strong><\/p>\n<\/blockquote>\n\n\n\n<p>These accelerate matrix multiplication operations central to deep learning.<\/p>\n\n\n\n<p>Architectures evolved rapidly:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Architecture<\/th><th>Notable Feature<\/th><\/tr><\/thead><tbody><tr><td>Kepler<\/td><td>Early CUDA maturity<\/td><\/tr><tr><td>Maxwell<\/td><td>Efficiency improvements<\/td><\/tr><tr><td>Pascal<\/td><td>AI acceleration begins<\/td><\/tr><tr><td>Volta<\/td><td>Tensor Cores introduced<\/td><\/tr><tr><td>Turing<\/td><td>RT cores + AI inference<\/td><\/tr><tr><td>Ampere<\/td><td>Large-scale AI acceleration<\/td><\/tr><tr><td>Hopper<\/td><td>Transformer optimization<\/td><\/tr><tr><td>Blackwell<\/td><td>Massive AI\/HPC scaling<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Tensor cores dramatically increased:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>AI training speed<\/li>\n\n\n\n<li>inference throughput<\/li>\n\n\n\n<li>FP16\/BF16 computation<\/li>\n\n\n\n<li>transformer performance<\/li>\n<\/ul>\n\n\n\n<p>This is one reason OpenAI and many others rely heavily on NVIDIA hardware.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"What-PTX-Actually-Is\"><\/span>What PTX Actually Is<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>CUDA source code is not executed directly by the GPU.<\/p>\n\n\n\n<p>Instead, it passes through several compilation stages.<\/p>\n\n\n\n<p>One of the most important layers is:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><strong>PTX<\/strong> \u2014 Parallel Thread Execution<\/p>\n<\/blockquote>\n\n\n\n<p>PTX is NVIDIA\u2019s intermediate assembly-like language.<\/p>\n\n\n\n<p>Think of it as:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Layer<\/th><th>Comparable To<\/th><\/tr><\/thead><tbody><tr><td>CUDA C++<\/td><td>High-level language<\/td><\/tr><tr><td>PTX<\/td><td>Virtual ISA \/ intermediate representation<\/td><\/tr><tr><td>SASS<\/td><td>Actual hardware machine code<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>PTX sits between:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>developer code<\/li>\n\n\n\n<li>final GPU instructions<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Why-PTX-Exists\"><\/span>Why PTX Exists<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>PTX provides portability across GPU generations.<\/p>\n\n\n\n<p>Instead of compiling directly to one exact GPU model:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>CUDA Source\n   \u2193\nPTX\n   \u2193\nDriver JIT Compiler\n   \u2193\nHardware-specific machine code\n<\/code><\/pre>\n\n\n\n<p>The NVIDIA driver performs:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>optimization<\/li>\n\n\n\n<li>scheduling<\/li>\n\n\n\n<li>hardware targeting<\/li>\n\n\n\n<li>instruction selection<\/li>\n<\/ul>\n\n\n\n<p>This allows older CUDA applications to continue working on newer GPUs.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"PTX-Example\"><\/span>PTX Example<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>A simple PTX instruction might look like:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>add.f32 %f3, %f1, %f2;\n<\/code><\/pre>\n\n\n\n<p>Meaning:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>f3 = f1 + f2\n<\/code><\/pre>\n\n\n\n<p>PTX resembles assembly language, but remains:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>virtualized<\/li>\n\n\n\n<li>hardware-independent<\/li>\n\n\n\n<li>forward compatible<\/li>\n<\/ul>\n\n\n\n<p>It acts almost like a GPU-focused bytecode layer.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"SASS-%E2%80%94-The-Real-Hardware-Instructions\"><\/span>SASS \u2014 The Real Hardware Instructions<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Eventually PTX becomes:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><strong>SASS<\/strong><\/p>\n<\/blockquote>\n\n\n\n<p>This is the actual machine code executed by the GPU.<\/p>\n\n\n\n<p>Unlike PTX:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SASS is architecture-specific<\/li>\n\n\n\n<li>tightly tied to GPU generations<\/li>\n\n\n\n<li>not generally portable<\/li>\n<\/ul>\n\n\n\n<p>Developers usually work at:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>CUDA level<\/li>\n\n\n\n<li>sometimes PTX level<\/li>\n<\/ul>\n\n\n\n<p>Very few work directly with SASS unless optimizing at extreme low levels.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"CUDAs-Real-Strategic-Advantage\"><\/span>CUDA\u2019s Real Strategic Advantage<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>The real power of CUDA is not just the language.<\/p>\n\n\n\n<p>It\u2019s the ecosystem:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>cuDNN<\/li>\n\n\n\n<li>TensorRT<\/li>\n\n\n\n<li>NCCL<\/li>\n\n\n\n<li>CUDA-X<\/li>\n\n\n\n<li>optimized AI libraries<\/li>\n\n\n\n<li>scientific tooling<\/li>\n\n\n\n<li>compilers<\/li>\n\n\n\n<li>debuggers<\/li>\n\n\n\n<li>framework integration<\/li>\n<\/ul>\n\n\n\n<p>Frameworks like:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>PyTorch<\/li>\n\n\n\n<li>TensorFlow<\/li>\n\n\n\n<li>JAX<\/li>\n<\/ul>\n\n\n\n<p>all heavily rely on CUDA underneath.<\/p>\n\n\n\n<p>This ecosystem lock-in became one of NVIDIA\u2019s greatest competitive advantages.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"CUDA-vs-PTX-%E2%80%94-The-Simple-Analogy\"><\/span>CUDA vs PTX \u2014 The Simple Analogy<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>A useful mental model is:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Layer<\/th><th>Analogy<\/th><\/tr><\/thead><tbody><tr><td>CUDA<\/td><td>Writing in C++<\/td><\/tr><tr><td>PTX<\/td><td>LLVM IR \/ Java bytecode<\/td><\/tr><tr><td>SASS<\/td><td>Native CPU machine code<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>CUDA is what developers write.<\/p>\n\n\n\n<p>PTX is the portable intermediate representation.<\/p>\n\n\n\n<p>SASS is what the GPU actually executes.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Why-This-Matters-Beyond-Gaming\"><\/span>Why This Matters Beyond Gaming<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Modern AI fundamentally depends on:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>parallel matrix math<\/li>\n\n\n\n<li>memory throughput<\/li>\n\n\n\n<li>distributed compute acceleration<\/li>\n<\/ul>\n\n\n\n<p>That means modern AI infrastructure is deeply coupled to GPU architecture.<\/p>\n\n\n\n<p>The rise of:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>LLMs<\/li>\n\n\n\n<li>diffusion models<\/li>\n\n\n\n<li>transformers<\/li>\n\n\n\n<li>inference systems<\/li>\n<\/ul>\n\n\n\n<p>has effectively turned NVIDIA GPUs into:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>the compute substrate of modern AI.<\/p>\n<\/blockquote>\n\n\n\n<p>CUDA and PTX are the software layers enabling that scale.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"The-Bigger-Picture\"><\/span>The Bigger Picture<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>What NVIDIA built was not merely a graphics card ecosystem.<\/p>\n\n\n\n<p>It became:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>a parallel computing platform<\/li>\n\n\n\n<li>a software ecosystem<\/li>\n\n\n\n<li>a compiler stack<\/li>\n\n\n\n<li>a hardware abstraction layer<\/li>\n\n\n\n<li>an AI acceleration framework<\/li>\n<\/ul>\n\n\n\n<p>CUDA made GPU programming practical.<\/p>\n\n\n\n<p>PTX made it portable.<\/p>\n\n\n\n<p>The GPU architecture made it fast.<\/p>\n\n\n\n<p>Together, they reshaped modern computing.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><sub><sup>Note: This article was developed using AI-assisted drafting and editing tools, including ChatGPT, with human direction, review, and refinement.<\/sup><\/sub><\/p>\n","protected":false},"excerpt":{"rendered":"<p>When people talk about modern AI, high-performance computing, or accelerated graphics, the conversation almost always arrives at NVIDIA.But the real story is not just the hardware. It\u2019s the layered software and execution model built around the GPU: Together, these form one of the most influential computing stacks of the last two decades. From Graphics Card &#8230; <a title=\"NVIDIA GPU Architecture, CUDA, and PTX \u2014 How Modern GPU Computing Actually Works\" class=\"read-more\" href=\"https:\/\/www.the-bach.kiwi\/index.php\/2026\/05\/06\/nvidia-gpu-architecture-cuda-and-ptx-how-modern-gpu-computing-actually-works\/\" aria-label=\"Read more about NVIDIA GPU Architecture, CUDA, and PTX \u2014 How Modern GPU Computing Actually Works\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[16],"tags":[17,26,27,28,30,29],"class_list":["post-401","post","type-post","status-publish","format-standard","hentry","category-skunkworks","tag-ai","tag-cuda","tag-gpu","tag-nvidia","tag-parallel-computing","tag-ptx"],"_links":{"self":[{"href":"https:\/\/www.the-bach.kiwi\/index.php\/wp-json\/wp\/v2\/posts\/401","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.the-bach.kiwi\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.the-bach.kiwi\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.the-bach.kiwi\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.the-bach.kiwi\/index.php\/wp-json\/wp\/v2\/comments?post=401"}],"version-history":[{"count":2,"href":"https:\/\/www.the-bach.kiwi\/index.php\/wp-json\/wp\/v2\/posts\/401\/revisions"}],"predecessor-version":[{"id":416,"href":"https:\/\/www.the-bach.kiwi\/index.php\/wp-json\/wp\/v2\/posts\/401\/revisions\/416"}],"wp:attachment":[{"href":"https:\/\/www.the-bach.kiwi\/index.php\/wp-json\/wp\/v2\/media?parent=401"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.the-bach.kiwi\/index.php\/wp-json\/wp\/v2\/categories?post=401"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.the-bach.kiwi\/index.php\/wp-json\/wp\/v2\/tags?post=401"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}