Rendered at 11:30:02 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
kinow 19 hours ago [-]
I just finished a master's on HPC where I had to take some classes on CUDA, MPI+CUDA, OpenCL. Reading an article like this before the classes would have been a lot helpful! Especially the part just before and after "What does it mean for a warp to be eligible?".
brcmthrowaway 9 hours ago [-]
What university?
kinow 4 hours ago [-]
It is a joint programme by Universities Santiago de Compostela and Coruña. I attended it from Barcelona, but they are in Galicia, Spain.
That was an interesting read. Also enjoyed reading about the semaphores in the default stream. It's great that cuda implicitly handles syncing of commands for users and makes parallel commands optional and opt-in via streams, unlike Vulkan which completely unloads the full complexity of syncing to users right from the start.
It's very useful. The doorbell and QMD part were the most useful for me, because it connects the CUDA launch syntax to what actually gets submitted to the GPU. Most explanations stop around kernels, blocks and warps, but this made the CPU to driver to GPU path much easier to follow.
orliesaurus 20 hours ago [-]
There are companies whose whole job right now is to optimize kernels so that things run faster. I wonder if those companies are going to be dethroned by some sort of like open source library that can do that really well (I bet Nvidia could release it any day.).. or if they're going to thrive and be acquired by the big providers as a `moat` to speed up their infrerence.
spmurrayzzz 19 hours ago [-]
Near-term acquihires are certainly a likely bet I think. But given model progress on related benchmarks like kernelbench [1], I do think a set of more commoditized solutions is also inevitable.
The caveat though is that each new gen of hardware often comes with brand new constraints/features that a given generation of models haven't seen before (e.g. tcgen05 in blackwell was OOD at one point). As the models start to generalize better, this might not be a showstopper, but still an issue at least currently.
Thank you for sharing the link. It's fascinating that the models can only do 10-20% on the hard subset, and I wonder why that is so. The fact that they can only get 30-40% out of the fp8 GEMM seems unintuitive to me, I would've expected a convergence near ~80%.
spmurrayzzz 14 hours ago [-]
I'm not entirely up to date with the latest batch, but I've reviewed some of the rollouts in the past and my sense is that the models are surprisingly good at getting correct custom kernels in the happy path, but still weak at sustained/shape-robust workloads. Having to deal with writing the full path from scratch compounded by weird memory layouts, odd sizes, routing, unpacking quantized weights, etc. is definitely challenging.
Also, at least a portion of this you could argue is arbitrary and entirely scoped to the eval itself. The fp8 GEMM score could be low simply because one of the shapes is fairly skinny (i.e. not enough math work to keep the compute engine busy for a meaningful amount of time).
connicpu 18 hours ago [-]
When you run CUDA at scale dealing with nvidia driver and library bugs takes up a disgustingly large percentage of engineer time, I don't know a lot of people who would be looking forward to rely on more nvidia libraries.
david-gpu 14 hours ago [-]
How do you determine that the bugs you run into are located in the Nvidia drivers and libraries?
Way back when I wrote the OpenCL driver at Qualcomm, we would frequently get bug reports from customers complaining about our code. During my tenure, every single one of them was root-caused as an application bug. Unsurprisingly, considering that our code was backed by an extensive test suite and their code wasn't.
Not to say that our code was perfect, of course. But people have a tendency to blame GPU drivers when the problem often lies elsewhere.
Athas 13 hours ago [-]
I have never used Qualcomm's OpenCL driver, but it is not unknown to get the NVIDIA driver into a state where some kernel is stuck in a running state, or some memory is allocated long after the originating process has terminated. This is usually down to application bugs, sure - but no application bug should be able to wedge the driver. While developing GPU kernels, the code will certainly be buggy, and hence the driver should be robust. For that matter, maybe I am running untrusted GPU code, and anytime the driver gets in a weird or stuck state, I am uneasy that it might not be many steps away from an exploitable situation. We don't accept this in CPU operating systems, so why should it be acceptable for GPUs? We are talking unprivileged code - nothing runs as root. Ever since I first got into GPGPU programming (about 2012), I noticed that they were far less robust in the face of buggy code than I was accustomed to.
It is also common in my experience for buggy GPU code to crash displays if the GPU is simultaneously used to drive a monitor. This usually happens for kernels that go into infinite loops, or out-of-memory conditions.
It is my understanding that modern GPU drivers even have watchdog systems that notice when they get stuck and forcibly reboot them, which to me is mere symptom treatment.
david-gpu 9 hours ago [-]
I understand the frustration of these unmet expectations. There are good technical reasons why each of these things don't work the way you would like them to. E.g. adding preemption to GPUs is doable, but it is not cheap, and simply killing the task that is hogging the GPU is often the more practical and expeditious way to go.
connicpu 14 hours ago [-]
If you're big enough you can get direct access to Nvidia engineers, and they are usually transparent when they find out the bug was in their software and send you a patched version to try to resolve the issue
AlotOfReading 10 hours ago [-]
And when the bug is in hardware instead, good luck getting them to admit it. Instead they go "We can't confirm that there's an issue, but it'll be gone in the next revision and meanwhile don't use that instruction sequence".
david-gpu 10 hours ago [-]
That is largely what drivers do: work around hardware bugs. It's the industry's dirty secret.
To be fair: the hardware is enormously complex, and the drivers much less so.
saagarjha 13 hours ago [-]
The alternative is you spend even more time on a competitor's system dealing with their bugs
orliesaurus 17 hours ago [-]
fair point, but are there alternatives that aren't CUDA locked?
whattheheckheck 17 hours ago [-]
Is there an issue board for these bugs? I want to see what is a disgustingly large percent. 50%?
einpoklum 18 hours ago [-]
Probably not, because the specifics of the workload - exact parameters, representation of data in memory, value ranges etc - lead you to highly divergent optimization strategies.
orliesaurus 17 hours ago [-]
shouldn't it be possible to be run as a mlautoresearch project?
i.e. orchestrate 10 strategies to speed it up, run in paralellel, pick the winning and go from there?
saagarjha 13 hours ago [-]
No, because all the low hanging fruit that this kind of thing would find has usually been picked.
einpoklum 15 hours ago [-]
You are assuming all problems in the world are solvable by one of "10 strategies".
charcircuit 1 hours ago [-]
He is not assuming that as he includes "and go from there."
keynha 11 hours ago [-]
[flagged]
saagarjha 13 hours ago [-]
Control codes are a little more complicated than the post describes, they're really a table lookup rather than just bits in the control word.
einpoklum 21 hours ago [-]
First - nice writeup which goes into a lot of nooks and crannies.
That said, a lot of the user-space "voodoo" is gone if you don't go through CUDA's "runtime API". If you use the driver API, take your kernel source as a string and compile it with NVIDIA's run-time compiler, you'll have better visibility into a lot (not all) of what's going on. For the "raw" version of this, look at:
that's a sample program for my CUDA API wrappers (header-only) library.
mschuetz 21 hours ago [-]
I like the driver API because it allows treating Cuda kernels like hot-reloadable shaders. It's fun to develop while being able to change the code at runtime.
einpoklum 18 hours ago [-]
> I like the driver API because it allows treating Cuda kernels like hot-reloadable shaders.
It is also much more friendly for library authors; and easier to wrap; and actually exposes a bunch of features the "runtime API" doesn't.
The difficulty with it is that there just so many API calls; dozens of calls just for copying, for example. That was part of my motivation for writing my wrappers - making the supposedly "lower-level" API more accessible and intuitive than the supposedly "higher-level" API; and better integrated with the other libraries: NVTX, NVRTC, PTX compiler, fatbin library etc.
> It's fun to develop while being able to change the code at runtime.
It's also _the_ way to debug your kernels: If you don't load them dynamically, you have to recompile your application or kernel test harness every time you make a change to the kernel.
b112 5 hours ago [-]
Nice post. A note if the author is about, I had to use ublock origin to remove the header, as it hid text at the start of each page when printing. Firefox.
(I prefer to read longer articles on my e-ink device via epub or PDF)
https://www.usc.gal/en/studies/masters/engineering-and-archi...
The caveat though is that each new gen of hardware often comes with brand new constraints/features that a given generation of models haven't seen before (e.g. tcgen05 in blackwell was OOD at one point). As the models start to generalize better, this might not be a showstopper, but still an issue at least currently.
[1] https://kernelbench.com/
Also, at least a portion of this you could argue is arbitrary and entirely scoped to the eval itself. The fp8 GEMM score could be low simply because one of the shapes is fairly skinny (i.e. not enough math work to keep the compute engine busy for a meaningful amount of time).
Way back when I wrote the OpenCL driver at Qualcomm, we would frequently get bug reports from customers complaining about our code. During my tenure, every single one of them was root-caused as an application bug. Unsurprisingly, considering that our code was backed by an extensive test suite and their code wasn't.
Not to say that our code was perfect, of course. But people have a tendency to blame GPU drivers when the problem often lies elsewhere.
It is also common in my experience for buggy GPU code to crash displays if the GPU is simultaneously used to drive a monitor. This usually happens for kernels that go into infinite loops, or out-of-memory conditions.
It is my understanding that modern GPU drivers even have watchdog systems that notice when they get stuck and forcibly reboot them, which to me is mere symptom treatment.
To be fair: the hardware is enormously complex, and the drivers much less so.
That said, a lot of the user-space "voodoo" is gone if you don't go through CUDA's "runtime API". If you use the driver API, take your kernel source as a string and compile it with NVIDIA's run-time compiler, you'll have better visibility into a lot (not all) of what's going on. For the "raw" version of this, look at:
https://github.com/NVIDIA/cuda-samples/tree/master/cpp/0_Int...
but for a much more readable, and still fully transparent modern-C++ API version of the same, try this:
https://github.com/eyalroz/cuda-api-wrappers/blob/master/exa...
that's a sample program for my CUDA API wrappers (header-only) library.
It is also much more friendly for library authors; and easier to wrap; and actually exposes a bunch of features the "runtime API" doesn't.
The difficulty with it is that there just so many API calls; dozens of calls just for copying, for example. That was part of my motivation for writing my wrappers - making the supposedly "lower-level" API more accessible and intuitive than the supposedly "higher-level" API; and better integrated with the other libraries: NVTX, NVRTC, PTX compiler, fatbin library etc.
> It's fun to develop while being able to change the code at runtime.
It's also _the_ way to debug your kernels: If you don't load them dynamically, you have to recompile your application or kernel test harness every time you make a change to the kernel.
(I prefer to read longer articles on my e-ink device via epub or PDF)