CUDA kernel development, debugging, and performance optimization for Claude Code. Use when writing, debugging, or optimizing CUDA code, GPU kernels, or parallel algorithms. Covers non-interactive profiling with nsys/ncu, debugging with cuda-gdb/compute-sanitizer, binary inspection with cuobjdump, and performance analysis workflows. Triggers on CUDA, GPU programming, kernel optimization, nsys, ncu, cuda-gdb, compute-sanitizer, PTX, GPU profiling, parallel performance.
Measure before guessing. GPU performance is deeply counterintuitive. Profile first, hypothesize second, change third, verify fourth.
Small, isolated changes. CUDA bugs compound. Make one change, test it, commit it. Resist the urge to "fix everything at once."
printf is your strongest tool. When debuggers fail, when tools produce inscrutable output, printf in device code reveals truth. Don't be embarrassed to use it extensively.
Sometimes, stare at the diff. Inscrutable segfaults are common. Tools often don't help. The human approach: minimize the diff, read it carefully, see the bug. This is legitimate and often faster than tooling.
printf in device code to trace executioncompute-sanitizer --tool memcheck ./your_program
compute-sanitizer --tool racecheck ./your_program # for race conditions
compute-sanitizer --tool initcheck ./your_program # uninitialized memory
cuda-gdb -batch -ex "run" -ex "bt" ./your_program
__global__ void myKernel(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx == 0) { // Limit output
printf("Kernel launched, n=%d, data[0]=%f\n", n, data[0]);
}
// ... kernel logic ...
if (idx < 10) { // Sample a few threads
printf("Thread %d: result=%f\n", idx, someValue);
}
}
Key patterns:
if (idx == 0) or if (idx < N) to avoid output flood