Two bar charts showing nvidia-smi reporting 97% GPU utilization while actual kernel activity measured via eBPF is about 25%, exposing the utilization paradox
eBPF, GPU Debugging, GPU Observability, MLOps

nvidia-smi Reports 97% Utilization While the GPU Sits Idle

97% GPU utilization in nvidia-smi, but training throughput was a fraction of what benchmarks promised. CUDA API tracing and kernel scheduling data showed what the GPU was actually doing during that 97%.