Memory block row showing CUDA out-of-memory red block among teal allocated blocks and dark free holes, demonstrating fragmentation despite 60 percent utilization
eBPF, GPU Debugging, GPU Observability, MLOps

CUDA Out of Memory at 60% Utilization: Tracing PyTorch GPU Memory Fragmentation

CUDA OOM at 60% GPU utilization. nvidia-smi showed plenty of free memory, but PyTorch kept crashing. eBPF tracing of every cudaMalloc and cudaFree call exposed the real cause: memory fragmentation from misaligned allocation patterns.