GPU Incident at 3am: eBPF Tracing from Page to Root Cause in 60 Seconds
3am page: GPU training pipeline missed its SLA. Datadog showed 95% GPU utilization, nvidia-smi agreed, everything looked green. Kernel-level tracing told a different story, and the root cause was visible in 60 seconds.

