TL;DR
3am page: GPU training pipeline missed its SLA. Datadog shows 95% GPU utilization. nvidia-smi agrees. Everything looks green, but the job is 3x slower than expected. Zero tools to diagnose this. eBPF kernel tracing produces causal chains in 60 seconds: the host CPU was fighting with DataLoader workers, starving the GPU. A
tasksetfix, back to sleep, no ML engineer woken up. This is a field guide for GPU incident response using eBPF tracing to go from alert to root cause in under a minute.
The 3am Page Every GPU SRE Dreads
GPU incident response starts with a page that makes no sense. PagerDuty fires:
[CRITICAL] GPU Training Pipeline SLA Breached
Cluster: prod-gpu-01 (8x H100)
Job: nightly-retraining-v3
Expected completion: 02:00 UTC
Current status: 47% complete at 03:12 UTCThe monitoring stack:
Datadog GPU Dashboard:
GPU Utilization: 95% ✅
GPU Memory: 78% ✅
GPU Temperature: 72°C ✅
Power Draw: 680W ✅Grafana (DCGM Exporter):
dcgm_gpu_utilization: 0.95 ✅
dcgm_fb_used: 62GB ✅
dcgm_sm_clock: 1980MHz ✅nvidia-smi:
+-------------------------------------------+
| GPU Name | GPU-Util | Memory-Usage |
|===============+==========+=================|
| 0 H100 SXM | 97% | 62000MiB / 80GB |
+-------------------------------------------+Every single dashboard says the GPU is fine. A breached SLA and zero signal to work with.
This is where most GPU incidents stall. The SRE has no tools that see below the GPU utilization counter. The options are:
- Wake the ML engineer (who’ll spend 2 hours adding print statements)
- Restart the job and hope it goes faster (it won’t)
- Stare at dashboards that all say green
Why GPU Dashboards Lie to SREs
Every GPU monitoring tool in the stack (Datadog, Grafana, DCGM, nvidia-smi) reports the same underlying metric: “did the GPU have at least one kernel scheduled?”
That metric is useless for diagnosis. It’s like monitoring a restaurant by checking “is someone sitting at each table?” without knowing if anyone is eating. The kitchen (GPU compute cores) could be idle 80% of the time between courses, and the dashboard would still say “97% utilized.”
The real problems that cause GPU SLA breaches are host-side:
- CPU scheduling contention starving the data pipeline
- DataLoader workers preempted by monitoring agents (ironic)
- Memory pressure causing page faults in the data loading path
- Disk I/O bottlenecks blocking the next training batch
- Network retransmits stalling distributed training
These are all Linux kernel events. DCGM and nvidia-smi have zero visibility into them. GPU dashboards are structurally blind to the most common causes of GPU performance degradation.
60-Second GPU Incident Response with eBPF
The tracer is eBPF-based and captures both sides: CUDA APIs (what the GPU is doing) and host kernel events (what the CPU, scheduler, memory, and I/O subsystems are doing). It builds causal chains connecting host events to GPU latency.
It deploys as a K8s DaemonSet and runs continuously with <2% overhead. No code changes, no NVIDIA SDK, no CUPTI.
Here’s what incident response looks like:
Step 1: Get the causal chain (10 seconds)
$ ingero explain --since 1hSystem Context:
CPU: 94.2% | Memory: 78.1% | Load: 12.3 (8 cores) | Swap: 0 MB
Causal Chains (last 1 hour):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[HIGH] CPU scheduling contention → CUDA throughput drop
Root: 14,504 context switches on training process (PID 3821)
Process off-CPU 62 of 120 seconds (51.7% of wall clock)
Effect: cudaStreamSync p99 inflated 1,028x (7µs → 7.2ms)
CUDA op throughput dropped 47% from peak
Contributing: 4 DataLoader workers + prometheus-node-exporter
+ fluent-bit competing for 8 cores
Fix: pin training to dedicated cores: taskset -c 0-5 python3 train.py
set DataLoader persistent_workers=True
nice -n 19 monitoring agentsThere it is. The training process was off-CPU 51.7% of the time. The GPU was waiting for data, not computing. Monitoring agents (Prometheus node exporter, Fluent Bit) were stealing CPU from the training pipeline.
nvidia-smi said 97% because kernels were queued, but the pipeline was running at half speed.
Step 2: Identify the culprits (20 seconds)
$ ingero explain --per-process --since 1hProcess Breakdown (last 1 hour):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
python3 (train.py) PID 3821:
cudaStreamSync | 12,403 calls | p50=1.2ms | p99=7.2ms
cudaMalloc | 206 calls | p50=65µs | p99=2.1ms
cuLaunchKernel | 17,509 calls | p50=12µs | p99=890µs
⚠ Off-CPU: 62.0s / 120s (51.7%)
⚠ Context switches: 14,504
pt_data_worker:0 PID 3822:
⚠ Off-CPU: 31.4s / 120s (26.2%)
⚠ Worst stall: 609ms
prometheus-node-exporter PID 1205:
⚠ Context switches: 3,201
⚠ CPU stolen: 8.7sThe training process and all 4 DataLoader workers are fighting for CPU with the monitoring stack. The worst single scheduling stall is 609ms, over half a second where a data worker was frozen while the GPU sat idle.
Step 3: Fix it (30 seconds)
# Pin training to dedicated cores (leave 2 for monitoring + OS)
$ kubectl exec -it gpu-training-pod -- taskset -c 0-5 python3 train.py
# Or: deprioritize monitoring agents
$ kubectl exec -it monitoring-pod -- nice -n 19 prometheus-node-exporterOr better yet, add to the DaemonSet config:
# training pod
resources:
limits:
cpu: "6"
requests:
cpu: "6"After the fix:
- Context switches on training: 14,504 → 890
cudaStreamSyncp99: 7.2ms → 45µs- Pipeline throughput: restored to expected rate
- SLA: back on track. ML engineer: still sleeping.
AI-Assisted Investigation with MCP
Ingero includes an MCP (Model Context Protocol) server that lets AI assistants investigate GPU incidents. Teams using Claude, Cursor, or any MCP-compatible tool, the AI can query Ingero directly:
SRE asks Claude: “Why is the GPU training pipeline slow on prod-gpu-01?”
Claude queries the MCP server and responds:
I found a HIGH severity causal chain on prod-gpu-01. The training process (PID 3821) was off-CPU for 51.7% of wall-clock time due to CPU scheduling contention. 14,504 context switches occurred, primarily caused by 4 DataLoader workers and monitoring agents (prometheus-node-exporter, fluent-bit) competing for 8 CPU cores.
The GPU was effectively starved. cudaStreamSync p99 inflated from 7µs to 7.2ms (1,028x), and CUDA throughput dropped 47%.
Recommended fix: Pin the training process to dedicated cores with
taskset -c 0-5, and setnice -n 19for monitoring agents.
This turns a 2-hour debugging session into a 30-second conversation.
Deployment: Familiar SRE Patterns
The tracer deploys like any other observability agent in a K8s stack:
# Helm install (DaemonSet + RBAC)
helm install ingero ./deploy/helm/ingero \
--set prometheus.enabled=true \
--set otlp.enabled=true
# Or standalone
sudo ./bin/ingero trace --stack --prometheus :9090What this provides:
- DaemonSet: Runs on every GPU node automatically
- Prometheus /metrics: GPU latency percentiles, causal chain counts, and plug into existing Grafana
- OTLP export: Send traces to an existing backend (Jaeger, Tempo, Honeycomb)
- MCP server: AI-assisted investigation via Claude, Cursor, etc.
- SQLite local storage: 10GB rolling, auto-prunes old events. No external database needed
- Pod metadata: Enriches events with K8s pod name, namespace, container ID
It slots into an existing monitoring stack. No rip-and-replace.
What the tracer Sees That DCGM Can’t
| Signal | DCGM / nvidia-smi | Ingero |
|---|---|---|
| GPU utilization % | Yes (misleading) | Yes (with causal context) |
| Per-CUDA-call latency | No | Yes (p50/p95/p99 for every API call) |
| CPU scheduling delays | No | Yes (sched_switch tracepoints) |
| DataLoader worker stalls | No | Yes (per-process off-CPU time) |
| Memory pressure → GPU impact | No | Yes (mm_page_alloc + CUDA correlation) |
| Disk I/O → GPU stalls | No | Yes (block_rq + CUDA correlation) |
| Network → distributed training | No | Yes (tcp_retransmit + CUDA correlation) |
| Root cause chain | No | Yes (automated causal chains with fix recommendations) |
| Python source line attribution | No | Yes (CPython frame extraction with –stack) |
The SRE Value Proposition
For SREs managing GPU infrastructure, the tracer answers three questions:
- Incident response: “Why is the GPU slow right now?” → Causal chain in 60 seconds
- Capacity planning: “Are we actually using these GPUs efficiently?” → Real compute efficiency, not nvidia-smi lies
- Cost attribution: “Which team’s workload is causing contention?” → Per-process, per-namespace breakdown
No need to understand CUDA or ML model architectures. The tracer translates kernel-level GPU events into actionable SRE language: root cause, impact, fix.
Try It Yourself
No GPU required to see the pattern:
# 1. Build
git clone https://github.com/ingero-io/ingero.git
cd ingero && make build
# 2. Try the demos
./bin/ingero demo incident # See a causal chain form in real-time
./bin/ingero demo cpu-contention # CPU scheduling causing GPU stallsFor the GPU tracing:
sudo ./bin/ingero check # Verify system compatibility
sudo ./bin/ingero trace --stack # Start tracing (runs continuously)
./bin/ingero explain --since 5min # See causal chainsGitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design.
If you are seeing GPU incidents in your own workloads, we’d love to take a look. Drop an issue on GitHub and we will gladly dive into it together.
Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.
