GPU Incident at 3am: eBPF Tracing from Page to Root Cause in 60 Seconds

TL;DR

3am page: GPU training pipeline missed its SLA. Datadog shows 95% GPU utilization. nvidia-smi agrees. Everything looks green, but the job is 3x slower than expected. Zero tools to diagnose this. eBPF kernel tracing produces causal chains in 60 seconds: the host CPU was fighting with DataLoader workers, starving the GPU. A taskset fix, back to sleep, no ML engineer woken up. This is a field guide for GPU incident response using eBPF tracing to go from alert to root cause in under a minute.

The 3am Page Every GPU SRE Dreads

GPU incident response starts with a page that makes no sense. PagerDuty fires:

[CRITICAL] GPU Training Pipeline SLA Breached
Cluster: prod-gpu-01 (8x H100)
Job: nightly-retraining-v3
Expected completion: 02:00 UTC
Current status: 47% complete at 03:12 UTC

The monitoring stack:

Datadog GPU Dashboard:

GPU Utilization:  95%  ✅
GPU Memory:       78%  ✅
GPU Temperature:  72°C ✅
Power Draw:       680W ✅

Grafana (DCGM Exporter):

dcgm_gpu_utilization:     0.95  ✅
dcgm_fb_used:             62GB  ✅
dcgm_sm_clock:            1980MHz ✅

nvidia-smi:

+-------------------------------------------+
| GPU  Name     | GPU-Util | Memory-Usage    |
|===============+==========+=================|
|   0  H100 SXM |    97%  | 62000MiB / 80GB |
+-------------------------------------------+

Every single dashboard says the GPU is fine. A breached SLA and zero signal to work with.

This is where most GPU incidents stall. The SRE has no tools that see below the GPU utilization counter. The options are:

  1. Wake the ML engineer (who’ll spend 2 hours adding print statements)
  2. Restart the job and hope it goes faster (it won’t)
  3. Stare at dashboards that all say green

Why GPU Dashboards Lie to SREs

Every GPU monitoring tool in the stack (Datadog, Grafana, DCGM, nvidia-smi) reports the same underlying metric: “did the GPU have at least one kernel scheduled?”

That metric is useless for diagnosis. It’s like monitoring a restaurant by checking “is someone sitting at each table?” without knowing if anyone is eating. The kitchen (GPU compute cores) could be idle 80% of the time between courses, and the dashboard would still say “97% utilized.”

The real problems that cause GPU SLA breaches are host-side:

  • CPU scheduling contention starving the data pipeline
  • DataLoader workers preempted by monitoring agents (ironic)
  • Memory pressure causing page faults in the data loading path
  • Disk I/O bottlenecks blocking the next training batch
  • Network retransmits stalling distributed training

These are all Linux kernel events. DCGM and nvidia-smi have zero visibility into them. GPU dashboards are structurally blind to the most common causes of GPU performance degradation.

60-Second GPU Incident Response with eBPF

The tracer is eBPF-based and captures both sides: CUDA APIs (what the GPU is doing) and host kernel events (what the CPU, scheduler, memory, and I/O subsystems are doing). It builds causal chains connecting host events to GPU latency.

It deploys as a K8s DaemonSet and runs continuously with <2% overhead. No code changes, no NVIDIA SDK, no CUPTI.

Here’s what incident response looks like:

Step 1: Get the causal chain (10 seconds)

$ ingero explain --since 1h
System Context:
  CPU: 94.2% | Memory: 78.1% | Load: 12.3 (8 cores) | Swap: 0 MB

Causal Chains (last 1 hour):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[HIGH] CPU scheduling contention → CUDA throughput drop
  Root: 14,504 context switches on training process (PID 3821)
        Process off-CPU 62 of 120 seconds (51.7% of wall clock)
  Effect: cudaStreamSync p99 inflated 1,028x (7µs → 7.2ms)
          CUDA op throughput dropped 47% from peak
  Contributing: 4 DataLoader workers + prometheus-node-exporter
                + fluent-bit competing for 8 cores
  Fix: pin training to dedicated cores: taskset -c 0-5 python3 train.py
       set DataLoader persistent_workers=True
       nice -n 19 monitoring agents

There it is. The training process was off-CPU 51.7% of the time. The GPU was waiting for data, not computing. Monitoring agents (Prometheus node exporter, Fluent Bit) were stealing CPU from the training pipeline.

nvidia-smi said 97% because kernels were queued, but the pipeline was running at half speed.

Step 2: Identify the culprits (20 seconds)

$ ingero explain --per-process --since 1h
Process Breakdown (last 1 hour):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

python3 (train.py) PID 3821:
  cudaStreamSync | 12,403 calls | p50=1.2ms | p99=7.2ms
  cudaMalloc     |    206 calls | p50=65µs  | p99=2.1ms
  cuLaunchKernel | 17,509 calls | p50=12µs  | p99=890µs
  ⚠ Off-CPU: 62.0s / 120s (51.7%)
  ⚠ Context switches: 14,504

pt_data_worker:0 PID 3822:
  ⚠ Off-CPU: 31.4s / 120s (26.2%)
  ⚠ Worst stall: 609ms

prometheus-node-exporter PID 1205:
  ⚠ Context switches: 3,201
  ⚠ CPU stolen: 8.7s

The training process and all 4 DataLoader workers are fighting for CPU with the monitoring stack. The worst single scheduling stall is 609ms, over half a second where a data worker was frozen while the GPU sat idle.

Step 3: Fix it (30 seconds)

# Pin training to dedicated cores (leave 2 for monitoring + OS)
$ kubectl exec -it gpu-training-pod -- taskset -c 0-5 python3 train.py

# Or: deprioritize monitoring agents
$ kubectl exec -it monitoring-pod -- nice -n 19 prometheus-node-exporter

Or better yet, add to the DaemonSet config:

# training pod
resources:
  limits:
    cpu: "6"
  requests:
    cpu: "6"

After the fix:

  • Context switches on training: 14,504 → 890
  • cudaStreamSync p99: 7.2ms → 45µs
  • Pipeline throughput: restored to expected rate
  • SLA: back on track. ML engineer: still sleeping.

AI-Assisted Investigation with MCP

Ingero includes an MCP (Model Context Protocol) server that lets AI assistants investigate GPU incidents. Teams using Claude, Cursor, or any MCP-compatible tool, the AI can query Ingero directly:

SRE asks Claude: “Why is the GPU training pipeline slow on prod-gpu-01?”

Claude queries the MCP server and responds:

I found a HIGH severity causal chain on prod-gpu-01. The training process (PID 3821) was off-CPU for 51.7% of wall-clock time due to CPU scheduling contention. 14,504 context switches occurred, primarily caused by 4 DataLoader workers and monitoring agents (prometheus-node-exporter, fluent-bit) competing for 8 CPU cores.

The GPU was effectively starved. cudaStreamSync p99 inflated from 7µs to 7.2ms (1,028x), and CUDA throughput dropped 47%.

Recommended fix: Pin the training process to dedicated cores with taskset -c 0-5, and set nice -n 19 for monitoring agents.

This turns a 2-hour debugging session into a 30-second conversation.

Deployment: Familiar SRE Patterns

The tracer deploys like any other observability agent in a K8s stack:

# Helm install (DaemonSet + RBAC)
helm install ingero ./deploy/helm/ingero \
  --set prometheus.enabled=true \
  --set otlp.enabled=true

# Or standalone
sudo ./bin/ingero trace --stack --prometheus :9090

What this provides:

  • DaemonSet: Runs on every GPU node automatically
  • Prometheus /metrics: GPU latency percentiles, causal chain counts, and plug into existing Grafana
  • OTLP export: Send traces to an existing backend (Jaeger, Tempo, Honeycomb)
  • MCP server: AI-assisted investigation via Claude, Cursor, etc.
  • SQLite local storage: 10GB rolling, auto-prunes old events. No external database needed
  • Pod metadata: Enriches events with K8s pod name, namespace, container ID

It slots into an existing monitoring stack. No rip-and-replace.

What the tracer Sees That DCGM Can’t

SignalDCGM / nvidia-smiIngero
GPU utilization %Yes (misleading)Yes (with causal context)
Per-CUDA-call latencyNoYes (p50/p95/p99 for every API call)
CPU scheduling delaysNoYes (sched_switch tracepoints)
DataLoader worker stallsNoYes (per-process off-CPU time)
Memory pressure → GPU impactNoYes (mm_page_alloc + CUDA correlation)
Disk I/O → GPU stallsNoYes (block_rq + CUDA correlation)
Network → distributed trainingNoYes (tcp_retransmit + CUDA correlation)
Root cause chainNoYes (automated causal chains with fix recommendations)
Python source line attributionNoYes (CPython frame extraction with –stack)

The SRE Value Proposition

For SREs managing GPU infrastructure, the tracer answers three questions:

  1. Incident response: “Why is the GPU slow right now?” → Causal chain in 60 seconds
  2. Capacity planning: “Are we actually using these GPUs efficiently?” → Real compute efficiency, not nvidia-smi lies
  3. Cost attribution: “Which team’s workload is causing contention?” → Per-process, per-namespace breakdown

No need to understand CUDA or ML model architectures. The tracer translates kernel-level GPU events into actionable SRE language: root cause, impact, fix.

Try It Yourself

No GPU required to see the pattern:

# 1. Build
git clone https://github.com/ingero-io/ingero.git
cd ingero && make build

# 2. Try the demos
./bin/ingero demo incident          # See a causal chain form in real-time
./bin/ingero demo cpu-contention    # CPU scheduling causing GPU stalls

For the GPU tracing:

sudo ./bin/ingero check              # Verify system compatibility

sudo ./bin/ingero trace --stack      # Start tracing (runs continuously)

./bin/ingero explain --since 5min    # See causal chains

GitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design.

If you are seeing GPU incidents in your own workloads, we’d love to take a look. Drop an issue on GitHub and we will gladly dive into it together.

Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.

Related reading

Leave a Comment

Scroll to Top