GPU Observability for Workloads That Cannot Phone Home
Federal, defense, regulated finance, and on-prem ML clusters share one constraint: telemetry cannot leave the host. Most GPU observability stacks were built for the opposite assumption.
Federal, defense, regulated finance, and on-prem ML clusters share one constraint: telemetry cannot leave the host. Most GPU observability stacks were built for the opposite assumption.
An observability agent on every host costs RAM, CPU, security review, and upgrade churn at fleet scale. eBPF puts the instrumentation in the kernel, once per host, available to every process above it.
The eBPF uprobe pattern that catches CUDA runtime calls on libcudart.so is silicon-agnostic at the kernel layer. AMD ROCm exposes a parallel surface on libhip.so. Here is what that surface looks like and what it does not show.
Seven weeks, ten releases. How an eBPF GPU agent grew an MCP tool surface that drives NCCL stall investigations from TCP retransmits to cluster scale.
Eight ranks on two hosts run an all-reduce. Token throughput drops 4x. Every per-host nvidia-smi reads 95-99% utilization and every per-host eBPF trace looks clean. The cause is rank 5 entering the barrier 290ms late. We walk through a cluster-level fan-in proof: 2,000 events from two nodes, fan-in into one DuckDB, queries that surface the straggler.
nvidia-smi reads 97% while throughput falls 3x in the same window. GPU utilization is a duty-cycle counter, not a measure of useful work. The cause-side data lives one layer down: kernel-runtime spreads, off-CPU time on the dispatcher thread, NCCL waits, I/O stalls.
Ingero Fleet v0.10 FOSS shipped this week. We ran it end-to-end on two three-node Lambda Cloud clusters, one Ampere, one Grace Hopper, injected a single straggler on each, and measured detection latency: 26 seconds on A100, ~30 seconds on arm64. Same code, same manifests, one wrinkle on GH200.
Peer comparison across GPU ranks reveals stragglers that nvidia-smi hides. eBPF traces show 20-35% of cluster compute wasted on cohorts that finish iterations slower than their peers, with no single GPU looking unhealthy.
A healthy vLLM server, normal nvidia-smi output, 11 seconds to first token. eBPF uprobes on the CUDA driver and kernel tracepoints on the scheduler traced it to prefix caching head-of-line blocking.
We gave Claude access to 10,869 CUDA Runtime API events via MCP. It found the root cause of a 124x DataLoader slowdown in 47 seconds by running SQL against a live eBPF trace database.
A single straggling node held up a 4-node distributed training job. We found it by fanning out one eBPF-powered SQL query to all four nodes and getting the answer in under a second. No central service, no Prometheus — just the same single-binary agent on each machine.
CUDA graphs collapse hundreds of kernel launches into one opaque cudaGraphLaunch call, creating a blind spot for GPU observability. We traced graph lifecycle events with eBPF uprobes and found pool exhaustion, re-capture storms, and CPU contention hiding in plain sight.
PyTorch DataLoader with 8 workers: 124x slower than direct tensor indexing. Kernel-level tracing of the full execution path revealed CPU over-subscription, context switching storms, and CUDA kernel launch latency spikes up to 356x above baseline.
A PyTorch training loop, 13x slower than expected. torch.profiler showed nothing unusual. eBPF kernel tracing found a hidden synchronization point: NumPy triggering implicit CUDA sync on every batch.
97% GPU utilization in nvidia-smi, but training throughput was a fraction of what benchmarks promised. CUDA API tracing and kernel scheduling data showed what the GPU was actually doing during that 97%.