GPU Observability for Workloads That Cannot Phone Home
Federal, defense, regulated finance, and on-prem ML clusters share one constraint: telemetry cannot leave the host. Most GPU observability stacks were built for the opposite assumption.
Federal, defense, regulated finance, and on-prem ML clusters share one constraint: telemetry cannot leave the host. Most GPU observability stacks were built for the opposite assumption.
An observability agent on every host costs RAM, CPU, security review, and upgrade churn at fleet scale. eBPF puts the instrumentation in the kernel, once per host, available to every process above it.
Seven weeks, ten releases. How an eBPF GPU agent grew an MCP tool surface that drives NCCL stall investigations from TCP retransmits to cluster scale.
Inference-platform writeups optimize for p90 TTFT graphs. The dimensions that matter operationally – tail variance past p90, per-rank skew on multi-GPU, per-tenant attribution – are usually absent. Here’s why, and what eBPF on the host adds.
MCP servers are shipping at a pace where the agent-side caller no longer knows which kernel resources its tool calls are actually touching. An eBPF view of the same call returns the answer in one trace.
Eight ranks on two hosts run an all-reduce. Token throughput drops 4x. Every per-host nvidia-smi reads 95-99% utilization and every per-host eBPF trace looks clean. The cause is rank 5 entering the barrier 290ms late. We walk through a cluster-level fan-in proof: 2,000 events from two nodes, fan-in into one DuckDB, queries that surface the straggler.
Ingero Fleet v0.10 FOSS shipped this week. We ran it end-to-end on two three-node Lambda Cloud clusters, one Ampere, one Grace Hopper, injected a single straggler on each, and measured detection latency: 26 seconds on A100, ~30 seconds on arm64. Same code, same manifests, one wrinkle on GH200.
Peer comparison across GPU ranks reveals stragglers that nvidia-smi hides. eBPF traces show 20-35% of cluster compute wasted on cohorts that finish iterations slower than their peers, with no single GPU looking unhealthy.
A healthy vLLM server, normal nvidia-smi output, 11 seconds to first token. eBPF uprobes on the CUDA driver and kernel tracepoints on the scheduler traced it to prefix caching head-of-line blocking.
We gave Claude access to 10,869 CUDA Runtime API events via MCP. It found the root cause of a 124x DataLoader slowdown in 47 seconds by running SQL against a live eBPF trace database.
Claude Code with MCP access to kernel-level GPU traces, pointed at a real performance anomaly. The full investigation session: what the agent called, what it found, and where it got things right and wrong.
MCP is becoming the interface between AI agents and infrastructure data. Three implementations examined side by side: Datadog, Qualys, and a kernel-level tracer that lets agents query eBPF tracepoints directly.
A single straggling node held up a 4-node distributed training job. We found it by fanning out one eBPF-powered SQL query to all four nodes and getting the answer in under a second. No central service, no Prometheus — just the same single-binary agent on each machine.
CUDA graphs collapse hundreds of kernel launches into one opaque cudaGraphLaunch call, creating a blind spot for GPU observability. We traced graph lifecycle events with eBPF uprobes and found pool exhaustion, re-capture storms, and CPU contention hiding in plain sight.
PyTorch DataLoader with 8 workers: 124x slower than direct tensor indexing. Kernel-level tracing of the full execution path revealed CPU over-subscription, context switching storms, and CUDA kernel launch latency spikes up to 356x above baseline.