GPU Observability for Workloads That Cannot Phone Home
Federal, defense, regulated finance, and on-prem ML clusters share one constraint: telemetry cannot leave the host. Most GPU observability stacks were built for the opposite assumption.
Federal, defense, regulated finance, and on-prem ML clusters share one constraint: telemetry cannot leave the host. Most GPU observability stacks were built for the opposite assumption.
An observability agent on every host costs RAM, CPU, security review, and upgrade churn at fleet scale. eBPF puts the instrumentation in the kernel, once per host, available to every process above it.
The eBPF uprobe pattern that catches CUDA runtime calls on libcudart.so is silicon-agnostic at the kernel layer. AMD ROCm exposes a parallel surface on libhip.so. Here is what that surface looks like and what it does not show.
Seven weeks, ten releases. How an eBPF GPU agent grew an MCP tool surface that drives NCCL stall investigations from TCP retransmits to cluster scale.
Inference-platform writeups optimize for p90 TTFT graphs. The dimensions that matter operationally – tail variance past p90, per-rank skew on multi-GPU, per-tenant attribution – are usually absent. Here’s why, and what eBPF on the host adds.
MCP exposes the agent’s actions: which tools, which arguments, which return values. eBPF exposes the kernel-level cause behind the latency the agent surfaced. We walk through a transcript that uses both.
MCP servers are shipping at a pace where the agent-side caller no longer knows which kernel resources its tool calls are actually touching. An eBPF view of the same call returns the answer in one trace.
Eight ranks on two hosts run an all-reduce. Token throughput drops 4x. Every per-host nvidia-smi reads 95-99% utilization and every per-host eBPF trace looks clean. The cause is rank 5 entering the barrier 290ms late. We walk through a cluster-level fan-in proof: 2,000 events from two nodes, fan-in into one DuckDB, queries that surface the straggler.
nvidia-smi reads 97% while throughput falls 3x in the same window. GPU utilization is a duty-cycle counter, not a measure of useful work. The cause-side data lives one layer down: kernel-runtime spreads, off-CPU time on the dispatcher thread, NCCL waits, I/O stalls.
Ingero Fleet v0.10 FOSS shipped this week. We ran it end-to-end on two three-node Lambda Cloud clusters, one Ampere, one Grace Hopper, injected a single straggler on each, and measured detection latency: 26 seconds on A100, ~30 seconds on arm64. Same code, same manifests, one wrinkle on GH200.
Peer comparison across GPU ranks reveals stragglers that nvidia-smi hides. eBPF traces show 20-35% of cluster compute wasted on cohorts that finish iterations slower than their peers, with no single GPU looking unhealthy.
A healthy vLLM server, normal nvidia-smi output, 11 seconds to first token. eBPF uprobes on the CUDA driver and kernel tracepoints on the scheduler traced it to prefix caching head-of-line blocking.
We gave Claude access to 10,869 CUDA Runtime API events via MCP. It found the root cause of a 124x DataLoader slowdown in 47 seconds by running SQL against a live eBPF trace database.
Claude Code with MCP access to kernel-level GPU traces, pointed at a real performance anomaly. The full investigation session: what the agent called, what it found, and where it got things right and wrong.
MCP is becoming the interface between AI agents and infrastructure data. Three implementations examined side by side: Datadog, Qualys, and a kernel-level tracer that lets agents query eBPF tracepoints directly.