GPU Observability for Workloads That Cannot Phone Home
Federal, defense, regulated finance, and on-prem ML clusters share one constraint: telemetry cannot leave the host. Most GPU observability stacks were built for the opposite assumption.
Federal, defense, regulated finance, and on-prem ML clusters share one constraint: telemetry cannot leave the host. Most GPU observability stacks were built for the opposite assumption.
An observability agent on every host costs RAM, CPU, security review, and upgrade churn at fleet scale. eBPF puts the instrumentation in the kernel, once per host, available to every process above it.
The eBPF uprobe pattern that catches CUDA runtime calls on libcudart.so is silicon-agnostic at the kernel layer. AMD ROCm exposes a parallel surface on libhip.so. Here is what that surface looks like and what it does not show.
MCP exposes the agent’s actions: which tools, which arguments, which return values. eBPF exposes the kernel-level cause behind the latency the agent surfaced. We walk through a transcript that uses both.
MCP servers are shipping at a pace where the agent-side caller no longer knows which kernel resources its tool calls are actually touching. An eBPF view of the same call returns the answer in one trace.
Eight ranks on two hosts run an all-reduce. Token throughput drops 4x. Every per-host nvidia-smi reads 95-99% utilization and every per-host eBPF trace looks clean. The cause is rank 5 entering the barrier 290ms late. We walk through a cluster-level fan-in proof: 2,000 events from two nodes, fan-in into one DuckDB, queries that surface the straggler.
nvidia-smi reads 97% while throughput falls 3x in the same window. GPU utilization is a duty-cycle counter, not a measure of useful work. The cause-side data lives one layer down: kernel-runtime spreads, off-CPU time on the dispatcher thread, NCCL waits, I/O stalls.
Peer comparison across GPU ranks reveals stragglers that nvidia-smi hides. eBPF traces show 20-35% of cluster compute wasted on cohorts that finish iterations slower than their peers, with no single GPU looking unhealthy.
A single straggling node held up a 4-node distributed training job. We found it by fanning out one eBPF-powered SQL query to all four nodes and getting the answer in under a second. No central service, no Prometheus — just the same single-binary agent on each machine.
CUDA graphs collapse hundreds of kernel launches into one opaque cudaGraphLaunch call, creating a blind spot for GPU observability. We traced graph lifecycle events with eBPF uprobes and found pool exhaustion, re-capture storms, and CPU contention hiding in plain sight.
PyTorch DataLoader with 8 workers: 124x slower than direct tensor indexing. Kernel-level tracing of the full execution path revealed CPU over-subscription, context switching storms, and CUDA kernel launch latency spikes up to 356x above baseline.