GPU Observability for Workloads That Cannot Phone Home
Federal, defense, regulated finance, and on-prem ML clusters share one constraint: telemetry cannot leave the host. Most GPU observability stacks were built for the opposite assumption.
Federal, defense, regulated finance, and on-prem ML clusters share one constraint: telemetry cannot leave the host. Most GPU observability stacks were built for the opposite assumption.
An observability agent on every host costs RAM, CPU, security review, and upgrade churn at fleet scale. eBPF puts the instrumentation in the kernel, once per host, available to every process above it.
The eBPF uprobe pattern that catches CUDA runtime calls on libcudart.so is silicon-agnostic at the kernel layer. AMD ROCm exposes a parallel surface on libhip.so. Here is what that surface looks like and what it does not show.
Inference-platform writeups optimize for p90 TTFT graphs. The dimensions that matter operationally – tail variance past p90, per-rank skew on multi-GPU, per-tenant attribution – are usually absent. Here’s why, and what eBPF on the host adds.
MCP exposes the agent’s actions: which tools, which arguments, which return values. eBPF exposes the kernel-level cause behind the latency the agent surfaced. We walk through a transcript that uses both.
MCP servers are shipping at a pace where the agent-side caller no longer knows which kernel resources its tool calls are actually touching. An eBPF view of the same call returns the answer in one trace.
Eight ranks on two hosts run an all-reduce. Token throughput drops 4x. Every per-host nvidia-smi reads 95-99% utilization and every per-host eBPF trace looks clean. The cause is rank 5 entering the barrier 290ms late. We walk through a cluster-level fan-in proof: 2,000 events from two nodes, fan-in into one DuckDB, queries that surface the straggler.
nvidia-smi reads 97% while throughput falls 3x in the same window. GPU utilization is a duty-cycle counter, not a measure of useful work. The cause-side data lives one layer down: kernel-runtime spreads, off-CPU time on the dispatcher thread, NCCL waits, I/O stalls.
3am page: GPU training pipeline missed its SLA. Datadog showed 95% GPU utilization, nvidia-smi agreed, everything looked green. Kernel-level tracing told a different story, and the root cause was visible in 60 seconds.