GPU Observability for Workloads That Cannot Phone Home
Federal, defense, regulated finance, and on-prem ML clusters share one constraint: telemetry cannot leave the host. Most GPU observability stacks were built for the opposite assumption.
Federal, defense, regulated finance, and on-prem ML clusters share one constraint: telemetry cannot leave the host. Most GPU observability stacks were built for the opposite assumption.
An observability agent on every host costs RAM, CPU, security review, and upgrade churn at fleet scale. eBPF puts the instrumentation in the kernel, once per host, available to every process above it.
The eBPF uprobe pattern that catches CUDA runtime calls on libcudart.so is silicon-agnostic at the kernel layer. AMD ROCm exposes a parallel surface on libhip.so. Here is what that surface looks like and what it does not show.
Peer comparison across GPU ranks reveals stragglers that nvidia-smi hides. eBPF traces show 20-35% of cluster compute wasted on cohorts that finish iterations slower than their peers, with no single GPU looking unhealthy.
We gave Claude access to 10,869 CUDA Runtime API events via MCP. It found the root cause of a 124x DataLoader slowdown in 47 seconds by running SQL against a live eBPF trace database.
A single straggling node held up a 4-node distributed training job. We found it by fanning out one eBPF-powered SQL query to all four nodes and getting the answer in under a second. No central service, no Prometheus — just the same single-binary agent on each machine.
CUDA graphs collapse hundreds of kernel launches into one opaque cudaGraphLaunch call, creating a blind spot for GPU observability. We traced graph lifecycle events with eBPF uprobes and found pool exhaustion, re-capture storms, and CPU contention hiding in plain sight.
MiniMax M2.7 running locally via Ollama, connected to a real GPU trace database through MCP. No Claude, no cloud API keys. The model found why vLLM blocked all requests for 11 seconds.