From TCP Retransmits to MCP-Driven Cluster Investigations: An eBPF GPU Agent Retrospective
Seven weeks, ten releases. How an eBPF GPU agent grew an MCP tool surface that drives NCCL stall investigations from TCP retransmits to cluster scale.
Seven weeks, ten releases. How an eBPF GPU agent grew an MCP tool surface that drives NCCL stall investigations from TCP retransmits to cluster scale.
MCP exposes the agent’s actions: which tools, which arguments, which return values. eBPF exposes the kernel-level cause behind the latency the agent surfaced. We walk through a transcript that uses both.
MCP servers are shipping at a pace where the agent-side caller no longer knows which kernel resources its tool calls are actually touching. An eBPF view of the same call returns the answer in one trace.
We gave Claude access to 10,869 CUDA Runtime API events via MCP. It found the root cause of a 124x DataLoader slowdown in 47 seconds by running SQL against a live eBPF trace database.
Claude Code with MCP access to kernel-level GPU traces, pointed at a real performance anomaly. The full investigation session: what the agent called, what it found, and where it got things right and wrong.
MCP is becoming the interface between AI agents and infrastructure data. Three implementations examined side by side: Datadog, Qualys, and a kernel-level tracer that lets agents query eBPF tracepoints directly.
MiniMax M2.7 running locally via Ollama, connected to a real GPU trace database through MCP. No Claude, no cloud API keys. The model found why vLLM blocked all requests for 11 seconds.