From TCP Retransmits to MCP-Driven Cluster Investigations: An eBPF GPU Agent Retrospective
Seven weeks, ten releases. How an eBPF GPU agent grew an MCP tool surface that drives NCCL stall investigations from TCP retransmits to cluster scale.
Seven weeks, ten releases. How an eBPF GPU agent grew an MCP tool surface that drives NCCL stall investigations from TCP retransmits to cluster scale.
3am page: GPU training pipeline missed its SLA. Datadog showed 95% GPU utilization, nvidia-smi agreed, everything looked green. Kernel-level tracing told a different story, and the root cause was visible in 60 seconds.