gpu-observability

Cluster-level GPU tracing diagram: eight ranks across two hosts (node A, node B) running ncclAllReduce, with rank 5 entering the barrier 290ms late and the other seven ranks blocked-in-ncclAllReduce while reading 95-99% utilization, fan-in into a single Ingero Echo DuckDB store on the right
eBPF, GPU Debugging, GPU Observability, MLOps

A Cluster Stall Looks Healthy on Every Host. The Cause Is in the Pattern Across Hosts.

Eight ranks on two hosts run an all-reduce. Token throughput drops 4x. Every per-host nvidia-smi reads 95-99% utilization and every per-host eBPF trace looks clean. The cause is rank 5 entering the barrier 290ms late. We walk through a cluster-level fan-in proof: 2,000 events from two nodes, fan-in into one DuckDB, queries that surface the straggler.

Scroll to Top