GPU Causal Observability — Open-Source eBPF Agent
Open source GPU observability, traced end to end. An eBPF agent that follows the full chain — Linux kernel through CUDA API to your Python source lines. Find out why your GPU is slow, not just that it is.
Open Source GPU Observability: What It Does
Ingero attaches eBPF uprobes to your running CUDA processes — no SDK, no code changes, no restart. It traces GPU API calls, correlates them with host kernel events, and outputs causal chains explaining root causes.
NEW (May 2026): v0.10.1 -> Agent v0.16.0 + Fleet v1.0. 0: Multiple new features and fixes shipped. Read more…
NEW (April 2026): We shipped Ingero Fleet (OTEL Collector) for cluster-wide GPU observability across all nodes, multi-node investigations, CUDA Graphs
CUDA Runtime + Driver
14 interception points on libcudart.so and libcuda.so. Traces cudaMalloc, cudaLaunchKernel, cudaMemcpy, cudaStreamSync, cuLaunchKernel, and more. Sees the kernel launches that cuBLAS/cuDNN make directly.
Host Kernel Tracepoints
CPU scheduling (sched_switch, sched_wakeup), memory pressure (mm_page_alloc, oom_kill), process lifecycle, block I/O, TCP retransmits, network socket I/O. 6 eBPF sensors total.
Causal Engine
Correlates events across layers by timestamp and PID. Outputs root cause chains with severity ranking and fix recommendations. Processes 24K+ events/sec through a 7-tier selective filter.
Quick Start: Ingero eBPF Agent for GPU Observability
Binary release, recommended:
# Linux amd64
VERSION=0.16.0
curl -fsSL "https://github.com/ingero-io/ingero/releases/download/v${VERSION}/ingero_${VERSION}_linux_amd64.tar.gz" | tar xz
sudo mv ingero /usr/local/bin/
# Linux arm64 (GH200 / Grace Hopper, Graviton)
VERSION=0.16.0
curl -fsSL "https://github.com/ingero-io/ingero/releases/download/v${VERSION}/ingero_${VERSION}_linux_arm64.tar.gz" | tar xz
sudo mv ingero /usr/local/bin/Docker image is also available. Or build from source:
git clone https://github.com/ingero-io/ingero.git # clone the repo
cd ingero
bash scripts/install-deps.sh # install dependencies: clang-14, go, ebpf chain
source ~/.bashrc # update your env
make
# [optional] Check your system
./bin/ingero check
# [optional] Try Ingero (auto-detects GPU)
./bin/ingero demo
# Trace live CUDA workloads and CPU-GPU interactions (requires sudo for eBPF access + NVIDIA GPU)
sudo ./bin/ingero traceSingle binary. No dependencies. Works on any Linux 5.15+ kernel with NVIDIA driver 550+. Also available as a K8s DaemonSet.
Embedded MCP Server: Let AI Agents Investigate AI Training and Inference Issues
Ingero Agent includes an MCP server with 7 tools. Connect it to Claude, Cursor, Ollama + Qwen/MiniMax or any MCP-compatible assistant / model, local or remote, and ask questions about your GPU workloads, both at a single node and cluster-wide via Ingero Fleet & Echo. Use built-in /investigate prompt to call multiple MCP tools Ingero provides to analyze collected runtime data and causal chains.
Engineer: "What caused the training slowdown?"
Ingero MCP → cudaStreamSync p99 spiked 29x (16ms → 472ms).
Root cause: 847 sched_switch events — logrotate preempted
training thread for 142ms cumulative off-CPU time.
Source: forward() at train.py:142Full AI investigation session →
Ingero Agent’s GPU Observability Architecture
┌────────────────────────────────────────────────────────────────┐
│ User Space │
│ │
│ ┌─────────┐ ┌─────────────┐ ┌───────┐ ┌─────────────┐ │
│ │ CUDA │ │ ingero │ │SQLite │ │MCP Server │ │
│ │ App │ │ agent │─►│ DB │◄───│(stdio/HTTPS)│ │
│ │(PyTorch)│ │ │ │ │ └─────────────┘ │
│ │ │ │ │ │ │ ┌───────────┐ │
│ │ │ │ │ │ │◄──│ Dashboard │ │
│ │ │ │ │ └───────┘ │ (HTTPS) │ │
│ └──┬──┬───┘ │ ┌──────────┐│ └───────────┘ │
│ │ │ │ │ causal ││ ┌───────────┐ │
│ │ │ │ │ engine ││ │ OTLP / │ │
│ │ │ │ └──────────┘│──►│ Prometheus│ │
│ │ │ └──┬──┬──┬────┘ └───────────┘ │
│ │ │ │ │ │ ▲ │
│ │ │ │ │ │ │ ring buffers │
│─────┼──┼───────────┼──┼──┼─┼───────────────────────────────────│
│ │ ▼ │ ▼ ▼ │ │
│ │ ┌─────────┐ │ ┌────────────────────┐ │
│ │ │libcuda │◄─┤ │ eBPF uprobes │ (Driver API) │
│ │ │ .so │ │ │ cuLaunchKernel │ │
│ │ └─────────┘ │ │ cuMemcpy/Alloc │ │
│ ▼ │ └────────────────────┘ │
│ ┌─────────┐ │ ┌────────────────────┐ │
│ │libcudart│◄──────┘ │ eBPF uprobes │ (Runtime API) │
│ │ .so │◄────────│ cudaLaunchKernel │ │
│ └─────────┘ │ cudaMalloc/Memcpy │ │
│ │ Graph: Capture, │ │
│ │ Instantiate,Launch│ │
│ └────────────────────┘ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ eBPF tracepoints (sched_switch, mm_page_alloc, oom, │ │
│ │ sched_process_exec/exit/fork) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Kernel Space /proc → CPU%, Mem%, Load, Swap │
└────────────────────────────────────────────────────────────────┘< 2% production overhead. Selective storage (not “store everything”) — 100% accuracy on the live stream, ~1% volume to disk. Local SQLite, size-bounded at 10GB default. No cloud backend required. For multi-node / cluster architecture please refer to Ingero Fleet.
From the Blog
- GPU Incident at 3am: eBPF Tracing from Page to Root Cause in 60 Seconds
- GPU 97% Utilized But Training 3x Slower: What nvidia-smi Misses
- Tracing torch.cuda.empty_cache() on an RTX 4090
Get Involved
Ingero Agent (single node) and Ingero Fleet (OTEL Collector for the multi-node / cluster) are free and open source, dual licensed with Apache 2.0 (Go agent) and GPL-2.0 (eBPF kernel programs).
GitHub
Source, issues, discussions → github.com/ingero-io/ingero and github.com/ingero-io/ingero-fleet
Docs
Setup, architecture, test matrix → github.com/ingero-io/ingero/docs
