Description:

Design, operate, and debug large-scale GPU infrastructure for distributed training and inference.
Serve as the primary technical point of contact for customers with large-scale training workloads.
Ensure reliability and performance of GPU infrastructure, including capacity planning and SLO definition.
Build observability into GPU utilization and performance metrics.
Lead incident response for complex failures across hardware and software layers.

Requirements:

Deep experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent).
Production experience with high-performance networking (InfiniBand, RoCE, NVLink).
Knowledge of distributed training and ML frameworks (NCCL, CUDA, PyTorch, etc.).
Expert-level Linux knowledge, including kernel tuning and driver management.
Strong experience with Kubernetes and orchestration for GPU workloads.
Proficient in automation and software engineering (Python, Go, Bash).
Experience building monitoring and alerting systems for GPU infrastructure.
Proven track record in incident management for complex distributed systems.

Significant ownership and autonomy in shaping foundational systems.
Opportunity to influence technical direction and operations of AI infrastructure.
Work directly with customers and providers in a high-impact role.

Skills