Remote Senior Site Reliability Engineer - AI Infrastructure

Posted 2 months ago

Share:

Please let Andromeda Cluster know you found this job on RemoteYeah. This helps us get more companies to post jobs here for you.

Description:

  • Design, operate, and debug large-scale GPU infrastructure for distributed training and inference.
  • Serve as the primary technical point of contact for customers with large-scale training workloads.
  • Ensure reliability and performance of GPU infrastructure, including capacity planning and SLO definition.
  • Build observability into GPU utilization and performance metrics.
  • Lead incident response for complex failures across hardware and software layers.

Requirements:

  • Deep experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent).
  • Production experience with high-performance networking (InfiniBand, RoCE, NVLink).
  • Knowledge of distributed training and ML frameworks (NCCL, CUDA, PyTorch, etc.).
  • Expert-level Linux knowledge, including kernel tuning and driver management.
  • Strong experience with Kubernetes and orchestration for GPU workloads.
  • Proficient in automation and software engineering (Python, Go, Bash).
  • Experience building monitoring and alerting systems for GPU infrastructure.
  • Proven track record in incident management for complex distributed systems.

Benefits:

  • Significant ownership and autonomy in shaping foundational systems.
  • Opportunity to influence technical direction and operations of AI infrastructure.
  • Work directly with customers and providers in a high-impact role.

Job type

Experience level

Required experience

-

Salary

-

Degree requirement

No degree required

Location requirements

Benefits

-

Report this job

Job expired or something else is wrong with this job?

Report job
SerpApi

SerpApi

Scrape Google and other search engines from our fast, easy, and complete API.

RemoteYeah Ads