Remote Site Reliability Engineer - AI & ML Infrastructure (Kubernetes, AWS & Terraform)

Posted 4 months ago

Description:

Seeking an experienced Site Reliability Engineer to build and operate hybrid infrastructure for AI/ML research and product development.
Responsibilities include architecting, building, and maintaining platforms on AWS and bare metal data centers using Kubernetes and Terraform.

Requirements:

5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE).
Proven experience with Terraform and Kubernetes in large-scale environments.
Familiarity with HPC job schedulers like Slurm for GPU workloads.
Strong scripting skills in languages such as Python, Go, or Bash.

Skills

Benefits