Description:

Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers.
Build automation and tooling to streamline cluster deployments and integrations.
Debug customer issues across networking, storage, scheduling, and system layers.
Improve reliability and scalability of both training and inference infrastructure.
Design and implement monitoring, alerting, and observability for critical systems.
Collaborate with engineering and product teams to plan and deliver infrastructure for new services.
Participate in on-call and incident response, leading postmortems and reliability improvements.

Requirements:

5+ years experience in SRE, DevOps, or infrastructure engineering roles.
Strong Linux systems and networking fundamentals.
Deep experience with Kubernetes and container orchestration at scale.
Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.).
Strong automation and scripting skills (Python, Go, or Bash).
Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.).
Track record of operating production systems and leading incident response.

Ownership and autonomy to shape how systems run.
Opportunity to work directly with customers and providers while building reliable, scalable AI infrastructure.

Skills