Remote Site Reliability Engineer - AI Infrastructure

Posted 3 months ago

Share:

Please let Andromeda Cluster know you found this job on RemoteYeah. This helps us get more companies to post jobs here for you.

Description:

  • Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers.
  • Build automation and tooling to streamline cluster deployments and integrations.
  • Debug customer issues across networking, storage, scheduling, and system layers.
  • Improve reliability and scalability of both training and inference infrastructure.
  • Design and implement monitoring, alerting, and observability for critical systems.
  • Collaborate with engineering and product teams to plan and deliver infrastructure for new services.
  • Participate in on-call and incident response, leading postmortems and reliability improvements.

Requirements:

  • 5+ years experience in SRE, DevOps, or infrastructure engineering roles.
  • Strong Linux systems and networking fundamentals.
  • Deep experience with Kubernetes and container orchestration at scale.
  • Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.).
  • Strong automation and scripting skills (Python, Go, or Bash).
  • Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.).
  • Track record of operating production systems and leading incident response.

Benefits:

  • Ownership and autonomy to shape how systems run.
  • Opportunity to work directly with customers and providers while building reliable, scalable AI infrastructure.

Job type

Experience level

Required experience

5 years

Salary

-

Degree requirement

No degree required

Location requirements

Benefits

-

Report this job

Job expired or something else is wrong with this job?

Report job
SerpApi

SerpApi

Scrape Google and other search engines from our fast, easy, and complete API.

RemoteYeah Ads