Welcome to RemoteYeah 2.0! Find out more about the new version here.

Remote Site Reliability Engineer

at FluidStack

Posted 1 day ago 3 applied

Description:

  • Fluidstack is building GPU supercomputers for top AI labs, governments, and enterprises, with customers including Mistral, Poolside, Black Forest Labs, and Meta.
  • The team is small, highly motivated, and focused on providing a world-class supercomputing experience, prioritizing customer satisfaction and repeat business.
  • The company values high standards, ownership, and a positive attitude towards work and problem-solving.
  • Site Reliability Engineers (SREs) at Fluidstack are integral to the infrastructure, working across software, hardware, and operations to ensure the reliability and performance of the global GPU cloud.
  • SREs collaborate with networking, platform engineering, and data center operations teams to build scalable systems for AI workloads.
  • Responsibilities include tackling complex production issues, deploying resilient infrastructure, and improving platform stability and observability.
  • Daily tasks may involve deploying clusters of 1,000+ GPUs, validating infrastructure performance, migrating petabytes of data, debugging various issues, and building internal tools to enhance deployment efficiency.
  • The role includes being part of an on-call rotation up to one week per month.

Requirements:

  • Candidates should have at least 2+ years of experience in SRE, DevOps, Sysadmin, and/or HPC engineering.
  • Strong verbal and written communication skills in English are required.
  • Experience in deploying and operating Kubernetes and/or SLURM clusters is necessary.
  • Proficiency in programming languages such as Go, Python, and Bash is expected.
  • Familiarity with automation or Infrastructure as Code (IAC) tools like Ansible and Terraform is required.
  • A strong engineering background in fields such as Computer Science, Software Engineering, Math, or Computer Engineering is preferred.
  • Exceptional candidates may have experience building and operating AI workloads at 1000+ GPU scale, managing multi-tenant Kubernetes services, deploying infrastructure in data centers, and managing large-scale storage systems.

Benefits:

  • The position offers a competitive total compensation package that includes cash and equity.
  • A retirement or pension plan is provided, in line with local norms.
  • Health, dental, and vision insurance are included in the benefits package.
  • A generous paid time off (PTO) policy is offered, consistent with local standards.
  • Fluidstack operates on a remote-first basis, with offices in London, New York, and San Francisco, and provides access to WeWork for employees in other locations.