Fluidstack is building GPU supercomputers for top AI labs, governments, and enterprises, with customers including Mistral, Poolside, Black Forest Labs, and Meta.
The team is small, highly motivated, and focused on providing a world-class supercomputing experience, prioritizing customer satisfaction and repeat business.
The company values high standards, ownership, and a positive attitude towards work and problem-solving.
Site Reliability Engineers (SREs) at Fluidstack are integral to the infrastructure, working across software, hardware, and operations to ensure the reliability and performance of the global GPU cloud.
SREs collaborate with networking, platform engineering, and data center operations teams to build scalable systems for AI workloads.
Responsibilities include tackling complex production issues, deploying resilient infrastructure, and improving platform stability and observability.
Daily tasks may involve deploying clusters of 1,000+ GPUs, validating infrastructure performance, migrating petabytes of data, debugging various issues, and building internal tools to enhance deployment efficiency.
The role includes being part of an on-call rotation up to one week per month.
Requirements:
Candidates should have at least 2+ years of experience in SRE, DevOps, Sysadmin, and/or HPC engineering.
Strong verbal and written communication skills in English are required.
Experience in deploying and operating Kubernetes and/or SLURM clusters is necessary.
Proficiency in programming languages such as Go, Python, and Bash is expected.
Familiarity with automation or Infrastructure as Code (IAC) tools like Ansible and Terraform is required.
A strong engineering background in fields such as Computer Science, Software Engineering, Math, or Computer Engineering is preferred.
Exceptional candidates may have experience building and operating AI workloads at 1000+ GPU scale, managing multi-tenant Kubernetes services, deploying infrastructure in data centers, and managing large-scale storage systems.
Benefits:
The position offers a competitive total compensation package that includes cash and equity.
A retirement or pension plan is provided, in line with local norms.
Health, dental, and vision insurance are included in the benefits package.
A generous paid time off (PTO) policy is offered, consistent with local standards.
Fluidstack operates on a remote-first basis, with offices in London, New York, and San Francisco, and provides access to WeWork for employees in other locations.