Please, let FluidStack know you found this job
on RemoteYeah.
This helps us grow 🌱.
Description:
The SRE/HPC Engineer at Fluidstack is responsible for ensuring peak performance of the GPU infrastructure and providing top-tier support to customers.
Responsibilities include deploying new clusters on a monthly basis, automating processes to support scalability, and offering client-facing support for tasks like GPU debugging and performance optimization.
The role involves working with top AI companies like Poolside, Meta, Modal, and Reka.
Requirements:
Experience in HPC systems, System Administration, SRE, or DevOps.
Proficiency in managing large-scale workloads with orchestrators like Slurm or Kubernetes.
Ability to automate processes for bare-metal machines and containers using tools like Ansible, Bash, or Python.
Familiarity with shared storage platforms such as NFS, DDN, Vast, CephFS, etc.
Experience in provisioning large-scale clusters and networks with tools like BCM, UFM.
Knowledge of large-scale GPU systems, including working with Nvidia GPUs and Infiniband networks.
Must be a fast learner, adaptable, and passionate about Fluidstack’s mission.
Benefits:
Opportunity to work with top AI companies in the industry.
Chance to contribute to the growth and scalability of Fluidstack's GPU infrastructure.
Client-facing role providing exposure to diverse challenges and problem-solving opportunities.
Continuous learning and development in a dynamic and innovative environment.
Apply now
Please, let FluidStack know you found this job
on RemoteYeah
.
This helps us grow 🌱.