Remote SRE / HPC Engineer

Posted

Apply now
Please, let FluidStack know you found this job on RemoteYeah. This helps us grow 🌱.

Description:

  • The SRE/HPC Engineer at Fluidstack is responsible for ensuring peak performance of the GPU infrastructure and providing top-tier support to customers.
  • Responsibilities include deploying new clusters on a monthly basis, automating processes to support scalability, and offering client-facing support for tasks like GPU debugging and performance optimization.
  • The role involves working with top AI companies like Poolside, Meta, Modal, and Reka.

Requirements:

  • Experience in HPC systems, System Administration, SRE, or DevOps.
  • Proficiency in managing large-scale workloads with orchestrators like Slurm or Kubernetes.
  • Ability to automate processes for bare-metal machines and containers using tools like Ansible, Bash, or Python.
  • Familiarity with shared storage platforms such as NFS, DDN, Vast, CephFS, etc.
  • Experience in provisioning large-scale clusters and networks with tools like BCM, UFM.
  • Knowledge of large-scale GPU systems, including working with Nvidia GPUs and Infiniband networks.
  • Must be a fast learner, adaptable, and passionate about Fluidstack’s mission.

Benefits:

  • Opportunity to work with top AI companies in the industry.
  • Chance to contribute to the growth and scalability of Fluidstack's GPU infrastructure.
  • Client-facing role providing exposure to diverse challenges and problem-solving opportunities.
  • Continuous learning and development in a dynamic and innovative environment.
Apply now
Please, let FluidStack know you found this job on RemoteYeah . This helps us grow 🌱.
About the job
Posted on
Job type
Salary
-
Experience level
Technology stack
Report this job

Job expired or something else is wrong with this job?

Report this job
Leave a feedback