Remote MLOps Professional Services Engineer (Cloud & AI Infra)

Posted

Apply now
Please, let Lavendo know you found this job on RemoteYeah. This helps us grow 🌱.

Description:

  • The MLOps Professional Services Engineer will design, implement, and maintain large-scale machine learning training and inference workflows for clients.
  • This role involves working closely with a Solutions Architect and support teams to provide expert guidance for optimal ML pipeline performance and efficiency.
  • Responsibilities include designing and implementing scalable ML workflows using Kubernetes and Slurm, focusing on containerization and orchestration.
  • The engineer will optimize ML model training and inference performance in collaboration with data scientists and engineers.
  • They will develop and expand a library of training and inference solutions by managing Kubernetes and Slurm clusters for large-scale ML training.
  • Integration with ML frameworks such as TensorFlow, PyTorch, or MXNet is required to ensure seamless execution of distributed ML training workloads.
  • The role also involves developing monitoring and logging tools to track distributed training performance and troubleshoot issues.
  • Automation scripts and tools will be created to streamline ML training workflows using technologies like Ansible, Terraform, or Python.
  • Participation in industry conferences and online forums is expected to stay updated with the latest developments in MLOps, K8S, Slurm, and ML.

Requirements:

  • Candidates must have at least 3 years of experience in MLOps, DevOps, or a related field.
  • Strong experience with Kubernetes and containerization technologies such as Docker is required.
  • Experience with cloud providers like AWS, GCP, or Azure is necessary.
  • Familiarity with Slurm or other distributed computing frameworks is essential.
  • Proficiency in Python and experience with ML frameworks such as TensorFlow, PyTorch, or MXNet are required.
  • Knowledge of ML model serving and deployment is necessary.
  • Familiarity with CI/CD pipelines and tools like Jenkins, GitLab CI/CD, or CircleCI is expected.
  • Experience with monitoring and logging tools such as Prometheus, Grafana, or ELK Stack is required.
  • A solid understanding of distributed computing principles, parallel processing, and job scheduling is necessary.
  • Experience with automation tools like Ansible and Terraform is required.

Benefits:

  • The position offers competitive compensation ranging from $130,000 to $175,000, negotiable based on experience and skills.
  • Full medical benefits and life insurance are provided, with 100% coverage for health, vision, and dental insurance for employees and their families.
  • A 401(k) match program is available with up to a 4% company match.
  • Paid time off (PTO) and paid holidays are included.
  • The company offers a flexible remote work environment.
  • Employees are reimbursed up to $85 per month for mobile and internet expenses.
  • The opportunity to work with state-of-the-art AI and cloud technologies, including the latest NVIDIA GPUs, is provided.
  • Employees will be part of a team that operates one of the most powerful commercially available supercomputers.
  • The company contributes to sustainable AI infrastructure with energy-efficient data centers that recover waste heat to warm nearby residential buildings.
Apply now
Please, let Lavendo know you found this job on RemoteYeah . This helps us grow 🌱.
About the job
Posted on
Job type
Salary
$ 130,000 - 175,000 USD / year
Location requirements
Report this job

Job expired or something else is wrong with this job?

Report this job
Leave a feedback