Remote MLOps Professional Services Engineer (Cloud & AI Infra) at Lavendo

Description:

The MLOps Professional Services Engineer will design, implement, and maintain large-scale machine learning training and inference workflows for clients.
This role involves working closely with a Solutions Architect and support teams to provide expert guidance for optimal ML pipeline performance and efficiency.
Responsibilities include designing and implementing scalable ML workflows using Kubernetes and Slurm, focusing on containerization and orchestration.
The engineer will optimize ML model training and inference performance in collaboration with data scientists and engineers.
They will develop and expand a library of training and inference solutions by managing Kubernetes and Slurm clusters for large-scale ML training.
Integration with ML frameworks such as TensorFlow, PyTorch, or MXNet is required to ensure seamless execution of distributed ML training workloads.
The role also involves developing monitoring and logging tools to track distributed training performance and troubleshoot issues.
Automation scripts and tools will be created to streamline ML training workflows using technologies like Ansible, Terraform, or Python.
Participation in industry conferences and online forums is expected to stay updated with the latest developments in MLOps, K8S, Slurm, and ML.

Candidates must have at least 3 years of experience in MLOps, DevOps, or a related field.
Strong experience with Kubernetes and containerization technologies such as Docker is required.
Experience with cloud providers like AWS, GCP, or Azure is necessary.
Familiarity with Slurm or other distributed computing frameworks is essential.
Proficiency in Python and experience with ML frameworks such as TensorFlow, PyTorch, or MXNet are required.
Knowledge of ML model serving and deployment is necessary.
Familiarity with CI/CD pipelines and tools like Jenkins, GitLab CI/CD, or CircleCI is expected.
Experience with monitoring and logging tools such as Prometheus, Grafana, or ELK Stack is required.
A solid understanding of distributed computing principles, parallel processing, and job scheduling is necessary.
Experience with automation tools like Ansible and Terraform is required.

The position offers competitive compensation ranging from $130,000 to $175,000, negotiable based on experience and skills.
Full medical benefits and life insurance are provided, with 100% coverage for health, vision, and dental insurance for employees and their families.
A 401(k) match program is available with up to a 4% company match.
Paid time off (PTO) and paid holidays are included.
The company offers a flexible remote work environment.
Employees are reimbursed up to $85 per month for mobile and internet expenses.
The opportunity to work with state-of-the-art AI and cloud technologies, including the latest NVIDIA GPUs, is provided.
Employees will be part of a team that operates one of the most powerful commercially available supercomputers.
The company contributes to sustainable AI infrastructure with energy-efficient data centers that recover waste heat to warm nearby residential buildings.