The Hendrix ML Platform team is focused on developing a robust platform for training and serving machine learning models across Spotify.
This platform aims to streamline the productionization of AI and ML models by reducing the complexities involved in creating backend services for serving predictions and training models.
Responsibilities include managing and maintaining large scale production Kubernetes clusters for ML workloads, including ML platform infrastructure and necessary dev ops.
The role involves contributing to the Spotify ML Platform SDK and building tools for various ML operations.
Collaboration with Machine Learning Engineers (MLE), researchers, and product teams is essential to deliver scalable ML platform tooling solutions that meet timelines and specifications.
The position requires working independently and collaboratively on squad projects, often necessitating the learning and application of new technologies.
The engineer will design, document, and implement reliable, testable, and maintainable solutions for ML infrastructure capabilities.
Requirements:
Candidates must have 3+ years of hands-on experience implementing production ML infrastructure at scale using Python, Go, or similar languages.
A minimum of 3+ years of experience working with a public cloud provider such as GCP, AWS, or Azure is required, with a preference for GCP.
Knowledge of deep learning fundamentals, algorithms, and open-source tools such as Huggingface, Ray, PyTorch, or TensorFlow is necessary.
An understanding of distributed training leveraging GPUs and Kubernetes is considered a good to have.
A general understanding of data processing for ML is required.
Experience with agile software processes and modular code design following industry standards is essential.
Benefits:
This role is based in Toronto, providing a location for in-person meetings while allowing flexibility to work from home.
The company offers the flexibility to work where you are most productive, accommodating both remote and in-office work arrangements.