Remote Staff Site Reliability Engineer

Posted 3 weeks ago 6 applied

Description:

The Wikimedia Foundation is seeking a Staff Site Reliability Engineer (SRE) focused on Machine Learning Infrastructure.
This position involves joining a distributed team that operates across UTC -5 to UTC +3 and reporting directly to the Director of Machine Learning, Chris Albon.
The primary responsibility includes designing, developing, maintaining, and scaling the foundational infrastructure for machine learning.
Responsibilities include designing and implementing robust ML infrastructure for training, deployment, monitoring, and scaling of machine learning models.
The role requires improving the reliability, availability, and scalability of ML infrastructure to ensure efficient workflows for internal ML engineers and researchers.
Collaboration with ML engineers, product teams, researchers, SREs, and the Wikimedia volunteer community is essential to identify infrastructure requirements and resolve operational issues.
Proactive monitoring and optimization of system performance, capacity, and security are necessary to maintain high service quality.
Providing expert guidance and documentation to teams across Wikimedia for effective utilization of the ML infrastructure and best practices is expected.
Mentoring team members and sharing knowledge on infrastructure management, operational excellence, and reliability engineering is also part of the role.

Candidates must be based within UTC -5 to UTC +3 time zones to ensure good collaboration overlap with the team.
A minimum of 7 years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles is required, with substantial exposure to production-grade machine learning systems.
Proven expertise with on-premises infrastructure for machine learning workloads, including Kubernetes, Docker, GPU acceleration, and distributed training systems, is necessary.
Strong proficiency with infrastructure automation and configuration management tools such as Terraform, Ansible, Helm, and Argo CD is required.
Experience in implementing observability, monitoring, and logging for ML systems using tools like Prometheus, Grafana, and the ELK stack is essential.
Familiarity with popular Python-based ML frameworks, including PyTorch, TensorFlow, and scikit-learn, is expected.
Strong English communication skills and comfort working asynchronously across global teams are necessary.

The Wikimedia Foundation offers a competitive and equitable salary, with the anticipated annual pay range for applicants in the United States being US$129,347 to US$200,824.
Salaries are determined based on multiple individualized factors, including cost of living in the location.
The organization is remote-first, allowing staff members to work from various countries.
The Wikimedia Foundation values diversity and encourages applicants from a wide range of backgrounds to apply.
The organization is committed to maintaining an inclusive and equitable workplace.

Be the first to know about new jobs