Remote Staff Site Reliability Engineer

at Wikimedia Foundation

Posted 23 hours ago 3 applied

Description:

  • The Wikimedia Foundation is seeking a Staff Site Reliability Engineer (SRE) focused on Machine Learning Infrastructure.
  • This position involves joining a distributed team that operates across UTC -5 to UTC +3 and reporting directly to the Director of Machine Learning, Chris Albon.
  • The primary responsibility includes designing, developing, maintaining, and scaling the foundational infrastructure for machine learning.
  • Responsibilities include designing and implementing robust ML infrastructure for training, deployment, monitoring, and scaling of machine learning models.
  • The role requires improving the reliability, availability, and scalability of ML infrastructure to ensure efficient workflows for internal ML engineers and researchers.
  • Collaboration with ML engineers, product teams, researchers, SREs, and the Wikimedia volunteer community is essential to identify infrastructure requirements and resolve operational issues.
  • Proactive monitoring and optimization of system performance, capacity, and security are necessary to maintain high service quality.
  • Providing expert guidance and documentation to teams across Wikimedia for effective utilization of the ML infrastructure and best practices is expected.
  • Mentoring team members and sharing knowledge on infrastructure management, operational excellence, and reliability engineering is also part of the role.

Requirements:

  • Candidates must be based within UTC -5 to UTC +3 time zones to ensure good collaboration overlap with the team.
  • A minimum of 7 years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles is required, with substantial exposure to production-grade machine learning systems.
  • Proven expertise with on-premises infrastructure for machine learning workloads, including Kubernetes, Docker, GPU acceleration, and distributed training systems, is necessary.
  • Strong proficiency with infrastructure automation and configuration management tools such as Terraform, Ansible, Helm, and Argo CD is required.
  • Experience in implementing observability, monitoring, and logging for ML systems using tools like Prometheus, Grafana, and the ELK stack is essential.
  • Familiarity with popular Python-based ML frameworks, including PyTorch, TensorFlow, and scikit-learn, is expected.
  • Strong English communication skills and comfort working asynchronously across global teams are necessary.

Benefits:

  • The Wikimedia Foundation offers a competitive and equitable salary, with the anticipated annual pay range for applicants in the United States being US$129,347 to US$200,824.
  • Salaries are determined based on multiple individualized factors, including cost of living in the location.
  • The organization is remote-first, allowing staff members to work from various countries.
  • The Wikimedia Foundation values diversity and encourages applicants from a wide range of backgrounds to apply.
  • The organization is committed to maintaining an inclusive and equitable workplace.