Remote Core & ML Ops Team Lead - Remote

at Zyte

Posted 2 weeks ago 1 applied

Description:

  • Zyte is seeking an experienced Team Lead to manage the Core & MLOps Squad, responsible for building the infrastructure that powers Zyte at scale.
  • This is a hands-on technical leadership role that requires expertise in MLOps, systems programming, and orchestration.
  • The role involves designing and evolving the core platform, including Kubernetes, Mesos, GPU scheduling/autoscaling, and distributed compute.
  • The Team Lead will own the model platform, which includes registry, experiment tracking, training orchestration, evaluation, serving, and monitoring.
  • Responsibilities include building the Golden Path, which consists of reference repos, a scaffold CLI, opinionated CI/CD pipelines, and production-ready defaults.
  • The Team Lead will operate a secure, multi-tenant model registry and training platform with standardized experiment/evaluation harnesses.
  • The role includes providing turnkey serving patterns, drift/quality monitoring, and rollback playbooks.
  • The Team Lead will integrate public/open-source AI capabilities as managed platform services with cost and data-governance guardrails.
  • The position requires running the squad, including roadmap/prioritization, delivery, mentoring, and maintaining high engineering standards.
  • The Team Lead will partner with product engineering, Prod Ops, and Security on adoption and rollout plans, while mentoring the team and fostering a platform-thinking mindset.
  • Ownership areas include container orchestration, GPU provisioning & autoscaling, environment & secret management, observability, billing pipeline, and reliability enablement.

Requirements:

  • A minimum of 5 years of experience building distributed systems and at least 3 years in MLOps/ML platform engineering or equivalent impact is required.
  • Knowledge of Linux/OS internals, networking, concurrency, and performance profiling is essential.
  • A deep understanding of Kubernetes is required, with bonus knowledge of Mesos.
  • Proficiency in developing high-performance services in Java, Rust, Go, or C++, along with strong Python skills is necessary.
  • Experience with GPU infrastructure, including scheduling, containerization, and optimization, is required.
  • A track record of designing and operating model platforms in production is essential.
  • Demonstrated success in leading technical teams and implementing organization-wide platform solutions is required.

Benefits:

  • Zyte fosters and nourishes new ideas and brings them to market.
  • Employees become part of a self-motivated, progressive, multi-cultural team.
  • The company offers the freedom and flexibility to work remotely from anywhere.
  • Employees have the opportunity to work with cutting-edge open-source technologies and tools.

Get realtime job alerts

Be the first to know about new jobs