Welcome to RemoteYeah 2.0! Find out more about the new version here.

Remote Senior Site Reliability Engineer, Model Serving Infrastructure

at Cohere

Posted 3 days ago 1 applied

Description:

  • Cohere is on a mission to scale intelligence to serve humanity by training and deploying frontier models for AI systems.
  • The Senior Site Reliability Engineer will join the Model Serving team, responsible for developing, deploying, and operating the AI platform that delivers large language models through API endpoints.
  • The role involves deploying optimized NLP models to production in environments that require low latency, high throughput, and high availability.
  • The engineer will collaborate with various teams and interface with customers to create customized deployments that meet specific needs.

Requirements:

  • Candidates should have 5+ years of engineering experience running production infrastructure at a large scale.
  • Experience in designing large, highly available distributed systems using Kubernetes and managing GPU workloads on those clusters is required.
  • Proficiency in Kubernetes development and production coding and support is necessary.
  • Familiarity with cloud platforms such as GCP, Azure, AWS, OCI, and experience with multi-cloud on-prem/hybrid serving is essential.
  • Candidates must have experience in designing, deploying, supporting, and troubleshooting in complex Linux-based computing environments.
  • Knowledge in compute/storage/network resource and cost management is required.
  • Excellent collaboration and troubleshooting skills are necessary to build mission-critical systems and ensure smooth operations.
  • Candidates should possess the grit and adaptability to solve complex technical challenges that evolve daily.
  • Familiarity with the computational characteristics of accelerators (GPUs, TPUs, and/or custom accelerators) is important, particularly regarding their influence on latency and throughput of inference.
  • A strong understanding or working experience with distributed systems is required.
  • Proficiency in Golang, C++, or other languages designed for high-performance scalable servers is necessary.

Benefits:

  • Cohere offers an open and inclusive culture and work environment.
  • Employees work closely with a team on the cutting edge of AI research.
  • A weekly lunch stipend, in-office lunches, and snacks are provided.
  • Full health and dental benefits are included, along with a separate budget for mental health care.
  • Employees based in Canada, the US, and the UK receive a 100% parental leave top-up for 6 months.
  • Personal enrichment benefits are available for arts and culture, fitness and well-being, quality time, and workspace improvement.
  • The position is remote-flexible, with offices in Toronto, New York, San Francisco, and London, along with a co-working stipend.
  • Employees enjoy 6 weeks of vacation.