Remote Site Reliability Engineer, Inference Infrastructure

Posted 5 months ago

Share:

Please let Cohere know you found this job on RemoteYeah. This helps us get more companies to post jobs here for you.

Description:

  • Cohere is on a mission to scale intelligence to serve humanity by training and deploying frontier models for AI systems.
  • The Site Reliability Engineer will join the Model Serving team, responsible for developing, deploying, and operating the AI platform that delivers Cohere's large language models through API endpoints.
  • The role involves building high-performance, scalable, and reliable machine learning systems for advanced NLP applications.
  • Responsibilities include building self-service systems for managing and deploying services, automating environment observability and resilience, ensuring defined SLOs, and developing strong relationships with internal developers.
  • The engineer will also participate in an on-call rotation and contribute to team development through knowledge sharing and active review processes.

Requirements:

  • Candidates should have 5+ years of engineering experience running production infrastructure at a large scale.
  • Experience in designing large, highly available distributed systems with Kubernetes and GPU workloads is required.
  • Proficiency in Kubernetes development and production coding and support is necessary.
  • Familiarity with cloud platforms such as GCP, Azure, AWS, OCI, and multi-cloud on-prem/hybrid serving is essential.
  • Candidates must have experience in designing, deploying, supporting, and troubleshooting in complex Linux-based computing environments.
  • Knowledge in compute/storage/network resource and cost management is required.
  • Excellent collaboration and troubleshooting skills are necessary for building mission-critical systems.
  • Candidates should possess the grit and adaptability to solve complex technical challenges.
  • Familiarity with computational characteristics of accelerators (GPUs, TPUs, etc.) is preferred.
  • A strong understanding or working experience with distributed systems is required.
  • Proficiency in Golang, C++, or other high-performance scalable server languages is necessary.

Benefits:

  • Cohere offers an open and inclusive culture and work environment.
  • Employees work closely with a team on the cutting edge of AI research.
  • A weekly lunch stipend, in-office lunches, and snacks are provided.
  • Full health and dental benefits are included, along with a separate budget for mental health care.
  • Employees receive a 100% parental leave top-up for up to 6 months.
  • Personal enrichment benefits are available for arts and culture, fitness and well-being, quality time, and workspace improvement.
  • The position is remote-flexible, with offices in Toronto, New York, San Francisco, London, and Paris, as well as a co-working stipend.
  • Employees enjoy 6 weeks of vacation (30 working days).

Report this job

Job expired or something else is wrong with this job?

Report job
SerpApi

SerpApi

Scrape Google and other search engines from our fast, easy, and complete API.

RemoteYeah Ads