Cohere is on a mission to scale intelligence to serve humanity by training and deploying frontier models for AI systems.
The Senior Site Reliability Engineer will join the Model Serving team, responsible for developing, deploying, and operating the AI platform that delivers large language models through API endpoints.
The role involves deploying optimized NLP models to production in environments that require low latency, high throughput, and high availability.
The engineer will collaborate with various teams and interface with customers to create customized deployments that meet specific needs.
Requirements:
Candidates should have 5+ years of engineering experience running production infrastructure at a large scale.
Experience in designing large, highly available distributed systems using Kubernetes and managing GPU workloads on those clusters is required.
Proficiency in Kubernetes development and production coding and support is necessary.
Familiarity with cloud platforms such as GCP, Azure, AWS, OCI, and experience with multi-cloud on-prem/hybrid serving is essential.
Candidates must have experience in designing, deploying, supporting, and troubleshooting in complex Linux-based computing environments.
Knowledge in compute/storage/network resource and cost management is required.
Excellent collaboration and troubleshooting skills are necessary to build mission-critical systems and ensure smooth operations.
Candidates should possess the grit and adaptability to solve complex technical challenges that evolve daily.
Familiarity with the computational characteristics of accelerators (GPUs, TPUs, and/or custom accelerators) is important, particularly regarding their influence on latency and throughput of inference.
A strong understanding or working experience with distributed systems is required.
Proficiency in Golang, C++, or other languages designed for high-performance scalable servers is necessary.
Benefits:
Cohere offers an open and inclusive culture and work environment.
Employees work closely with a team on the cutting edge of AI research.
A weekly lunch stipend, in-office lunches, and snacks are provided.
Full health and dental benefits are included, along with a separate budget for mental health care.
Employees based in Canada, the US, and the UK receive a 100% parental leave top-up for 6 months.
Personal enrichment benefits are available for arts and culture, fitness and well-being, quality time, and workspace improvement.
The position is remote-flexible, with offices in Toronto, New York, San Francisco, and London, along with a co-working stipend.