Description:

Cohere is on a mission to scale intelligence to serve humanity by training and deploying frontier models for AI systems.
The company values hard work, speed, and customer satisfaction, with a focus on building great products through diverse perspectives.
The Senior ML Systems Engineer will help build, maintain, and evolve the training framework for frontier-scale language models.
This role involves designing and maintaining core components for fast, reliable, and scalable model training, as well as building tooling that connects research ideas to thousands of GPUs.
Responsibilities include owning the training framework for large-scale LLM training, designing distributed training abstractions, improving training throughput and stability, developing monitoring and debugging tools, collaborating with infrastructure teams, resolving performance bottlenecks, and ensuring reproducible large-scale runs.

Requirements:

Candidates should have strong engineering experience in large-scale distributed training or HPC systems.
Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops is required.
Experience with multi-node cluster orchestration tools such as Slurm, Ray, or Kubernetes is necessary.
Comfort in debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines is essential.
Experience with containerized environments like Docker or Singularity/Apptainer is expected.
A track record of building tools that enhance developer velocity for ML teams is important.
Candidates should demonstrate excellent judgment regarding trade-offs between performance and complexity, as well as research velocity and maintainability.
Strong collaboration skills are required, as the role involves working closely with infrastructure, research, and deployment teams.

Employees will work on challenging and consequential ML systems problems.
The opportunity to collaborate with a world-class team that operates at scale is provided.
There is end-to-end ownership over critical components of the training stack.
Employees will have the chance to shape the next generation of infrastructure for frontier-scale models.
The role includes building tools and systems that directly accelerate research and improve model quality.
Additional perks include an open and inclusive culture, weekly lunch stipend, full health and dental benefits, 100% parental leave top-up for up to 6 months, personal enrichment benefits, remote-flexible work options, and 6 weeks of vacation (30 working days).

Skills

Benefits