Remote Staff Software Engineer, GPU Infrastructure (HPC)

Posted 5 months ago

Share:

Please let Cohere know you found this job on RemoteYeah. This helps us get more companies to post jobs here for you.

Description:

  • Cohere is on a mission to scale intelligence to serve humanity by training and deploying frontier models for AI systems.
  • The internal infrastructure team builds world-class infrastructure and tools for training, evaluating, and serving Cohere's foundational models.
  • As a Staff Software Engineer, you will work closely with AI researchers to support their AI workload needs, focusing on stability, scalability, and observability.
  • You will be responsible for building and operating superclusters across multiple clouds, directly accelerating the development of industry-leading AI models.
  • All infrastructure roles require participation in a 24x7 on-call rotation, with compensation for the on-call schedule.
  • Key responsibilities include building and scaling ML-optimized HPC infrastructure, optimizing for AI/ML training, troubleshooting complex issues, enabling researchers with self-service tools, driving innovation in ML infrastructure, championing best practices, and providing mentorship and collaboration.

Requirements:

  • Candidates should have deep expertise in ML/HPC infrastructure, including experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and HPC environments.
  • Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads is essential.
  • Strong programming skills in Python (for ML tooling) and Go (for systems engineering) are required, with a preference for open-source contributions.
  • Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads is necessary.
  • A track record of collaborating with AI researchers or ML engineers to solve infrastructure challenges is important.
  • Candidates should demonstrate self-directed problem-solving abilities, including identifying bottlenecks, proposing solutions, and driving impact in a fast-paced environment.

Benefits:

  • Cohere offers an open and inclusive culture and work environment.
  • Employees work closely with a team on the cutting edge of AI research.
  • A weekly lunch stipend, in-office lunches, and snacks are provided.
  • Full health and dental benefits are included, along with a separate budget for mental health care.
  • Employees receive a 100% parental leave top-up for up to 6 months.
  • Personal enrichment benefits are available for arts and culture, fitness and well-being, quality time, and workspace improvement.
  • The position is remote-flexible, with offices in Toronto, New York, San Francisco, London, and Paris, as well as a co-working stipend.
  • Employees enjoy 6 weeks of vacation (30 working days).

Report this job

Job expired or something else is wrong with this job?

Report job
SerpApi

SerpApi

Scrape Google and other search engines from our fast, easy, and complete API.

RemoteYeah Ads