Remote Machine Learning Engineer (Distributed Training)

at CloudWalk

Posted 1 day ago 0 applied

Description:

  • CloudWalk is a fintech company focused on reimagining financial services through AI, blockchain, and thoughtful design.
  • The company is seeking a Machine Learning Engineer to manage and enhance their distributed training pipeline for large language models.
  • The role involves working within a GPU cluster to assist researchers in training and scaling foundation models using frameworks such as Hugging Face Transformers, Accelerate, DeepSpeed, and FSDP.
  • Responsibilities include owning the architecture and maintenance of the distributed training pipeline, training LLMs, designing and debugging multi-node/multi-GPU training runs, optimizing training performance, managing experiment tracking, and building reusable training templates.
  • The position emphasizes building and scaling systems to facilitate efficient, resilient, and reproducible training runs, rather than conducting research.

Requirements:

  • Expertise in distributed training is required, with experience in DeepSpeed, FSDP, or Hugging Face Accelerate in real-world multi-GPU or multi-node setups.
  • A strong background in PyTorch is necessary, including the ability to write custom training loops, schedulers, or callbacks.
  • Familiarity with the Hugging Face stack, including Transformers, Datasets, and Accelerate, is essential.
  • Candidates must possess infrastructure literacy, understanding how GPUs, containers, and job schedulers interact, and be able to debug cluster issues and memory bottlenecks.
  • A resilience mindset is important, with the ability to write code that can checkpoint, resume, and log correctly under failure conditions.
  • The ideal candidate should be a collaborative builder, willing to improve others' scripts and enhance training efficiency.

Benefits:

  • The position offers the opportunity to work with cutting-edge technology in a dynamic fintech environment.
  • Employees will have the chance to collaborate closely with researchers and other teams to drive innovation in machine learning.
  • The role provides a platform for professional growth in distributed training and large language model development.
  • CloudWalk promotes a culture of resilience and collaboration, fostering an environment where team members can learn from each other and improve their skills.