CloudWalk is a fintech company focused on reimagining financial services through AI, blockchain, and thoughtful design.
The company is seeking a Machine Learning Engineer to manage and enhance their distributed training pipeline for large language models.
The role involves working within a GPU cluster to assist researchers in training and scaling foundation models using frameworks such as Hugging Face Transformers, Accelerate, DeepSpeed, and FSDP.
Responsibilities include owning the architecture and maintenance of the distributed training pipeline, training LLMs, designing and debugging multi-node/multi-GPU training runs, optimizing training performance, managing experiment tracking, and building reusable training templates.
The position emphasizes building and scaling systems to facilitate efficient, resilient, and reproducible training runs, rather than conducting research.
Requirements:
Expertise in distributed training is required, with experience in DeepSpeed, FSDP, or Hugging Face Accelerate in real-world multi-GPU or multi-node setups.
A strong background in PyTorch is necessary, including the ability to write custom training loops, schedulers, or callbacks.
Familiarity with the Hugging Face stack, including Transformers, Datasets, and Accelerate, is essential.
Candidates must possess infrastructure literacy, understanding how GPUs, containers, and job schedulers interact, and be able to debug cluster issues and memory bottlenecks.
A resilience mindset is important, with the ability to write code that can checkpoint, resume, and log correctly under failure conditions.
The ideal candidate should be a collaborative builder, willing to improve others' scripts and enhance training efficiency.
Benefits:
The position offers the opportunity to work with cutting-edge technology in a dynamic fintech environment.
Employees will have the chance to collaborate closely with researchers and other teams to drive innovation in machine learning.
The role provides a platform for professional growth in distributed training and large language model development.
CloudWalk promotes a culture of resilience and collaboration, fostering an environment where team members can learn from each other and improve their skills.