Remote MLOps Engineer

at CloudWalk

Posted 9 hours ago 1 applied

Description:

  • CloudWalk is a fintech company focused on reimagining the future of financial services through AI, blockchain, and thoughtful design.
  • The company is seeking an MLOps Engineer to build ML infrastructure that scales dynamically from dozens to thousands of GPUs.
  • The role involves working closely with researchers and engineers to design systems for training, evaluating, and monitoring machine learning models at scale.
  • Responsibilities include building and maintaining ML pipelines for data processing, training, evaluation, and model deployment.
  • The engineer will orchestrate batch and training jobs in Kubernetes, handling retries, failures, and resource constraints.
  • The position requires designing systems that can scale dynamically from small GPU jobs to thousands of GPUs on-demand.
  • Collaboration with researchers to productionize experiments into reproducible workflows is essential.
  • The engineer will implement model serving endpoints and integrate with internal tooling.
  • Setting up monitoring, logging, and KPI tracking for ML pipelines and compute jobs is part of the job.
  • Automating CI/CD and infrastructure provisioning for ML workloads is required.
  • The role includes managing experiment tracking, model versioning, and metadata with tools like MLflow or W&B.
  • Support for model serving infrastructure that may be used by internal UIs or tools in the future is also expected.

Requirements:

  • Strong experience with Kubernetes, specifically in orchestrating jobs and managing training workloads, GPU scheduling, job retries, and Helm-based deployments.
  • Proficiency in Python for writing scripts and services to automate processes.
  • Familiarity with ML workflows, including data preprocessing, training, evaluation, and deployment pipelines.
  • Ability to expose models via FastAPI, TorchServe, or equivalent serving stacks.
  • Strong command of Linux and debugging compute-heavy jobs.
  • Experience with ML metadata systems such as MLflow, W&B, or Neptune.
  • Capability to work alongside AI assistants and agents.
  • Proficient communication skills in both English and Portuguese.

Benefits:

  • The company promotes a welcoming work environment that values diversity and inclusion.
  • Employees are encouraged to be authentic, regardless of gender, ethnicity, race, religion, sexuality, mobility, disability, or education.
  • The recruiting process includes an online assessment, a technical project essay, a technical interview, and a cultural interview.
  • Candidates should be prepared for an online quiz as part of the application process.