CloudWalk is a fintech company focused on reimagining the future of financial services through AI, blockchain, and thoughtful design.
The company is seeking an MLOps Engineer to build ML infrastructure that scales dynamically from dozens to thousands of GPUs.
The role involves working closely with researchers and engineers to design systems for training, evaluating, and monitoring machine learning models at scale.
Responsibilities include building and maintaining ML pipelines for data processing, training, evaluation, and model deployment.
The engineer will orchestrate batch and training jobs in Kubernetes, handling retries, failures, and resource constraints.
The position requires designing systems that can scale dynamically from small GPU jobs to thousands of GPUs on-demand.
Collaboration with researchers to productionize experiments into reproducible workflows is essential.
The engineer will implement model serving endpoints and integrate with internal tooling.
Setting up monitoring, logging, and KPI tracking for ML pipelines and compute jobs is part of the job.
Automating CI/CD and infrastructure provisioning for ML workloads is required.
The role includes managing experiment tracking, model versioning, and metadata with tools like MLflow or W&B.
Support for model serving infrastructure that may be used by internal UIs or tools in the future is also expected.
Requirements:
Strong experience with Kubernetes, specifically in orchestrating jobs and managing training workloads, GPU scheduling, job retries, and Helm-based deployments.
Proficiency in Python for writing scripts and services to automate processes.
Familiarity with ML workflows, including data preprocessing, training, evaluation, and deployment pipelines.
Ability to expose models via FastAPI, TorchServe, or equivalent serving stacks.
Strong command of Linux and debugging compute-heavy jobs.
Experience with ML metadata systems such as MLflow, W&B, or Neptune.
Capability to work alongside AI assistants and agents.
Proficient communication skills in both English and Portuguese.
Benefits:
The company promotes a welcoming work environment that values diversity and inclusion.
Employees are encouraged to be authentic, regardless of gender, ethnicity, race, religion, sexuality, mobility, disability, or education.
The recruiting process includes an online assessment, a technical project essay, a technical interview, and a cultural interview.
Candidates should be prepared for an online quiz as part of the application process.