Remote DevOps Lead - Bare-Metal & GPU Infrastructure (Linux)

at Runware

Posted 23 hours ago 5 applied

Description:

  • Runware is seeking a full-time remote DevOps Lead for Bare-Metal & GPU Infrastructure (Linux).
  • The successful candidate will ensure 99.999% service availability and optimize infrastructure usage and scaling while deploying code across hundreds of Linux GPU servers in multiple data-center locations.
  • Responsibilities include designing and automating high-availability architectures, building zero-touch CI/CD pipelines, managing bare-metal lifecycle processes, implementing Kubernetes on bare metal, ensuring observability at scale, leading incident command, scripting server bring-up, maintaining security and compliance, and mentoring a small SRE/DevOps team.
  • In the first 12 months, the candidate will aim to cut deployment latency, maintain minimal user-visible downtime, automate server bring-up, reduce incidents, and deliver auditable change pipelines.

Requirements:

  • Candidates must have 5+ years of experience in Linux SRE/DevOps with 100+ bare-metal node fleets and at least 2 years in a technical lead role.
  • Deep knowledge of NVIDIA/AMD GPU servers, high-speed interconnects, and NVMe/RDMA storage is required.
  • A proven record of sustaining β‰₯ 99.999% uptime in latency-sensitive environments is essential.
  • Expertise in Kubernetes on bare metal, advanced CNI, and custom schedulers is necessary.
  • Strong programming skills in Go or Python, along with Bash, are required.
  • Mastery of Infrastructure-as-Code tools such as Terraform, Ansible, and Packer, as well as GitOps workflows, is needed.
  • Experience with monitoring and alerting stacks, chaos testing, and clear architectural thinking is important.
  • Candidates should possess strong documentation skills and the ability to communicate calmly under pressure.

Benefits:

  • The position offers generous paid time off, including vacation, sick days, and public holidays.
  • Employees will receive meaningful stock options to share in the company's success.
  • The role supports a remote-first setup, allowing work from home anywhere the company can employ you.
  • Flexible working hours are provided, enabling employees to own their schedules outside of core collaboration times.
  • Family leave is available, including paid maternity, paternity, and caregiver time.
  • The company organizes retreats twice a year in inspiring locations for team gatherings and celebrations.