Runware is seeking a full-time remote DevOps Lead for Bare-Metal & GPU Infrastructure (Linux).
The successful candidate will ensure 99.999% service availability and optimize infrastructure usage and scaling while deploying code across hundreds of Linux GPU servers in multiple data-center locations.
Responsibilities include designing and automating high-availability architectures, building zero-touch CI/CD pipelines, managing bare-metal lifecycle processes, implementing Kubernetes on bare metal, ensuring observability at scale, leading incident command, scripting server bring-up, maintaining security and compliance, and mentoring a small SRE/DevOps team.
In the first 12 months, the candidate will aim to cut deployment latency, maintain minimal user-visible downtime, automate server bring-up, reduce incidents, and deliver auditable change pipelines.
Requirements:
Candidates must have 5+ years of experience in Linux SRE/DevOps with 100+ bare-metal node fleets and at least 2 years in a technical lead role.
Deep knowledge of NVIDIA/AMD GPU servers, high-speed interconnects, and NVMe/RDMA storage is required.
A proven record of sustaining β₯ 99.999% uptime in latency-sensitive environments is essential.
Expertise in Kubernetes on bare metal, advanced CNI, and custom schedulers is necessary.
Strong programming skills in Go or Python, along with Bash, are required.
Mastery of Infrastructure-as-Code tools such as Terraform, Ansible, and Packer, as well as GitOps workflows, is needed.
Experience with monitoring and alerting stacks, chaos testing, and clear architectural thinking is important.
Candidates should possess strong documentation skills and the ability to communicate calmly under pressure.
Benefits:
The position offers generous paid time off, including vacation, sick days, and public holidays.
Employees will receive meaningful stock options to share in the company's success.
The role supports a remote-first setup, allowing work from home anywhere the company can employ you.
Flexible working hours are provided, enabling employees to own their schedules outside of core collaboration times.
Family leave is available, including paid maternity, paternity, and caregiver time.
The company organizes retreats twice a year in inspiring locations for team gatherings and celebrations.