The role is for a Principal Network Engineer at SF Compute, focusing on designing and managing networks for large-scale GPU clusters globally.
Responsibilities include network architecture design, automated fabric provisioning and validation, performance monitoring and optimization, and interconnecting high-performance distributed storage systems.
The ideal candidate will design a 400GbE spine-leaf network from scratch and automate its deployment across a 20k node cluster.
In-person work is preferred at the San Francisco office, but remote applicants will also be considered.
Requirements:
Candidates must have previous experience with technologies related to HPC and GPU networks, including RoCEv2, InfiniBand, eBGP, EVPN/VXLAN, QoS, and ACLs.
Experience in architecting resilient high-performance networks, including fat-tree and multi-layer spine-leaf topologies, is required.
Prior experience with network automation using tools such as Ansible, Bash, and Python is necessary.
Candidates should be comfortable operating and have opinions about networking hardware from Arista, Cisco, Dell, Juniper, OCP, and SONiC.
Comfort in configuring Next-Generation Firewalls (NGFWs) is essential.
A strong appreciation for good documentation is required.
Benefits:
Team members receive a generous equity grant along with a competitive salary.
The company sponsors visas and work permits for eligible candidates.
Retirement plans are matched up to 4% for 401(k) contributions.
Comprehensive medical, dental, and vision insurance is provided, covering 100% of premiums for employees and their dependents.
Employees enjoy unlimited paid time off and over 10 observed holidays.
Paid parental leave is offered for biological, adoptive, and foster parents.
Daily lunch is covered for employees.
There is an unlimited office book budget for purchasing books for the office.