The role involves building AI infrastructure that scales, managing complex multi-cluster Kubernetes deployments across five distinct environments: NMS, Sandbox, Development, Staging, and Production.
The candidate will design systems for production readiness while ensuring security and operational excellence.
Responsibilities include managing multi-environment Kubernetes architecture, designing redundancy and failover mechanisms for the centralized NMS hub, and developing Pulumi-based infrastructure using Python.
The role requires managing complex cross-environment dependencies, automating resource provisioning, and implementing zero-trust security measures.
The candidate will deploy and configure observability tools such as Prometheus, Grafana, and CloudWatch, and design alerting and incident response procedures.
The position also involves managing a centralized API for all environments and optimizing resource utilization across node groups.
Requirements:
The candidate must have 5+ years of experience in DevOps, SRE, or infrastructure engineering.
Expert-level Kubernetes experience with EKS and multi-cluster management is required.
Strong Python programming skills for infrastructure automation and API development are essential.
Expertise in Infrastructure as Code with Pulumi, Terraform, or similar tools is necessary.
Deep knowledge of AWS services including VPC, EKS, ECR, S3, CloudWatch, IAM, and networking is required.
Experience in Linux system administration and containerization with Docker is needed.
Hands-on experience with Prometheus, Grafana, and centralized logging systems is a must.
The candidate should have network security experience, including VPN, firewalls, and certificate management, along with an understanding of zero-trust architecture principles.
Nice-to-have qualifications include experience with machine learning infrastructure, HashiCorp Vault administration, GitOps, service mesh technologies, database administration, and CI/CD pipeline design.
Benefits:
The position offers a competitive base salary and a performance-based bonus based on achieving goals.
Equity participation is included as part of the compensation package.
Comprehensive benefits are provided, including health, dental, vision, and paid time off.
A flexible work environment is available, with options for hybrid work or remote work considered.
There is an option to start on a contract basis with the potential for full-time hire.