Description:

Take end-to-end responsibility for critical reliability areas as a Senior Site Reliability Engineer in the Platform Squad.
Lead architectural decisions, mentor team members, and continuously raise the reliability standards within the team.

Requirements:

5+ years of hands-on experience as a Site Reliability Engineer, Platform Engineer, DevOps Engineer, Infrastructure Engineer, Cloud Engineer, or Backend Engineer with a strong infrastructure focus.
Proven track record in building and operating highly available high-throughput systems in production.
Deep production-level experience with Kubernetes on major hyperscalers.
Strong experience with modern observability stacks (e.g., Prometheus, Mimir, VictoriaMetrics, Loki, ELK) and a clear understanding of SLIs, SLOs, and Error Budgets.
Solid software development skills in Go (preferred) or Python.
Hands-on experience with Infrastructure as Code (Pulumi, OpenTofu, Terraform) and GitOps (e.g., ArgoCD) + CI/CD pipeline design.
Proven ability to lead complex infrastructure initiatives from design to production.
Experience mentoring engineers and elevating the technical level within a team.
Strong communication skills and fluent English.
Willingness to participate in on-call duties to ensure platform reliability.

Co-own the architecture and development of cloud infrastructure on Azure and Kubernetes clusters.
Drive the resilience strategy for global scaling, zero-downtime deployments, and disaster recovery.

Skills