Remote Sr. Site Reliability Engineer at Varo Bank

Description:

Varo’s SRE team is responsible for designing, building, and running large-scale, distributed, fault-tolerant systems that power most of Varo's operations.
The team focuses on AWS and Kubernetes, maintaining an open-source first and results-oriented mindset.
They prioritize automation and observability, aiming to reduce manual and remedial tasks.
Daily activities include scaling production infrastructure, building CI/CD pipelines, and collaborating with developers to enhance processes.
Responsibilities include taking ownership of the availability and resiliency of Varo's cloud-based infrastructure, designing disaster recovery scenarios, and creating self-healing patterns.
The role involves writing and maintaining infrastructure as code using Terraform and Kubernetes helm charts, as well as building and maintaining CI/CD pipelines.
The engineer will improve observability and monitoring by implementing advanced tools and technologies, creating monitoring dashboards, alerts, and log systems.
They will implement advanced observability tools like distributed tracing and anomaly detection for better system insights and troubleshooting.
The position requires leading high-profile incidents and facilitating blameless post-mortems.
Collaboration with development teams is essential to implement and improve SLIs and SLOs, promoting service ownership.
The engineer will use monitoring data to drive actionable insights and contribute to incident response strategies.
Automating operational tasks and writing clean, scalable scripts and systems to manage platform infrastructure and applications are key responsibilities.

Requirements:

A minimum of 8 years of experience as a Site Reliability, DevOps, or Software Engineer, with proficiency in one or more high-level programming languages such as Python, GoLang, Ruby, Java, or JavaScript is required.
Excellent Linux and troubleshooting skills are essential.
Experience in building and supporting high-availability cloud environments in AWS is necessary.
Proficiency in Infrastructure as Code (IaC) and deployment automation using tools such as Terraform, Helm, Gitlab, or equivalent is required.
Experience running Kubernetes in production is mandatory.
Familiarity with Istio is a plus.
Experience with monitoring, logging, and tracing tools such as Prometheus, Grafana, Jaeger/Tempo, ELK/Loki, and OpenTelemetry is required.
The candidate should have experience instrumenting code (Java/Kotlin, Python, Go, etc.) and creating simple instrumentation frameworks for developers.
Participation in an on-call rotation for after-hours production infrastructure incidents is expected.
Experience with the Software Development Life Cycle (SDLC), CI/CD, and related tooling is necessary.
Kafka experience is a plus.

Benefits:

The salary range for this role is between $150,000 and $190,000 per year, based on function, level, and geographic location.
Final offer amounts are determined by multiple factors, including candidate experience and expertise, and may vary from the identified range.