Remote Sr. Site Reliability Engineer at Varo Bank

Description:

Varo’s SRE team is responsible for designing, building, and running large-scale, distributed, fault-tolerant systems that power most of Varo's operations.
The team focuses on AWS and Kubernetes, maintaining an open-source first and results-oriented mindset.
Members of the team strive to automate manual tasks and promote a data-driven approach to scaling the platform.
Daily activities include scaling production infrastructure, building CI/CD pipelines, and collaborating with developers to enhance operations.
Responsibilities include taking ownership of the availability and resiliency of Varo's cloud-based infrastructure, designing disaster recovery scenarios, and implementing self-healing patterns.
The role involves writing and maintaining infrastructure as code using Terraform and Kubernetes helm charts, as well as building and maintaining CI/CD pipelines.
The engineer will improve observability and monitoring by implementing advanced tools and technologies, creating monitoring dashboards, alerts, and log systems.
The position requires leading high-profile incidents and facilitating blameless post-mortems.
Collaboration with development teams to implement and improve SLIs and SLOs is essential, along with using monitoring data to drive actionable insights.
The engineer will automate operational tasks, write clean and scalable scripts, and manage platform infrastructure and applications.

A minimum of 8 years of experience as a Site Reliability, DevOps, or Software Engineer with proficiency in high-level programming languages such as Python, GoLang, Ruby, Java, or JavaScript is required.
Excellent Linux and troubleshooting skills are necessary.
Experience in building and supporting high-availability cloud environments in AWS is essential.
Proficiency in Infrastructure as Code (IaC) and deployment automation using tools like Terraform, Helm, Gitlab, or equivalent is required.
Experience running Kubernetes in production is mandatory.
Familiarity with Istio is a plus.
Experience with monitoring, logging, and tracing tools such as Prometheus, Grafana, Jaeger/Tempo, ELK/Loki, and OpenTelemetry is required.
The candidate should have experience instrumenting code in languages like Java/Kotlin, Python, or Go, and creating simple instrumentation frameworks.
Participation in an on-call rotation for after-hours production infrastructure incidents is expected.
Experience with the Software Development Life Cycle (SDLC), CI/CD, and related tooling is necessary.
Kafka experience is a plus.

The salary range for this role is between $150,000 and $190,000 per year, based on function, level, and geographic location.
Final offer amounts are determined by multiple factors, including candidate experience and expertise, and may vary from the identified range.