Remote (1016) Staff Site Reliability Engineer

Posted 2 months ago 3 applied

Description:

As a Staff Site Reliability Engineer, you will own and optimize OpenTelemetry pipelines, enabling scalable and efficient observability.
You will build tools that empower teams, support incident response, and drive best practices.
Your work ensures a reliable, secure infrastructure and actionable alerting across the organization.
Daily tasks include designing, implementing, and maintaining observability pipelines across logs, metrics, and traces.
You will optimize ingestion strategies to balance cost, performance, and usability.
You will build self-service automation and tooling that enables development teams to instrument and leverage observability without manual intervention.
You will design processes, playbooks, checklists, and automations for incident management.
Interaction with members from various teams will be necessary to understand their monitoring, alerting, and SLO/SLA requirements.
You will influence architectural decisions during initial design stages to ensure resiliency and scale.
You will leverage Infrastructure-as-Code (IaC) to manage monitoring tools and observability configurations.
You will take full ownership of client infrastructure reliability, ensuring adherence to key availability and security KPIs.

A Bachelor's Degree in Computer Science, Engineering, or a related field is required.
You must have 8+ years of experience working as an SRE Engineer or in a similar role focused on observability.
You should have 5+ years of experience working with cloud services, specifically AWS.
5+ years of experience with IaC tools (Terraform) and GitOps CI/CD solutions (ArgoCD, GitHub Actions, or similar) is necessary.
You need 4+ years of experience with monitoring and logging OpenSource tools such as Grafana, Prometheus, Elastic/OpenSearch, Loki, and Tempo.
4+ years of experience working in Kubernetes, including its core components and monitoring best practices, is required.
Strong scripting abilities in Python, Go, or similar languages for automating observability tasks are essential.
Experience in managing observability metrics such as SLI, SLOs, and distributed tracing is necessary.
You should have experience with automated alerting workflows and exposure to OpenTelemetry Pipelines.
An advanced level of English is required for effective communication with US clients.

A competitive USD salary is offered, valuing your skills and contributions.
The position allows for 100% remote work, with opportunities to connect with teammates at coworking spaces across LATAM.
Paid time off is provided according to your country’s regulations, allowing you to rest and recharge while receiving your full salary.
National holidays are celebrated, giving you time off to embrace important events and traditions with loved ones.
Sick leave is available to focus on your health without stress.
A refundable annual credit is provided to spend on perks that enhance your work-life balance.
Team-building activities such as coffee breaks, tech talks, and after-work gatherings are organized to foster community.
An extra day off during your birthday week is offered to celebrate with friends and family.

Apply now

Please let Nearsure know you found this job on RemoteYeah. This helps us get more companies to post jobs here for you.

Hiring company

Nearsure

About the job

Posted on

June 19, 2025

Job type

Full-time

Salary

Location requirements

Job title

Site Reliability Engineer

Experience level

Staff

Degree requirement

Skills

Benefits