Remote Site Reliability Engineering (SRE) Engineer
Posted
Apply now
Please, let dLocal know you found this job
on RemoteYeah.
This helps us grow 🌱.
Description:
dLocal enables the biggest companies in the world to collect payments in 40 countries in emerging markets.
Global brands rely on dLocal to increase conversion rates and simplify payment expansion effortlessly.
As a Site Reliability Engineering (SRE) Engineer, you will focus on the design and implementation of systems that are highly resilient, scalable, and reliable.
You will work on mission-critical applications with big customers like Netflix, Amazon, Nike, and Facebook.
Your responsibilities will include developing quality gates based on production-level service level objectives (SLOs) to detect issues earlier in the development cycle.
You will automate build testing and validation using service-level indicators (SLIs) and SLOs.
You will influence architectural decisions during initial design stages to ensure resiliency and scale at the outset of software development.
You will design processes, playbooks, and checklists for other engineers to follow during and after incidents.
Writing post mortems and performing technical after-action reviews to understand root causes and propose system improvements will be part of your role.
You will interact with members from almost all teams across the business to understand their monitoring, alerting, and SLO/SLA requirements.
Automating the provisioning of monitoring tools and rules with tools like Terraform and Ansible/Chef will be required.
You will design base level requirements for new and existing services to ensure consistent and accurate monitoring of dLocal infrastructure and code.
Monitoring both the technical health and security health of dLocal infrastructure and systems will be essential.
You will optimize the signal-to-noise ratio for alerting to ensure only actionable alerts are received.
Requirements:
You must have over 3 years of experience as an SRE Engineer or in a very similar role.
Experience with monitoring tools such as New Relic, DataDog, and Nagios is required.
You should have experience working with tools such as Jira, PagerDuty, and Confluence, and integrating these tools with automated processing techniques (API integrations).
Familiarity with CI/CD tools such as Github Actions, Jenkins, Spinnaker, ArgoCD, or similar is necessary.
Knowledge of security best practices and infosec tooling is essential, as you will be writing systems to monitor for breaches and insecurities.
Strong communication skills are required.
You should possess problem-solving skills and be detail-oriented.
A highly analytical mindset is necessary.
The ability to collaborate across multi-functional teams is important.
Cloud experience, particularly with AWS, is highly advantageous as most systems will integrate with AWS at some level.
Experience with Infrastructure as Code (IaC) using a tool like Terraform is highly advantageous.
Configuration as Code (CaC) experience with a tool like Ansible, Chef, or Salt is highly advantageous.
Database knowledge, particularly in terms of performance and SQL syntax, is highly advantageous.
Benefits:
You will be part of a flexible, remote-first dynamic culture with travel, health, and learning benefits.
Working with a global team of 900+ teammates from 25+ different nationalities will provide an international career development opportunity.
You will have the chance to impact millions of people's daily lives through your work.
dLocal fosters a culture of building and facing challenges, which will allow you to thrive in your role.
Apply now
Please, let dLocal know you found this job
on RemoteYeah
.
This helps us grow 🌱.