Remote Contractor: Senior-Level Site Reliability Engineering Services (Brazil or Argentina) at Newsela

Description:

The position is for a Contractor based out of Brazil or Argentina for Senior-Level Site Reliability Engineering Services.
The contractor will be on an on-call rotation to respond to incidents impacting Newsela.com availability and provide support for developers during incidents.
Responsibilities include maintaining and extending infrastructure using Terraform, Github Actions CI/CD, Prefect, and AWS services.
The contractor will build monitoring systems that alert on symptoms rather than outages using tools like Datadog, Sentry, and CloudWatch.
They will seek to automate repeatable manual actions to reduce toil and improve operational processes such as deployments, releases, and migrations with fault tolerance in mind.
The role involves designing, building, and maintaining core cloud infrastructure on AWS and GCP to support thousands of concurrent users.
The contractor will debug production issues across services and levels of the stack and provide infrastructure and architectural planning support as an embedded team member.
They will plan the growth of Newsela’s infrastructure and influence the product roadmap to enhance the resiliency and reliability of the Newsela product.
Proactive efficiency and capacity planning will be required to set clear requirements and reduce system resource usage.
The contractor will identify non-scaling parts of the system, provide immediate solutions, and drive long-term resolutions.
They will identify Service Level Indicators (SLIs) to align the team with availability and latency objectives and maintain awareness of stage group plans and priorities.

Requirements:

A minimum of 5 years of experience in site reliability is required.
Advanced knowledge of Terraform syntax and CI/CD configuration, pipelines, and jobs is necessary.
Experience managing DAG tooling and data pipelines, such as Airflow, Dagster, or Prefect, is essential.
Candidates must have advanced knowledge and experience in maintaining data pipeline infrastructure and large-scale data migrations.
Proficiency in cloud infrastructure services, specifically AWS and GCP, is required.
Familiarity with container orchestration technologies, including ECS, Kubernetes, and Docker, is necessary.
Experience with service catalog metrics and alert recording rules using tools like Datadog, NewRelic, Sentry, and Cloudwatch is required.
Candidates should have experience with log shipping pipelines and incident debugging visualizations.
Familiarity with Linux operating system configuration, package management, and BASH/CLI scripting is essential.
Knowledge of block and object storage configuration and debugging is required.
The ability to identify significant projects that improve reliability, cost savings, or revenue is necessary.
Candidates must be able to identify architectural changes from reliability, performance, and availability perspectives using a data-driven approach.
Experience leading initiatives and problem definition, design, and planning through epics and blueprints is required.
Deep domain knowledge and the ability to share that knowledge through documentation and presentations is essential.
Candidates should be able to perform blameless Root Cause Analyses (RCAs) on incidents and outages.

Benefits:

Please note that given the nature of the contract, this role will not be eligible to participate in company-sponsored benefits.