Remote Site Reliability Engineer (SRE) at Air Apps

Description:

As a Site Reliability Engineer (SRE) at Air Apps, you will be responsible for ensuring the reliability, availability, and scalability of our systems.
You will work at the intersection of software development and operations, implementing automation, monitoring, and performance optimization strategies to minimize downtime and improve system resilience.
Your responsibilities will include designing and implementing scalable, reliable, and fault-tolerant systems across cloud environments.
You will develop and maintain observability tools, including monitoring, logging, and alerting (e.g., Prometheus, Grafana, Datadog, ELK).
Automating infrastructure provisioning, deployment, and incident response using Infrastructure as Code (IaC) tools like Terraform or CloudFormation will be part of your role.
You will optimize system performance, scalability, and incident response workflows to improve uptime.
Working closely with development and DevOps teams to improve system design for reliability is essential.
Conducting root cause analysis (RCA) and implementing preventative measures to minimize failures will be required.
You will ensure high availability by designing and maintaining load balancing, failover, and disaster recovery strategies.
Improving CI/CD pipelines to enhance deployment speed while maintaining stability will be part of your tasks.
You will optimize cloud cost and resource utilization for AWS, Azure, or Google Cloud Platform (GCP).
Participating in on-call rotations to quickly address system failures and minimize downtime is expected.

Requirements:

You should have around 4+ years of experience in Site Reliability Engineering (SRE), DevOps, or System Engineering.
A strong knowledge of cloud platforms (AWS, Azure, or GCP) and cloud-native architectures is required.
Experience with observability and monitoring tools (Prometheus, Grafana, ELK, Datadog, New Relic) is necessary.
Proficiency in Infrastructure as Code (IaC) tools such as Terraform, CloudFormation, or Pulumi is essential.
Hands-on experience with containerization and orchestration (Docker, Kubernetes, Helm) is required.
Strong Linux system administration and networking fundamentals are necessary.
Experience with incident management, debugging, and root cause analysis is expected.
Proficiency in scripting (Bash, Python, or Go) for automation and system monitoring is required.
Knowledge of load balancing, failover strategies, and distributed systems is essential.
An understanding of security best practices, access control, and compliance requirements is necessary.
Strong communication skills and the ability to collaborate with cross-functional teams are required.

Benefits:

We offer a remote-first approach with flexible working hours.
You will receive Apple hardware for work.
Flexible Paid Time Off (PTO) is provided to support work-life balance.
An annual bonus is part of the compensation package.
Top-tier health insurance is offered for peace of mind.
A public transportation pass will be provided to support your commute needs.
The Coverflex benefits package includes meal allowances, well-being, and more.
You will have the opportunity to attend the Air Conference 2025 in Las Vegas to meet the team, collaborate, and grow together!