Description:

As a Site Reliability Engineer (SRE) at Air Apps, you will be responsible for ensuring the reliability, availability, and scalability of our systems.
You will work at the intersection of software development and operations, implementing automation, monitoring, and performance optimization strategies to minimize downtime and improve system resilience.
Your responsibilities will include designing and implementing scalable, reliable, and fault-tolerant systems across cloud environments.
You will develop and maintain observability tools, including monitoring, logging, and alerting (e.g., Prometheus, Grafana, Datadog, ELK).
Automating infrastructure provisioning, deployment, and incident response using Infrastructure as Code (IaC) tools like Terraform or CloudFormation will be part of your role.
You will optimize system performance, scalability, and incident response workflows to improve uptime.
Collaborating closely with development and DevOps teams to improve system design for reliability is essential.
Conducting root cause analysis (RCA) and implementing preventative measures to minimize failures will be required.
You will ensure high availability by designing and maintaining load balancing, failover, and disaster recovery strategies.
Improving CI/CD pipelines to enhance deployment speed while maintaining stability will be part of your tasks.
You will optimize cloud cost and resource utilization for AWS, Azure, or Google Cloud Platform (GCP).
Participating in on-call rotations to quickly address system failures and minimize downtime is expected.

Requirements:

You should have around 4+ years of experience in Site Reliability Engineering (SRE), DevOps, or System Engineering.
A strong knowledge of cloud platforms (AWS, Azure, or GCP) and cloud-native architectures is required.
Experience with observability and monitoring tools (Prometheus, Grafana, ELK, Datadog, New Relic) is necessary.
Proficiency in Infrastructure as Code (IaC) tools such as Terraform, CloudFormation, or Pulumi is essential.
Hands-on experience with containerization and orchestration (Docker, Kubernetes, Helm) is required.
Strong Linux system administration and networking fundamentals are necessary.
You should have experience with incident management, debugging, and root cause analysis.
Proficiency in scripting (Bash, Python, or Go) for automation and system monitoring is required.
Knowledge of load balancing, failover strategies, and distributed systems is essential.
An understanding of security best practices, access control, and compliance requirements is necessary.
Strong communication skills and the ability to collaborate with cross-functional teams are required.

Benefits:

We offer a remote-first approach with flexible working hours.
You will receive Apple hardware for work.
An annual bonus is part of the compensation package.
Medical insurance, including vision and dental, is provided.
Disability insurance, both short and long-term, is included.
A 401k plan with up to 4% contribution is available.
You will receive an Air Stipend of $3,120 per year, paid over 12 monthly installments for home office, learning, wellness, etc.
There is an opportunity to attend the Air Conference 2025 in Las Vegas to meet the team, collaborate, and grow together.

Remote Site Reliability Engineer (SRE)

Description:

Requirements:

Benefits: