Remote Site Reliability Engineer (SRE)

Posted

This job is closed

This job post is closed and the position is probably filled. Please do not apply.  Automatically closed by a robot after apply link was detected as broken.

Description:

  • As a Site Reliability Engineer (SRE) at Air Apps, you will be responsible for ensuring the reliability, availability, and scalability of our systems.
  • You will work at the intersection of software development and operations, implementing automation, monitoring, and performance optimization strategies to minimize downtime and improve system resilience.
  • Your responsibilities will include designing and implementing scalable, reliable, and fault-tolerant systems across cloud environments.
  • You will develop and maintain observability tools, including monitoring, logging, and alerting (e.g., Prometheus, Grafana, Datadog, ELK).
  • Automating infrastructure provisioning, deployment, and incident response using Infrastructure as Code (IaC) tools like Terraform or CloudFormation will be part of your role.
  • You will optimize system performance, scalability, and incident response workflows to improve uptime.
  • Working closely with development and DevOps teams to improve system design for reliability is essential.
  • Conducting root cause analysis (RCA) and implementing preventative measures to minimize failures will be required.
  • You will ensure high availability by designing and maintaining load balancing, failover, and disaster recovery strategies.
  • Improving CI/CD pipelines to enhance deployment speed while maintaining stability will be part of your tasks.
  • You will optimize cloud cost and resource utilization for AWS, Azure, or Google Cloud Platform (GCP).
  • Participating in on-call rotations to quickly address system failures and minimize downtime is expected.

Requirements:

  • You should have around 4+ years of experience in Site Reliability Engineering (SRE), DevOps, or System Engineering.
  • A strong knowledge of cloud platforms (AWS, Azure, or GCP) and cloud-native architectures is required.
  • Experience with observability and monitoring tools (Prometheus, Grafana, ELK, Datadog, New Relic) is necessary.
  • Proficiency in Infrastructure as Code (IaC) tools such as Terraform, CloudFormation, or Pulumi is essential.
  • Hands-on experience with containerization and orchestration (Docker, Kubernetes, Helm) is required.
  • Strong Linux system administration and networking fundamentals are necessary.
  • Experience with incident management, debugging, and root cause analysis is expected.
  • Proficiency in scripting (Bash, Python, or Go) for automation and system monitoring is required.
  • Knowledge of load balancing, failover strategies, and distributed systems is essential.
  • An understanding of security best practices, access control, and compliance requirements is necessary.
  • Strong communication skills and the ability to collaborate with cross-functional teams are required.

Benefits:

  • We offer a remote-first approach with flexible working hours.
  • You will receive Apple hardware for work.
  • Flexible Paid Time Off (PTO) is provided to support work-life balance.
  • An annual bonus is part of the compensation package.
  • Top-tier health insurance is offered for peace of mind.
  • A public transportation pass will be provided to support your commute needs.
  • The Coverflex benefits package includes meal allowances, well-being, and more.
  • You will have the opportunity to attend the Air Conference 2025 in Las Vegas to meet the team, collaborate, and grow together!
About the job
Posted on
Job type
Salary
-
Leave a feedback