Remote Staff Site Reliability Engineer

Posted

Apply now
Please, let Agiloft know you found this job on RemoteYeah. This helps us grow 🌱.

Description:

  • As a Staff Site Reliability Engineer (SRE), you will be responsible for developing and implementing highly reliable and scalable systems.
  • You will work closely with different functional teams to create a stable, efficient, and scalable environment, leading complex projects requiring collaboration with multiple stakeholders.
  • Your responsibilities will include defining and enforcing SRE best practices and standards.
  • You will architect and implement highly reliable and scalable systems.
  • You will lead complex post-incident reviews and implement systemic improvements.
  • Collaboration with product and engineering teams to set reliability targets will be part of your role.
  • You will manage high-impact incidents and coordinate incident response.
  • Contributing to budget planning and resource allocation is expected.
  • You will lead efforts to establish disaster recovery strategies.
  • Providing technical leadership and mentorship to the SRE team is essential.
  • You will continuously track and improve metrics (for example, DORA) to optimize software delivery and operational performance.
  • Participation in on-call rotation is required.
  • Other duties may be assigned as needed.

Requirements:

  • You must have 8-10 years of experience in a similar or related role.
  • A Bachelor’s degree in Computer Science, Information Technology, or a related field (or equivalent experience) is required.
  • In-depth knowledge of Cloud Ops technologies including Amazon Web Services (AWS) and Terraform or other Infrastructure as Code (IaC) is necessary.
  • Advanced knowledge in Linux operating systems and troubleshooting OS issues is essential.
  • Expertise in setting up and managing monitoring tools (such as Prometheus, Grafana, Datadog, Nagios, Open Telemetry, ELK, or similar tools) is required.
  • You should have an in-depth understanding of monitoring and alerting systems, networking principles (such as load balancing, CDN, and disaster recovery).
  • A strong understanding of incident management, capacity planning, disaster recovery, and observability practices (in tools such as OpenTelemetry and Jaeger) is needed.
  • Advanced experience with or knowledge of security measures and practices (for example, threat modeling, compliance, and secure coding practices) is important.
  • Strong analytical and problem-solving skills are required.
  • Knowledge of Linux systems and common system administration tasks is necessary.
  • A strong understanding of programming/scripting languages (such as Python) including additional scripting skills in multiple languages to automate SRE operations is essential.
  • Excellent communication and teamwork skills are required.
  • A willingness to learn and adapt in a fast-paced, dynamic environment is necessary.

Benefits:

  • Agiloft offers a working environment that supports healthy work/life balance, including floating holidays and a quarterly, no-questions-asked wellness day.
  • The company is committed to ensuring a diverse and inclusive workplace, allowing individuals to bring their authentic selves to work.
  • Employees are encouraged to participate in multiple Employee Resource Groups (ERGs).
  • Agiloft values employee experience, believing that an excellent employee experience leads to an excellent customer experience.
  • There is no application deadline for this opportunity, allowing for ongoing review of applications.
Apply now
Please, let Agiloft know you found this job on RemoteYeah . This helps us grow 🌱.
About the job
Posted on
Job type
Salary
-
Experience level
Report this job

Job expired or something else is wrong with this job?

Report this job
Leave a feedback