Remote Staff Site Reliability Engineer at Agiloft

Description:

As a Staff Site Reliability Engineer (SRE), you will be responsible for developing and implementing highly reliable and scalable systems.
You will work closely with different functional teams to create a stable, efficient, and scalable environment, leading complex projects requiring collaboration with multiple stakeholders.
Your responsibilities will include defining and enforcing SRE best practices and standards.
You will architect and implement highly reliable and scalable systems.
You will lead complex post-incident reviews and implement systemic improvements.
Collaboration with product and engineering teams to set reliability targets will be part of your role.
You will manage high-impact incidents and coordinate incident response.
Contributing to budget planning and resource allocation is expected.
You will lead efforts to establish disaster recovery strategies.
Providing technical leadership and mentorship to the SRE team is essential.
You will continuously track and improve metrics (for example, DORA) to optimize software delivery and operational performance.
Participation in on-call rotation is required.
Other duties may be assigned as needed.

Requirements:

You must have 8-10 years of experience in a similar or related role.
A Bachelor’s degree in Computer Science, Information Technology, or a related field (or equivalent experience) is required.
In-depth knowledge of Cloud Ops technologies including Amazon Web Services (AWS) and Terraform or other Infrastructure as Code (IaC) is necessary.
Advanced knowledge in Linux operating systems and troubleshooting OS issues is essential.
Expertise in setting up and managing monitoring tools (such as Prometheus, Grafana, Datadog, Nagios, Open Telemetry, ELK, or similar tools) is required.
You should have an in-depth understanding of monitoring and alerting systems, networking principles (such as load balancing, CDN, and disaster recovery).
A strong understanding of incident management, capacity planning, disaster recovery, and observability practices (in tools such as OpenTelemetry and Jaeger) is needed.
Advanced experience with or knowledge of security measures and practices (for example, threat modeling, compliance, and secure coding practices) is important.
Strong analytical and problem-solving skills are required.
Knowledge of Linux systems and common system administration tasks is necessary.
A strong understanding of programming/scripting languages (such as Python) including additional scripting skills in multiple languages to automate SRE operations is essential.
Excellent communication and teamwork skills are required.
A willingness to learn and adapt in a fast-paced, dynamic environment is necessary.

Benefits:

Agiloft offers a working environment that supports healthy work/life balance, including floating holidays and a quarterly, no-questions-asked wellness day.
The company is committed to ensuring a diverse and inclusive workplace, allowing individuals to bring their authentic selves to work.
Employees are encouraged to participate in multiple Employee Resource Groups (ERGs).
Agiloft values employee experience, believing that an excellent employee experience leads to an excellent customer experience.
There is no application deadline for this opportunity, allowing for ongoing review of applications.