Remote Senior Site Reliability Engineer (AWS, AI/ML, & APM)

Posted 1 month ago 7 applied

Description:

Granicus is seeking an experienced and highly skilled Senior Site Reliability Engineer (SRE) to join their SRE team.
The role involves ensuring the reliability, scalability, and performance of services.
Responsibilities include providing on-call production support, working on customer and internal engineering tickets, and managing the SRE backlog.
The SRE will continuously monitor the health and performance of services, respond to alerts and incidents, and develop automation scripts to streamline operations.
The position requires assisting in troubleshooting incidents, performing root cause analysis, and implementing long-term fixes.
The SRE will participate in system improvements, collaborate with software engineers, and create documentation for processes and troubleshooting guides.
Capacity planning activities will be part of the role to anticipate future needs and ensure infrastructure can handle growth.
Security best practices must be implemented and adhered to in order to protect systems and data.

Candidates must have 5+ years of experience in site reliability engineering, system administration, or a similar role, with a proven track record of managing large-scale, high-availability systems.
Experience supporting AI/ML infrastructure, including model deployment and integration with services like AWS Bedrock, is highly desirable.
Expertise in Linux/Unix systems and cloud platforms such as AWS, Azure, or Google Cloud is required.
Strong proficiency in scripting languages (Python, Bash, Ruby) and programming languages (Go, Java, C++) is necessary.
Familiarity with AI/ML operations, including model lifecycle management and inference performance tuning, is expected.
Experience with the ELK Stack (Elasticsearch, Logstash, Kibana) for centralized logging and monitoring is required.
Knowledge of configuration management tools (Ansible, Chef, Puppet) is necessary.
Exposure to AI/ML toolchains, including AWS Bedrock, SageMaker, and LLMOps frameworks, is a plus.
Relevant certifications such as AWS Certified DevOps Engineer or AWS Certified Machine Learning – Specialty are advantageous.

Granicus offers a competitive benefits package that allows employees to tailor benefits to their needs.
Benefits include flexible time off, medical (with an option that is paid 100% by Granicus), dental, and vision insurance.
Employees can participate in a 401(k) plan with matching contributions.
Paid parental leave is provided, along with employer-paid short and long-term disability insurance, group term life insurance, and AD&D insurance.
Group legal coverage is also included, among other benefits.