Welcome to RemoteYeah 2.0! Find out more about the new version here.

Remote Senior Site Reliability Engineer (AWS, AI/ML, & APM)

at Granicus

Posted 4 days ago | 4 applied

Description:

  • Granicus is seeking an experienced and highly skilled Senior Site Reliability Engineer (SRE) to join their SRE team.
  • The role involves ensuring the reliability, scalability, and performance of services.
  • Responsibilities include providing on-call production support, working on customer and internal engineering tickets, and managing the SRE backlog.
  • The SRE will continuously monitor the health and performance of services, respond to alerts and incidents, and develop automation scripts to streamline operations.
  • The position requires assisting in troubleshooting incidents, performing root cause analysis, and implementing long-term fixes.
  • The SRE will participate in system improvements, collaborate with software engineers, and create documentation for processes and troubleshooting guides.
  • Capacity planning activities will be part of the role to anticipate future needs and ensure infrastructure can handle growth.
  • Security best practices must be implemented and adhered to in order to protect systems and data.

Requirements:

  • Candidates must have 5+ years of experience in site reliability engineering, system administration, or a similar role, with a proven track record of managing large-scale, high-availability systems.
  • Experience supporting AI/ML infrastructure, including model deployment and integration with services like AWS Bedrock, is highly desirable.
  • Expertise in Linux/Unix systems and cloud platforms such as AWS, Azure, or Google Cloud is required.
  • Strong proficiency in scripting languages (Python, Bash, Ruby) and programming languages (Go, Java, C++) is necessary.
  • Familiarity with AI/ML operations, including model lifecycle management and inference performance tuning, is expected.
  • Experience with the ELK Stack (Elasticsearch, Logstash, Kibana) for centralized logging and monitoring is required.
  • Knowledge of configuration management tools (Ansible, Chef, Puppet) is necessary.
  • Exposure to AI/ML toolchains, including AWS Bedrock, SageMaker, and LLMOps frameworks, is a plus.
  • Relevant certifications such as AWS Certified DevOps Engineer or AWS Certified Machine Learning – Specialty are advantageous.

Benefits:

  • Granicus offers a competitive benefits package that allows employees to tailor benefits to their needs.
  • Benefits include flexible time off, medical (with an option that is paid 100% by Granicus), dental, and vision insurance.
  • Employees can participate in a 401(k) plan with matching contributions.
  • Paid parental leave is provided, along with employer-paid short and long-term disability insurance, group term life insurance, and AD&D insurance.
  • Group legal coverage is also included, among other benefits.