Welcome to RemoteYeah 2.0! Find out more about the new version here.

Remote Site Reliability Engineer

at Weekday AI

Posted 10 hours ago 0 applied

Description:

  • This role is for one of Weekday's clients.
  • The position is full-time and requires a minimum of 5 years of experience.
  • The Site Reliability Engineer will help build and maintain highly reliable, scalable, and secure infrastructure and applications.
  • The role will focus on automating operations, improving system performance, and ensuring overall service health by applying modern SRE practices.
  • Key responsibilities include designing, implementing, and managing Kubernetes-based infrastructure.
  • The engineer will utilize AWS services such as IAM, EC2, EKS, S3, and CloudWatch to build and support scalable cloud environments.
  • Developing and maintaining automation scripts and tools using Shell scripting or Python is essential.
  • The engineer will proactively identify, analyze, and troubleshoot complex application, network, and system-level issues.
  • Optimizing system performance and reliability, with deep expertise in Linux debugging and performance tuning, is required.
  • Building automation for system self-healing and recovery mechanisms is part of the role.
  • Developing monitoring and alerting solutions for high-performance and low-latency applications is necessary.
  • Collaboration with development and operations teams to implement effective CI/CD pipelines is expected.
  • The engineer will apply SRE principles including service monitoring, alerting, error budget tracking, capacity planning, fault tolerance, automation, and toil reduction.
  • Continuously seeking opportunities to improve system reliability and engineering processes is a key aspect of the job.

Requirements:

  • Proven experience working with Kubernetes in production environments is required.
  • A strong command of AWS cloud services with hands-on experience in infrastructure provisioning and management is necessary.
  • Proficiency in scripting or programming, preferably in Shell or Python, is essential.
  • In-depth Linux knowledge, including tools for diagnostics and performance optimization, is required.
  • Familiarity with modern observability tools for monitoring, logging, and alerting is necessary.
  • Strong troubleshooting and problem-solving skills are essential for this role.
  • An understanding and application of SRE concepts and best practices is required.

Benefits:

  • The position offers the opportunity to work with cutting-edge technologies in a dynamic environment.
  • Employees will have the chance to enhance their skills in Site Reliability Engineering and cloud infrastructure.
  • The role provides a platform for collaboration with talented development and operations teams.
  • There are opportunities for continuous learning and professional growth within the organization.