The position is full-time and requires a minimum of 5 years of experience.
The Site Reliability Engineer will help build and maintain highly reliable, scalable, and secure infrastructure and applications.
The role will focus on automating operations, improving system performance, and ensuring overall service health by applying modern SRE practices.
Key responsibilities include designing, implementing, and managing Kubernetes-based infrastructure.
The engineer will utilize AWS services such as IAM, EC2, EKS, S3, and CloudWatch to build and support scalable cloud environments.
Developing and maintaining automation scripts and tools using Shell scripting or Python is essential.
The engineer will proactively identify, analyze, and troubleshoot complex application, network, and system-level issues.
Optimizing system performance and reliability, with deep expertise in Linux debugging and performance tuning, is required.
Building automation for system self-healing and recovery mechanisms is part of the role.
Developing monitoring and alerting solutions for high-performance and low-latency applications is necessary.
Collaboration with development and operations teams to implement effective CI/CD pipelines is expected.
The engineer will apply SRE principles including service monitoring, alerting, error budget tracking, capacity planning, fault tolerance, automation, and toil reduction.
Continuously seeking opportunities to improve system reliability and engineering processes is a key aspect of the job.
Requirements:
Proven experience working with Kubernetes in production environments is required.
A strong command of AWS cloud services with hands-on experience in infrastructure provisioning and management is necessary.
Proficiency in scripting or programming, preferably in Shell or Python, is essential.
In-depth Linux knowledge, including tools for diagnostics and performance optimization, is required.
Familiarity with modern observability tools for monitoring, logging, and alerting is necessary.
Strong troubleshooting and problem-solving skills are essential for this role.
An understanding and application of SRE concepts and best practices is required.
Benefits:
The position offers the opportunity to work with cutting-edge technologies in a dynamic environment.
Employees will have the chance to enhance their skills in Site Reliability Engineering and cloud infrastructure.
The role provides a platform for collaboration with talented development and operations teams.
There are opportunities for continuous learning and professional growth within the organization.