This job post is closed and the position is probably filled. Please do not apply.
🤖 Automatically closed by a robot after apply link
was detected as broken.
Description:
The Site Reliability Engineer - II role focuses on owning the availability of Sumo’s planet-scale observability and security products.
The position is remote and can be performed from anywhere in India.
The engineer will work alongside a global SRE team to execute projects in a reliability roadmap specific to their product area.
Responsibilities include optimizing operations, increasing efficiency in cloud resource usage, hardening security posture, and enhancing feature velocity for developers.
The engineer will support engineering teams by maintaining and executing a reliability roadmap that identifies opportunities for improvement in reliability, maintainability, security, efficiency, and velocity.
Collaboration with development infrastructure, Global SRE, and product area engineering teams is essential to refine the reliability roadmap.
Participation in defining, evolving, and managing Service Level Objectives (SLOs) for several teams is required.
The role includes participating in on-call rotations to understand operational workloads and improve the on-call experience.
The engineer will complete projects to optimize the on-call experience and improve the lifecycle of microservices and architectural components.
Writing code and automation to reduce operational workload and improve security posture is a key responsibility.
The engineer will work closely with developer infrastructure teams to expedite the adoption of tools that advance the reliability roadmap.
Scaling systems sustainably through automation and facilitating blame-free root cause analysis meetings for incidents are part of the role.
The engineer will drive root cause identification and issue resolution with teams in a fast-paced iterative environment.
Requirements:
Candidates must have cloud-native application development experience leveraging best practices and design patterns.
Strong debugging and troubleshooting skills across the entire technology stack are required.
A deep understanding of AWS Networking, Compute, Storage, and managed services is essential.
Competency with modern CI/CD tooling such as Kubernetes, Terraform, Ansible, and Jenkins is necessary.
Experience with full lifecycle support of services, from creation to production support, is required.
Candidates should be versed in Infrastructure as Code practices using technologies like Terraform or Cloud Formation.
The ability to author production-ready code in at least one of the following languages: Java, Scala, or Go is required.
Experience with Linux systems and proficiency on the command line is necessary.
Candidates must understand and apply modern approaches to cloud-native software security.
Experience with agile frameworks, such as Scrum and Kanban, is required.
Flexibility and willingness to step into new roles and responsibilities are essential.
A willingness to learn and use Sumo Logic products for solving reliability and security issues is necessary.
A Bachelor’s or Master's Degree in Computer Science, Electrical Engineering, or another scientific or technical discipline is required.
A minimum of 2 years of industry experience is necessary.
Benefits:
Employees will have the opportunity to work in a remote environment, providing flexibility in their work-life balance.
The role offers the chance to work with cutting-edge technology in a fast-paced and innovative company.
Employees will gain experience in optimizing operations and improving the reliability of large-scale systems.
The position allows for collaboration with global teams, enhancing professional growth and networking opportunities.
Employees will have access to continuous learning and development opportunities, particularly in cloud-native technologies and security practices.
The company promotes a culture of blame-free root cause analysis, fostering a supportive work environment.