This job post is closed and the position is probably filled. Please do not apply.
🤖 Automatically closed by a robot after apply link
was detected as broken.
Description:
Lead the implementation and refinement of Site Reliability Engineering (SRE) practices, including SLOs, error budgets, and blameless postmortems
Design and implement automation to enhance system reliability and efficiency
Architect scalable hybrid cloud solutions for Web3 infrastructure
Manage error budgets and prioritize between reliability and new features based on data-driven decisions
Ensure high availability, performance, and reliability under varying load conditions
Collaborate with the Platform engineering team to embed reliability into services
Align SRE strategies with the technical vision of Nethermind’s Infrastructure Leadership department
Implement observability best practices and comprehensive monitoring systems
Develop and maintain service level indicators (SLIs) and objectives (SLOs) in collaboration with product owners
Mentor team members in SRE practices and promote continuous learning
Lead capacity planning efforts using quantitative analysis to address future scaling challenges
Contribute to long-term technical roadmaps balancing reliability concerns with product innovation
Requirements:
5+ years of experience in Site Reliability Engineering or DevOps
Expertise in cloud platforms like AWS and GCP
Proficiency in Kubernetes
Demonstrated experience in designing and implementing scalable, efficient, resilient systems
Deep understanding of Linux/Unix systems and networking protocols
Strong programming skills in Python or Go
Background in monitoring, observability, and logging systems (e.g., Grafana, Prometheus, Loki)
Familiarity with CI/CD tools (e.g., GitHub Actions, ArgoCD)
Excellent communication skills to convey complex technical concepts
Ability to produce technical documentation, runbooks, presentations, and post-mortem reports
Experience in mentoring and upskilling team members
Benefits:
Opportunity to lead and mentor a team of Site Reliability Engineers
Work on cutting-edge projects in the blockchain space with a globally distributed team
Collaborate with renowned companies in the industry
Chance to contribute to open-source projects and demonstrate thought leadership in SRE
Exposure to MLOps, big data technologies, and blockchain infrastructure
Experience with chaos engineering principles and traffic management technologies
Potential for career growth and development in a cross-functional environment