Remote Sr. Site Reliability Engineer at AuthZed

Description:

We are seeking a Site Reliability Engineer to join our tech startup in the infrastructure and authorization space.
As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, availability, and performance of our systems.
You will be responsible for designing, implementing, and maintaining scalable infrastructure solutions to support our growing customer base.
This is an exciting opportunity to work in a fast-paced environment and contribute to the success of a company bringing a Google-inspired authorization system to companies around the globe.
Your responsibilities will include designing, implementing, and maintaining highly available and scalable infrastructure solutions for our projects, products, and customers.
You will monitor and analyze system performance, identifying and resolving bottlenecks and issues to ensure optimal performance and reliability.
You will automate infrastructure deployment and configuration management processes.
You will continuously improve system reliability, security, and efficiency through proactive monitoring, capacity planning, and performance tuning.
You will troubleshoot and resolve complex infrastructure and application issues in production and test environments.
You will collaborate with software engineering teams to design and implement systems that are resilient, scalable, and secure.
You will participate in on-call rotation and respond to production incidents in a timely manner.
You will document system configurations, troubleshooting procedures, and operational guidelines.

Requirements:

Proven experience as a Site Reliability Engineer or in a similar role is required.
A strong understanding of networking, operating systems, and cloud infrastructure is necessary.
Experience with Site Reliability Engineering, System Design, and Distributed Computing is essential.
You should have experience in various programming languages, including NodeJS, Java, Python, Ruby, and Go.
Experience with containerization technologies such as Docker and Kubernetes is required.
Knowledge of infrastructure-as-code tools like Terraform and Pulumi is necessary.
Familiarity with monitoring and logging tools, such as Prometheus, Grafana, and the ELK stack, is important.
Experience with lower-level implementation details of relational databases is preferred, with a bonus for experience with distributed SQL databases like Google Cloud Spanner or CockroachDB.
Experience working with Git and GitHub is required.
Experience with continuous integration and deployment systems is necessary.
Strong problem-solving and troubleshooting skills are essential.
Excellent communication and collaboration abilities are required.

Benefits:

This position offers the opportunity to work remotely from the U.S. or EU.
You will be part of a dynamic and innovative tech startup environment.
You will have the chance to contribute to the development of a cutting-edge authorization system.
The role provides opportunities for professional growth and development in the field of Site Reliability Engineering.
You will work with a talented team of professionals in a fast-paced and collaborative setting.