Remote SRE (Site Reliability Engineer) at Solvd

Description:

Solvd Inc. is a premier software engineering company with 8 offices globally and over 800 international employees.
The company has over 12 years of experience and helps clients create software that improves operations and opens new markets.
Solvd Inc. serves a roster of digital-native enterprise clients, including major brands in retail and social media.
The company is seeking a Site Reliability Engineer to join their growing team.
Responsibilities include collaborating with product, engineering, and operations teams to enhance the reliability, scalability, and performance of infrastructure and services.
The role involves overseeing the end-to-end management of production systems to ensure high availability and rapid recovery from failures.
The engineer will develop and maintain SRE best practices through automation, monitoring, and alerting to minimize system downtime.
Responsibilities also include creating and managing infrastructure-as-code (IaC) layers, scripts, deployment frameworks, and tools for efficient environments.
The engineer will work closely with the software engineering team to design and implement monitoring and alerting systems.
Incident response and root cause analysis for critical issues will be part of the role, focusing on eliminating causes of outages or poor performance.
The engineer will be responsible for the performance and scalability of AWS environments, ensuring they meet service level objectives (SLOs).
Providing expertise during client meetings to address reliability and scalability questions is also expected.
The role includes maintaining comprehensive documentation on system architecture, processes, and runbooks.
Engaging in capacity planning, disaster recovery exercises, and postmortem reviews for continual improvement in system resilience is required.

Requirements:

A Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience is required.
At least 5 years of professional experience in a Site Reliability Engineering (SRE), DevOps, or similar role is necessary.
Strong expertise in Amazon Web Services (AWS) is required, with AWS certifications being a plus.
Proficiency in infrastructure-as-code (IaC) tools like CloudFormation or Terraform is essential.
The candidate must be skilled in at least one programming language such as Python, Java, or Go, with experience in scripting for automation and systems management.
Expertise in automating cloud-native technologies and provisioning infrastructure across large environments is required.
Proven experience in building CI/CD pipelines and automating deployment processes with tools like Jenkins, GitLab, or AWS CodePipeline is necessary.
Hands-on experience with containerization technologies, such as Docker and Kubernetes, is required for managing microservices-based architectures.
A deep understanding of Linux systems, networking, and security best practices is essential.
The candidate must demonstrate the ability to work with monitoring tools (e.g., Prometheus, Grafana) and troubleshoot live systems.
Excellent communication skills are required, with the ability to collaborate with cross-functional teams and explain complex concepts to clients and non-technical stakeholders.

Benefits:

Working with a premier software engineering company that has a global presence and a diverse team.
Opportunities to collaborate with top brands in retail and social media.
The chance to enhance skills in Site Reliability Engineering and cloud technologies.
A dynamic work environment that encourages continual improvement and innovation.
The opportunity to engage in capacity planning and disaster recovery exercises.
A role that offers the chance to work with cutting-edge technologies and methodologies in software engineering.