Remote Senior Site Reliability Engineer at Lightspeed Commerce

Description:

The Senior Site Reliability Engineer will be part of the Services SRE Team, responsible for the observability, scalability, and reliability of the Services Platform.
The role involves performing updates to multi-region, multi-tenant, multi-platform global cloud infrastructure in critical production environments.
The engineer will champion corporate initiatives to enhance the scalability, reliability, and observability of the Services platform.
Responsibilities include acting as a subject matter expert and incident lead during the incident response process.
The position requires initiating and contributing to continuous improvement of software delivery processes in a multidisciplinary team.
The engineer will design and architect operational solutions aimed at increasing efficiency, performance, and standardization of operational tasks.
The role emphasizes reliability and assisting teams in delivering reliable software.
Best practices such as Infrastructure as Code, monitoring, high availability, disaster recovery, security, and DevOps methodologies must be adhered to and advocated for.
The engineer will provide timely assistance and remediation solutions during critical situations and production incidents, with on-call responsibilities.
As a Cloud Efficiency Expert, the engineer will mentor the SRE Team on cloud cost optimizations and understand how different business units utilize cloud services.
The role includes building resource attribution, measurement, monitoring, and quota management frameworks.
The engineer will work with development teams to improve architecture for optimal utilization and performance of cloud technologies.

Requirements:

A passion for scalability, reliability, and observability, with a desire to share that passion positively.
Experience leading projects that require coordination and collaboration with other development teams.
A desire to grow in championing process changes in pursuit of the SRE mandate.
Proven track record in optimizing cloud services, including data pipelines, storage, databases, and caching layers.
Ability to evaluate tradeoffs between different architectures, such as single-tenant vs multi-tenant deployments.
Understanding of different types of SLAs/SLOs and resource contracts, including reserved instances and savings plans.
An analytical mindset, with a focus on metrics to drive technical decisions.
Good understanding of Agile development and continuous delivery best practices, software engineering tools, processes, methods, and testing.
Primary ownership experience of customer-facing, zero-downtime production environments using major cloud platforms (AWS, GCP, Azure).
Familiarity with CI/CD pipelines (CircleCI, Jenkins, Github, ArgoCD, Helm), containers (Docker, Kubernetes), and Infrastructure as Code (Terraform).
Proficiency in programming or scripting languages such as Python, Ruby, Java, or Golang.

Benefits:

The opportunity to join a growing team and contribute to its advancement.
Amazing benefits and perks, including equity for all employees.
Continuous development of skills and business acumen with limitless growth opportunities.
A flexible work culture with lots of autonomy.
Innovation time allocated for exploration and learning at work.
Opportunities to shape the company by joining cultural and technical committees.
Numerous growth opportunities into technical or people management roles.
The chance to be part of a fast-paced, high-growth company.
Opportunities to learn, expand skill sets, and build relationships within a diverse and inclusive environment.
A range of benefits including a Lightspeed equity scheme, flexible paid time off, health insurance, pension plan contributions, and a health and wellness benefit of $500 per year.
Paid leave and assistance for new parents, mental health support, training opportunities, a volunteer day, and a fully stocked kitchen.