Description:

The Discogs Platform team is focused on building and supporting performant, cost-effective, reliable infrastructure, developer experience tooling, and creating organization-wide standards and velocity.
The Senior Site Reliability Engineer will contribute to the Platform team’s centralized infrastructure, including maintenance, monitoring, and automation of services ranging from databases to Kubernetes.
This role involves leading incident response and postmortem efforts and working closely with other engineering teams to understand their needs and drive improvements to technologies and processes.
Responsibilities include maintaining the organization’s cloud presence in AWS, automating and deploying infrastructure configurations using Infrastructure as Code (IAC), and mentoring engineering squads on Platform best practices.
The engineer will assist engineering squads with capacity planning, infrastructure budgeting, and production readiness, as well as writing documentation and runbooks for the engineering organization’s knowledge base.
Implementing monitoring and alerting systems with Discogs observability tools and working in a containerized, orchestrated environment are also key tasks.
Participation in on-call rotation, responding to incidents, and troubleshooting data and other operations issues is required.
The engineer will contribute to efforts on the reliability and design patterns of Kafka, Kafka Connect, and database implementations.

Requirements:

A Bachelor's Degree in Computer Science or a similar area of focus, or equivalent relevant work experience is required.
A minimum of 5 years of experience in Ops, DevOps, Site Reliability, Platform, or other systems roles is necessary.
Required skills include Infrastructure-as-code (Terraform), CI/CD (GitHub Actions), GitOps (ArgoCD), and Kubernetes (EKS, Kustomize, Karpenter, administration, application manifests).
Proficiency in AWS and cloud development (VPC, EKS, RDS, S3), FinOps and cloud cost optimization, and observability tools (Datadog, Sentry) is essential.
Scripting skills in Shell and Python are required, along with a track record of collaboration and mentorship.
Excellent written communication and documentation skills, a commitment to continuous learning, and a proactive approach to solving large problems are necessary.
Preferred skills include Kafka cluster administration (Strimzi), Kafka Connect (Debezium, JDBC), relational database administration (MySQL, Percona Server, AWS RDS), and Elasticsearch (ECK administration).
Additional preferred skills include Python (SQLAlchemy, FastAPI), GraphQL (schema design, Apollo federation), REST API, Hashicorp Vault, Redis, and Memcached.

Competitive compensation includes a salary and a performance-related bonus program.
A 401(k) plan with employer match is provided.
The company offers 100% company-paid medical and dental insurance benefits for employees and their dependents.
Employees receive 4 weeks of paid vacation, which increases based on tenure.
Birth mothers are entitled to 18 weeks of paid leave, while all employees can take 8 weeks of paid parental leave, including for adoption.
A monthly wellness allowance and an annual professional and personal development allowance are included.
The company provides work-from-home office set-up and expense allowances, along with flexible work location opportunities.
Employer matching toward charitable contributions is also offered.