Please, let Discogs know you found this job
on RemoteYeah.
This helps us grow 🌱.
Description:
The Discogs Platform team is focused on building and supporting performant, cost-effective, reliable infrastructure, developer experience tooling, and creating organization-wide standards and velocity.
The Senior Site Reliability Engineer will contribute to the Platform team’s centralized infrastructure, including maintenance, monitoring, and automation of services ranging from databases to Kubernetes.
This role involves leading incident response and postmortem efforts and working closely with other engineering teams to understand their needs and drive improvements to technologies and processes.
Responsibilities include maintaining the organization’s cloud presence in AWS, automating and deploying infrastructure configurations using Infrastructure as Code (IAC), and mentoring engineering squads on Platform best practices.
The engineer will assist engineering squads with capacity planning, infrastructure budgeting, and production readiness, as well as writing documentation and runbooks for the engineering organization’s knowledge base.
Implementing monitoring and alerting systems with Discogs observability tools and working in a containerized, orchestrated environment are also key tasks.
Participation in on-call rotation, responding to incidents, and troubleshooting data and other operations issues is required.
The engineer will contribute to efforts on the reliability and design patterns of Kafka, Kafka Connect, and database implementations.
Requirements:
A Bachelor's Degree in Computer Science or a similar area of focus, or equivalent relevant work experience is required.
A minimum of 5 years of experience in Ops, DevOps, Site Reliability, Platform, or other systems roles is necessary.
Required skills include Infrastructure-as-code (Terraform), CI/CD (GitHub Actions), GitOps (ArgoCD), and Kubernetes (EKS, Kustomize, Karpenter, administration, application manifests).
Proficiency in AWS and cloud development (VPC, EKS, RDS, S3), FinOps and cloud cost optimization, and observability tools (Datadog, Sentry) is essential.
Scripting skills in Shell and Python are required, along with a track record of collaboration and mentorship.
Excellent written communication and documentation skills, a commitment to continuous learning, and a proactive approach to solving large problems are necessary.