Remote Senior Site Reliability Engineer (REMOTE)

Posted

Apply now
Please, let Discogs know you found this job on RemoteYeah. This helps us grow 🌱.

Description:

  • The Discogs Platform team is focused on building and supporting performant, cost-effective, reliable infrastructure, developer experience tooling, and creating organization-wide standards and velocity.
  • The Senior Site Reliability Engineer will contribute to the Platform team’s centralized infrastructure, including maintenance, monitoring, and automation of services ranging from databases to Kubernetes.
  • This role involves leading incident response and postmortem efforts and working closely with other engineering teams to understand their needs and drive improvements to technologies and processes.
  • Responsibilities include maintaining the organization’s cloud presence in AWS, automating and deploying infrastructure configurations using Infrastructure as Code (IAC), and mentoring engineering squads on Platform best practices.
  • The engineer will assist engineering squads with capacity planning, infrastructure budgeting, and production readiness, as well as writing documentation and runbooks for the engineering organization’s knowledge base.
  • Implementing monitoring and alerting systems with Discogs observability tools and working in a containerized, orchestrated environment are also key tasks.
  • Participation in on-call rotation, responding to incidents, and troubleshooting data and other operations issues is required.
  • The engineer will contribute to efforts on the reliability and design patterns of Kafka, Kafka Connect, and database implementations.

Requirements:

  • A Bachelor's Degree in Computer Science or a similar area of focus, or equivalent relevant work experience is required.
  • A minimum of 5 years of experience in Ops, DevOps, Site Reliability, Platform, or other systems roles is necessary.
  • Required skills include Infrastructure-as-code (Terraform), CI/CD (GitHub Actions), GitOps (ArgoCD), and Kubernetes (EKS, Kustomize, Karpenter, administration, application manifests).
  • Proficiency in AWS and cloud development (VPC, EKS, RDS, S3), FinOps and cloud cost optimization, and observability tools (Datadog, Sentry) is essential.
  • Scripting skills in Shell and Python are required, along with a track record of collaboration and mentorship.
  • Excellent written communication and documentation skills, a commitment to continuous learning, and a proactive approach to solving large problems are necessary.
  • Preferred skills include Kafka cluster administration (Strimzi), Kafka Connect (Debezium, JDBC), relational database administration (MySQL, Percona Server, AWS RDS), and Elasticsearch (ECK administration).
  • Additional preferred skills include Python (SQLAlchemy, FastAPI), GraphQL (schema design, Apollo federation), REST API, Hashicorp Vault, Redis, and Memcached.

Benefits:

  • Competitive compensation includes a salary and a performance-related bonus program.
  • A 401(k) plan with employer match is provided.
  • The company offers 100% company-paid medical and dental insurance benefits for employees and their dependents.
  • Employees receive 4 weeks of paid vacation, which increases based on tenure.
  • Birth mothers are entitled to 18 weeks of paid leave, while all employees can take 8 weeks of paid parental leave, including for adoption.
  • A monthly wellness allowance and an annual professional and personal development allowance are included.
  • The company provides work-from-home office set-up and expense allowances, along with flexible work location opportunities.
  • Employer matching toward charitable contributions is also offered.
Apply now
Please, let Discogs know you found this job on RemoteYeah . This helps us grow 🌱.
About the job
Posted on
Job type
Salary
$ 130,000 - 140,000 USD / year
Report this job

Job expired or something else is wrong with this job?

Report this job
Leave a feedback