Please, let Sift know you found this job
on RemoteYeah.
This helps us grow 🌱.
Description:
The Core Platform team maintains and optimizes the data, infrastructure, messaging, and services platform that powers Sift’s online systems.
The team ensures these systems are always available, reliable, and performing at their best to meet customer needs.
In the event of an outage or failure, the team follows well-practiced recovery plans to restore services swiftly.
Managing complex, large-scale systems requires continuous monitoring and proactive maintenance to uphold these standards.
Responsibilities include owning the availability, performance, and scalability of Sift’s primary online storage systems and infrastructure.
The role involves designing and building immutable infrastructure and fault-tolerant, multi-AZ/multi-region systems that are resilient and self-healing.
The engineer will design and implement multi-region deployments, such as BigTable clusters spanning multiple regions, ensuring specific customers are routed to designated regions.
The position requires solving complex problems arising from unique data volume and request rates, which may involve deep dives into data store and messaging internals.
The engineer will optimize local development and testing workflows to be fast, efficient, and seamless.
Responsibilities also include designing and implementing services and libraries for components to interact with data stores, messaging layers, and services platforms.
The role involves developing tools for monitoring, detecting faults, and automatically repairing distributed systems.
The engineer will provide design support to internal engineering teams for optimal usage of data stores, data growth planning, production workload optimization, messaging, caching, and service platform.
Participation in on-call support and incident response activities is required, providing 12/7 coverage for one calendar week approximately once every 3-4 weeks.
The technical stack includes GCP, AWS, Airflow, Terraform, Kubernetes, Vault, Jenkins, Kafka, Snowflake, Spark, Java 11, Python 3, Ruby 2.7, and Ruby on Rails.
Requirements:
Candidates must have 8+ years of experience as a Software Engineer focused on infrastructure/platform services or in a Site Reliability Engineering (SRE) role.
Strong programming skills in languages such as Java, Scala, or Python are required.
Experience designing and implementing distributed systems is necessary.
Candidates must have experience building and managing cloud infrastructure on AWS or GCP.
Expertise in building infrastructure as code and automating provisioning processes using tools like CloudFormation or Terraform is essential.
Proficiency in setting up and managing monitoring and alerting systems, both open-source and commercial, is required.
Familiarity with Docker and container orchestration technologies like Kubernetes, GKE, or AWS ECS is necessary.
Strong experience troubleshooting and resolving production system issues, with a focus on building automated solutions to prevent future occurrences, is required.
Proven expertise in automation and a solid understanding of configuration management tools is essential.
Benefits:
The position offers a competitive total compensation package.
A 401k plan is provided.
Medical, dental, and vision coverage is included.
Wellness reimbursement is available.
Education reimbursement is offered.
Flexible time off is part of the benefits package.
Apply now
Please, let Sift know you found this job
on RemoteYeah
.
This helps us grow 🌱.