Please, let Sift know you found this job
on RemoteYeah.
This helps us grow 🌱.
Description:
The Core Platform team maintains and optimizes the data, infrastructure, messaging, and services platform that powers Sift’s online systems.
The team ensures these systems are always available, reliable, and performing at their best to meet customer needs.
In the event of an outage or failure, the team follows well-practiced recovery plans to restore services swiftly.
Managing complex, large-scale systems requires continuous monitoring and proactive maintenance to uphold these standards.
Responsibilities include owning the availability, performance, and scalability of Sift’s primary online storage systems and infrastructure.
The role involves designing and building immutable infrastructure and fault-tolerant, multi-AZ/multi-region systems that are resilient and self-healing.
The engineer will design and implement multi-region deployments, such as BigTable clusters spanning multiple regions, ensuring specific customers are routed to designated regions.
The position requires solving complex problems arising from unique data volume and request rates, which may involve deep dives into data store and messaging internals.
The engineer will optimize local development and testing workflows to be fast, efficient, and seamless.
Responsibilities also include designing and implementing services and libraries for components to interact with data stores, messaging layers, and services platforms.
The role involves developing tools for monitoring, detecting faults, and automatically repairing distributed systems.
The engineer will provide design support to internal engineering teams for optimal usage of data stores, data growth planning, production workload optimization, messaging, caching, and service platforms.
Participation in on-call support and incident response activities is required, providing 12/7 coverage for one calendar week approximately once every 3-4 weeks.
Requirements:
Candidates must have 8+ years of experience as a Software Engineer focused on infrastructure/platform services or in a Site Reliability Engineering (SRE) role.
Strong programming skills in languages such as Java, Scala, or Python are required.
Experience designing and implementing distributed systems is essential.
Candidates must have experience building and managing cloud infrastructure on AWS or GCP.
Expertise in building infrastructure as code and automating provisioning processes using tools like CloudFormation or Terraform is necessary.
Proficiency in setting up and managing monitoring and alerting systems, both open-source and commercial, is required.
Familiarity with Docker and container orchestration technologies like Kubernetes, GKE, or AWS ECS is important.
Strong experience troubleshooting and resolving production system issues, with a focus on building automated solutions to prevent future occurrences, is needed.
Proven expertise in automation and a solid understanding of configuration management tools is required.
Benefits:
The position offers a competitive total compensation package.
A 401k plan is included as part of the benefits.
Medical, dental, and vision coverage is provided.
Wellness reimbursement is available to employees.
Education reimbursement is offered to support continuous learning.
Flexible time off is part of the benefits package.
Apply now
Please, let Sift know you found this job
on RemoteYeah
.
This helps us grow 🌱.