Please, let Gretel know you found this job
on RemoteYeah.
This helps us grow 🌱.
Description:
At Gretel, the mission is to build the world’s first developer platform for synthetic data, addressing the data bottleneck problem for developers, data scientists, and AI/ML researchers.
The Senior or Staff Site Reliability Engineer (SRE) will ensure the safety, security, and reliability of the cloud infrastructure, including compute infrastructure, container orchestration platform, deployment pipelines, and observability stack.
Responsibilities include building and maintaining Gretel's observability stack, measuring and monitoring availability, latency, and overall system health.
The role involves scaling systems sustainably with automation and continuously improving and evolving systems.
The SRE will manage and lead incident response, recovery, and blameless postmortems.
The position requires partnering with software engineers to troubleshoot production issues.
The engineer will build tools and frameworks to enhance productivity for Gretel engineers.
The role includes shipping complex ML/AI models in collaboration with Gretel's applied science and engineering teams.
Requirements:
Candidates must have experience with at least one cloud platform, with a strong preference for AWS.
Proficiency in Docker and Kubernetes is required.
The ability to write software and tools in Python or Go is necessary.
Experience with monitoring, alerting, and operations is essential.
Candidates should have experience operating highly available distributed systems in the cloud.
The ability to identify, diagnose, and respond to operational outages is required.
Preferred qualifications include experience with infrastructure as code tools like Terraform or CloudFormation.
Familiarity with build systems such as Bazel is a plus.
Experience in shipping applications with complex dependencies, such as Pytorch or Tensorflow, is preferred.
Software engineering skills beyond script writing, including TDD and design patterns, are desirable.
Experience with DevOps or CI/CD pipelines is also preferred.
Benefits:
Compensation for the position will be determined based on interview performance, level of experience, specialization of skills, and market rate.
The salary range for the Senior or Staff Site Reliability Engineer role is between $180,000 and $230,000 USD.
During the offer discussion, the recruiter will review the finalized base salary, bonus (if applicable), benefits, perks, and stock options.
Gretel is committed to creating an inclusive environment and celebrates diversity among its employees.
Accommodations are available for candidates with disabilities during the recruitment process.
Apply now
Please, let Gretel know you found this job
on RemoteYeah
.
This helps us grow 🌱.