Please, let Unitary know you found this job
on RemoteYeah.
This helps us grow 🌱.
Description:
We are a rapidly growing startup developing solutions that blend human expertise and AI agents to handle manual customer and marketplace operations tasks.
Our unique approach combines the strengths of human expertise with the advantages of AI automation to help businesses solve real-world challenges in trust & safety and beyond.
We are looking for a Site Reliability Engineer to ensure our systems run smoothly and reliably at scale.
Your expertise in monitoring, observability, and system automation will help maintain the high availability and performance our customers depend on.
You will work at the intersection of development and operations, using your technical skills to build robust infrastructure and streamline deployment processes.
Your mission will be to proactively identify and resolve system issues before they impact our customers.
You will collaborate closely with development teams to implement monitoring solutions, create comprehensive alerting systems, and develop the tools needed to maintain system reliability.
Initially, you will focus on enhancing our existing monitoring and alerting infrastructure, then gradually build self-healing systems and self-service capabilities that empower teams to diagnose and resolve issues independently.
You will design and implement comprehensive alerting systems that detect issues early and provide actionable insights to streamline the resolution of these issues.
You will collaborate with development teams to ensure our observability stack provides clear visibility into system health and performance.
You will optimise on-call processes, including creating and maintaining detailed runbooks that enable efficient incident response and knowledge sharing across teams.
You will build self-healing systems using AI tools that automatically resolve common issues before they require human intervention.
You will develop automation tools and diagnostic capabilities that help teams quickly identify and resolve issues when manual investigation is required.
You will ensure secure and reliable code deployment processes through robust CI/CD pipelines and infrastructure automation.
You will join our 24/7 support rotation which provides first-level platform support to ensure a great customer experience.
Requirements:
We are looking for someone who is excited about building innovative solutions and wants to have a large impact in a smaller company.
You should be a collaborative engineer who excels at working across teams and can translate complex technical concepts into actionable solutions.
You should be comfortable balancing your time between fixing urgent issues and investing in proactive system improvements.
Strong communication skills are crucial, as you'll be working closely with multiple engineers and may need to coordinate during high-stress incident situations.
You should have experience with visualisation tools such as Grafana for creating and maintaining dashboards that provide meaningful insights into system performance.
Proficiency with metrics platforms such as Prometheus, InfluxDB, or OpenTelemetry for collecting and analysing system data is required.
Experience with incident management tools such as Incident.io for coordinating response efforts and recording follow-up learnings and actions is necessary.
You should demonstrate strong problem-solving skills and the ability to work autonomously.
Confidence in writing production code in languages such as Go or Python is essential.
You should thrive in a collaborative environment where group output and team achievements weigh heavier than individual input.
It would be even better if you have experience working in a fully remote, international team, previous startup experience, built Slack bots or similar automation tools, experience with CI/CD platforms, worked with Kubernetes and infrastructure as code tools, and are familiar with MLOps practices and tools.
Benefits:
We are committed to creating a positive and inclusive culture built on genuine interest for each other's well-being.
We offer flexible hours and location to accommodate our team members.
A competitive salary and equity package is provided to our employees.
We offer an occupational pension for long-term financial security.
Generous paid parental leave is available to support our employees during family growth.
We provide generous paid sick leave to ensure our team can take care of their health.
An annual budget for your professional development and growth is allocated to encourage continuous learning.
An annual budget for your individual health and wellness is provided to promote well-being.
We organize three team offsites to London or other exciting destinations in Europe to foster team bonding and collaboration.
Apply now
Please, let Unitary know you found this job
on RemoteYeah
.
This helps us grow 🌱.