Remote Site Reliability Engineer

Posted

Apply now
Please, let Unitary know you found this job on RemoteYeah. This helps us grow 🌱.

Description:

  • We are a rapidly growing startup developing solutions that blend human expertise and AI agents to handle manual customer and marketplace operations tasks.
  • Our unique approach combines the strengths of human expertise with the advantages of AI automation to help businesses solve real-world challenges in trust & safety and beyond.
  • We are looking for a Site Reliability Engineer to ensure our systems run smoothly and reliably at scale.
  • Your expertise in monitoring, observability, and system automation will help maintain the high availability and performance our customers depend on.
  • You will work at the intersection of development and operations, using your technical skills to build robust infrastructure and streamline deployment processes.
  • Your mission will be to proactively identify and resolve system issues before they impact our customers.
  • You will collaborate closely with development teams to implement monitoring solutions, create comprehensive alerting systems, and develop the tools needed to maintain system reliability.
  • Initially, you will focus on enhancing our existing monitoring and alerting infrastructure, then gradually build self-healing systems and self-service capabilities that empower teams to diagnose and resolve issues independently.
  • You will design and implement comprehensive alerting systems that detect issues early and provide actionable insights to streamline the resolution of these issues.
  • You will collaborate with development teams to ensure our observability stack provides clear visibility into system health and performance.
  • You will optimise on-call processes, including creating and maintaining detailed runbooks that enable efficient incident response and knowledge sharing across teams.
  • You will build self-healing systems using AI tools that automatically resolve common issues before they require human intervention.
  • You will develop automation tools and diagnostic capabilities that help teams quickly identify and resolve issues when manual investigation is required.
  • You will ensure secure and reliable code deployment processes through robust CI/CD pipelines and infrastructure automation.
  • You will join our 24/7 support rotation which provides first-level platform support to ensure a great customer experience.

Requirements:

  • We are looking for someone who is excited about building innovative solutions and wants to have a large impact in a smaller company.
  • You should be a collaborative engineer who excels at working across teams and can translate complex technical concepts into actionable solutions.
  • You should be comfortable balancing your time between fixing urgent issues and investing in proactive system improvements.
  • Strong communication skills are crucial, as you'll be working closely with multiple engineers and may need to coordinate during high-stress incident situations.
  • You should have experience with visualisation tools such as Grafana for creating and maintaining dashboards that provide meaningful insights into system performance.
  • Proficiency with metrics platforms such as Prometheus, InfluxDB, or OpenTelemetry for collecting and analysing system data is required.
  • Experience with incident management tools such as Incident.io for coordinating response efforts and recording follow-up learnings and actions is necessary.
  • You should demonstrate strong problem-solving skills and the ability to work autonomously.
  • Confidence in writing production code in languages such as Go or Python is essential.
  • You should thrive in a collaborative environment where group output and team achievements weigh heavier than individual input.
  • It would be even better if you have experience working in a fully remote, international team, previous startup experience, built Slack bots or similar automation tools, experience with CI/CD platforms, worked with Kubernetes and infrastructure as code tools, and are familiar with MLOps practices and tools.

Benefits:

  • We are committed to creating a positive and inclusive culture built on genuine interest for each other's well-being.
  • We offer flexible hours and location to accommodate our team members.
  • A competitive salary and equity package is provided to our employees.
  • We offer an occupational pension for long-term financial security.
  • Generous paid parental leave is available to support our employees during family growth.
  • We provide generous paid sick leave to ensure our team can take care of their health.
  • An annual budget for your professional development and growth is allocated to encourage continuous learning.
  • An annual budget for your individual health and wellness is provided to promote well-being.
  • We organize three team offsites to London or other exciting destinations in Europe to foster team bonding and collaboration.
Apply now
Please, let Unitary know you found this job on RemoteYeah . This helps us grow 🌱.
About the job
Report this job

Job expired or something else is wrong with this job?

Report this job
Leave a feedback