Welcome to RemoteYeah 2.0! Find out more about the new version here.

Remote Platform Site Reliability Engineer

at Nexthink

Posted 1 day ago 1 applied

Description:

  • Nexthink is seeking a strong Platform Engineer with SRE operations experience to enhance their infrastructure and improve deployment, monitoring, and scaling of systems.
  • The role is crucial for ensuring a seamless, reliable, and scalable experience for customers 24/7 as a SaaS provider.
  • Responsibilities include designing, building, and maintaining the infrastructure for a multi-tenant SaaS platform with a focus on reliability, security, and scalability.
  • The engineer will implement and manage cloud-native systems on AWS using top-tier tools and automation.
  • Operating and enhancing Kubernetes clusters, deployment pipelines, and service meshes to support continuous delivery is essential.
  • The role involves establishing and enforcing SLOs, SLAs, and error budgets while proactively addressing availability and performance issues.
  • Developing infrastructure as code using Terraform or similar tools for repeatable and auditable provisioning is required.
  • The engineer will program solutions for Platform Tools for automation, monitoring, and provisioning.
  • A solid understanding of the network stack, cloud topologies, and storage solutions is necessary.
  • Monitoring system health and application performance using tools like Datadog, Prometheus, and Grafana is part of the job.
  • The engineer will improve incident response practices and reduce mean time to detect (MTTD) and recover (MTTR).
  • Troubleshooting incidents with minimal intervention from other functions is expected.
  • Participation in a shared on-call rotation to respond to incidents and troubleshoot outages is required.
  • Collaboration with software engineers to embed reliability and observability into services is essential.
  • Developing automated runbooks, health checks, and alerting to support reliable operations is part of the role.
  • Supporting automated testing, canary deployments, and rollback strategies for safe and reliable releases is necessary.
  • Contributing to security best practices, compliance automation, and cost optimization is expected.

Requirements:

  • A minimum of a BS in Computer Science or Engineering is required.
  • At least 5 years of experience in an SRE/platform engineering role supporting SaaS platforms is necessary.
  • Strong hands-on experience with public cloud services such as AWS, GCP, or Azure is required.
  • Proficiency with Kubernetes, container-based deployment, and related ecosystems is essential.
  • Strong programming or scripting skills in languages such as Python, Go, or Bash are required.
  • Experience with CI/CD pipelines, including tools like GitHub Actions, GitLab CI, or ArgoCD, is necessary.
  • Familiarity with observability stacks such as Prometheus, ELK/EFK, or Datadog is required.
  • Comfort with being part of a rotating on-call schedule, including handling critical incidents, is necessary.
  • Strong system-level troubleshooting skills and a proactive mindset toward incident prevention are essential.
  • A deep understanding of Linux systems, networking, and common troubleshooting practices is required.
  • Experience supporting multi-tenant microservices architectures is necessary.
  • Familiarity with service mesh technologies, such as Istio, is preferred.
  • Knowledge of zero-downtime deployment strategies, including blue/green and canary releases, is required.
  • Exposure to compliance standards such as SOC 2, ISO 27001, or HIPAA is preferred, with FedRAMP experience being a plus.
  • Experience with chaos engineering or resilience testing practices is beneficial.

Benefits:

  • Employees enjoy flexible hours and unlimited vacation, including 15 days of holidays, 11 company-paid holidays, and 3 extra days for volunteering.
  • The company offers a hybrid work model that balances office and remote work, with structured onboarding to foster connections and team integration.
  • Free access to professional training platforms is provided to explore interests and enhance skills.
  • Up to 16 weeks of paid leave for birthing parents/primary caregivers and 6 weeks for secondary caregivers is available.
  • A 401(k) plan with up to 4% company matching contributions is offered to help employees grow their retirement savings.
  • Bonuses are available for referring successful hires after three months of continuous employment.
  • Comprehensive benefits include 100% covered health, dental, and vision insurance, as well as access to life insurance, long-term disability, and accidental death/personal loss coverage.