Remote Senior Site Reliability Engineer

at Cordial

Posted 4 days ago 1 applied

Description:

  • Cordial is seeking a motivated and talented Site Reliability Engineer to help monitor, develop, and scale the Cordial platform.
  • The goal is to provide clients with a delightful experience and ensure that expected jobs and background processes run without issues.
  • The engineer will work with DevOps and Product teams to optimize performance, squash bugs, and reveal blind spots through comprehensive monitoring.
  • Responsibilities include administering, monitoring, and troubleshooting application and network components in a cloud-based environment, specifically AWS.
  • The role involves designing, authoring, deploying, and monitoring manifests for Kubernetes clusters, helm charts/repos, and service mesh configurations.
  • The engineer will actively contribute to platform infrastructure design and implementation discussions.
  • Software engineering skills will be utilized to trace/debug code and identify root causes of production data corruption and performance issues.
  • The position requires providing production support for Product Development teams and participating in an on-call rotation.
  • The engineer will develop and deploy monitoring and alerting architecture and implement monitoring/logging solutions.
  • Troubleshooting complex issues in a timely manner is essential to maintain the performance and stability of the Production Application environment.
  • The role also includes helping to build out SLOs and documenting and monitoring SLAs.

Requirements:

  • Candidates must have 5+ years of experience in UNIX/Linux Systems and Network Administration, including DNS, IPsec, VPN, Load Balancing, and process tracing.
  • Experience with AWS, specifically EC2 and EKS, is required.
  • Candidates should have experience deploying and/or maintaining Kubernetes/EKS clusters.
  • Hands-on experience writing and maintaining custom Helm charts is necessary.
  • Experience working with one or more service meshes such as app-mesh, Istio, or Linkerd is required.
  • Familiarity with monitoring, logging, and alerting tools is essential.
  • Previous positions held as a Site Reliability Engineer (SRE) and/or in a DevOps role are necessary.
  • Development experience in PHP is required.
  • Extensive experience with Docker/containers and Kubernetes is necessary.
  • Experience with Hashicorp products such as Consul and Vault is required.
  • Candidates must be comfortable working in a globally distributed team across time zones.
  • Strong teamwork and communication skills are essential.
  • A genuine desire to learn new technologies and grow is required.
  • Fluency in verbal and written English is necessary.
  • Experience with large-scale distributed systems is required.
  • Proficiency in infrastructure as code (IaC) tools such as Terraform or CloudFormation is necessary.
  • Understanding of observability principles and tools like Prometheus, Grafana, ELK stack, and distributed tracing is required.
  • Familiarity with CI/CD pipelines such as Jenkins, GitLab CI, or ArgoCD is necessary.
  • A strong grasp of networking fundamentals and security best practices in a cloud environment is required.

Benefits:

  • The position offers a salary range of $140,000.00 to $180,000.00 annually, which may be adjusted based on experience and location.
  • In addition to the base salary, the compensation package includes equity and bonuses.
  • A robust benefits plan is provided, including medical, dental, vision, and life insurance.
  • The company offers a 401k match and flexible time off.
  • Additional perks include monthly wellness and cell phone stipends, childcare support, and yearly reimbursements for continued education.
  • Cordial emphasizes maintaining a healthy work/life balance and has a strong dedication to diversity, equity, and inclusion efforts.
  • The company fosters an overall respectful and open culture.

Get realtime job alerts

Be the first to know about new jobs