Remote Site Reliability Engineer - Remote - Canada

Posted

This job is closed

This job post is closed and the position is probably filled. Please do not apply.  Automatically closed by a robot after apply link was detected as broken.

Description:

  • As a Site Reliability Engineer (SRE), you will play a key role in designing, implementing, and maintaining scalable infrastructure while ensuring system reliability and efficiency.
  • Your focus will be on automation, performance optimization, and cloud resource management.
  • You will collaborate with cross-functional teams to streamline CI/CD pipelines, enhance monitoring solutions, and support a highly available infrastructure.
  • This position requires a proactive approach to troubleshooting and continuous improvement, ensuring seamless integration of new services while leveraging the latest SRE best practices.
  • You will design, build, and maintain highly scalable cloud infrastructure using Terraform and Terragrunt for automated resource provisioning.
  • You will manage and optimize AWS cloud environments, ensuring security, cost efficiency, and high availability.
  • You will oversee data streaming platforms using Confluent Cloud and Kafka, ensuring reliable data pipelines.
  • You will deploy and manage Redis instances for caching and real-time data processing.
  • You will implement and maintain monitoring and alerting solutions using Prometheus, Grafana, Alert Manager, and OpsGenie.
  • You will enable feature flag management and controlled rollouts with LaunchDarkly.
  • You will manage Kubernetes clusters, utilizing Helm, ArgoCD, Istio, and Kustomize for continuous deployment and infrastructure-as-code practices.
  • You will collaborate with development teams to integrate new services into the infrastructure seamlessly.
  • You will troubleshoot complex system issues to maintain high availability and performance.
  • You will continuously improve automation tools, processes, and methodologies to enhance system scalability.

Requirements:

  • You must have 4+ years of experience in Site Reliability Engineering or a similar role with a strong focus on cloud infrastructure.
  • You should have expertise in Infrastructure as Code (IaC) using Terraform and Terragrunt.
  • You need deep knowledge of AWS cloud services and best practices for scalable and secure architectures.
  • You must have hands-on experience with Confluent Cloud and Kafka for distributed data streaming.
  • Strong experience with Redis for caching and RDS for data storage is required.
  • You should be proficient with OpenSearch/ElasticSearch/ChaosSearch for search and analytics.
  • Advanced knowledge of monitoring tools like Prometheus, Grafana, Alert Manager, and OpsGenie is necessary.
  • Experience with LaunchDarkly for feature flag management is required.
  • You must have extensive experience managing Kubernetes clusters, including Helm for package management, ArgoCD for deployments, and Istio for service mesh configurations.
  • Familiarity with Kustomize for Kubernetes resource configuration is needed.
  • Strong problem-solving skills and the ability to troubleshoot complex systems in production environments are essential.
  • Excellent communication and collaboration skills within agile teams are required.

Benefits:

  • You will receive a competitive salary based on experience and qualifications.
  • The position offers fully remote work flexibility, with a collaborative team environment.
  • Comprehensive healthcare coverage, including medical, dental, and vision plans, is provided.
  • A retirement savings plan with company matching is available.
  • Flexible paid time off (PTO) is offered to support work-life balance.
  • Professional development opportunities, including training and certifications, are provided.
  • You will have access to cutting-edge technology and opportunities to work on innovative projects.
Leave a feedback