Description:

As a Site Reliability Engineer (SRE), you will play a key role in designing, implementing, and maintaining scalable infrastructure while ensuring system reliability and efficiency.
Your focus will be on automation, performance optimization, and cloud resource management.
You will collaborate with cross-functional teams to streamline CI/CD pipelines, enhance monitoring solutions, and support a highly available infrastructure.
This position requires a proactive approach to troubleshooting and continuous improvement, ensuring seamless integration of new services while leveraging the latest SRE best practices.
You will design, build, and maintain highly scalable cloud infrastructure using Terraform and Terragrunt for automated resource provisioning.
You will manage and optimize AWS cloud environments, ensuring security, cost efficiency, and high availability.
You will oversee data streaming platforms using Confluent Cloud and Kafka, ensuring reliable data pipelines.
You will deploy and manage Redis instances for caching and real-time data processing.
You will implement and maintain monitoring and alerting solutions using Prometheus, Grafana, Alert Manager, and OpsGenie.
You will enable feature flag management and controlled rollouts with LaunchDarkly.
You will manage Kubernetes clusters, utilizing Helm, ArgoCD, Istio, and Kustomize for continuous deployment and infrastructure-as-code practices.
You will collaborate with development teams to integrate new services into the infrastructure seamlessly.
You will troubleshoot complex system issues to maintain high availability and performance.
You will continuously improve automation tools, processes, and methodologies to enhance system scalability.

Requirements:

You must have 4+ years of experience in Site Reliability Engineering or a similar role with a strong focus on cloud infrastructure.
You should have expertise in Infrastructure as Code (IaC) using Terraform and Terragrunt.
You need deep knowledge of AWS cloud services and best practices for scalable and secure architectures.
You must have hands-on experience with Confluent Cloud and Kafka for distributed data streaming.
Strong experience with Redis for caching and RDS for data storage is required.
Proficiency with OpenSearch/ElasticSearch/ChaosSearch for search and analytics is necessary.
You should have advanced knowledge of monitoring tools like Prometheus, Grafana, Alert Manager, and OpsGenie.
Experience with LaunchDarkly for feature flag management is essential.
Extensive experience managing Kubernetes clusters, including Helm for package management, ArgoCD for deployments, and Istio for service mesh configurations is required.
Familiarity with Kustomize for Kubernetes resource configuration is necessary.
You must possess strong problem-solving skills and the ability to troubleshoot complex systems in production environments.
Excellent communication and collaboration skills within agile teams are required.

Benefits:

You will receive a competitive salary based on experience and qualifications.
The position offers fully remote work flexibility, with a collaborative team environment.
Comprehensive healthcare coverage, including medical, dental, and vision plans, is provided.
A retirement savings plan with company matching is available.
Flexible paid time off (PTO) is offered to support work-life balance.
Professional development opportunities, including training and certifications, are provided.
You will have access to cutting-edge technology and opportunities to work on innovative projects.

Remote Site Reliability Engineer - (Remote - Canada)

Description:

Requirements:

Benefits: