Remote Site Reliability Engineer - SRE at StarCompliance

Description:

Maintain and improve platform reliability, availability, and performance using Azure as the core cloud platform and industry-leading tools.
Collaborate with cross-functional teams to design, implement, and maintain resilient systems, automating operations to minimize downtime.
Proactively identify and resolve potential issues to prevent customer impact.
Contribute to continuous improvement of infrastructure and processes.
Analyze reliability challenges and develop automated solutions for incident resolution.
Work with development teams to enhance operational features for faster MTTD, MTTR, and auto-recovery.
Establish SLIs, SLOs, Error budgets, and policies for operational performance.
Identify and address Toil, conduct Post-Mortems, and implement continuous improvements in production operations.
Provide advanced technical support for cross-product issues and incidents.
Utilize SRE tooling to fulfill the SRE mission, conduct Chaos Testing, and implement new tools and technologies for platform efficiency.
Drive reliability and supportability aspects of Cloud service, including change management, customer escalations, remediation plans, playbooks, and automation.
Monitor system health, scale systems sustainably through automation, and improve services throughout their lifecycle.

4+ years of experience in Reliability engineering.
2+ recent years of experience with Azure systems.
Advanced knowledge of New Relic ecosystem.
Working knowledge of Monitoring and APM tools like Azure App Insights, Grafana, and Selenium.
Familiarity with networking, troubleshooting latency, connectivity, and performance.
Experience with IaC using Terraform and CaC with Ansible.
Hands-on experience with SRE practices, Chaos engineering experiments, and containerization.
Proficiency in Linux and Windows administration, troubleshooting, and support.
Experience with databases such as SQL server, Mongo DB, and PostgreSQL.
Knowledge of C#, .Net, PowerShell, Python, or Golang.
Experience in High Availability and distributed systems.
Proficient in Azure DevOps and debugging skills across integrated platforms.

Opportunity to work remotely from anywhere in the United States.
Full-time position with a focus on maintaining and improving platform reliability.
Collaborative work environment with cross-functional teams.
Utilization of industry-leading tools and technologies.
Continuous learning and development opportunities in a dynamic environment.
Competitive salary and benefits package.
Equal opportunity employer with a commitment to diversity and inclusion.