Remote Site Reliability Engineer (SRE) (The Uptime Guardian) at Unreal Gigs

Description:

The Site Reliability Engineer (SRE) at our client will focus on building and maintaining highly reliable, scalable infrastructure supporting products and services.
Responsibilities include system monitoring, incident management, automation, infrastructure as code, high availability, performance optimization, disaster recovery, collaboration with development teams, on-call responsibilities, and capacity planning.
The role involves working across software engineering, operations, and problem-solving areas of the tech stack.

Requirements:

System Reliability and Automation Expertise: Experience in building and maintaining reliable systems, automating infrastructure using tools like Terraform, Ansible, or Puppet, and optimizing systems for uptime and performance.
Monitoring and Incident Management: Proficiency in setting up and managing monitoring, logging, and alerting systems like Prometheus, Grafana, or ELK Stack, with experience in incident management and problem resolution.
Cloud Infrastructure Management: Hands-on experience managing cloud infrastructure on platforms such as AWS, GCP, or Azure, deploying and maintaining scalable systems in the cloud.
Performance Optimization: Expertise in optimizing systems for low latency, high throughput, and minimal downtime, understanding load balancing, caching strategies, and database performance optimization.
Security and Compliance: Understanding of security best practices, encryption, and compliance frameworks like SOC2 or GDPR, ensuring secure systems while maintaining reliability.
Educational Requirements: Bachelor’s degree in Computer Science, Systems Engineering, or related field, or equivalent experience in site reliability engineering, systems administration, or DevOps. Certifications like AWS Certified Solutions Architect, Kubernetes Administrator, or SRE Practitioner are a plus.
Experience Requirements: 3+ years of experience in site reliability engineering or similar role, focusing on system automation, performance optimization, and cloud infrastructure management. Hands-on experience with Docker, Kubernetes, and managing large-scale distributed systems.

Benefits:

Health and Wellness: Comprehensive medical, dental, and vision insurance plans with low co-pays and premiums.
Paid Time Off: Competitive vacation, sick leave, and 20 paid holidays per year.
Work-Life Balance: Flexible work schedules and telecommuting options.
Professional Development: Opportunities for training, certification reimbursement, and career advancement programs.
Wellness Programs: Access to gym memberships, health screenings, and mental health resources.
Life and Disability Insurance: Coverage for life insurance and short-term/long-term disability.
Employee Assistance Program (EAP): Confidential counseling and support services for personal and professional challenges.
Tuition Reimbursement: Financial assistance for continuing education and professional development.
Community Engagement: Opportunities for community service and volunteer activities.
Recognition Programs: Employee recognition programs to celebrate achievements and milestones.