Remote Site Reliability Engineer (Internal Engineering) (Remote) at KnowBe4

Description:

The Internal Site Reliability Engineer (SRE) ensures the reliability, scalability, and performance of internal systems and infrastructure.
This role involves monitoring, automation, incident management, and maintaining self-hosted platforms to support smooth development operations.
The Internal SRE works closely with cross-functional teams to manage GitLab CI/CD workflows and cloud infrastructure on AWS.
The position emphasizes proactive problem-solving, automation, and collaboration to continuously improve system stability and efficiency.
Responsibilities include managing and maintaining GitLab environments to ensure high availability and security.
The SRE will design and implement CI/CD pipelines to automate software delivery.
Monitoring and troubleshooting system performance issues using observability tools like Prometheus, Grafana, or Datadog is required.
Collaboration with development teams to align infrastructure efforts with project needs and timelines is essential.
The role involves building and maintaining infrastructure as code (IaC) solutions using tools like Terraform and Ansible.
Managing AWS services, including ECS, S3, API Gateway, DynamoDB, RDS, IAM, and VPC, is part of the job.
Participation in incident response, conducting root cause analysis and post-incident reviews is expected.
Automating manual tasks to improve operational efficiency and reduce technical debt is a key responsibility.

A Bachelor’s degree in Computer Science, Information Technology, or a related field is required.
Equivalent work experience in SRE, DevOps, or infrastructure management may substitute for formal education.
Experience managing and securing self-hosted GitLab environments is necessary.
Expertise in designing and maintaining automated pipelines for continuous delivery is required.
Strong knowledge of AWS services, including ECS, S3, API Gateway, DynamoDB, RDS, IAM, VPC, and Lambda, is essential.
Proficiency in Terraform, Ansible, or similar tools for Infrastructure-as-Code is required.
Experience with Prometheus, Grafana, Datadog, or other observability platforms is necessary.
Proficiency in Python, Bash, or other scripting languages to automate tasks is required.
The ability to lead incident response efforts and conduct root cause analysis is essential.
Strong interpersonal skills to work effectively across teams and with stakeholders are required.

KnowBe4 has been recognized as a best place to work for women, millennials, and in technology for four consecutive years.
The company has been certified as a "Great Place To Work" in 8 countries.
Employees enjoy a welcoming workplace that encourages them to be themselves.
The company promotes continuous professional development and radical transparency.
There are opportunities for team engagement through activities like team lunches, trivia competitions, and local outings.