Ververica is seeking a Site Reliability Engineer (SRE) contractor to design, provision, and maintain the infrastructure for its Unified Streaming Data Platform across multiple cloud providers, including AWS, GCP, and Azure.
The role involves collaborating with software engineering teams to enhance feature delivery, optimize performance, and address security vulnerabilities.
Key responsibilities include building and maintaining infrastructure, managing Infrastructure as Code (IaC) using Terraform, implementing observability tooling, ensuring system reliability, improving infrastructure architecture, enhancing CI/CD pipelines, monitoring security vulnerabilities, contributing to product development, participating in on-call rotations, and maintaining documentation.
Requirements:
A Bachelorโs degree in Computer Science, Information Technology, or a related field is required.
Candidates must have a minimum of 2 years of hands-on experience with Kubernetes clusters, Helm charts, controllers, and operators.
Proficiency in designing and maintaining Terraform code with best practices is essential.
Strong knowledge of observability tools and practices, including metrics, logging, and alerting systems, is required.
Experience implementing SRE principles such as SLIs, SLOs, and error budgets is necessary.
A solid understanding of Linux systems and networking in cloud environments is required.
Hands-on experience managing multiple Kubernetes clusters is essential.
Familiarity with distributed systems or streaming data platforms is preferred.
Knowledge of cloud-native security best practices is required.
Benefits:
The position offers the opportunity to work with cutting-edge technology in real-time data processing and analytics.
Contractors will have the chance to collaborate with a team of experts in the field, enhancing their professional development.
The role includes the flexibility of working across multiple cloud platforms, providing diverse experience.
Participation in on-call rotations allows for hands-on experience in managing incidents in a 24/7 live infrastructure.
The position supports continuous learning and improvement through architectural enhancements and best practices in reliability.