Remote Contract - Site Reliability Engineer (m/f/d)

at Ververica GmbH

Posted 11 hours ago 0 applied

Description:

  • Ververica is seeking a Site Reliability Engineer (SRE) contractor to design, provision, and maintain the infrastructure for its Unified Streaming Data Platform across multiple cloud providers, including AWS, GCP, and Azure.
  • The role involves collaborating with software engineering teams to enhance feature delivery, optimize performance, and address security vulnerabilities.
  • Key responsibilities include building and maintaining infrastructure, managing Infrastructure as Code (IaC) using Terraform, implementing observability tooling, ensuring system reliability, improving infrastructure architecture, enhancing CI/CD pipelines, monitoring security vulnerabilities, contributing to product development, participating in on-call rotations, and maintaining documentation.

Requirements:

  • A Bachelorโ€™s degree in Computer Science, Information Technology, or a related field is required.
  • Candidates must have a minimum of 2 years of hands-on experience with Kubernetes clusters, Helm charts, controllers, and operators.
  • Proficiency in designing and maintaining Terraform code with best practices is essential.
  • Strong knowledge of observability tools and practices, including metrics, logging, and alerting systems, is required.
  • Experience implementing SRE principles such as SLIs, SLOs, and error budgets is necessary.
  • A solid understanding of Linux systems and networking in cloud environments is required.
  • Hands-on experience managing multiple Kubernetes clusters is essential.
  • Familiarity with distributed systems or streaming data platforms is preferred.
  • Knowledge of cloud-native security best practices is required.

Benefits:

  • The position offers the opportunity to work with cutting-edge technology in real-time data processing and analytics.
  • Contractors will have the chance to collaborate with a team of experts in the field, enhancing their professional development.
  • The role includes the flexibility of working across multiple cloud platforms, providing diverse experience.
  • Participation in on-call rotations allows for hands-on experience in managing incidents in a 24/7 live infrastructure.
  • The position supports continuous learning and improvement through architectural enhancements and best practices in reliability.