Welcome to RemoteYeah 2.0! Find out more about the new version here.

Remote Senior Staff Machine Learning Engineer - DevOps/Site Reliability Engineer

at ServiceNow

Posted 12 hours ago 1 applied

Description:

  • The position requires passing a ServiceNow background screening, which includes a credit check, criminal/misdemeanor check, and a drug test. Employment is contingent upon passing this screening.
  • The role is within PLATO (Platform Engineering and AI Technology Organization) at ServiceNow, which focuses on building intelligent software to enhance customer work experiences.
  • As a Senior Staff Machine Learning Engineer - Site Reliability Engineer, you will contribute to the design, development, and implementation of infrastructure, platform, deployment, and observability features for AI workloads.
  • You will collaborate with researchers, AI engineers, and infrastructure teams to ensure efficient performance, scalability, and reliability of GPU clusters.
  • The role involves continuous improvement of the SRE practice by transforming operational use cases into software tooling requirements.
  • You will execute deployment and support activities for AI/ML developers and build high-quality, clean, scalable, and reusable code by enforcing best practices in software engineering.
  • You will work with product owners to understand detailed requirements and take ownership of your code from design to delivery.
  • Experience with operating large language models (LLMs) on NVIDIA GPUs is required.
  • You will also mentor colleagues and promote knowledge-sharing within the team.

Requirements:

  • You must have experience in integrating AI into work processes, decision-making, or problem-solving, including using AI-powered tools and automating workflows.
  • Proficiency in prompt engineering and developing LLM-based features is required.
  • You should have experience with training and fine-tuning large language models using methods such as distillation, supervised fine-tuning, and policy optimization.
  • Familiarity with AI productivity tools like Cursor and Windsurf is necessary.
  • A minimum of 8 years of experience in infrastructure and platform operations, deployments, SRE, and DevOps is required, with a focus on improving platform health.
  • You should have at least 6 years of experience operating highly-available distributed workloads on Kubernetes using a DevOps approach.
  • A minimum of 6 years of development experience with programming languages such as Python, GoLang, or Java is required.
  • Experience with DevOps tooling, including Helm, Ansible, Kubernetes, Prometheus, Splunk, and GitLab CI, is necessary.
  • Strong working experience with distributed systems built on Linux and J2EE is required.
  • You should have experience with software-defined networking, infrastructure as code, and configuration management.
  • Experience in building software for compliance and security in regulated environments is necessary.
  • You must have the ability to drive outcomes in projects with significant technical risk.

Benefits:

  • The position offers a base pay range of $197,800 - $346,200, plus equity (when applicable), variable/incentive compensation, and benefits.
  • Health plans are provided, including flexible spending accounts.
  • A 401(k) Plan with company match is available.
  • Employees can participate in an Employee Stock Purchase Plan (ESPP) and matching donations.
  • A flexible time away plan and family leave programs are offered to support work-life balance.
  • Compensation is based on geographic location and may vary based on qualifications, skill level, competencies, and work location.