The position requires passing a ServiceNow background screening, which includes a credit check, criminal/misdemeanor check, and a drug test. Employment is contingent upon passing this screening.
The role is within PLATO (Platform Engineering and AI Technology Organization) at ServiceNow, which focuses on building intelligent software to enhance customer work experiences.
As a Senior Staff Machine Learning Engineer - Site Reliability Engineer, you will contribute to the design, development, and implementation of infrastructure, platform, deployment, and observability features for AI workloads.
You will collaborate with researchers, AI engineers, and infrastructure teams to ensure efficient performance, scalability, and reliability of GPU clusters.
The role involves continuous improvement of the SRE practice by transforming operational use cases into software tooling requirements.
You will execute deployment and support activities for AI/ML developers and build high-quality, clean, scalable, and reusable code by enforcing best practices in software engineering.
You will work with product owners to understand detailed requirements and take ownership of your code from design to delivery.
Experience with operating large language models (LLMs) on NVIDIA GPUs is required.
You will also mentor colleagues and promote knowledge-sharing within the team.
Requirements:
You must have experience in integrating AI into work processes, decision-making, or problem-solving, including using AI-powered tools and automating workflows.
Proficiency in prompt engineering and developing LLM-based features is required.
You should have experience with training and fine-tuning large language models using methods such as distillation, supervised fine-tuning, and policy optimization.
Familiarity with AI productivity tools like Cursor and Windsurf is necessary.
A minimum of 8 years of experience in infrastructure and platform operations, deployments, SRE, and DevOps is required, with a focus on improving platform health.
You should have at least 6 years of experience operating highly-available distributed workloads on Kubernetes using a DevOps approach.
A minimum of 6 years of development experience with programming languages such as Python, GoLang, or Java is required.
Experience with DevOps tooling, including Helm, Ansible, Kubernetes, Prometheus, Splunk, and GitLab CI, is necessary.
Strong working experience with distributed systems built on Linux and J2EE is required.
You should have experience with software-defined networking, infrastructure as code, and configuration management.
Experience in building software for compliance and security in regulated environments is necessary.
You must have the ability to drive outcomes in projects with significant technical risk.
Benefits:
The position offers a base pay range of $197,800 - $346,200, plus equity (when applicable), variable/incentive compensation, and benefits.
Health plans are provided, including flexible spending accounts.
A 401(k) Plan with company match is available.
Employees can participate in an Employee Stock Purchase Plan (ESPP) and matching donations.
A flexible time away plan and family leave programs are offered to support work-life balance.
Compensation is based on geographic location and may vary based on qualifications, skill level, competencies, and work location.