Remote LLM Data Engineer | United States | Fully Remote at Halo Media

Description:

We are seeking an experienced AI/LLM Data Engineer to build and maintain the data pipeline for our Generative AI platform.
The ideal candidate will be well-versed in the latest Large Language Model (LLM) technologies and have a strong background in data engineering, focusing on Retrieval-Augmented Generation (RAG) and knowledge-base techniques.
This role sits in the AI COE within DX Tech & Digital and reports to the Director, AI Solutions & Development who oversees the AI COE.
You will work on highly visible strategic projects, collaborating with cross-functional teams to define requirements and deliver high-quality AI solutions.
The ideal candidate will have a passion for Generative AI and LLMs, with a proven track record of delivering innovative AI applications.
Responsibilities include designing, implementing, and maintaining an end-to-end multi-stage data pipeline for LLMs, including Supervised Fine Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) data processes.
You will identify, evaluate, and integrate diverse data sources and domains to support the Generative AI platform.
Develop and optimize data processing workflows for chunking, indexing, ingestion, and vectorization for both text and non-text data.
Benchmark and implement various vector stores, embedding techniques, and retrieval methods.
Create a flexible pipeline supporting multiple embedding algorithms, vector stores, and search types (e.g., vector search, hybrid search).
Implement and maintain auto-tagging systems and data preparation processes for LLMs.
Develop tools for text and image data crawling, cleaning, and refinement.
Collaborate with cross-functional teams to ensure data quality and relevance for AI/ML models.
Work with data lake house architectures to optimize data storage and processing.
Integrate and optimize workflows using Snowflake and various vector store technologies.

Requirements:

A Master's degree in Computer Science, Data Science, or a related field is required.
3-5 years of work experience in data engineering, preferably in AI/ML contexts.
Proficiency in Python, JSON, HTTP, and related tools is essential.
A strong understanding of LLM architectures, training processes, and data requirements is necessary.
Experience with RAG systems, knowledge base construction, and vector databases is required.
Familiarity with embedding techniques, similarity search algorithms, and information retrieval concepts is important.
Hands-on experience with data cleaning, tagging, and annotation processes (both manual and automated) is needed.
Knowledge of data crawling techniques and associated ethical considerations is required.
Strong problem-solving skills and the ability to work in a fast-paced, innovative environment are essential.
Familiarity with Snowflake and its integration in AI/ML pipelines is necessary.
Experience with various vector store technologies and their applications in AI is required.
An understanding of data lakehouse concepts and architectures is important.
Excellent communication, collaboration, and problem-solving skills are essential.
The ability to translate business needs into technical solutions is required.
A passion for innovation and a commitment to ethical AI development is necessary.
Experience building LLMs pipeline using frameworks like LangChain, LlamaIndex, Semantic Kernel, or OpenAI functions is preferred.
Familiarity with different LLM parameters like temperature, top-k, and repeat penalty, as well as different LLM outcome evaluation data science metrics and methodologies is important.

Benefits:

US employees benefit from a comprehensive benefits package.