Remote LLM Data Engineer | United States | Fully Remote

Posted

Apply now
Please, let Halo Media know you found this job on RemoteYeah. This helps us grow 🌱.

Description:

  • We are seeking an experienced AI/LLM Data Engineer to build and maintain the data pipeline for our Generative AI platform.
  • The ideal candidate will be well-versed in the latest Large Language Model (LLM) technologies and have a strong background in data engineering, focusing on Retrieval-Augmented Generation (RAG) and knowledge-base techniques.
  • This role sits in the AI COE within DX Tech & Digital and reports to the Director, AI Solutions & Development who oversees the AI COE.
  • You will work on highly visible strategic projects, collaborating with cross-functional teams to define requirements and deliver high-quality AI solutions.
  • The ideal candidate will have a passion for Generative AI and LLMs, with a proven track record of delivering innovative AI applications.
  • Responsibilities include designing, implementing, and maintaining an end-to-end multi-stage data pipeline for LLMs, including Supervised Fine Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) data processes.
  • You will identify, evaluate, and integrate diverse data sources and domains to support the Generative AI platform.
  • Develop and optimize data processing workflows for chunking, indexing, ingestion, and vectorization for both text and non-text data.
  • Benchmark and implement various vector stores, embedding techniques, and retrieval methods.
  • Create a flexible pipeline supporting multiple embedding algorithms, vector stores, and search types (e.g., vector search, hybrid search).
  • Implement and maintain auto-tagging systems and data preparation processes for LLMs.
  • Develop tools for text and image data crawling, cleaning, and refinement.
  • Collaborate with cross-functional teams to ensure data quality and relevance for AI/ML models.
  • Work with data lake house architectures to optimize data storage and processing.
  • Integrate and optimize workflows using Snowflake and various vector store technologies.

Requirements:

  • A Master's degree in Computer Science, Data Science, or a related field is required.
  • 3-5 years of work experience in data engineering, preferably in AI/ML contexts.
  • Proficiency in Python, JSON, HTTP, and related tools is essential.
  • A strong understanding of LLM architectures, training processes, and data requirements is necessary.
  • Experience with RAG systems, knowledge base construction, and vector databases is required.
  • Familiarity with embedding techniques, similarity search algorithms, and information retrieval concepts is important.
  • Hands-on experience with data cleaning, tagging, and annotation processes (both manual and automated) is needed.
  • Knowledge of data crawling techniques and associated ethical considerations is required.
  • Strong problem-solving skills and the ability to work in a fast-paced, innovative environment are essential.
  • Familiarity with Snowflake and its integration in AI/ML pipelines is necessary.
  • Experience with various vector store technologies and their applications in AI is required.
  • An understanding of data lakehouse concepts and architectures is important.
  • Excellent communication, collaboration, and problem-solving skills are essential.
  • The ability to translate business needs into technical solutions is required.
  • A passion for innovation and a commitment to ethical AI development is necessary.
  • Experience building LLMs pipeline using frameworks like LangChain, LlamaIndex, Semantic Kernel, or OpenAI functions is preferred.
  • Familiarity with different LLM parameters like temperature, top-k, and repeat penalty, as well as different LLM outcome evaluation data science metrics and methodologies is important.

Benefits:

  • US employees benefit from a comprehensive benefits package.
Apply now
Please, let Halo Media know you found this job on RemoteYeah . This helps us grow 🌱.
Report this job

Job expired or something else is wrong with this job?

Report this job
Leave a feedback