As a Pre-Training Data Engineer at Cohere, you will be responsible for developing the data infrastructure that supports advanced language models.
Your role includes end-to-end management of training data, which involves ingestion, cleaning, filtering, and optimization.
You will work with diverse data sources such as web data, code data, multilingual corpora, and synthetic data to ensure their quality, diversity, and reliability.
You will design and implement scalable and robust pipelines for data processing and conduct data ablations to evaluate data quality.
Experimenting with data mixtures to enhance model performance will also be part of your responsibilities.
Your work will bridge the gap between raw data and cutting-edge AI models, contributing to improvements in training metrics like throughput and accelerator utilization.
This position is remote-friendly, with no restrictions on location, and you will collaborate with cross-functional teams to meet the demands of language models.
Requirements:
Strong software engineering skills are required, with proficiency in Python and experience in building data pipelines.
Familiarity with data processing frameworks such as Apache Spark, Apache Beam, Pandas, or similar tools is necessary.
Experience working with large-scale datasets, including web data, code data, and multilingual corpora, is essential.
Knowledge of data quality assessment techniques and experimentation with data mixtures is required.
A passion for bridging research and engineering to solve complex data-related challenges in AI model training is important.
Bonus points for having published papers at top-tier venues such as NeurIPS, ICML, ICLR, and others.
Benefits:
Cohere offers an open and inclusive culture and work environment.
Employees work closely with a team on the cutting edge of AI research.
A weekly lunch stipend, in-office lunches, and snacks are provided.
Full health and dental benefits are included, along with a separate budget for mental health care.
Employees based in Canada, the US, and the UK receive a 100% parental leave top-up for 6 months.
Personal enrichment benefits are available for arts and culture, fitness and well-being, quality time, and workspace improvement.
The position is remote-flexible, with offices in Toronto, New York, San Francisco, and London, along with a co-working stipend.