about the company.
Internet
about the team.
Data
...
about the job.
- Data Pipeline Development for LLMs: Design, develop, and maintain highly scalable, reliable, and efficient data pipelines (ETL/ELT) for ingesting, transforming, and loading diverse datasets critical for LLM pre-training, fine-tuning, and evaluation. This includes structured, semi-structured, and unstructured text data.
- High-Quality Dataset Creation & Curation:
- Implement advanced techniques for data cleaning and preprocessing, including deduplication, noise reduction, PII masking, tokenization, and formatting of large text corpora.
- Explore and implement methods for expanding and enriching datasets for LLM training, such as data augmentation and synthesis.
- Establish and enforce rigorous data quality standards, implement automated data validation checks, and ensure data privacy and security compliance (e.g., GDPR, CCPA).
- Data Job Management:
- Establish robust systems for data versioning, lineage tracking, and reproducibility of datasets used across the LLM development lifecycle.
- Identify and resolve data-related performance bottlenecks within data pipelines, optimizing data storage, retrieval, and processing for efficiency and cost-effectiveness.
- Data Infrastructure & Orchestration:
- Build and maintain scalable data warehouses and data lakes specifically designed for LLM data on both on-premise and public cloud environments.
- Implement and manage data orchestration tools (e.g., Apache Airflow, Prefect, Dagster) to automate and manage complex data workflows for LLM dataset preparation.
skills and experience required.
- Bachelor's or Master's degree in Computer Science, Data Science, Engineering, or a related quantitative field. With 3+ years of professional experience in Data Engineering, with a significant focus on building and managing data pipelines for large-scale machine learning or data science initiatives, especially those involving large text/image/voice datasets.
- Direct experience with data engineering specifically for Large Language Models (LLMs), including pre-training, fine-tuning, and evaluation datasets.
- Familiarity with common challenges and techniques for preprocessing massive text corpora (e.g., handling noise, deduplication, PII detection/masking, tokenization at scale).
- Experience with data versioning and lineage tools/platforms (e.g., DVC, Pachyderm, LakeFS, or data versioning features within MLOps platforms like MLflow).
- Familiarity with deep learning frameworks (e.g., PyTorch, TensorFlow, JAX) from a data loading and preparation perspective.
- Experience designing and implementing data annotation workflows and pipelines.
- Strong proficiency in Python, and extensive experience with its data ecosystem.
- Proficiency in SQL, and good understanding of data warehousing concepts, data modeling, and schema design.