about the company.
Internet

about the team.
Data

about the job.

Data Pipeline Development for LLMs: Design, develop, and maintain highly scalable, reliable, and efficient data pipelines (ETL/ELT) for ingesting, transforming, and loading diverse datasets critical for LLM pre-training, fine-tuning, and evaluation. This includes structured, semi-structured, and unstructured text data.
High-Quality Dataset Creation & Curation:
Implement advanced techniques for data cleaning and preprocessing, including deduplication, noise reduction, PII masking, tokenization, and formatting of large text corpora.
Explore and implement methods for expanding and enriching datasets for LLM training, such as data augmentation and synthesis.
Establish and enforce rigorous data quality standards, implement automated data validation checks, and ensure data privacy and security compliance (e.g., GDPR, CCPA).
Data Job Management:
Establish robust systems for data versioning, lineage tracking, and reproducibility of datasets used across the LLM development lifecycle.
Identify and resolve data-related performance bottlenecks within data pipelines, optimizing data storage, retrieval, and processing for efficiency and cost-effectiveness.
Data Infrastructure & Orchestration:
Build and maintain scalable data warehouses and data lakes specifically designed for LLM data on both on-premise and public cloud environments.
Implement and manage data orchestration tools (e.g., Apache Airflow, Prefect, Dagster) to automate and manage complex data workflows for LLM dataset preparation.

...

senior data engineer - rit.

job details