Data Pipeline Development for LLMs: Design, develop, and maintain highly scalable, reliable, and efficient data pipelines (ETL/ELT) for ingesting, transforming, and loading diverse datasets critical for LLM pre-training, fine-tuning, and evaluation. This includes structured, semi-structured, and unstructured text data.
High-Quality Dataset Creation & Curation:
Implement advanced techniques for data cleaning and preprocessing, including deduplication, noise reduction, PII masking, tokenization, and formatting of large text corpora.
Explore and implement methods for expanding and enriching datasets for LLM training, such as data augmentation and synthesis.
Establish and enforce rigorous data quality standards, implement automated data validation checks, and ensure data privacy and security compliance (e.g., GDPR, CCPA).
Data Job Management:
Establish robust systems for data versioning, lineage tracking, and reproducibility of datasets used across the LLM development lifecycle.
Identify and resolve data-related performance bottlenecks within data pipelines, optimizing data storage, retrieval, and processing for efficiency and cost-effectiveness.
Data Infrastructure & Orchestration:
Build and maintain scalable data warehouses and data lakes specifically designed for LLM data on both on-premise and public cloud environments.
Implement and manage data orchestration tools (e.g., Apache Airflow, Prefect, Dagster) to automate and manage complex data workflows for LLM dataset preparation.
skills and experience required.
Bachelor's or Master's degree in Computer Science, Data Science, Engineering, or a related quantitative field. With 3+ years of professional experience in Data Engineering, with a significant focus on building and managing data pipelines for large-scale machine learning or data science initiatives, especially those involving large text/image/voice datasets.
Direct experience with data engineering specifically for Large Language Models (LLMs), including pre-training, fine-tuning, and evaluation datasets.
Familiarity with common challenges and techniques for preprocessing massive text corpora (e.g., handling noise, deduplication, PII detection/masking, tokenization at scale).
Experience with data versioning and lineage tools/platforms (e.g., DVC, Pachyderm, LakeFS, or data versioning features within MLOps platforms like MLflow).
Familiarity with deep learning frameworks (e.g., PyTorch, TensorFlow, JAX) from a data loading and preparation perspective.
Experience designing and implementing data annotation workflows and pipelines.
Strong proficiency in Python, and extensive experience with its data ecosystem.
Proficiency in SQL, and good understanding of data warehousing concepts, data modeling, and schema design.
show more
about the company. Internet
about the team. Data
about the job.
Data Pipeline Development for LLMs: Design, develop, and maintain highly scalable, reliable, and efficient data pipelines (ETL/ELT) for ingesting, transforming, and loading diverse datasets critical for LLM pre-training, fine-tuning, and evaluation. This includes structured, semi-structured, and unstructured text data.
High-Quality Dataset Creation & Curation:
Implement advanced techniques for data cleaning and preprocessing, including deduplication, noise reduction, PII masking, tokenization, and formatting of large text corpora.
Explore and implement methods for expanding and enriching datasets for LLM training, such as data augmentation and synthesis.
Establish and enforce rigorous data quality standards, implement automated data validation checks, and ensure data privacy and security compliance (e.g., GDPR, CCPA).
Data Job Management:
Establish robust systems for data versioning, lineage tracking, and reproducibility of datasets used across the LLM development lifecycle.
Identify and resolve data-related performance bottlenecks within data pipelines, optimizing data storage, retrieval, and processing for efficiency and cost-effectiveness.
Data Infrastructure & Orchestration:
Build and maintain scalable data warehouses and data lakes specifically designed for LLM data on both on-premise and public cloud environments.
Implement and manage data orchestration tools (e.g., Apache Airflow, Prefect, Dagster) to automate and manage complex data workflows for LLM dataset preparation.
...
skills and experience required.
Bachelor's or Master's degree in Computer Science, Data Science, Engineering, or a related quantitative field. With 3+ years of professional experience in Data Engineering, with a significant focus on building and managing data pipelines for large-scale machine learning or data science initiatives, especially those involving large text/image/voice datasets.
Direct experience with data engineering specifically for Large Language Models (LLMs), including pre-training, fine-tuning, and evaluation datasets.
Familiarity with common challenges and techniques for preprocessing massive text corpora (e.g., handling noise, deduplication, PII detection/masking, tokenization at scale).
Experience with data versioning and lineage tools/platforms (e.g., DVC, Pachyderm, LakeFS, or data versioning features within MLOps platforms like MLflow).
Familiarity with deep learning frameworks (e.g., PyTorch, TensorFlow, JAX) from a data loading and preparation perspective.
Experience designing and implementing data annotation workflows and pipelines.
Strong proficiency in Python, and extensive experience with its data ecosystem.
Proficiency in SQL, and good understanding of data warehousing concepts, data modeling, and schema design.
about the company.互联网about the team.数据about the job.1、与利益相关者合作,了解他们的业务需求,并识别分析洞察机会,从而为他们的决策提供信息支持。2、了解并向团队和高级管理层报告业务绩效。3、负责数据可视化产品管理,包括产品需求管理、解决方案设计,并与产品团队协调开发和发布解决方案。4、可视化仪表板的开发和发布,确保定期报告/仪表板按时、无误地交付。对相关业务领域承担所有权。5、鼓励和培训部门成员在数据使用、分析技术和解读方面采用最佳实践。skills and experience required.1、拥有计算机科学、工程学、数学、统计学、数据科学或相关学位(成绩优异)的学士学位并具备 8 年以上工作经验,或具备同等学历的硕士学位并具备 5 年以上数据可视化或数据产品经理领域的工作经验。2、良好的英语书面和口语能力,流利的英语沟通,有海外生活背景者优先。3、精通 SQL/Mysql/BQ (Google BigQuery)。4、拥有 Tableau、Pow
about the company.互联网about the team.数据about the job.1、与利益相关者合作,了解他们的业务需求,并识别分析洞察机会,从而为他们的决策提供信息支持。2、了解并向团队和高级管理层报告业务绩效。3、负责数据可视化产品管理,包括产品需求管理、解决方案设计,并与产品团队协调开发和发布解决方案。4、可视化仪表板的开发和发布,确保定期报告/仪表板按时、无误地交付。对相关业务领域承担所有权。5、鼓励和培训部门成员在数据使用、分析技术和解读方面采用最佳实践。skills and experience required.1、拥有计算机科学、工程学、数学、统计学、数据科学或相关学位(成绩优异)的学士学位并具备 8 年以上工作经验,或具备同等学历的硕士学位并具备 5 年以上数据可视化或数据产品经理领域的工作经验。2、良好的英语书面和口语能力,流利的英语沟通,有海外生活背景者优先。3、精通 SQL/Mysql/BQ (Google BigQuery)。4、拥有 Tableau、Pow
about the company.Internetabout the team.Dataabout the job.Work with stakeholders to understand their business needs and identify opportunitiesfor analysis insight to inform their decision-makingUnderstand and report to your team and senior management on business performanceResponsible for data visualization product management, including product requirementmanagement, solution design, coordinate with product team on development andpublish solution.Product
about the company.Internetabout the team.Dataabout the job.Work with stakeholders to understand their business needs and identify opportunitiesfor analysis insight to inform their decision-makingUnderstand and report to your team and senior management on business performanceResponsible for data visualization product management, including product requirementmanagement, solution design, coordinate with product team on development andpublish solution.Product