Data Pipeline Development for LLMs: Design, develop, and maintain highly scalable, reliable, and efficient data pipelines (ETL/ELT) for ingesting, transforming, and loading diverse datasets critical for LLM pre-training, fine-tuning, and evaluation. This includes structured, semi-structured, and unstructured text data.
High-Quality Dataset Creation & Curation:
Implement advanced techniques for data cleaning and preprocessing, including deduplication, noise reduction, PII masking, tokenization, and formatting of large text corpora.
Explore and implement methods for expanding and enriching datasets for LLM training, such as data augmentation and synthesis.
Establish and enforce rigorous data quality standards, implement automated data validation checks, and ensure data privacy and security compliance (e.g., GDPR, CCPA).
Data Job Management:
Establish robust systems for data versioning, lineage tracking, and reproducibility of datasets used across the LLM development lifecycle.
Identify and resolve data-related performance bottlenecks within data pipelines, optimizing data storage, retrieval, and processing for efficiency and cost-effectiveness.
Data Infrastructure & Orchestration:
Build and maintain scalable data warehouses and data lakes specifically designed for LLM data on both on-premise and public cloud environments.
Implement and manage data orchestration tools (e.g., Apache Airflow, Prefect, Dagster) to automate and manage complex data workflows for LLM dataset preparation.
skills and experience required.
Bachelor's or Master's degree in Computer Science, Data Science, Engineering, or a related quantitative field. With 3+ years of professional experience in Data Engineering, with a significant focus on building and managing data pipelines for large-scale machine learning or data science initiatives, especially those involving large text/image/voice datasets.
Direct experience with data engineering specifically for Large Language Models (LLMs), including pre-training, fine-tuning, and evaluation datasets.
Familiarity with common challenges and techniques for preprocessing massive text corpora (e.g., handling noise, deduplication, PII detection/masking, tokenization at scale).
Experience with data versioning and lineage tools/platforms (e.g., DVC, Pachyderm, LakeFS, or data versioning features within MLOps platforms like MLflow).
Familiarity with deep learning frameworks (e.g., PyTorch, TensorFlow, JAX) from a data loading and preparation perspective.
Experience designing and implementing data annotation workflows and pipelines.
Strong proficiency in Python, and extensive experience with its data ecosystem.
Proficiency in SQL, and good understanding of data warehousing concepts, data modeling, and schema design.
show more
about the company. Internet
about the team. Data
about the job.
Data Pipeline Development for LLMs: Design, develop, and maintain highly scalable, reliable, and efficient data pipelines (ETL/ELT) for ingesting, transforming, and loading diverse datasets critical for LLM pre-training, fine-tuning, and evaluation. This includes structured, semi-structured, and unstructured text data.
High-Quality Dataset Creation & Curation:
Implement advanced techniques for data cleaning and preprocessing, including deduplication, noise reduction, PII masking, tokenization, and formatting of large text corpora.
Explore and implement methods for expanding and enriching datasets for LLM training, such as data augmentation and synthesis.
Establish and enforce rigorous data quality standards, implement automated data validation checks, and ensure data privacy and security compliance (e.g., GDPR, CCPA).
Data Job Management:
Establish robust systems for data versioning, lineage tracking, and reproducibility of datasets used across the LLM development lifecycle.
Identify and resolve data-related performance bottlenecks within data pipelines, optimizing data storage, retrieval, and processing for efficiency and cost-effectiveness.
Data Infrastructure & Orchestration:
Build and maintain scalable data warehouses and data lakes specifically designed for LLM data on both on-premise and public cloud environments.
Implement and manage data orchestration tools (e.g., Apache Airflow, Prefect, Dagster) to automate and manage complex data workflows for LLM dataset preparation.
...
skills and experience required.
Bachelor's or Master's degree in Computer Science, Data Science, Engineering, or a related quantitative field. With 3+ years of professional experience in Data Engineering, with a significant focus on building and managing data pipelines for large-scale machine learning or data science initiatives, especially those involving large text/image/voice datasets.
Direct experience with data engineering specifically for Large Language Models (LLMs), including pre-training, fine-tuning, and evaluation datasets.
Familiarity with common challenges and techniques for preprocessing massive text corpora (e.g., handling noise, deduplication, PII detection/masking, tokenization at scale).
Experience with data versioning and lineage tools/platforms (e.g., DVC, Pachyderm, LakeFS, or data versioning features within MLOps platforms like MLflow).
Familiarity with deep learning frameworks (e.g., PyTorch, TensorFlow, JAX) from a data loading and preparation perspective.
Experience designing and implementing data annotation workflows and pipelines.
Strong proficiency in Python, and extensive experience with its data ecosystem.
Proficiency in SQL, and good understanding of data warehousing concepts, data modeling, and schema design.
about the company.Our client is global leader in energy management and automation, providing digital solutions for efficiency and sustainability. about the team.Business process teamabout the job.Drive SAP process excellence in China Ops to achieve org. efficiencyIdentify, initiate and drive improvement for China Ops SAP process optimization/integration to enhance operation efficiencyIn charge of and support China Ops SAP process implementation project to
about the company.Our client is global leader in energy management and automation, providing digital solutions for efficiency and sustainability. about the team.Business process teamabout the job.Drive SAP process excellence in China Ops to achieve org. efficiencyIdentify, initiate and drive improvement for China Ops SAP process optimization/integration to enhance operation efficiencyIn charge of and support China Ops SAP process implementation project to
关于公司Technology关于团队IT关于职位1.Lead the design, development, and deployment of end-to-end AI/ML solutions, from proof-of-concept to production.2.Architect and implement scalable and robust AI/ML models and systems using appropriate frameworks and tools.3.Collaborate with data scientists to transition research prototypes into production-ready solutions.4.Ensure the quality, reliability, and performance of deployed AI solutions.技能和经验要求1.Bachelor's or Master's deg
关于公司Technology关于团队IT关于职位1.Lead the design, development, and deployment of end-to-end AI/ML solutions, from proof-of-concept to production.2.Architect and implement scalable and robust AI/ML models and systems using appropriate frameworks and tools.3.Collaborate with data scientists to transition research prototypes into production-ready solutions.4.Ensure the quality, reliability, and performance of deployed AI solutions.技能和经验要求1.Bachelor's or Master's deg