about the company.
The company has almost 100 million customers based in Japan and 1 billion globally as well, providing more than 70 services in a variety such as e-commerce, payment services, financial services, telecommunication, media, sports, etc.

about the team.
AI & Data Division (AIDD)

about the job.

Optimize LLM training frameworks (e.g., PyTorch, DeepSpeed, Megatron-LM, FSDP) to maximize GPU utilization and reduce training time.
Profile and optimize distributed training bottlenecks (e.g., NCCL issues, CUDA kernel efficiency, communication overhead).
Implement and tune inference optimizations (e.g., quantization, dynamic batching, KV caching) for low-latency, high-throughput LLM serving (vLLM, TensorRT-LLM, Triton, SGLang).
Collaborate with infrastructure teams to improve GPU cluster scheduling, resource allocation, and fault tolerance for large-scale training jobs.
Develop benchmarking tools to measure and improve training throughput, memory efficiency, and inference latency.
Research and apply cutting-edge techniques (e.g., mixture-of-experts, speculative decoding) to optimize LLM performance.

skills and experience required.

3+ years of hands-on experience in GPU-accelerated ML training & inference optimization, preferably for LLMs or large-scale deep learning models.
Deep expertise in PyTorch, DeepSpeed, FSDP, or Megatron-LM, with experience in distributed training optimizations.
Strong knowledge of LLM inference optimizations (e.g., quantization, pruning, KV caching, continuous batching).
Bachelor’s or higher degree in Computer Science, Engineering, or related field.

英语gpu训练优化工程师.

职位概述

分享职位

联系我们

葛瑞莹 Hedy Ge

相关岗位

检测认证（英语）

资深ai应用工程师

产品设计师（AI方向）

查看相关岗位