about the company.
The company has almost 100 million customers based in Japan and 1 billion globally as well, providing more than 70 services in a variety such as e-commerce, payment services, financial services, telecommunication, media, sports, etc.
about the team.
GPUOD
...
about the job.
- Optimize Kubernetes (K8s) for GPU workloads, including scheduling policies, autoscaling, and multi-tenant resource isolation.
- Deploy and maintain inference serving platforms (e.g., NVIDIA Triton, vLLM, SGlang) for high-throughput and low-latency model deployment.
- Automate cluster provisioning, monitoring, and recovery to maximize uptime and GPU utilization.
- Collaborate with ML engineers to troubleshoot GPU-related issues in training jobs (e.g., NCCL errors, OOM) and inference bottlenecks.
- Implement observability tools (Prometheus, Grafana) to track GPU utilization, job performance, and cluster health.
- Develop infrastructure-as-code (IaC) solutions for reproducible GPU environments (e.g., Terraform, Ansible).
skills and experience required.
- 3+ years of experience in DevOps/MLOps, GPU infrastructure, or distributed computing.
- Deep expertise in Kubernetes (K8s) for GPU workload orchestration (e.g., KubeFlow, Volcano, custom schedulers).
- Strong programming skills in Go or Python for platform development, automation and tooling.
- Proficiency in Linux system administration, performance tuning, and networking (e.g., RDMA, InfiniBand).
- Experience with IaC tools (Terraform, Ansible) and CI/CD pipelines (GitHub Actions, Jenkins).
- Bachelor’s or higher degree in Computer Science, Engineering, or a related field.
- Strong teamwork and communication skills, with a passion for solving infrastructure challenges.
- Familiarity with distributed training frameworks (e.g., PyTorch DDP, FSDP, DeepSpeed).
- Familiarity with Nvidia Triton serving framework or similar framework, and serving parameter tuning to make a good trade off between latency and throughput.
- Hands-on experience with GPU clusters, including troubleshooting NVIDIA drivers, CUDA, and NCCL issues.
- Knowledge of high-performance storage (Lustre, WekaFS) for large-scale training data.
- Experience with LLM training/inference stacks (e.g., Megatron-LM, TensorRT-LLM).