about the company.
The company has almost 100 million customers based in Japan and 1 billion globally as well, providing more than 70 services in a variety such as e-commerce, payment services, financial services, telecommunication, media, sports, etc.

about the team.
GPUOD

about the job.

Optimize Kubernetes (K8s) for GPU workloads, including scheduling policies, autoscaling, and multi-tenant resource isolation.
Deploy and maintain inference serving platforms (e.g., NVIDIA Triton, vLLM, SGlang) for high-throughput and low-latency model deployment.
Automate cluster provisioning, monitoring, and recovery to maximize uptime and GPU utilization.
Collaborate with ML engineers to troubleshoot GPU-related issues in training jobs (e.g., NCCL errors, OOM) and inference bottlenecks.
Implement observability tools (Prometheus, Grafana) to track GPU utilization, job performance, and cluster health.
Develop infrastructure-as-code (IaC) solutions for reproducible GPU environments (e.g., Terraform, Ansible).

...

system engineer.

job details

share this job.

get in touch

葛瑞莹 Hedy Ge

related jobs.

SCM System Project Manager

Global HR System Generalist

Staff Software Engineer, Service

let similar jobs come to you