Job Application - Engineer

PolyU Academy for Artificial Intelligence

Engineer

(Ref. 251204004-IE)

Duties

The appointee will be required to work for one of the constituent research units - Research Institute for Generative AI (RIGAI) (to be established) under the PolyU Academy for Artificial Intelligence (PAAI). The appointee will be required to:

(a) oversee the daily operations, the monitoring and inspection of the Large Language Model (LLM) GPU cluster to ensure stable and reliable operation of training tasks;

(b) handle GPU node failures, IB network anomalies, CUDA/NCCL errors and Kubernetes scheduling failures, perform rapid troubleshooting, recovery and Root Cause Analysis (RCA);

(c) operate and optimise the Kubernetes GPU scheduling system, including node management, resource quotas, queuing policies, image management and task governance;

(d) build and maintain the large model training environment, including CUDA, PyTorch, base images and container runtime environments (Docker/Containerd);

(e) maintain the monitoring and logging pipelines for the training platform, including Prometheus/Grafana, DCGM, Node Exporter and NCCL metric collection;

(f) participate in training-performance troubleshooting, including low GPU utilisation, NCCL communication bottlenecks, IB network congestion and Pod/Container resource limitations;

(g) support the model team's daily tasks, including environment preparation, task troubleshooting, running automated training evaluation tasks and resolving data access anomalies;

(h) provide technical support for platform scaling, migration and version upgrades, and participate in resource utilisation analysis and capacity planning;

(i) write operational automation scripts (Python/Shell) and daily operational SOPs to improve efficiency and platform reliability; and

(j) perform any other duties as assigned by the Director and the Executive Director of PAAI or his/her delegates.

Qualifications

Applicants should:

(a) have a bachelor’s degree or above in Computer Science, Communications, Electronics or other related field;

(b) have at least five years of work experience in the MLOps fields;

(d) be proficient in Linux system administration and basic maintenance of GPU servers;

(e) have knowledge of the Kubernetes operation framework and the principles of GPU workload scheduling;

(f) be familiar with PyTorch, and have knowledge of NCCL communication issues during training and their troubleshooting methods;

(g) have knowledge of the basic principles of IB networking and methods for IB debugging (bandwidth, connectivity, fabric topology);

(h) have knowledge of the training environment, image building and container runtime environments such as Docker and Containerd;

(i) be proficient in Python or Shell and capable of developing automation scripts for operations;

(j) be fluent in both written and spoken English and Chinese;

(k) be familiar with distributed storage systems (JuiceFS/GPFS/HDFS) is a plus; and

(l) have good communication and collaboration skills, and a strong sense of responsibility.

Preference will be given to those with experience in AI training platform/HPC operations.

Conditions of Service

A highly competitive remuneration package will be offered. Initial appointment will be on a fixed-term gratuity-bearing contract. Re-engagement thereafter is subject to mutual agreement.

Consideration of applications will commence on 11 December 2025 until the position is filled.

Posting date: 4 December 2025

Post Specification

Engineer