PolyU Academy for Artificial Intelligence
Engineer
(Ref. 251204004-IE)
Duties
The appointee will be required to work for one of the constituent research units - Research Institute for Generative AI (RIGAI) (to be established) under the PolyU Academy for Artificial Intelligence (PAAI). The appointee will be required to:
(a) oversee the daily operations, the monitoring and inspection of the Large Language Model (LLM) GPU cluster to ensure stable and reliable operation of training tasks;
(b) handle GPU node failures, IB network anomalies, CUDA/NCCL errors and Kubernetes scheduling failures, perform rapid troubleshooting, recovery and Root Cause Analysis (RCA);
(c) operate and optimise the Kubernetes GPU scheduling system, including node management, resource quotas, queuing policies, image management and task governance;
(d) build and maintain the large model training environment, including CUDA, PyTorch, base images and container runtime environments (Docker/Containerd);
(e) maintain the monitoring and logging pipelines for the training platform, including Prometheus/Grafana, DCGM, Node Exporter and NCCL metric collection;
(f) participate in training-performance troubleshooting, including low GPU utilisation, NCCL communication bottlenecks, IB network congestion and Pod/Container resource limitations;
(g) support the model team's daily tasks, including environment preparation, task troubleshooting, running automated training evaluation tasks and resolving data access anomalies;
(h) provide technical support for platform scaling, migration and version upgrades, and participate in resource utilisation analysis and capacity planning;
(i) write operational automation scripts (Python/Shell) and daily operational SOPs to improve efficiency and platform reliability; and
(j) perform any other duties as assigned by the Director and the Executive Director of PAAI or his/her delegates.
Qualifications
Applicants should:
(a) have a bachelor’s degree or above in Computer Science, Communications, Electronics or other related field;
(b) have at least five years of work experience in the MLOps fields;
(c) be familiar with the LLM training process and have a basic understanding of the model training/evaluation pipeline;
(d) be proficient in Linux system administration and basic maintenance of GPU servers;
(e) have knowledge of the Kubernetes operation framework and the principles of GPU workload scheduling;
(f) be familiar with PyTorch, and have knowledge of NCCL communication issues during training and their troubleshooting methods;
(g) have knowledge of the basic principles of IB networking and methods for IB debugging (bandwidth, connectivity, fabric topology);
(h) have knowledge of the training environment, image building and container runtime environments such as Docker and Containerd;
(i) be proficient in Python or Shell and capable of developing automation scripts for operations;
(j) be fluent in both written and spoken English and Chinese;
(k) be familiar with distributed storage systems (JuiceFS/GPFS/HDFS) is a plus; and
(l) have good communication and collaboration skills, and a strong sense of responsibility.
Preference will be given to those with experience in AI training platform/HPC operations.
Conditions of Service
A highly competitive remuneration package will be offered. Initial appointment will be on a fixed-term gratuity-bearing contract. Re-engagement thereafter is subject to mutual agreement.
Consideration of applications will commence on 11 December 2025 until the position is filled.
Posting date: 4 December 2025