Skip to main content Start main content

PolyU Academy for Artificial Intelligence

Engineer

(Ref. 251204004-IE)

Duties 

The appointee will be required to work for one of the constituent research units - Research Institute for Generative AI (RIGAI) (to be established) under the PolyU Academy for Artificial Intelligence (PAAI).  The appointee will be required to:

(a)    oversee the daily operations, the monitoring and inspection of the Large Language Model (LLM) GPU cluster to ensure stable and reliable operation of training tasks;

(b)    handle GPU node failures, IB network anomalies, CUDA/NCCL errors and Kubernetes scheduling failures, perform rapid troubleshooting, recovery and Root Cause Analysis (RCA);

(c)    operate and optimise the Kubernetes GPU scheduling system, including node management, resource quotas, queuing policies, image management and task governance;

(d)    build and maintain the large model training environment, including CUDA, PyTorch, base images and container runtime environments (Docker/Containerd); 

(e)    maintain the monitoring and logging pipelines for the training platform, including Prometheus/Grafana, DCGM, Node Exporter and NCCL metric collection;

(f)     participate in training-performance troubleshooting, including low GPU utilisation, NCCL communication bottlenecks, IB network congestion and Pod/Container resource limitations;

(g)    support the model team's daily tasks, including environment preparation, task troubleshooting, running automated training evaluation tasks and resolving data access anomalies;

(h)    provide technical support for platform scaling, migration and version upgrades, and participate in resource utilisation analysis and capacity planning;

(i)     write operational automation scripts (Python/Shell) and daily operational SOPs to improve efficiency and platform reliability; and

(j)     perform any other duties as assigned by the Director and the Executive Director of PAAI or his/her delegates.

Qualifications

Applicants should: 

(a)    have a bachelor’s degree or above in Computer Science, Communications, Electronics or other related field;

(b)    have at least five years of work experience in the MLOps fields;

(c)    be familiar with the LLM training process and have a basic understanding of the model training/evaluation pipeline;

(d)    be proficient in Linux system administration and basic maintenance of GPU servers;

(e)    have knowledge of the Kubernetes operation framework and the principles of GPU workload scheduling;

(f)     be familiar with PyTorch, and have knowledge of NCCL communication issues during training and their troubleshooting methods;

(g)    have knowledge of the basic principles of IB networking and methods for IB debugging (bandwidth, connectivity, fabric topology);

(h)    have knowledge of the training environment, image building and container runtime environments such as Docker and Containerd;

(i)     be proficient in Python or Shell and capable of developing automation scripts for operations;

(j)     be fluent in both written and spoken English and Chinese;

(k)    be familiar with distributed storage systems (JuiceFS/GPFS/HDFS) is a plus; and

(l)     have good communication and collaboration skills, and a strong sense of responsibility.

Preference will be given to those with experience in AI training platform/HPC operations.

Conditions of Service

A highly competitive remuneration package will be offered.  Initial appointment will be on a fixed-term gratuity-bearing contract.  Re-engagement thereafter is subject to mutual agreement.  


Consideration of applications will commence on 11 December 2025 until the position is filled



Posting date: 4 December 2025