PolyU Academy for Artificial Intelligence
Senior Engineer / Engineer
(Ref. 251204002-IE)
Duties
The appointee will be required to work for one of the constituent research units – Research Institute for Generative AI (RIGAI) (to be established) under the PolyU Academy for Artificial Intelligence (PAAI). The appointee will be required to:
(a) take responsibility for the design and development of a Large Language Model (LLM) training platform, developing unified capabilities for GPU resource pooling, training job scheduling, inference acceleration and the Machine Learning Operations (MLOps) platform to support efficient model training iteration;
(b) lead the construction of the GPU computing cluster centered around a Kubernetes + NVIDIA GPU Operator, including node planning, resource management, scheduling policies and container runtime environment setup (Docker/Containerd);
(c) build the software stack for the NVIDIA cluster, including CUDA, NVIDIA drivers, Fabric Manager, PyTorch Distributed and NCCL communication, to ensure high performance and stability for distributed training;
(d) design and implement critical infrastructure components and toolchains for the training platform, including training task orchestration and automated pipelines, unified base image system (CUDA + PyTorch), data loading and data distribution components, and training artifact management and model version management;
(e) collaborate with the LLM team to support the implementation, optimisation and efficiency improvement of distributed training for framework layers (PyTorch Distributed, Megatron, SGLang) on the platform;
(f) participate in building the monitoring and observability system, covering GPU metrics, NCCL communication, IB network, storage I/O and Pod runtime status as well as establish alerting strategies;
(g) write platform construction documentation, development specifications, and automation scripts and tooling (Python/Go/Bash/Terraform) to enhance engineering consistency and delivery quality; and
(h) perform any other duties as assigned by the Director of PAAI or his delegates.
Qualifications
Applicants should:
(a) have a master’s degree or above in Computer Science, Communications, Electronics or a related discipline;
(b) have at least five years of solid experience in the MLOps fields at supervisory level;
(c) have a basic understanding of LLM training processes, multimodal models and AI Agents;
(d) be familiar with the overall training, inference and evaluation pipeline;
(e) be proficient in mainstream languages such as Python or Go with good engineering skills, coding standards and backend development capabilities;
(f) be familiar with LLM-related training frameworks such as PyTorch, PyTorch Distributed, SGLang and Megatron;
(g) have knowledge of Kubernetes and its GPU scheduling ecosystem, including GPU Operator, container runtime, image building and pipeline engineering processes;
(h) be familiar with NVIDIA Hopper GPU architecture, NCCL communication, InfiniBand network, GPU/NVLink topology and performance bottlenecks;
(i) have knowledge of HDFS, JuiceFS, GPFS or similar large-scale data access systems, and an understanding of training data reading bottlenecks;
(j) have experience in foundational infrastructure technologies such as Ray, message queues, backend storage and API services being an advantage;
(k) have experience in platform engineering, training platform development, MLOps or distributed systems development being an advantage;
(l) be capable of translating model team requirements into engineered solutions;
(m) have good communication skills; and
(n) have a good command of both written and spoken English and Chinese.
Applicants with less supervisory experience will be considered for the post of Engineer.
Conditions of Service
A highly competitive remuneration package will be offered. Initial appointment will be on a fixed-term gratuity-bearing contract. Re-engagement thereafter is subject to mutual agreement.
Consideration of applications will commence on 11 December 2025 until the position is filled.
Posting date: 4 December 2025