Skip to main content Start main content

University Centre for AI Computing

Senior Engineer

(Ref. 260514014-IE)

Duties 

The appointee will be required to:

(a)    lead the design, development and implementation of a new Large Language Model (LLM) inference platform under heterogeneous GPU infrastructure to provide general LLM inference service for the University;

(b)    address the core technical problems during the construction of core modules such as inference gateways, scheduling and traffic routing of the LLM inference platform;

(c)    undertake in-depth optimization and customized development of inference engines (e.g. vLLM, SGLang) to enhance inference throughput and latency in performance;

(d)    participate in the development of the PD (Prefill-Decode) separation architecture, optimizing the resource allocation and collaborative scheduling in the Prefill and Decode phases;

(e)    be responsible for the development of the KV Cache storage and management system, including the design and implementation of KV Cache migration, sharing, and eviction strategies;

(f)    participate in the development of the model distribution system to support efficient distribution and loading of LLMs among multiple nodes;

(g)    continuously optimize the performance, stability and cost efficiency of the reasoning link based on online monitoring and user feedback; and

(h)    perform any other duties as assigned by the Head of Unit or his/her delegates.

Qualifications 

Applicants should have: 

(a)    a master’s degree in Computer Science, Data Science, Business Analytics or a related discipline;

(b)    at least five years of solid experience in LLMs;

(c)    familiarity with the architecture and implementation of open-source inference engines such as vLLM and SGLang, and knowledge of core mechanisms such as Continuous Batching and PagedAttention;

(d)    knowledge of scheduling strategy in LLM inference scenarios, and familiarity with cutting-edge architectures such as KV Cache management and PD separation; and

(e)    expertise in designing and debugging complex systems, and be able to systematically analyse performance bottlenecks in the reasoning chain.

Preference will be given to those with experience in high-performance system development and familiarity with GPU programming (e.g. CUDA, Triton).

Conditions of Service

A highly competitive remuneration package will be offered.  Initial appointment will be on a fixed-term gratuity-bearing contract.  Re-engagement thereafter is subject to mutual agreement. 


Consideration of applications will commence on 1 July 2026 until the position is filled.



Posting date: 14 May 2026