Adjustments of HKUST SuperPOD resources (1)

30 Apr 2024

Dear HKUST SuperPOD users,

We extend our sincere appreciation for your enthusiastic engagement in the pilot use of the HKUST SuperPOD, and we would like to express our gratitude for your patience during the initial phase, which posed certain challenges.

As the influx of new applications continues to rise and researchers become more acquainted with the technical environment, there has been a steady increase in demand for cluster resources. To increase system stability and optimize resource allocations, we have implemented some changes in the cluster, details as follows:

  1. A dedicated priority job queue is established for projects appointed by the Office of Provost. While at initial stage this particular job queue will be exclusively accessible to appointed projects only, we are currently exploring the viability of enabling other users to submit jobs to this queue during periods of idle resources, albeit with pre-emption. Further information regarding this arrangement will be shared with you upon the completion of the setup process.
     
  2. Recognizing that certain tasks within AI workloads often necessitate a greater allocation of CPU resources as opposed to GPU resources (such as data pre-processing), we have undertaken measures to enhance the efficiency of GPU resource utilization and preserve them for GPU-intensive workloads. In pursuit of this optimization, we are introducing two additional Intel-based CPU machines to the HKUST SuperPOD cluster, specifically dedicated to handling CPU-oriented tasks. The new job queue named “cpu” is made available now to all HKUST SuperPOD users. For detail of resource limitation of this queue, please refer to https://itsc.hkust.edu.hk/services/academic-teaching-support/high-performance-computing/superpod/partitionandquota
     
  3. The debug job queue is reconfigured with MIG disabled. Under this revised configuration, each request is allocated 1 x H800 GPU, with a maximum wall time limit of 2 hours per request. We anticipate that this adjustment will significantly reduce the perceived wait time for users engaging in brief tasks, such as debugging.
     
  4. In order to enhance the overall stability of the system, we implement a scheduled reboot plan for the login nodes. This plan entails rebooting the login nodes on every Sunday at 6:00 AM, starting this week. The primary purpose of these reboots is to address any lingering issues such as stale or hanging jobs, as well as to clear user sessions that may have become inactive. It is important to note that active interactive sessions running within the debug job queue may be affected during the reboot process. Therefore, we kindly remind users to take this into consideration and plan their activities accordingly.
     

We appreciate your cooperation as we strive to optimize the resource allocation and enhance the overall performance of HKUST SuperPOD. Should you have any further inquiries or feedbacks, please do not hesitate to reach out to us.

Regards,
Alan Wong

Head of Teaching Technologies
Information Technology Services Center
Hong Kong University of Science and Technology
Direct Line: (852) 2358-6246