Working with Nvidia GPU Cloud (NGC) catalog

The NGC catalog provides access to GPU-accelerated software that speeds up end-to-end workflows with performance-optimized containers, pretrained AI models, and industry-specific SDKs that can be deployed on premises, in the cloud, or at the edge.

Setup for the first NGC login

1. Go to NGC website: https://catalog.ngc.nvidia.com/

    On the top right corner, click Sign In/Sign Up 

Image
Thumbnail

2. Login NGC using your HKUST email address 

Image
Thumbnail

3. Select The Hong Kong University of Sc... as your organization as below and you will be redirected to sign on the HKUST SSO.

Image
Thumbnail

4. After signed in, click top right corner and click "Setup"

Image
Thumbnail

5. You need to generate an API key to access NGC services.

     Click Generate API Key to generate your API key and copy the key in clipboard

Image
Thumbnail

6. Login to SuperPOD login node and follow the below command.  Replace the <token> with the key copied in previous step.

$ mkdir -p ~/.config/enroot
$ cat > ~/.config/enroot/.credentials <<_EOF_
# NVIDIA GPU Cloud (both endpoints are required)
machine nvcr.io login $oauthtoken password <token>
machine authn.nvidia.com login $oauthtoken password <token>
_EOF_

7. The NGC is setup and you don't need to modify the .credential file unless you change the NGC key later

Run Pytorch job using NGC container image

1. Setup and login NGC as above, search Pytorch container and copy the image path as below:

Image
Thumbnail

2. Prepare the Slurm script job.slurm.ngc under your home directory folder. As # is the character used to start a SBATCH comment, this character needs to be escaped when also used in --container-image as a separator between the registry and the image name.

#!/bin/bash
#SBATCH --job-name=pytorch-ngc   # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --gres=gpu:1             # number of gpus per node
#SBATCH --time=00:05:00          # total run time limit (HH:MM:SS)
#SBATCH --container-image nvcr.io\#nvidia/pytorch:24.02-py3

python -c 'import torch ; print(torch.__version__)'

3. Run sbatch to submit

% sbatch --wait -o slurm.out job.slurm.ngc

4. You should see something like this if the job has completed

% cat slurm.out
pyxis: imported docker image: nvcr.io#nvidia/pytorch:24.02-py3
2.3.0a0+ebedce2