7. SDP HPC

The SDP has the ability to run parallel simulation codes in the configured HPC. This is intended for generating data by the SD models.

7.1. HPC Environment

7.1.1. OpenHPC Modules

Two HPC environment modules have been deployed in the SDP HPC: the intel and gcc compliers with OpenMPI4 as the MPI library. To use the intel or GNU environment, one should load the ohpc-intel or ohpc-gnu12 module, respectively. Then the associated submodules (like fftw, netcdf) will be seen and can be loaded. One example of using FFTW in the intel environment is given as follows.

[xiangliu@localhost ~]$ module overview
-------------------------------- /opt/ohpc/pub/modulefiles ---------------------------------
ohpc-intel (1)   ohpc-gnu12 (1)
[xiangliu@localhost ~]$ module load ohpc-intel
Loading compiler version 2023.1.0
[xiangliu@localhost ~]$ ml ov
------------------------- /opt/ohpc/pub/moduledeps/intel-openmpi4 --------------------------
fftw (1)   netcdf-cxx (1)   netcdf (1)   phdf5 (1)
[xiangliu@localhost ~]$ ml fftw
[xiangliu@localhost ~]$ ml
Currently Loaded Modules:
1) compiler/2023.1.0      2) fftw/3.3.10        3) mkl/2023.1.0
4) intel/2023.1.0         5) openmpi4/4.1.4     6) ohpc-intel

7.1.2. Temporary Storage: Scratch

The SDP HPC equips the temporary storage to store the large amount of data generated by the simulation codes. User must store the large simulation data into the scratch directory. The scratch directory can be automatically created by cdscratch command with hpc-tools module being loaded. After calling this command, a soft link to the user scratch directory is created under the root of the user’s home directory.

[xiangliu@localhost ~]$ ml hpc-tools
[xiangliu@localhost ~]$ ml
Currently Loaded Modules:
  1) hpc-tools
[xiangliu@localhost ~]$ cdscratch
User scratch created: /data/scratch/xiangliu
Soft link created: /home/xiangliu/scratch
Now you can type “cd ~/scratch” to change to your scratch directory.
[xiangliu@localhost ~]$ ll
lrwxrwxrwx   1 xiangliu xiangliu   18 Jul 26 scratch -> /data/scratch/xiangliu
[xiangliu@localhost ~]$ cd scratch/
[xiangliu@localhost scratch]$ pwd
/home/xiangliu/scratch

Warning

The scratch directory will be cleaned periodically. So, please don’t save your important files in this directory and backup your useful simulation data.

7.2. SLURM Workload Manager

The SDP HPC utilizes the SLURM workload manager. Large scale of computing codes should submit the computing tasks to the SDP HPC through SLURM. The detailed guide of using SLURM can be found in the SLURM Documentation. In this guide, a brief tutorial will be introduced based on the OpenHPC Installation Guide.

Hint

The following table provides approximate command equivalences between SLURM and OpenPBS:

Command

OpenPBS

SLURM

Submit batch job

qsub [job script]

sbatch [job script]

Request interactive shell

qsub -I /bin/bash

salloc

Delete job

qdel [job id]

scancel [job id]

Queue status

qstat -q

sinfo

User’s Job list

qstat -u user_name

squeue -u user_name

Job status

qstat -f [job id]

scontrol show job [job id]

Node status

pbsnodes [node name]

scontrol show node [node id]

7.2.1. Interactive Task Submission

If you want to test a simulation code or run heavy task with Programming IDE, a temporary connection with a limited time (wall time) can be requested. Here, an example will be given with hello world program. Switch to the scratch directory (make sure you have followed Temporary Storage: Scratch) and copy the hello program into your scratch directory with cphello command.

[xiangliu@localhost ~]$ ml hpc-tools
[xiangliu@localhost ~]$ cd scratch/
[xiangliu@localhost scratch]$ cphello
[xiangliu@localhost scratch]$ ls
hello.c

Then you can compile the hello program with mpicc with ohpc-intel module being loaded. If you run the complied program, you will see that it runs on the administration node.

[xiangliu@localhost scratch]$ ml ohpc-intel
[xiangliu@localhost scratch]$ mpicc -O3 hello.c
[xiangliu@localhost scratch]$ ls
hello.c  a.out
[xiangliu@localhost scratch]$ ./a.out
Hello, world (1 procs total)
    --> Process #   0 of   1 is alive. -> admin

Now you can run the hello program interactively after requesting the computing resource by salloc command.

[xiangliu@localhost scratch]$ salloc -n4 -N2 # "-n" and "-N" specify the number of processors and the number of the nodes being requested
salloc: Granted job allocation 41
[xiangliu@localhost scratch]$ prun $SCRATCHPATH/a.out # use absolute path of the scratch directory, hpc-tools module should be loaded.
 Hello, world (4 procs total)
--> Process #   0 of   4 is alive. -> cpu1
--> Process #   1 of   4 is alive. -> cpu1
--> Process #   2 of   4 is alive. -> cpu2
--> Process #   3 of   4 is alive. -> cpu2
[xiangliu@localhost scratch]$ exit
exit
salloc: Relinquishing job allocation 41
salloc: Job allocation 41 has been revoked.

7.2.2. Batch File Submission

One can also use a batch file to submit the task. Following the above hello program. Use cpslurmjob to copy the batch job file into your scratch directory. Then use sbatch command to submit the task.

[xiangliu@localhost ~]$ ml hpc-tools
[xiangliu@localhost ~]$ cd scratch/
[xiangliu@localhost scratch]$ cpslurmjob
[xiangliu@localhost scratch]$ ls
hello.c  job.sh  a.out
[xiangliu@localhost scratch]$ sbatch job.sh
Submitted batch job 69
[xiangliu@localhost scratch]$ ls
hello.c  job.sh  a.out  hellow.69.out  hellow.69.err
[xiangliu@localhost scratch]$ cat hellow.69.out
Hello, world (4 procs total)
    --> Process #   0 of   4 is alive. -> cpu1
    --> Process #   1 of   4 is alive. -> cpu1
    --> Process #   2 of   4 is alive. -> cpu2
    --> Process #   3 of   4 is alive. -> cpu2

7.2.3. SLURM Task Management

The queue info can be viewed by sinfo command.

[xiangliu@localhost scratch]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST

The task status can be viewed by squeue command.

[xiangliu@localhost scratch]$ squeue
         JOBID PARTITION     NAME     USER      ST       TIME  NODES NODELIST(REASON)
            70    p      interact     xiangliu  R       0:06      1 c1

The submitted task can be cancelled by scancel command.

[xiangliu@localhost scratch]$ salloc -n2 -N1
salloc: Granted job allocation 70
[xiangliu@localhost scratch]$ squeue
         JOBID PARTITION     NAME     USER      ST       TIME  NODES NODELIST(REASON)
            70    p      interact     xiangliu  R       0:06      1 c1
[xiangliu@localhost scratch]$ scancel 70
salloc: Job allocation 70 has been revoked.
Hangup

7.3. Simulation Codes

Several simulation codes have been compiled and are ready to use on SDP. The compiled codes and their sources are located at /home/imas-public/hpc-codes. The folder with suffix _bin is the compiled codes. Before using it, one should load the compiler module.

[xiangliu@localhost ~] module load compiler/GCC

7.3.1. CQL3D Code

The documentation can be found at CQL3D Documentation

7.3.2. GENRAY Code

The documentation can be found at GENRAY Documentation