7. SDP HPC
The SDP has the ability to run parallel simulation codes in the configured HPC. This is intended for generating data by the SD models.
7.1. HPC Environment
7.1.1. OpenHPC Modules
Two HPC environment modules have been deployed in the SDP HPC: the intel and gcc compliers with OpenMPI4 as the MPI library. To use the intel or GNU environment, one should load the ohpc-intel
or ohpc-gnu12
module, respectively. Then the associated submodules (like fftw
, netcdf
) will be seen and can be loaded. One example of using FFTW in the intel environment is given as follows.
[xiangliu@localhost ~]$ module overview
-------------------------------- /opt/ohpc/pub/modulefiles ---------------------------------
ohpc-intel (1) ohpc-gnu12 (1)
[xiangliu@localhost ~]$ module load ohpc-intel
Loading compiler version 2023.1.0
[xiangliu@localhost ~]$ ml ov
------------------------- /opt/ohpc/pub/moduledeps/intel-openmpi4 --------------------------
fftw (1) netcdf-cxx (1) netcdf (1) phdf5 (1)
[xiangliu@localhost ~]$ ml fftw
[xiangliu@localhost ~]$ ml
Currently Loaded Modules:
1) compiler/2023.1.0 2) fftw/3.3.10 3) mkl/2023.1.0
4) intel/2023.1.0 5) openmpi4/4.1.4 6) ohpc-intel
7.1.2. Temporary Storage: Scratch
The SDP HPC equips the temporary storage to store the large amount of data generated by the simulation codes. User must store the large simulation data into the scratch directory. The scratch directory can be automatically created by cdscratch
command with hpc-tools
module being loaded. After calling this command, a soft link to the user scratch directory is created under the root of the user’s home directory.
[xiangliu@localhost ~]$ ml hpc-tools
[xiangliu@localhost ~]$ ml
Currently Loaded Modules:
1) hpc-tools
[xiangliu@localhost ~]$ cdscratch
User scratch created: /data/scratch/xiangliu
Soft link created: /home/xiangliu/scratch
Now you can type “cd ~/scratch” to change to your scratch directory.
[xiangliu@localhost ~]$ ll
lrwxrwxrwx 1 xiangliu xiangliu 18 Jul 26 scratch -> /data/scratch/xiangliu
[xiangliu@localhost ~]$ cd scratch/
[xiangliu@localhost scratch]$ pwd
/home/xiangliu/scratch
Warning
The scratch directory will be cleaned periodically. So, please don’t save your important files in this directory and backup your useful simulation data.
7.2. SLURM Workload Manager
The SDP HPC utilizes the SLURM workload manager. Large scale of computing codes should submit the computing tasks to the SDP HPC through SLURM. The detailed guide of using SLURM can be found in the SLURM Documentation. In this guide, a brief tutorial will be introduced based on the OpenHPC Installation Guide.
Hint
The following table provides approximate command equivalences between SLURM and OpenPBS:
Command |
OpenPBS |
SLURM |
---|---|---|
Submit batch job |
qsub [job script] |
sbatch [job script] |
Request interactive shell |
qsub -I /bin/bash |
salloc |
Delete job |
qdel [job id] |
scancel [job id] |
Queue status |
qstat -q |
sinfo |
User’s Job list |
qstat -u user_name |
squeue -u user_name |
Job status |
qstat -f [job id] |
scontrol show job [job id] |
Node status |
pbsnodes [node name] |
scontrol show node [node id] |
7.2.1. Interactive Task Submission
If you want to test a simulation code or run heavy task with Programming IDE, a temporary connection with a limited time (wall time) can be requested. Here, an example will be given with hello world program. Switch to the scratch directory (make sure you have followed Temporary Storage: Scratch) and copy the hello program into your scratch directory with cphello
command.
[xiangliu@localhost ~]$ ml hpc-tools
[xiangliu@localhost ~]$ cd scratch/
[xiangliu@localhost scratch]$ cphello
[xiangliu@localhost scratch]$ ls
hello.c
Then you can compile the hello program with mpicc
with ohpc-intel
module being loaded. If you run the complied program, you will see that it runs on the administration node.
[xiangliu@localhost scratch]$ ml ohpc-intel
[xiangliu@localhost scratch]$ mpicc -O3 hello.c
[xiangliu@localhost scratch]$ ls
hello.c a.out
[xiangliu@localhost scratch]$ ./a.out
Hello, world (1 procs total)
--> Process # 0 of 1 is alive. -> admin
Now you can run the hello program interactively after requesting the computing resource by salloc
command.
[xiangliu@localhost scratch]$ salloc -n4 -N2 # "-n" and "-N" specify the number of processors and the number of the nodes being requested
salloc: Granted job allocation 41
[xiangliu@localhost scratch]$ prun $SCRATCHPATH/a.out # use absolute path of the scratch directory, hpc-tools module should be loaded.
Hello, world (4 procs total)
--> Process # 0 of 4 is alive. -> cpu1
--> Process # 1 of 4 is alive. -> cpu1
--> Process # 2 of 4 is alive. -> cpu2
--> Process # 3 of 4 is alive. -> cpu2
[xiangliu@localhost scratch]$ exit
exit
salloc: Relinquishing job allocation 41
salloc: Job allocation 41 has been revoked.
7.2.2. Batch File Submission
One can also use a batch file to submit the task. Following the above hello program. Use cpslurmjob
to copy the batch job file into your scratch directory. Then use sbatch
command to submit the task.
[xiangliu@localhost ~]$ ml hpc-tools
[xiangliu@localhost ~]$ cd scratch/
[xiangliu@localhost scratch]$ cpslurmjob
[xiangliu@localhost scratch]$ ls
hello.c job.sh a.out
[xiangliu@localhost scratch]$ sbatch job.sh
Submitted batch job 69
[xiangliu@localhost scratch]$ ls
hello.c job.sh a.out hellow.69.out hellow.69.err
[xiangliu@localhost scratch]$ cat hellow.69.out
Hello, world (4 procs total)
--> Process # 0 of 4 is alive. -> cpu1
--> Process # 1 of 4 is alive. -> cpu1
--> Process # 2 of 4 is alive. -> cpu2
--> Process # 3 of 4 is alive. -> cpu2
7.2.3. SLURM Task Management
The queue info can be viewed by sinfo
command.
[xiangliu@localhost scratch]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
The task status can be viewed by squeue
command.
[xiangliu@localhost scratch]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
70 p interact xiangliu R 0:06 1 c1
The submitted task can be cancelled by scancel
command.
[xiangliu@localhost scratch]$ salloc -n2 -N1
salloc: Granted job allocation 70
[xiangliu@localhost scratch]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
70 p interact xiangliu R 0:06 1 c1
[xiangliu@localhost scratch]$ scancel 70
salloc: Job allocation 70 has been revoked.
Hangup
7.3. Simulation Codes
Several simulation codes have been compiled and are ready to use on SDP. The compiled codes and their sources are located at /home/imas-public/hpc-codes
. The folder with suffix _bin
is the compiled codes. Before using it, one should load the compiler module.
[xiangliu@localhost ~] module load compiler/GCC
7.3.1. CQL3D Code
The documentation can be found at CQL3D Documentation
7.3.2. GENRAY Code
The documentation can be found at GENRAY Documentation