A high-performance cluster (HPC) is a network of servers that are pooled together to maximize their computational capabilities for specific purposes — often for computationally-intensive requirements such as simulations and modeling, among others. Users should treat the HPC as an extension of their personal computers to provide the extra computational power needed for research. However, users need to exercise prudence in the use of any HPC, i.e., wrong input will produce wrong output.
For novice or first-time HPC users, the COARE Team prepared this basic HPC module for guidance on how to jump-start your HPC journey prior to running actual jobs. This module focuses on the overview and practicalities of using the Saliksik HPC.
An important reminder for all users is that to be able to effectively use any HPC, some basic knowledge of the Linux command-line interface (terminal) is needed. Here are some useful online references that users can study, which cover the basics of the Linux terminal:
Of course, the Saliksik HPC also has its share of limitations, such as:
After this module, the users are expected to be able to:
For this module, Windows OS users who will opt not to use the built-in PowerShell application should install PuTTY and WinSCP for logging in and file transfers, respectively. The usage of such applications will be discussed further.
NOTE:
The following sections include commands that should be entered into the terminal which are indicated by the shell symbols
$(for Linux, Unix, and MacOS systems) and>(for Windows PowerShell). Unless otherwise indicated, terminal commands for Linux/Unix/MacOS systems may also be used for Windows Powershell.Commands or arguments enclosed in square brackets (
[and]) are optional, while those in angled brackets (<and>) need to be supplied (e.g.,<username>). Commands or arguments separated by vertical bar (|) indicate different choices (e.g.,-n name | ppath means that either the name or path may be used).
The HPC can only be accessed using passwordless SSH, so SSH key(s) need to be appended to the user's account. Every user is responsible for their own account. Account sharing is strictly prohibited as outlined in the COARE Acceptable Use Policy (AUP). This module assumes that the user hasn't logged in to their account.
The SSH key pair consists of a private and public key. The public key is the one appended to the user's account in the HPC which is used to confirm the private key stored in the user's personal computer every time the user logs in to the HPC.
For Windows OS users, follow only either Terminal or Graphical method as a key generated with one method will not work with the other. For example: an SSH key generated using ssh-keygen in the terminal (as discussed below) cannot be used to log in with the graphical program PuTTY.
NOTE: This section is applicable for Linux, MacOS, and Windows PowerShell terminals.
To generate an SSH key, open your computer's terminal and use this command:
$ ssh-keygen [<args>]
Subsequent prompts for input from the user will be displayed:
Simply pressing the Enter keyboard key without any input to the prompts will use the default options indicated in parentheses. Users may opt to put a passphrase (password) on the key pair; however, it is more convenient not to, as the passphrase will be asked for every time the user logs in.
By default, the key pair will be stored in the ${HOME}/.ssh folder, where the private and public keys are called id_rsa and id_rsa.pub, respectively. The $HOME folder will look like /home/username for Linux/Unix, /Users/username for MacOS, and C:\Users\username for Windows. For all the said platforms, the shorthand for the $HOME folder is ~/ (tilde symbol).
The private key SHOULD ONLY be accessible to the user for security reasons as this can be used by another person to log in to your account. Here is what a sample RSA-type SSH private key looks like:
And here is its corresponding public key:
The public key will be appended to the user's account so they can log in to the HPC. Make sure it is in the OpenSSH format like above — ssh-rsa <some_long_str_here> <comment> where <comment> is optional and usually takes the form <user>@<computer-name>.
WARNING: DO NOT send your private key to anyone else, even to the COARE Team (unless explicitly required by the team).
NOTE: This section is applicable for Windows OS only.
Open the PuTTYgen application that comes with installing PuTTY. The default parameters at the bottom, i.e., RSA type of key and 2048 bits, are already good to use:
Click on the Generate button, then randomly move over the mouse pointer at the blank area to generate the key:
The PuTTYgen interface will look like this after a key has been successfully generated:
Click the Save private key button to save the private key in your computer. The public key, which will be appended to your account, is inside the box indicated by Public key for pasting into the OpenSSH authorized_keys file. Send your public key to the COARE Team to have it appended to your account. If you need the OpenSSH-formatted public key from a previously created private key, click the Load button and locate the private key.
WARNING: Again, DO NOT send your private key to anyone else, even to the COARE Team (unless explicitly required by the Team).
The user may log in to the HPC after the COARE Team has appended the public key to the user's account.
NOTE: This section is applicable for Linux, MacOS, and Windows Powershell terminals
To log in to the HPC, use this command in your local machine's terminal:
$ ssh [-v] [-i </path/to/ssh/privKey>] <username>@saliksik.asti.dost.gov.ph
The -i option specifies the private key to your path which is ~/.ssh/id_rsa by default. To print more verbose messages with this command, add the -v option with more v's to increase verbosity (i.e., -vv and -vvv), but the single -v should suffice. The front-end (log-in) node has the public domain name saliksik.asti.dost.gov.ph or IP address 202.90.149.55. After successfully logging in, the HPC's welcome page will be displayed:
The SSH parameters can be saved in a configuration file for convenient login every time. This will also come in handy later on when downloading and uploading files.
In Linux and macOS terminals, use vim, nano, or any text editor. If the file doesn't exist, it will automatically be created upon saving:
$ <vim|nano> ~/.ssh/config
For Windows PowerShell, a blank file should be created prior to editing using notepad.exe because notepad.exe automatically adds a .txt filename extension, which will make the config file unusable:
> ni ~/.ssh/config
> notepad.exe ~/.ssh/config
Here is a sample SSH configuration file:
Host saliksik
User username
Hostname saliksik.asti.dost.gov.ph
IdentityFile ~/.ssh/id rsa
The column spacing set above is optional and is only set for better readability, so a single space for each line will suffice. The value set for Host (in this case, saliksik) will now be used to shorten the previously full SSH login command into:
$ ssh [-v<vv>] saliksik
NOTE: This section is applicable for Windows OS only.
To log in using PuTTY, the minimum parameters needed are the username, hostname, and private key generated by PuTTYgen. Under the Session tab (the default tab), in the Host Name or (IP address) box, key in username@saliksik.asti.dost.gov.ph (or username@202.90.149.55):
Then, go to the Connection > SSH > Auth tab, and locate the private key previously created by PuTTYgen in the Private key file for authentication box:
To save the parameters, go back to the Session tab, then put a name (such as saliksik) in the Saved Sessions box, and click Save. It should be added below Default Settings. In the future, to use the saved session settings, click on the name of the saved session, then click the Load button to load the saved parameters.
Finally, click Open at the bottom portion of the window to log in to the HPC. The following security alert might appear:
If logging in using PuTTY for the very first time, then this is normal as the server's host key is not yet recognized by PuTTY. However, if the server's host key has already been previously cached, yet the alert still appeared, then kindly inform the COARE Team, as this may be a security concern.
Users are also encouraged to explore the other settings of PuTTY, such as the terminal size, font size, and color etc.
The Saliksik HPC is composed of the following nodes (servers):
This is where users log in to the HPC.
WARNING: DO NOT RUN JOBS HERE. Use DEBUG NODES instead. Violatators will be subjected to the COARE AUP.
Each user has the following default storage quotas:
/home/username): 100 GB/scratch[1-3]/username symlinked to /home/username/scratch[1-3]): 5 TB Total for all scratch foldersThe Saliksik HPC is regularly undergoing maintenance and streamlining operations, so this may change in the future with prior notice to users.
The home folder is intended for long-term data storage, while the scratch folders are for heavy input and output (I/O) file operations when running jobs. The scratch folders are also significantly faster than the home folder for read and write operations, so jobs should only be performed using the scratch folders, and users are prohibited from running their jobs in their home folders. Please refer to the COARE AUP for more info.
NOTE: This section is applicable for Linux, MacOS, and Windows Powershell terminals.
Remote file transfers via the terminal are done using scp or rsync. All of the commands listed here should be done on the local computer for both upload and download operations.
In your computer, upload files and/or folders with scp using the following command:
$ scp [-r] [-v] [-i </path/to/ssh/priv/key>] </source/path/in/local/machine> <username>@saliksik.asti.dost.gov.ph:</dest/path/in/server>
The scp options -r and -v are for recursive (entire folders) transfers and verbose output, respectively. The -i </path/to/ssh/priv/key> option specifies the private SSH key file to use. If an SSH configuration file was created (for example, Host is set as saliksik), the command can be shortened into:
$ scp [-r] [-v] </source/path/in/local/machine> saliksik:</destination/path/in/server>
Downloading files using scp follows the same principle as above, with a minor modification: the source and destination should be switched, of course. To download, use either the long or shortened (if there is an SSH configuration file) version of the command:
$ scp [-r] [-v] [-i </path/to/priv/ssh/key>] <username>@saliksik.asti.dost.gov.ph:</source/path/in/server> </dest/path/in/local/machine>
$ scp [-r] [-v] <host>:</source/path/in/server> </dest/path/in/local/machine>
More information about scp can be found in its manual pages:
$ man scp
CAUTION
When using
rsync, the path naming is critical: it interprets a slash (/) at the end to mean you're transferring the contents of the folder. For example, say inside your local machine's home folder there is a folder calledfolder1containing a file calledfile2.If recursively transferring the folder into
/home/username/destin the HPC using the command$ rsync -avhP ~/folder1 saliksik:/home/username/destfor example, this will result in transferring the entire folder into/home/username/destin the HPC, sofile2will be stored as/home/username/dest/folder1/file2.However, when
$ rsync -avhP ~/folder1/ hpc:/home/username/dest(mind the/afterfolder1) is used, only the contents offolder1will be transferred, sofile2will be stored as/home/username/dest/file2in the HPC. That subtle/at the end makes a significant difference which may affect your file and folder transfers.
One key advantage of rsync over scp is that the former updates the data of the files in the destination, so when rsync detects that there's no difference between the source and destination files then the transfer can terminate immediately. With rsync, transfers can be interrupted but can be continued later on without having to transfer everything from the start, which is different from scp as it will overwrite the destination file even if they are exactly the same.
In your computer, upload files with rsync using the following command:
$ rsync [-a] [-v] [-h] [-P] [-i </path/to/ssh/priv/key>] </source/path/in/local/machine> <username>@saliksik.asti.dost.gov.ph:</destination/path/in/server>
The rsync options -a, -v, -h, and -P are for archive mode (-a), verbose output (-v), human-readable output (-h), and to keep partially transferred files and show progress (-P), respectively. Like that with scp, the -i option also defines the path to the private SSH key file. If an SSH configuration file is set (for example, Host is also set as saliksik), the command can be shortened to:
$ rsync [-avhP] </source/path/in/local/machine> saliksik:</destination/path/in/server>
To download files and/or folders, use either the long or shortened (again, if there is an SSH configuration file) version of the command:
$ rsync [-avhP] [-i </path/to/priv/ssh/key>] <username>@saliksik.asti.dost.gov.ph:</source/path/in/server> </dest/path/in/local/machine>
$ rsync [-avhP] <host>:</source/path/in/server> </dest/path/in/local/machine>
For more information about rsync and its options, refer to its manual pages:
$ man rsync
NOTE: This section is applicable for Windows OS only.
One non-terminal option to transfer files and folders to and from the HPC for Windows users is WinSCP.
To log in to the HPC, enter the following parameters in its interface:
SFTPsaliksik.asti.dost.gov.ph (or 202.90.149.55)22
Then, click the Advanced... button which will bring up the Advanced Site Settings window:
Navigate to the SSH > Authentication tab and locate the private SSH key file generated using PuTTYgen:
Click OK to go back to the login interface. After configuring the login, click the Login button to connect to the HPC. Upon successful login, WinSCP will show the Commander interface where local and remote files are shown on the left and right portions, respectively:
Uploading and downloading files to and from the HPC is as simple as "drag and drop" using WinSCP. Users are encouraged to explore the other settings and features of WinSCP such as displaying hidden files (with dot prefixes, e.g., .bashrc), etc.
Modules allow program installations with different versions to be used without them interfering with each other, thus effectively keeping each version in a sandboxed environment. In other words, modules allow programs to be used in isolation from others which avoids possible incompatibilities and inconsistencies. However, it should be noted that the COARE Team is gradually doing away with modules in favor of Anaconda environments, but modules are still used for programs that are not available in the Anaconda repository (anaconda.org).
Modules have the format <module_name>/<version>, for example: anaconda/3-2023.07-2.
Without any argument, this command will list all available versions of all installed modules. When one or more module names are provided, the available versions for the modules are listed:
$ module avail [<module1/version> <module2/version> ...]
For example, running module avail without additional arguments will print the following example list of modules which is not exhaustive as it is constantly being updated:
On the other hand, when using the command module avail gromacs for example, the available versions of the gromacs module are listed:
$ module load <module1/version> [<module2/version> ...]
$ module list
$ module reload
$ module unload <module1/version> [<module2/version>]
$ module purge
Anaconda is a package and environment manager written primarily in Python. Its official website is anaconda.org.
Anaconda's default package manager is conda, although in practice, mamba is better to use because it's much more efficient and its warning and error messages are more intuitive. However, it's still a good idea to be able to use them both.
As of writing, the latest Anaconda module is anaconda/3-2023.07-2. In the past, running $ conda activate would prompt an error saying that the ~/.bashrc script has not yet been initialized. Loading the module will automatically initialize conda and mamba, so no need to modify your ~/.bashrc script like in the previous Anaconda module versions.
$ module load anaconda/3-2023.07-2 # use the latest available
$ conda activate # or mamba activate; activate base env
The default locations for the conda environments and packages are at ~/.conda/envs and ~/.conda/pkgs, respectively. The environments folder is the path prefix where environments are created (e.g., creating an environment named test will be created in ~/.conda/envs/test by default), while the packages folder is where the installers are downloaded and cached. Both folders are stored in the user's home folder by default. However, the home folder is significantly slower than the scratch folders. In addition, the storage quota for the home folder is significantly less than the scratch folders. Thus, to maximize job performance later on, the default paths for both folders will be changed to one of the scratch folders. The configuration set here will also be used by mamba.
To do this, use the following commands:
# <scratch> can be "scratch1", "scratch2", or "scratch3"
# (e.g., /scratch3/trainee/conda/envs)
$ conda config --add envs_dirs /<scratch>/<username>/conda/envs
$ conda config --add pkgs_dirs /<scratch>/<username>/conda/pkgs
If these paths don't exist, conda will automatically create them during package download or environment creation. To confirm the configuration, check that the conda start up script has been modified which should look like this (YAML format):
$ cat ~/.condarc
envs_dirs:
- /scratch3/username/conda/envs
pkgs_dirs:
- /scratch3/username/conda/pkgs
The ~/.condarc file may actually be directly created and/or modified without having to run the above commands. Intuitively, additional paths may be supplied to the envs_dirs and pkgs_dirs parameters, which will be useful in cases, say, the first path becomes full or the user has no permission to write to the path, so the next path will be used, and so on. The paths are directly pointed to the scratch folder instead of that in the user's home folder (/home/username/scratch[1-3]) because the former is the actual path of the scratch folder, while the latter is only a symlink to the scratch folders.
CAUTION: Creating environments may significantly use computational resources which is not allowed in the front-end node. This operation should be performed in a compute node. Therefore, the commands discussed here should be submitted as a SLURM job. Refer to the next section (SLURM) on how to submit a job.
To create an Anaconda environment, simply use the following command template:
$ mamba create [-y] <-n env_name | -p env_path> <-c channel1> [<-c channel2> ...] <package1>[=<version1>=<build1>] <package2>[=<version2>=<build2>] ...
The -y argument is optional and tells mamba to assume that "yes" is the answer to all its questions. However, -y is required when creating the environment via SLURM because the job will fail if it is not defined, since there will be interactive questions which cannot be answered. If the version and build of the package(s) are not defined, then the latest available will be installed.
For example, to create an environment named myenv containing the package hmmer from the bioconda channel (https://anaconda.org/bioconda/hmmer):
$ mamba create [-y] -n myenv -c bioconda hmmer
Of course, multiple channels and packages may be used, such as hmmer from the channel (https://anaconda.org/bioconda/hmmer) and sqsgenerator from the conda-forge channel (https://anaconda.org/conda-forge/sqsgenerator):
$ mamba create [-y] -n myenv -c bioconda -c conda-forge hmmer sqsgenerator
To list the environments visible to the user, use the following command template:
$ <mamba|conda> env list # or: <mamba|conda> info <-e|--envs>
To activate an environment, use the following command template:
$ <mamba|conda> activate <env_name|env_path>
To remove an environment:
$ <mamba|conda> env remove <-n env_name | -p env_path>
CAUTION: Installing packages may significantly use computational resources which is not allowed in the front-end node. This operation should be performed in a compute node. Therefore, the commands discussed here should be submitted as a SLURM job. Refer to the next section (SLURM) on how to submit a job.
To install packages into an existing environment in a single line, use the following command template:
$ mamba install [-y] <-n env_name | -p env_path> <-c channel1> [<-c channel2> ...] <package1>[=<version1>=<build1>]
To remove packages:
$ mamba remove [-y] <-n env_name | -p env_path> <package1>[=<version1>=<build1>]
The above commands may also be done by activating the environment first, prior to package installation or removal:
$ module load anaconda/3-2023.07-2
$ conda activate <env_name | env_path>
$ mamba install [-y] <-c channel1> [<-c channel2> ...] <package1>[=<version1>=<build1>] ...
$ mamba remove [-y] <package1>[=<version1>=<build1>] ...
A package may have different versions and builds available. For example, the pytorch package in the pytorch channel (https://anaconda.org/pytorch/pytorch) has multiple versions available, and each version has multiple builds:
In the above screenshot, the linux-64 architecture offers multiple builds for version 1.11.0, namely: py3.10_cuda11.1_cudnn8.0.5_0, py3.10_cuda11.3_cudnn8.2.0_0, py3.10_cuda11.5_cudnn8.3.2_0, and py3.7_cpu_0. The other newer versions have multiple builds as well. The build for each package may be inferred from the name or accessed by pressing the ¡ icon, for example:
In the above example, pytorch may be installed by simply specifying the version, like so:
$ module load anaconda/3-2023.07-2
$ mamba create [-y] -n myenv -c conda-forge pytorch=1.11.0
However, there may be instances where you need to install the CUDA-enabled (GPU) build but the latest build is CPU-only, so the above command would install the CPU build of pytorch version 1.11.0. To install the CUDA-enabled build, for example, py3.10_cuda11.1_cudnn8.0.5_0, use the command below (hint: this should be submitted to a GPU-capable node such as those in the gpu partition):
$ module load anaconda/3-2023.07-2
$ mamba create [-y] -n myenv -c conda-forge pytorch=1.11.0=py3.10_cuda11.1_cudnn8.0.5_0
This operation may be done interactively, so no need to submit this via SLURM. To list the packages installed in an environment, there are two ways:
Activate the environment, then list the packages:
$ <conda|mamba> activate <env_name | env_path>
$ <conda|mamba> list
Or, list the packages directly:
$ <conda|mamba> list <-n env_name | -p env_path>
SLURM is the job and resource manager used in the HPC. Its official online documentation is at https://slurm.schedmd.com/documentation.html.
The compute nodes previously listed are grouped into partitions, and each partition has its default QOS. The default partition is debug. For all QOSes, the maximum number of concurrently running jobs is 30, while the maximum number of submitted jobs is 45.
| Partition | Nodes | QOS | Limits | Remarks |
|---|---|---|---|---|
| debug | saliksik-cpu-[21-22] | debug_default | 86 CPUs, 1 day run time | |
| batch | saliksik-cpu-[01-20,25-36] | batch_default | 86 CPUs, 7 days run time | |
| serial | saliksik-cpu-[23-24] | serial_default | 86 CPUs, 14 days run time | |
| gpu | saliksik-gpu-[01-06] | gpu-p40_default | 12 CPUs, 1 GPU, 3 days run time | To use the GPU, use either the #SBATCH --gres=gpu:p40:1 or #SBATCH --gres=gpu:1 |
| gpu_a100 | saliksik-gpu-[09-10] | gpu-a100_default | 14 CPUs, 1 GPU, 3 days run time | job parameter |
These are the job parameters that are required before running any job:
--account: (string) group account where job quotas are set;--partition: (string) which partition the job will be submitted to;--qos: (string) the appropriate QOS in the partition;--nodes: (integer) number of nodes to request;--ntasks: (integer) total number of CPUs to request;--output: (string) job log file
On the other hand, these are some of the optional job parameters:
--ntasks-per-node: (integer) specify the number of CPUs per node to be requested (must not contradict --ntasks if also specified);--mem: (string) memory per node (e.g., 1G, 500K, 4GB, etc.);--job-name: (string) name for the job; will be displayed in job monitoring commands (as discussed later);--error: (string) job error file; recommended to not define this parameter and use only --output instead;--requeue: (no arg) make job eligible for requeue;--mail-type: (string) send an email to the user when the job is in the specified status, such as NONE, BEGIN, END, FAIL, REQUEUE, ALL, etc. (see sbatch manual for more info);--mail-user: (string) user's email address;For other parameters or more info regarding the above-listed parameters, see the sbatch manual using the following command, or go to the online manual.
$ man sbatch
A job script is submitted to allocate resources for a job. The previously discussed job parameters and the commands to be used to run the job are placed here.
Here is a sample job script where comments have been included to describe what each block does:
#!/bin/bash
#SBATCH --account=<slurm_group_acct>
#SBATCH --partition=<partition>
#SBATCH --qos=<qos> #SBATCH --nodes=<num_nodes> #SBATCH --ntasks=<num_cpus>
#SBATCH --job-name="<jobname>" #SBATCH --output="%x.out" ## <jobname>.<jobid>.out
##SBATCH --mail-type=ALL ## optional ##SBATCH --mail-user=<email_add> ## optional
##SBATCH --requeue ## optional
##SBATCH --ntasks-per-node=1 ## optional ##SBATCH --mem=24G ## optional: mem per node
##SBATCH --error="%x.%j.err" ## optional; better to use --output only
## For more `sbatch` options, use `man sbatch` in the HPC, or go to https://slurm.schedmd.com/sbatch.html.
## Set stack size to unlimited.
ulimit -s unlimited
## Benchmarking.
start_time=$(date +%s.%N)
## Print job parameters.
echo "Submitted on $(date)"
echo "JOB PARAMETERS"
echo "SLURM_JOB_ID : ${SLURM_JOB_ID}"
echo "SLURM_JOB_NAME : ${SLURM_JOB_NAME}"
echo "SLURM_JOB_NUM_NODES : ${SLURM_JOB_NUM_NODES}"
echo "SLURM_JOB_NODELIST : ${SLURM_JOB_NODELIST}"
echo "SLURM_NTASKS : ${SLURM_NTASKS}"
echo "SLURM_NTASKS_PER_NODE : ${SLURM_NTASKS_PER_NODE}"
echo "SLURM_MEM_PER_NODE : ${SLURM_MEM_PER_NODE}"
## Create a unique temporary folder in the node. Using a local temporary folder usually results in faster read/write for temporary files.
custom_tmpdir="yes"
if [[ $custom_tmpdir == "yes" ]]; then
JOB_TMPDIR=/tmp/${USER}/SLURM_JOB_ID/${SLURM_JOB_ID}
mkdir -p ${JOB_TMPDIR}
export TMPDIR=${JOB_TMPDIR}
echo "TMPDIR : $TMPDIR"
fi
## Reset modules.
module purge
module load <module1> [<module2> ...]
## Main job. Run your codes and executables here; `srun` is optional.
[srun] /path/to/exe1 <arg1> ...
[srun] /path/to/exe2 <arg2> ...
## Flush the TMPDIR.
if [[ $custom_tmp == "yes" ]]; then
rm -rf $TMPDIR
echo "Cleared the TMPDIR (${TMPDIR})"
fi
## Benchmarking
end_time=$(date +%s.%N)
echo "Finished on $(date)"
run_time=$(python -c "print($end_time - $start_time)")
echo "Total runtime (sec): ${run_time}"
It is recommended to submit the job inside the folder containing the job script. It is also recommended that any and all input and/or output files be within the same folder where the job script is located. This is to avoid changing working directories, which may result in confusion and possible errors in accessing files/folders. For example, if the job folder is at /home/username/scratch3/test-job, where all the necessary input files are stored together with the job script named job.sbatch:
$ cd /home/username/scratch3/test-job
$ sbatch job.sbatch
NOTE: In the following commands,
nodelistcan be written as a single node (e.g.,saliksik-cpu-01) or a combination (e.g.,saliksik-cpu-[01-10],saliksik-cpu-[01-10,15],saliksik-gpu-01, etc.). On the other hand, thepartitionargument can also be combination (e.g.,batch,gpu, etc.)..
If no argument is passed, all jobs in the queue will be displayed.
$ squeue [-u <username> ] [-p <partition>] [-w <nodelist>]
$ scontrol show job <job_id> # or jobid=<job_id>
$ sinfo [-p <partition> | -n <nodelist>]
You may only cancel jobs created under your account.
$ scancel <job_id1> [<job_id2> ...]
Test your knowledge and skills acquired from this module by performing the following tasks.
For your first task, create an Anaconda environment via a SLURM job. The environment should have the following specifications:
mytestenvconda-forgeopenmpi-mpicc version 4.1.6
Into the environment created above, install the following packages via another SLURM job:
conda-forgepytorchGROMACS version 2023.3 build mpi_openmpi_dblprec_hecbbb8f_0pytorch-cuda version 11.8
Create a file (in any of your scratch folders) containing the following sample source code:
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(NULL, NULL);
// Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
// Get the name of the processor
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
// Print off a hello world message
printf("Hello world from processor %s, rank %d out of %d processors\n", processor_name, world_rank, world_size);
// Finalize the MPI environment.
MPI_Finalize();
}
This is a simple "hello world" script written in C to test if the right number of processors are spawned as allocated. For this example, the file is named mpi_hello_world.c which will be compiled and executed using the mpicc and mpiexec executables, respectively, which have been installed during the creation of the mytestenv environment. The following job script named mpi_hello_world.sbatch will be used:
#!/bin/bash
#SBATCH --account=<slurm_grp_acct>
#SBATCH --partition=debug
#SBATCH --qos=debug_default
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --job-name="mpi_hello_world"
#SBATCH --output="%x.out"
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user@email.com
#SBATCH --requeue
## Set stack size to unlimited.
ulimit -s unlimited
## Benchmarking.
start_time=$(date +%s.%N)
## Print job parameters.
echo "Submitted on $(date)"
echo "JOB PARAMETERS"
echo "SLURM_JOB_ID : ${SLURM_JOB_ID}"
echo "SLURM_JOB_NAME : ${SLURM_JOB_NAME}"
echo "SLURM_JOB_NUM_NODES : ${SLURM_JOB_NUM_NODES}"
echo "SLURM_JOB_NODELIST : ${SLURM_JOB_NODELIST}"
echo "SLURM_NTASKS : ${SLURM_NTASKS}"
echo "SLURM_NTASKS_PER_NODE : ${SLURM_NTASKS_PER_NODE}"
echo "SLURM_MEM_PER_NODE : ${SLURM_MEM_PER_NODE}"
## Create a unique temporary folder in the node. Using a local temporary folder usually results in faster read/write for temporary files.
custom_tmp="no"
if [[ $custom_tmp == "yes" ]]; then
JOB_TMPDIR=/tmp/${USER}/SLURM_JOB_ID/${SLURM_JOB_ID}
mkdir -p ${JOB_TMPDIR}
export TMPDIR=${JOB_TMPDIR}
echo "TMPDIR : ${TMPDIR}"
fi
## Reset modules.
module purge
module load anaconda/3-2023.07-2
## Main job. Run your codes and executables here. `srun` is optional.
conda activate openmpi-mpicc-4.1.6
mpicc mpi_hello_world.c -o mpi_hello_world.exe
mpiexec -n ${SLURM_NTASKS} ./mpi_hello_word.exe
## Flush the TMPDIR.
if [[ $custom_tmp == "yes" ]]; then
rm -rf $TMPDIR
echo "Cleared the TMPDIR (${TMPDIR})"
fi
## Benchmarking
end_time=$(date +%s.%N)
echo "Finished on $(date)"
run_time=$(python -c "print($end_time - $start_time)")
echo "Total runtime (sec): ${run_time}"
In the above job script, the source code is compiled using mpicc and the resulting binary file (mpi_hello_world.exe) is executed using mpiexec, where the number of processors is the same as that defined for the #SBATCH --ntasks parameter. It is expected that this job will only spawn a single processor. This may be confirmed by checking the resulting output file named mpi_hello_world.out.
Experiment with modifying the --ntasks parameter to see if the same number of processors are spawned. Another experiment to try is to set inconsistent values where mpiexec uses more processors than allocated, such as --ntasks=5 but mpiexec -n 10, which should be expected to result in an error. You can also try setting the opposite, where the number of processors allocated exceeds the number mpiexec will use to see how it will turn out.
It is also important to note the resulting total run time with the changes in job parameters. Hence, the job log will include the message Total run time (sec): <seconds>. For this activity, any difference in run time is irrelevant because no heavy compute workload is being done.
For actual compute jobs, however, this is a crucial step in benchmarking to see which combination of job parameters are optimal. As shown in the figure below, the relationship between run time vs. number of processors used is not linear — compute performance will plateau (have little to no change) past a critical point. In the particular example below, the optimal number of processors is around 8. Therefore, it is essential to run benchmark tests before performing actual production runs.
Congratulations for completing the Basic HPC Usage Module. At this point, you should now have learned how to:
Moving forward, users are enjoined to: