you can find the slides to this tutorial here
This tutorial assumes you have a NYU HPC user account. If you don't have an account, you may apply for an account here.
Introduction to the Prince Cluster
In a Linux cluster there are hundreds of computing nodes inter-connected by high speed networks. Linux operating system runs on each of the nodes individually. The resources are shared among many users for their technical or scientific computing purposes. Slurm is a cluster software layer built on top of the inter-connected nodes, aiming at orchestrating the nodes' computing activities, so that the cluster could be viewed as a unified, enhanced and scalable computing system by its users. In NYU HPC clusters the users coming from many departments with various disciplines and subjects, with their own computing projects, impose on us very diverse requirements regarding hardware, software resources and processing parallelism. Users submit jobs, which compete for computing resources. The Slurm software system is a resource manager and a job scheduler, which is designed to allocate resources and schedule jobs. Slurm is an open source software, with a large user community, and has been installed on many top 500 supercomputers.
This tutorial assumes you have a NYU HPC account. If not, you may apply for an account here.
Also assumes you are comfortable with Linux command-line environment. To learn about Linux please read Tutorial 1.
Prince computing nodes
|Nodes||Cores/Node||CPU Type||Memory Available To Jobs (GB)||GPU Cards/Node||Names|
|68||28||Intel(R) Broadwell @ 2.60GHz||125||c[01-17]-[01-04]|
|32||28||Intel(R) Broadwell @ 2.60GHz||250||c[18-25]-[01-04]|
|32||20||Intel(R) Haswell @ 2.60GHz||62||c[26-27]-[01-16]*|
|176||20||Intel(R) IvyBridge @ 3.00GHz||62||c[28-38]-[01-16]|
|48||20||Intel(R) IvyBridge @ 3.00GHz||188||c[39-41]-[01-16]|
|4||40||Intel(R) Skylake @2.40GHz||187.5||c42-[01-04]|
|12||40||Intel(R) Cascade Lake @2.50GHz||187||c[43-45]-[01-04]|
|4||20||Intel(R) Haswell @ 3.10GHz||500||c99-[01-04]|
|2||48||Intel(R) IvyBridge @ 3.00GHz||1510||c99-[05-06]|
|2||64||Intel(R) Xeon Phi @ 1.30GHz||186 (+ 16GB MCDRAM)||c99-[07-08]|
|1||16||Intel(R) Skylake @ 3.50GHz||1500||c99-09|
|9||28||Intel(R) Broadwell @ 2.60GHz||250||4 Tesla K80||gpu-[01-09]|
|4||28||Intel(R) Broadwell @ 2.60GHz||125||4 GeForce GTX 1080||gpu-[10-13]|
|8||20||Intel(R) IvyBridge @ 2.50GHz||125||8 Tesla K80||gpu-[23-30]|
|24||28||Intel(R) Broadwell @ 2.60GHz||250||4 Tesla P40||gpu-[31-54]|
|8||28||Intel(R) Broadwell @ 2.60GHz||250||4 Tesla P100||gpu-[60-67]|
|6||40||Intel(R) Skylake @ 2.40GHz||376||4 Tesla V100||gpu-[68-73]|
|1||40||Intel(R) Skylake @ 2.40GHz||185||2 Tesla V100||gpu-90|
|/archive||$ARCHIVE||Long-term storage||NO||2 TB|
|/home||$HOME||Small files, code||NO||20 GB|
|/beegfs||$BEEGFS||File staging - workflows with many small files||YES. Files unused for 60 days are deleted||2 TB||/scratch||$SCRATCH||File staging - frequent writting and reading||YES. Files unused for 60 days are deleted||5 TB|
For more details of nodes, file system's hardware configuration, please click the link "Cluster - Prince".
The Prince picture
NOTE: Alternatively, instead of login to the bastion hosts, you can use VPN to get inside NYU's network and access the HPC clusters directly. Instructions on how to install and use the VPN client are available here
NOTE: You can't do anything on the bastion hosts, except
ssh to the HPC clusters (Prince, Dumbo).
Connecting to Prince
Logging onto the Prince cluster and submitting jobs is analogous to triple jump the Olympic game which was originated in ancient Greece. First, open a terminal on your Mac workstation. If your workstation is outside of NYU network, follow these three steps:
Hop -from your workstation, ssh onto one bastion host gw.hpc.nyu.edu
Step -from any bastion host, ssh to the Prince cluster login node prince.hpc.nyu.edu
Jump -from any login node, run command "sbatch" or "srun" to submit jobs which will land on the computing node(s)
If you are inside NYU network, the first step 'hop' could be omitted.
See for instance a complete HPC session:
For access from Windows station using PuTTY, please read below.
Enter "gw.hpc.nyu.edu" for host name, and leave "22" the default for port. If you want, you may enter
a name for saved session e.g. "hpcgw", click "Save" for use next time. Hit "Open".
Click "Yes" when a Window as below showing up.
Enter your NetID username, and password. This will get you onto one of our bastion hosts.
On hpc bastion host, enter command "ssh prince.hpc.nyu.edu" or "ssh prince" for short hostname, answer "yes" to the question
and type your NetID password. Suppose everything goes on smoothly, you will land on one prince login node!
Note: ssh tunnelling is not required for Slurm tutorial classroom exercises.
Describing Slurm commands
Submit jobs - [sbatch]
Batch job submission can be accomplished with the command sbatch. Like in Torque qsub, we create a bash script to describe our job requirement: what resources we need, what softwares and processing we want to run, memory and CPU requested, and where to send job standard output and error etc. After a job is submitted, Slurm will find the suitable resources, schedule and drive the job execution, and report outcome back to the user. The user can then return to look at the output files.
The script is given in /share/apps/Tutorials/slurm/example. Below is an annotated version with detailed explanation of the SBATCH directives used in the script:
The job has been submitted successfully. And as the example box showing, its job ID is 11615. Usually we should let the scheduler to decide on what nodes to run jobs. In cases there is a need to request a specific set of nodes, use the directive nodelist, e.g. '#SBATCH --nodelist=c09-01,c09-02'.
Check cluster status - [sinfo, squeue]
The sinfo command gives information about the cluster status, by default listing all the partitions. Partitions group computing nodes into logical sets, which serves various functionalities such as interactivity, visualization and batch processing.
See two useful sinfo command examples: 1. the first one lists those nodes in idle state in the gpu partition; 2. the second outputs information in a node-oriented format.
The squeue command lists jobs which are in a state of either running, or waiting or completing etc. It can also display jobs owned by a specific user or with specific job ID.
Run 'man sinfo' or 'man squeue' to see the explanations for the results.
Check job status - [squeue, sstat, sacct]
With the job ID in hand, we can track the job status through its life time. The job first appears in the Slurm queue in the PENDING state. Then when its required resources become available, the job gets priority for its turn to run, and is allocated resources, the job will transit to the RUNNING state. Once the job runs to the end and completes successfully, it goes to the COMPLETED state; otherwise it would be in the FAILED state. Use squeue -j <jobID> to check a job status.
Cancel a job - [scancel]
Things can go wrong, or in a way unexpected. Should you decide to terminate a job before it finishes, scancel is the tool to help.
Checking job history - [sacct]
If you need to check your history of jobs, sacct is the command to use.
Look at job results
Job results includes the job execution logs (standard output and error), and of course the output data files if any defined when submitting the job. Log files should be created in the working directory, and output data files in your specified directory. Examine log files with a text viewer or editor, to gain a rough idea of how the execution goes. Open output data files to see exactly what result is generated. Run sacct command to see resource usage statistics. Should you decide that the job needs to be rerun, submit it again with sbatch with a modified version of batch script and/or updated execution configuration. Iteration is one characteristic of a typical data analysis!
Software and Environment Modules
Environment Modules is a tool for managing multiple versions and configurations of software packages, and is used by many HPC centers around the world. With Environment Modules, software packages are installed away from the base system directories, and for each package an associated modulefile describes what must be altered in a user's shell environment - such as the $PATH environment variable - in order to use the software package. The modulefile also describes dependencies and conflicts between this software package and other packages and versions.
To use a given software package, you load the corresponding module. Unloading the module afterwards cleanly undoes the changes that loading the module made to your environment, thus freeing you to use other software packages that might have conflicted with the first one.
Working with software packages on the NYU HPC clusters.
|module avail||check what software packages are available|
|module whatis module-name||Find out more about a software package|
|module help module-name||A module file may include more detailed help for the software package|
|module show module-name||see exactly what effect loading the module will have with|
|module list||check which modules are currently loaded in your environment|
|module load module-name||load a module|
|module unload module-name||unload a module|
|module purge||remove all loaded modules from your environment|
Running interactive jobs
Majority of the jobs on Prince cluster are submitted with the sbatch command, and executed in the background. These jobs' steps and workflows are predefined by users, and their executions are driven by the scheduler system. There are cases when users need to run applications interactively (interactive jobs). Interactive jobs allow the users to enter commands and data on the command line (or in a graphical interface), providing an experience similar to working on a desktop or laptop. Examples of common interactive tasks are: Since the login nodes of the Prince cluster are shared between many users, running interactive jobs that require significant computing and IO resources on the login nodes will impact many users. Interactive jobs on Prince Login nodes Running compute and IO intensive interactive jobs on the Prince login nodes is not allowed. Jobs may be removed without notice. Instead of running interactive jobs on Login nodes, users can run interactive jobs on Prince Compute nodes using SLURM's srun utility. Running interactive jobs on compute nodes does not impact many users and in addition provides access to resources that are not available on the login nodes, such as interactive access to GPUs, high memory, exclusive access to all the resources of a compute node, etc. There is no partition on Prince that has been reserved for Interactive jobs. Through srun SLURM provides rich command line options for users to request resources from the cluster, to allow interactive jobs. Please see some examples and short accompanying explanations in the code block below, which should cover many of the use cases. In srun there is an option "–x11", which enables X forwarding, so programs using a GUI can be used during an interactive session (provided you have X forwarding to your workstation set up). If necessary please read the wiki pages on how to set up X forwarding for Windows and Linux / Max workstation. NOTE: X forwarding is not required for Slurm tutorial classroom exercises.
There are cases when users need to run applications interactively (interactive jobs). Interactive jobs allow the users to enter commands and data on the command line (or in a graphical interface), providing an experience similar to working on a desktop or laptop. Examples of common interactive tasks are:
Since the login nodes of the Prince cluster are shared between many users, running interactive jobs that require significant computing and IO resources on the login nodes will impact many users.
Interactive jobs on Prince Login nodes
Running compute and IO intensive interactive jobs on the Prince login nodes is not allowed. Jobs may be removed without notice.
Instead of running interactive jobs on Login nodes, users can run interactive jobs on Prince Compute nodes using SLURM's srun utility. Running interactive jobs on compute nodes does not impact many users and in addition provides access to resources that are not available on the login nodes, such as interactive access to GPUs, high memory, exclusive access to all the resources of a compute node, etc. There is no partition on Prince that has been reserved for Interactive jobs.
Through srun SLURM provides rich command line options for users to request resources from the cluster, to allow interactive jobs. Please see some examples and short accompanying explanations in the code block below, which should cover many of the use cases.
In srun there is an option "–x11", which enables X forwarding, so programs using a GUI can be used during an interactive session (provided you have X forwarding to your workstation set up). If necessary please read the wiki pages on how to set up X forwarding for Windows and Linux / Max workstation. NOTE: X forwarding is not required for Slurm tutorial classroom exercises.
Running R batch Job
Long running and big data crunching jobs ought to be submitted as batch, so that they will run in the background and Slurm will drive their executions. Below are a R script "example.R", and a job script which can be used with sbatch command to send a job to Slurm:
R Interactive session
The following example shows how to work with Interactive R session on a compute node:
Running GPU jobs
Running array jobs
SLURM_* environment variables
To get the list of SLURM_* variables, you may run a job to check, e.g. srun sh -c 'env | grep SLURM | sort' . The command 'man sbatch' explains what these variables stand for. Below are a few frequently used:
SLURM_JOB_ID- the job ID
SLURM_SUBMIT_DIR- the job submission directory
SLURM_SUBMIT_HOST- name of the host from which the job was submitted
SLURM_JOB_NODELIST- names of nodes allocated to the job
SLURM_ARRAY_TASK_ID- job array job index
SLURM_JOB_CPUS_PER_NODE- CPU cores on this node allocated to the job
SLURM_NNODES- number of nodes allocated to the job
Excceded step memory limit at some point
If you get the correct outputs, please just ignore this warning message - "slurmstepd: error: Exceeded job memory limit at some point". You can also check job exit state to confirm. For reference there is some explanation in the bug report.