Introduction to SLURM and Job Submission¶
Now that you are connected to the cluster, it's time to learn about SLURM (Simple Linux Utility for Resource Management), which is the workload manager used on the cluster. SLURM handles job scheduling, resource allocation, and job monitoring.
Login vs. Compute Nodes¶
In general, cluster users are expected to submit most computations to the job scheduler to be run on the dedicated compute nodes. The login nodes are meant for tasks like editing source/command files and running short test programs that do not use much memory, time, and only need one or two CPUs. Some rough guidelines are: the test will run less than five minutes, will use less than 5 GB of memory, and will not use more than two CPUs.
Specifying the computing resources¶
When you submit a job to Slurm, you must specify the resources needed for that job. The following options are used to specify the resources per physical computer (node).
--partition From which group of machines (nodes) should the ones for your job be selected.
--nodes How many physical machines should be used for the job.
--ntasks-per-node How many copies of the program should be
started when using mpirun or srun.
--cpus-per-task How many CPUs (processors, cores) should each task have.
--mem Memory available to all tasks/processors on each node.
--mem-per-cpu How much memory per CPU. This option should be
avoided in general because specifying --mem provides the system more
flexibility in memory assignment.
--time How long should the job be able to run before Slurm stops the job.
Tasks versus CPUs per task
Clusters are often used to run software than can communicate among many
nodes during calculation. That is done by starting many copies of the same
program, and each copy can then determine what it should do. Each of
those copies is called a task. The --ntasks-per-node option
should only be used if you know your software uses MPI.
More software can use multiple CPUs on the same node than can use
multiple nodes. A single copy of that software is started. To specify
that it can access multiple CPUs, you use --cpus-per-task. For most
software, you will want to specify one node with one task and then request
--cpus-per-taks larger than 1.
Submitting a Job with sbatch¶
You can submit a job by writing a job script. It's a simple text file that contains both the resource requirements and the commands you want to execute.
Let's create our first job script. (you can use the editor of your choice, e.g. emacs, joe, nano, vim, etc)
$ nano test_job.sh
We need to have a shebang line at the beginning of the script to specify the file is a shell script.
#!/bin/sh
Slurm lets you specify options directly in a batch script, called Slurm “directives.” These directives can provide job setup information used by Slurm, including resource requests, email options, and more. This information is then followed by the commands to be executed to do the computational work of your job.
Slurm directives must precede the executable section in your script.
# Run on the general partition
#SBATCH --partition=general
# Request one node
#SBATCH --nodes=1
# Request one task
#SBATCH --ntasks=1
# Request 4GB of RAM
#SBATCH --mem=4G
# Run for a maximum of 5 minutes
#SBATCH --time=5:00
# Name of the job
#SBATCH --job-name=testjob
# Name the output file
#SBATCH --output=%x_%j.out
# Specify when Slurm should send you e-mail. You may choose from
# BEGIN, END, FAIL to receive mail, or NONE to skip mail entirely.
#SBATCH --mail-type=NONE
Below the job script’s directives is the section of code that Slurm will execute. This section is equivalent to running a Bash script in the command line – it’ll go through and sequentially run each command that you include. When there are no more commands to run, the job will stop.
For example, these commands go to jshmoe's home directory and executes a python program.
# go to jshmoe's home directory
cd /gpfs1/home/j/s/jshmoe
# in that directory, run test.py
python test.py
When you are done editing your file, save and exit.
To submit the job we use the sbatch command.
$ sbatch test_job.sh
Submitted batch job 123456
Your job will be submited and run onces the requested resources are avialable.
Jobs with fewer resources requested will run sooner.
Although jobs submitted before you are further ahead in the queue, the slurm scheduler looks for jobs that can fit in the gaps between larger jobs, as long as they do not delay them. This means that being conservative in your resource requests will result in jobs running sooner.
Running an Interactive Job with srun¶
In addition to batch jobs, you can run interactive jobs on the cluster using SLURM. An interactive job gives you direct access to a compute node, allowing you to run commands interactively as if you were logged into that node. This is useful for tasks like debugging and testing code.
To start an interactive session, use the srun command. Here's an example:
$ srun --partition=general --nodes=1 --ntasks=1 --mem=4G --time=30:00 --pty /bin/bash
In this command, the options are the same as they would be in a job
script except the --pty option, which tells Slurm that you wish to start
/bin/bash as a shell at which you can type commands and have them run on
the node assigned to the job. An interactive job is like a login session,
but the resources on the computer match those requested.
To end the interactive session, simply type:
$ exit
Interactive jobs are ideal for real-time experimentation and testing, complementing the batch job process.
Job Constraints¶
When a job has specific hardware requirements, you can use constraints to select the appropriate nodes. For example, to limit your job to a node with an Infiniband network card, you might use
#SBATCH --constraint=ib
or add --constraint=ib to your srun or salloc command.
The constraints in the table below were available when this page was last updated. For the most current available constraints, you can run
show_node_constraints
| Constraint | Description |
|---|---|
| intel | Nodes with Intel processors |
| v100 | Nodes with V100 GPUs |
| a100 | Nodes with A100 GPUs |
| h100 | Nodes with H100 GPUs |
| h200 | Nodes with H200 GPUs |
| noib | Nodes without Infiniband |
| ib | Infiniband nodes, all types |
| ib1 | Infiniband nodes, group 1 |
| ib2 | Infiniband nodes, group 2 |
| 10g | 10 Gig Ethernet |
| hc | High clockspeed nodes |
| cascadelake | Nodes with cascadelake generation processors |