Using Slurm

The cluster uses Slurm as its job scheduler. Slurm is an open source job scheduler used in many supercomputers across the world. Below are some basic commands to get started. Man pages for each command are available which provide more information. The SLURM manual can also be found online.

  • sacct is used to report job or job step accounting information about active or completed jobs.
  • salloc is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.
  • sattach is used to attach standard input, output, and error plus signal capabilities to a currently running job or job step. One can attach to and detach from jobs multiple times.
  • sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.
  • sbcast is used to transfer a file from local disk to local disk on the nodes allocated to a job. This can be used to effectively use diskless compute nodes or provide improved performance relative to a shared file system.
  • scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
  • scontrol is the administrative tool used to view and/or modify SLURM state. Note that manyscontrol commands can only be executed as user root.
  • sinfo reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options.
  • smap reports state information for jobs, partitions, and nodes managed by SLURM, but graphically displays the information to reflect network topology.
  • squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.
  • srun is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared nodes within the job's node allocation.
  • smap reports state information for jobs, partitions, and nodes managed by SLURM, but graphically displays the information to reflect network topology.
  • strigger is used to set, get or view event triggers. Event triggers include things such as nodes going down or jobs approaching their time limit.
  • sview is a graphical user interface to get and update state information for jobs, partitions, and nodes managed by SLURM.

Common Usage Examples

Run a single command on a single compute node

$ salloc --account=<AccountName> -N 1 srun echo 'Hello World!'

Running Interactive Jobs

Don't run jobs on the head node! Start an interactive session instead. If you are testing new software or testing command line arguments to make sure your jobs are correctly defined, run an interactive shell on a compute node. The srun command to run an interactive job on DASH is as follows.

$ srun -p exec --pty --mem=500 --account=<AccountName> /bin/bash

This will run a bash session with which you can interact on a compute node that the scheduler choses. If you need more than 500 megabytes of memory for your interactive job, change the value after the --mem argument.

Submit a batch job

To execute more complex sets of commands with more complex compute requirements, it is generally easiest to use a batch file that contains the compute requirements for the job and a list of commands to be run on the requested node(s).  For example, to submit "Hello World" as a batch job, first create a slurm batch file, called echo.slurm:

echo.slurm
#!/bin/bash
#SBATCH -J echo.slurm
#SBATCH --get-user-env
#SBATCH --time=00:00:30
#SBATCH --mem=500
#SBATCH --account=<AccountName>

srun echo.bash

The slurm batch file then calls echo.bash:

echo.bash
#!/bin/bash
echo "Starting Test ("$SLURM_JOB_NAME") on host "`hostname`;
sleep 10
echo "Hello world ("`hostname`")"
sleep 10
echo "Test done ("`hostname`")"

Finally to run the command, you would type in at the unix prompt:

$ sbatch --account=<AccountName> ./echo.slurm

To check that the job is running, use the command. The echo.slurm job (#562) is on the fourth line:

$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    120    lowmem     bash    bem28   R 14-23:48:48      1 dash1-exec-1
    144    lowmem     bash    bem28   R 14-19:37:02      1 dash1-exec-1
    278    lowmem     bash    bem28   R 9-02:07:14       1 dash1-exec-1
    562    lowmem echo.slu    ter18   R       0:09       1 dash1-exec-1

To check the output, you can "cat" the slurm-[job number].out file:

$ cat slurm-562.out
Starting Test (echo.slurm) on host dash1-exec-1
Hello world (dash1-exec-1)
Test done (dash1-exec-1)

To run the script on four cores in parallel, use the -N option of sbatch:

$ sbatch --account=<AccountName> -N 4 ./echo.slurm
Submitted batch job 563

$ cat slurm-563.out
Starting Test (echo.slurm) on host dash1-exec-1
Starting Test (echo.slurm) on host dash1-exec-2
Starting Test (echo.slurm) on host dash1-exec-4
Starting Test (echo.slurm) on host dash1-exec-3
Hello world (dash1-exec-1)
Hello world (dash1-exec-2)
Hello world (dash1-exec-4)
Hello world (dash1-exec-3)
Test done (dash1-exec-1)
Test done (dash1-exec-2) 
Test done (dash1-exec-3)
Test done (dash1-exec-4)

To run a bunch of independent jobs in parallel on single cores, submit a bunch of sbatch commands (here, the stdout and stderr are piped into local, well-named log files):

batch_impute2.sh
#!/bin/bash
sbatch --account=<AccountName> -o /home/bem28/meta_eqtl/scripts/out/CAP_18.533.log /home/bem28/meta_eqtl/scripts/impute2_wrapper.sh -j i2.18.533 -d cap -s 30000001 -e 35000000 -c 18 -f CAP
sbatch --account=<AccountName> -o /home/bem28/meta_eqtl/scripts/out/CAP_18.534.log /home/bem28/meta_eqtl/scripts/impute2_wrapper.sh -j i2.18.534 -d cap -s 35000001 -e 40000000 -c 18 -f CAP
sbatch --account=<AccountName> -o /home/bem28/meta_eqtl/scripts/out/CAP_18.535.log /home/bem28/meta_eqtl/scripts/impute2_wrapper.sh -j i2.18.535 -d cap -s 40000001 -e 45000000 -c 18 -f CAP
sbatch --account=<AccountName> -o /home/bem28/meta_eqtl/scripts/out/CAP_18.536.log /home/bem28/meta_eqtl/scripts/impute2_wrapper.sh -j i2.18.536 -d cap -s 45000001 -e 50000000 -c 18 -f CAP
sbatch --account=<AccountName> -o /home/bem28/meta_eqtl/scripts/out/CAP_18.537.log /home/bem28/meta_eqtl/scripts/impute2_wrapper.sh -j i2.18.537 -d cap -s 50000001 -e 55000000 -c 18 -f CAP
sbatch --account=<AccountName> -o /home/bem28/meta_eqtl/scripts/out/CAP_18.538.log /home/bem28/meta_eqtl/scripts/impute2_wrapper.sh -j i2.18.538 -d cap -s 55000001 -e 60000000 -c 18 -f CAP
sbatch --account=<AccountName> -o /home/bem28/meta_eqtl/scripts/out/CAP_18.539.log /home/bem28/meta_eqtl/scripts/impute2_wrapper.sh -j i2.18.539 -d cap -s 60000001 -e 65000000 -c 18 -f CAP
sbatch --account=<AccountName> -o /home/bem28/meta_eqtl/scripts/out/CAP_18.540.log /home/bem28/meta_eqtl/scripts/impute2_wrapper.sh -j i2.18.540 -d cap -s 65000001 -e 70000000 -c 18 -f CAP

To launch all of these jobs, type:

$ sh batch_impute2.sh
Submitted batch job 563

Submitted batch job 564
Submitted batch job 565
Submitted batch job 566
Submitted batch job 567
Submitted batch job 568
Submitted batch job 569
Submitted batch job 570

Then each can be checked individually in the queue (squeue) or in the log file.

Submit high number of jobs

Submitting a high number of jobs in short succession (such as through a loop in a shell script) can considerably strain the system in multiple ways, including slowing down the scheduler, and flooding the job queue, causing excessive disk I/O contention -  all to the detriment of other users of the cluster. We have therefore limited the number of jobs you can have in the system to 500. If you need more jobs, please consider job arrays, which are well suited to prevent resource monopolization by your job submission.

Check your runs

$ squeue --steps

Check a the memory usage of a completed job

$ sacct -o maxvmsize -j JobID

where JobID is the JobID of your completed job. Each task in your job will consume a different amount of memory - so take the largest value

Check the exit code for a job

$ sacct -o exit -j JobID

In the result, the number to the left of the colon is the exit code from your application. An exit code of 0 indicates successful completion; any other value indicates some type of error. If the left number is not a 0, then typically the right number may also indicate that the scheduler killed the application because it consumed too many resources. A right value of 9 indicates that slurm had to kill the process using SIGKILL, a kill signal that kills the process immediately.

SLURM job arrays

If you have a large number of similar jobs that use the same amount of memory and cpus, it's much more efficient to submit them as an job array. Here is a simple job array template that you can modify for your own use. Put the command line of the command you want executed, one line at a time in the simulation_runs.txt file. You can get the number of lines of your simulation_runs.txt file by typing:

wc -l simulation_runs.txt

Then submit the job assuming that the previous command reported there were 50 lines in the file with

sbatch --account=<AccountName> -a 1-50%20 batch_jobs.sh

The %20 syntax limits the number of running jobs to 20 so your job doesn't prevent others from using the cluster. This is one of the key mechanisms to run a very large number of jobs while preventing that you unduly monopolize the resources of the cluster. 

The slurm job array template:

#!/bin/bash
#SBATCH -J jobname # A single job name for the array
#SBATCH -n 1 # Number of cores
#SBATCH -N 1 # All cores on one machine
#SBATCH --mem=16GB
#SBATCH -o jobname%A_%a.out # Standard output
#SBATCH -e jobname%A_%a.err # Standard error (put in same file for this set of jobs)
#SBATCH --account=<AccountName>

command=$(head -${SLURM_ARRAY_TASK_ID} /data/somelab/simulation_runs.txt | tail -1)
srun $command

Playing nice when submitting large numbers of jobs

If you are going to be submitting a large number of jobs, you should submit the job with a higher nice value which has the effect of lowering the job priority if it's waiting for resources to become available. The default nice value is 100. This this example we are submitting a job with a nice value of 1000 so other jobs pending execution will run before this one.

nice
$ sbatch --account=<AccountName> --nice=1000 -o /home/bem28/meta_eqtl/scripts/out/CAP_18.533.log /home/bem28/meta_eqtl/scripts/impute2_wrapper.sh -j i2.18.533 -d cap -s 30000001 -e 35000000 -c 18 -f CAP

If you have already submitted a job and would like to change it's nice value, you can use the scontrol command to do this.

scontrol
$ scontrol update JodId=570 Nice=1000