Using Slurm
The cluster uses Slurm as its job scheduler. Slurm is an open source job scheduler used in many supercomputers across the world. Below are some basic commands to get started. Man pages for each command are available which provide more information. The SLURM manual can also be found online.
- sacct is used to report job or job step accounting information about active or completed jobs.
- salloc is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.
- sattach is used to attach standard input, output, and error plus signal capabilities to a currently running job or job step. One can attach to and detach from jobs multiple times.
- sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.
- sbcast is used to transfer a file from local disk to local disk on the nodes allocated to a job. This can be used to effectively use diskless compute nodes or provide improved performance relative to a shared file system.
- scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
- scontrol is the administrative tool used to view and/or modify SLURM state. Note that manyscontrol commands can only be executed as user root.
- sinfo reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options.
- smap reports state information for jobs, partitions, and nodes managed by SLURM, but graphically displays the information to reflect network topology.
- squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.
- srun is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared nodes within the job's node allocation.
- smap reports state information for jobs, partitions, and nodes managed by SLURM, but graphically displays the information to reflect network topology.
- strigger is used to set, get or view event triggers. Event triggers include things such as nodes going down or jobs approaching their time limit.
- sview is a graphical user interface to get and update state information for jobs, partitions, and nodes managed by SLURM.
Common Usage Examples
Run a single command on a single compute node
$ salloc --account=<AccountName> -N 1 srun echo 'Hello World!'
Running Interactive Jobs
Don't run jobs on the head node! Start an interactive session instead. If you are testing new software or testing command line arguments to make sure your jobs are correctly defined, run an interactive shell on a compute node. The srun command to run an interactive job on DASH is as follows.
$ srun -p exec --pty --mem=500 --account=<AccountName> /bin/bash
This will run a bash session with which you can interact on a compute node that the scheduler choses. If you need more than 500 megabytes of memory for your interactive job, change the value after the --mem
argument.
Submit a batch job
To execute more complex sets of commands with more complex compute requirements, it is generally easiest to use a batch file that contains the compute requirements for the job and a list of commands to be run on the requested node(s). For example, to submit "Hello World" as a batch job, first create a slurm batch file, called echo.slurm:
#!/bin/bash #SBATCH -J echo.slurm #SBATCH --get-user-env #SBATCH --time=00:00:30 #SBATCH --mem=500 #SBATCH --account=<AccountName> srun echo.bash
The slurm batch file then calls echo.bash:
#!/bin/bash echo "Starting Test ("$SLURM_JOB_NAME") on host "`hostname`; sleep 10 echo "Hello world ("`hostname`")" sleep 10 echo "Test done ("`hostname`")"
Finally to run the command, you would type in at the unix prompt:
$ sbatch --account=<AccountName> ./echo.slurm
To check that the job is running, use the command. The echo.slurm job (#562) is on the fourth line:
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 120 lowmem bash bem28 R 14-23:48:48 1 dash1-exec-1 144 lowmem bash bem28 R 14-19:37:02 1 dash1-exec-1 278 lowmem bash bem28 R 9-02:07:14 1 dash1-exec-1 562 lowmem echo.slu ter18 R 0:09 1 dash1-exec-1
To check the output, you can "cat" the slurm-[job number].out file:
$ cat slurm-562.out Starting Test (echo.slurm) on host dash1-exec-1 Hello world (dash1-exec-1) Test done (dash1-exec-1)
To run the script on four cores in parallel, use the -N option of sbatch:
$ sbatch --account=<AccountName> -N 4 ./echo.slurm Submitted batch job 563 $ cat slurm-563.out Starting Test (echo.slurm) on host dash1-exec-1 Starting Test (echo.slurm) on host dash1-exec-2 Starting Test (echo.slurm) on host dash1-exec-4 Starting Test (echo.slurm) on host dash1-exec-3 Hello world (dash1-exec-1) Hello world (dash1-exec-2) Hello world (dash1-exec-4) Hello world (dash1-exec-3) Test done (dash1-exec-1) Test done (dash1-exec-2) Test done (dash1-exec-3) Test done (dash1-exec-4)
To run a bunch of independent jobs in parallel on single cores, submit a bunch of sbatch commands (here, the stdout and stderr are piped into local, well-named log files):
#!/bin/bash sbatch --account=<AccountName> -o /home/bem28/meta_eqtl/scripts/out/CAP_18.533.log /home/bem28/meta_eqtl/scripts/impute2_wrapper.sh -j i2.18.533 -d cap -s 30000001 -e 35000000 -c 18 -f CAP sbatch --account=<AccountName> -o /home/bem28/meta_eqtl/scripts/out/CAP_18.534.log /home/bem28/meta_eqtl/scripts/impute2_wrapper.sh -j i2.18.534 -d cap -s 35000001 -e 40000000 -c 18 -f CAP sbatch --account=<AccountName> -o /home/bem28/meta_eqtl/scripts/out/CAP_18.535.log /home/bem28/meta_eqtl/scripts/impute2_wrapper.sh -j i2.18.535 -d cap -s 40000001 -e 45000000 -c 18 -f CAP sbatch --account=<AccountName> -o /home/bem28/meta_eqtl/scripts/out/CAP_18.536.log /home/bem28/meta_eqtl/scripts/impute2_wrapper.sh -j i2.18.536 -d cap -s 45000001 -e 50000000 -c 18 -f CAP sbatch --account=<AccountName> -o /home/bem28/meta_eqtl/scripts/out/CAP_18.537.log /home/bem28/meta_eqtl/scripts/impute2_wrapper.sh -j i2.18.537 -d cap -s 50000001 -e 55000000 -c 18 -f CAP sbatch --account=<AccountName> -o /home/bem28/meta_eqtl/scripts/out/CAP_18.538.log /home/bem28/meta_eqtl/scripts/impute2_wrapper.sh -j i2.18.538 -d cap -s 55000001 -e 60000000 -c 18 -f CAP sbatch --account=<AccountName> -o /home/bem28/meta_eqtl/scripts/out/CAP_18.539.log /home/bem28/meta_eqtl/scripts/impute2_wrapper.sh -j i2.18.539 -d cap -s 60000001 -e 65000000 -c 18 -f CAP sbatch --account=<AccountName> -o /home/bem28/meta_eqtl/scripts/out/CAP_18.540.log /home/bem28/meta_eqtl/scripts/impute2_wrapper.sh -j i2.18.540 -d cap -s 65000001 -e 70000000 -c 18 -f CAP
To launch all of these jobs, type:
$ sh batch_impute2.sh Submitted batch job 563 Submitted batch job 564 Submitted batch job 565 Submitted batch job 566 Submitted batch job 567 Submitted batch job 568 Submitted batch job 569 Submitted batch job 570
Then each can be checked individually in the queue (squeue) or in the log file.
Submit high number of jobs
Submitting a high number of jobs in short succession (such as through a loop in a shell script) can considerably strain the system in multiple ways, including slowing down the scheduler, and flooding the job queue, causing excessive disk I/O contention - all to the detriment of other users of the cluster. We have therefore limited the number of jobs you can have in the system to 500. If you need more jobs, please consider job arrays, which are well suited to prevent resource monopolization by your job submission.
Check your runs
$ squeue --steps
Check a the memory usage of a completed job
$ sacct -o maxvmsize -j JobID
where JobID is the JobID of your completed job. Each task in your job will consume a different amount of memory - so take the largest value
Check the exit code for a job
$ sacct -o exit -j JobID
In the result, the number to the left of the colon is the exit code from your application. An exit code of 0 indicates successful completion; any other value indicates some type of error. If the left number is not a 0, then typically the right number may also indicate that the scheduler killed the application because it consumed too many resources. A right value of 9 indicates that slurm had to kill the process using SIGKILL, a kill signal that kills the process immediately.
SLURM job arrays
If you have a large number of similar jobs that use the same amount of memory and cpus, it's much more efficient to submit them as an job array. Here is a simple job array template that you can modify for your own use. Put the command line of the command you want executed, one line at a time in the simulation_runs.txt file. You can get the number of lines of your simulation_runs.txt file by typing:
wc -l simulation_runs.txt
Then submit the job assuming that the previous command reported there were 50 lines in the file with
sbatch --account=<AccountName> -a 1-50%20 batch_jobs.sh
The %20 syntax limits the number of running jobs to 20 so your job doesn't prevent others from using the cluster. This is one of the key mechanisms to run a very large number of jobs while preventing that you unduly monopolize the resources of the cluster.
The slurm job array template:
#!/bin/bash #SBATCH -J jobname # A single job name for the array #SBATCH -n 1 # Number of cores #SBATCH -N 1 # All cores on one machine #SBATCH --mem=16GB #SBATCH -o jobname%A_%a.out # Standard output #SBATCH -e jobname%A_%a.err # Standard error (put in same file for this set of jobs) #SBATCH --account=<AccountName> command=$(head -${SLURM_ARRAY_TASK_ID} /data/somelab/simulation_runs.txt | tail -1) srun $command
Playing nice when submitting large numbers of jobs
If you are going to be submitting a large number of jobs, you should submit the job with a higher nice value which has the effect of lowering the job priority if it's waiting for resources to become available. The default nice value is 100. This this example we are submitting a job with a nice value of 1000 so other jobs pending execution will run before this one.
$ sbatch --account=<AccountName> --nice=1000 -o /home/bem28/meta_eqtl/scripts/out/CAP_18.533.log /home/bem28/meta_eqtl/scripts/impute2_wrapper.sh -j i2.18.533 -d cap -s 30000001 -e 35000000 -c 18 -f CAP
If you have already submitted a job and would like to change it's nice value, you can use the scontrol command to do this.
$ scontrol update JodId=570 Nice=1000