Inhaltsverzeichnis

Queuing-System (SLURM) and Jobfiles

Target audience: Beginners in Environmental Modules and the Queuing-System SLURM

Important info:

Big parts of this script was taken from the documentation of the Hamburg HPC Competence Center (HHCC). Please visit their website for more details on the Use of the Command Line Interface or about Using Shell Scripts.

Description:

General Information

Environment Modules are a tool for managing environment variables of the shell. Modules can be loaded and unloaded dynamically and atomically, in an clean fashion. Details can be found on the official website.

The workload manager used on the Phoenix-Cluster is SLURM (Simple Linux Utility for Resource Management). SLURM a widely used open source workload managers for large and small Linux clusters which is controlled via a CLI (Command Line Interface). Details can be found in the official documentation.

Environment Modules

Introduction:

The module load command extends variables containing search paths (e.g. PATH or MANPATH). The module unload command is the corresponding inverse operation, it removes entries from search paths. By extending search paths software is made callable. Effectively software can be provided through Modules. An advantage over defining environment variables directly in the shell is that Modules allow to undo changes of environment variables. The idea of introducing Modules is to be able to define software environments in a modular way. In the context of HPC, Modules make it easy to switch compilers or libraries, or to choose between different versions of an application software package.

Naming:

Names of Modules have the format program/version, just program or even a slightly more nested path description. Modules can be loaded (and always be unloaded) without specifying a version. If the version is not specified the default version will be loaded. The default version is either explicitly defined (and will be marked in the output of module avail) or module will load the version that appears to be the latest one. Because defaults can change versions should always be given if reproducibility is required.

Dependences and conflicts:

Modules can have dependences, i.e. a Module can enforce that other Modules that it depends on must be loaded before the Module itself can be loaded. Module can be conflicting, i.e. these modules must not be loaded at the same time (e.g. two version of a compiler). A conflicting Module must be unloaded before the Module it conflicts with can be loaded.

Caveats:

The name Modules suggest that Modules can be picked and combined in a modular fashion. For Modules providing application packages this is true (up to possible dependences and conflicts described above), i.e. it is possible to chose any combination of application software.

However, today, environments for building software are not modular anymore. In particular, it is no longer guaranteed that a library that was built with one compiler can be used with code generated by a different compiler. Hence, the corresponding Modules cannot be modular either. A popular way to handle this situation is to append compiler information to the version information of library Modules. Firstly, this leads to long names and secondly, to very many Modules that are hard to overlook. A more modern way is to build up toolchains with Modules. For example, in such a toolchain only compiler Modules are available at the beginning. Once a compiler Module is loaded, MPI libraries (the next level of tools) become available and after that all other Modules (that were built with that chain).

Important commands:

Important Module commands are:

list Modules currently loaded module list
list available Modules module avail
load a Module module load program[/version]
unload a Module module unload program
switch a Module (e.g. compiler version) module switch program program/version
add or remove a directory/path to the Module search path (e.g. by an own Module directory) module [un]use [–append] path

Self-documentation:

Modules are self-documented:

show the actions of a Module module display program/version
short description of [one or] all Modules module whatis [program/version]
longer help text on a Module module help program/version
help on module itself module help

Basics of SLURM

Introduction:

There are three key functions of SLURM described on the SLURM website:

“… First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. …”

SLURM’s default scheduling is based on a FIFO-queue, which is typically enhanced with the Multifactor Priority Plugin to achieve a very versatile facility for ordering the queue of jobs waiting to be scheduled. In contrast to other workload managers SLURM does not use several job queues. Cluster nodes in a SLURM configuration can be assigned to multiple partitions by the cluster administrators instead. This enables the same functionality.

A compute center will seek to configure SLURM in a way that resource utilization and throughput are maximized, waiting times and turnaround times are minimized, and all users are treated fairly.

The basic functionality of SLURM can be divided into three areas:

Job submission and cancellation:

There are three commands for handling job submissions:

SLURM assigns a unique jobid (integer number) to each job when it is submitted. This jobid is returned at submission time or can be obtained from the squeue command.

The scancel command is used to abort a job or job step that is running or waiting for execution.

The scontrol command is mainly used by cluster administrators to view or modify the configuration of the SLURM system but it also offers the users the possibility to control their jobs (e.g. to hold and release a pending job).

The Table below lists basic user activities for job submission and cancellation and the corresponding SLURM commands.

User activities for job submission and cancellation (user supplied information is given in italics)

User activity SLURM command
Submit a job script for (later) execution sbatch job-script
Allocate a set of nodes for interactive use salloc –nodes=N
Launch a parallel task (e.g. program, command, or script) within allocated resources
by sbatch (i.e. within a job script) or salloc
srun task
Allocate a set of nodes and launch a parallel task directly srun –nodes=N task
Abort a job that is running or waiting for execution scancel jobid
Abort all jobs of a user scancel –user=username
or generally
scancel –user=$USER
Put a job on hold (i.e. pause waiting) and Release a job from hold
(These related commands are rarely used in standard operation.)
scontrol hold jobid
scontrol release jobid

The major command line options that are used for sbatch and salloc are listed in the Table below. These options can also be specified for srun, if srun is not used in the context of nodes previously allocated via sbatch or salloc.

Major sbatch and salloc options

Specification Option Comments
Number of nodes requested –nodes=N
Number of tasks to invoke on each node –tasks-per-node=n Can be used to specify the number of cores to use per node, e.g. to avoid hyper-threading. (If option is omitted, all cores and hyperthreads are used; Hint: using hyperthreads is not always advantageous.)
Partition –partition= partitionname
Job time limit –time=time-limit time-limit may be given as minutes or in hh:mm:ss or d-hh:mm:ss format (d means number of days)
Output file –output=out Location of stdout redirection

For the sbatch command these options may also be specified directly in the job script using a pseudo comment directive starting with #SBATCH as a prefix. The directives must precede any executable command in the batch script:

      #!/bin/bash
      #SBATCH --partition=std
      #SBATCH --nodes=2
      #SBATCH --tasks-per-node=16
      #SBATCH --time=00:10:00
      ...
      srun ./helloParallelWorld
      

A complete list of parameters can be retrieved from the man pages for sbatch, salloc, or srun, e.g. via

      man sbatch

Monitoring job and system information:

There are four commands for monitoring job and system information:

The Table below lists basic user activities for job and system monitoring and the corresponding SLURM commands.

User activity SLURM command
View information about currently available nodes and partitions. The state of a partition may be UP, DOWN, or INACTIVE. If the state is INACTIVE, no new submissions are allowed to the partition. sinfo [–partition=partitionname]
View summary about currently available nodes and partitions. The NODES(A/I/O/T) column contains corresponding number of nodes being allocated, idle, in some other state and the total of the three numbers. sinfo -s
Check the state of all jobs. squeue
Check the state of all own jobs. squeue –user=$USER
Check the state of a single job. squeue -j jobid
Check the expected starting time of a pending job. squeue –start -j jobid
Display status information of a running job (e.g. average CPU time, average Virtual Memory (VM) usage – see sstat –helpformat and man sstat for information on more options). sstat –format=AveCPU, AveVMSize -j jobid
View SLURM configuration information for a partition cluster node (e.g. associated nodes). scontrol show partition partitionname
View SLURM configuration information for a cluster node. scontrol show node nodename
View detailed job information. scontrol show job jobid

Retrieving accounting information:

There are two commands for retrieving accounting information:

User Activity SLURM Command
View job account information for a specific job. sacct -j jobid
View all job information from a specific start date (given as yyyy-mm-dd). sacct -S startdate -u $USER
View execution time for (completed) job (formatted as days-hh:mm:ss, cumulated over job steps, and without any header). sacct -n -X -P -o Elapsed -j jobid

Submitting a batch job:

Below an example script for a SLURM batch job – in the sense of a hello world program – is given. The job is suited to be run in the Phoenix HPC cluster at the Gauß-IT-Zentrum. For other cluster systems some appropriate adjustments will probably be necessary.

#!/bin/bash
   # Do not forget to select a proper partition if the default
   # one is no fit for the job! You can do that either in the sbatch
   # command line or here with the other settings.
#SBATCH --partition=standard
   # Number of nodes used:
#SBATCH --nodes=2
   # Wall clock limit:
#SBATCH --time=12:00:00
   # Name of the job:
#SBATCH --job-name=nearest
   # Number of tasks (cores) per node: 
#SBATCH --ntasks-per-node=20

   # If needed, set your working environment here.
working_dir=~
cd $working_dir

   # Load environment modules for your application here.
module load comp/gcc/6.3.0
module load mpi/openmpi/2.1.0/gcc

   # Execute the application.
mpiexec -np 40 ./test/mpinearest

The job script file above can be stored e.g. in $HOME/hello_world.sh ($HOME is mapped to the user’s home directory).

The job is submitted to SLURM’s batch queue using the default value for partition (scontrol show partitions (also see above) can be used to show that information):

[exampleusername@node001 14:48:33]~$ sbatch $HOME/hello_world.sh
Submitted batch job 123456

The start time can be selected via –begin, for example::

--begin=16:00
--begin=now+1hour
--begin=now+60 (seconds by default) 
--begin=2010-01-20T12:34:00

More information can be found via man sbatch. All parameters shown there can be included in the jobscript via #SBATCH.

The output of sbatch will contain the jobid, like 123456 in this example. During execution the output of the job is written to a file, named slurm-123456.out. If there had been errors (i.e. any output to the stderr stream) a corresponding file named slurm-123456.err would have been created.

Cancelling a batch job:

scancel <jobid>

The required ID can be viewed via the general command squeue or the user specific command squeue -u $USER.

If you want to delete all jobs of a user:

scancel –u <username>

How to change a node status (root only):

scontrol update nodename=node[005-008] state=drain reason=”RMA”

This command will exclude the node from the list of available nodes. This ensures that no more jobs can be submitted to this node, allowing it to be used for testing etc.

scontrol update nodename=node[005-008] state=idle

This reverses the previous command and returns the node back to the list of available nodes. Executing this command might also be necessary if a node crash caused a ramovel of a node from the batch system.

Interactive jobs (intermediate difficulty):

Method one:

Assume you have submitted a job as follows:

sbatch beispiel.job
Submitted batch job 1256

Let the corresponding jobfile be the following:

beispiel.job

#!/bin/bash -l

#SBATCH --partition=standard
#SBATCH --nodes=1
#SBATCH --time=7-00:00:00
#SBATCH --job-name=towhee
#SBATCH --ntasks-per-node=1

cd ~/data_dir
sleep 168h

In this case, the command squeue -l will show you which node the job is currently running on. For example:

1256 standard towhee raskrato  RUNNING       0:04 7-00:00:00      1 node282

You can then log onto that node via ssh node282 and start a new shell via screen (please follow the link for further information). The program can then be started in this new shell.

Once you are done, you can exit the shell via:

strg a d

Another way to use the allocated nodes is via the salloc command (see method two below).

Method two:

Interactive sessions under control of the batch system can be created via salloc. salloc differs from sbatch by the fact that resources are initially only reserved (i.e. allocated) without executing a job script. Also, the session is running on the node on which salloc was invoked (but not on a compute node in contrast to submission with sbatch). This is often useful during the interactive development of a parallel program.

A single node is reserved for interactive usage as follows:

[exampleusername@node001 14:48:33]~$ salloc

When the resources are granted by SLURM, salloc will start a new shell on the (login or head) node where salloc was executed. This interactive session is terminated by exiting the shell or by reaching the time limit.

An OpenMP program using N threads, for example, can be started on the allocated node as follows:

[exampleusername@node001 14:48:33]~$ export OMP_NUM_THREADS=N
[exampleusername@node001 14:48:33]~$ srun my-openmp-binary

To start an interactive parallel MPI program N nodes can be allocated as follows:

[exampleusername@node001 14:48:33]~$ salloc --nodes=N

The MPI Program using n=32 processes, for example, can be started on the allocated nodes as follows:

[exampleusername@node001 14:48:33]~$ mpirun -np 32 my-mpi-binary

Another way to use the allocated nodes is to use ssh to establish connections to them (see method one above).