Dies ist eine alte Version des Dokuments!
Target audience: Beginners in Environmental Modules and the Queuing-System SLURM
Important info:
Big parts of this script was taken from the documentation of the Hamburg HPC Competence Center (HHCC). Please visit their website for more details on the Use of the Command Line Interface or about Using Shell Scripts.
Description:
Environment Modules are a tool for managing environment variables of the shell. Modules can be loaded and unloaded dynamically and atomically, in an clean fashion. Details can be found on the official website.
The workload manager used on the Phoenix-Cluster is SLURM (Simple Linux Utility for Resource Management). SLURM a widely used open source workload managers for large and small Linux clusters which is controlled via a CLI (Command Line Interface). Details can be found in the official documentation.
Introduction:
The module load
command extends variables containing search paths (e.g. PATH
or MANPATH
). The module unload
command is the corresponding inverse operation, it removes entries from search paths. By extending search paths software is made callable. Effectively software can be provided through Modules. An advantage over defining environment variables directly in the shell is that Modules allow to undo changes of environment variables. The idea of introducing Modules is to be able to define software environments in a modular way. In the context of HPC, Modules make it easy to switch compilers or libraries, or to choose between different versions of an application software package.
Naming:
Names of Modules have the format program/version
, just program
or even a slightly more nested path description. Modules can be loaded (and always be unloaded) without specifying a version. If the version
is not specified the default version
will be loaded. The default version
is either explicitly defined (and will be marked in the output of module avail
) or module will load the version
that appears to be the latest one. Because defaults can change version
s should always be given if reproducibility is required.
Dependences and conflicts:
Modules can have dependences, i.e. a Module can enforce that other Modules that it depends on must be loaded before the Module itself can be loaded. Module can be conflicting, i.e. these modules must not be loaded at the same time (e.g. two version of a compiler). A conflicting Module must be unloaded before the Module it conflicts with can be loaded.
Caveats:
The name Modules suggest that Modules can be picked and combined in a modular fashion. For Modules providing application packages this is true (up to possible dependences and conflicts described above), i.e. it is possible to chose any combination of application software.
However, today, environments for building software are not modular anymore. In particular, it is no longer guaranteed that a library that was built with one compiler can be used with code generated by a different compiler. Hence, the corresponding Modules cannot be modular either. A popular way to handle this situation is to append compiler information to the version information of library Modules. Firstly, this leads to long names and secondly, to very many Modules that are hard to overlook. A more modern way is to build up toolchains with Modules. For example, in such a toolchain only compiler Modules are available at the beginning. Once a compiler Module is loaded, MPI libraries (the next level of tools) become available and after that all other Modules (that were built with that chain).
Important commands:
Important Module commands are:
list Modules currently loaded | module list |
list available Modules | module avail |
load a Module | module load program[/version] |
unload a Module | module unload program |
switch a Module (e.g. compiler version) | module switch program program/version |
add or remove a directory/path to the Module search path (e.g. by an own Module directory) | module [un]use [–append] path |
Self-documentation:
Modules are self-documented:
show the actions of a Module | module display program/version |
short description of [one or] all Modules | module whatis [program/version] |
longer help text on a Module | module help program/version |
help on module itself | module help |
Introduction:
There are three key functions of SLURM described on the SLURM website:
“… First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. …”
SLURM’s default scheduling is based on a FIFO-queue, which is typically enhanced with the Multifactor Priority Plugin to achieve a very versatile facility for ordering the queue of jobs waiting to be scheduled. In contrast to other workload managers SLURM does not use several job queues. Cluster nodes in a SLURM configuration can be assigned to multiple partitions by the cluster administrators instead. This enables the same functionality.
A compute center will seek to configure SLURM in a way that resource utilization and throughput are maximized, waiting times and turnaround times are minimized, and all users are treated fairly.
The basic functionality of SLURM can be divided into three areas:
Job submission and cancellation:
There are three commands for handling job submissions:
sbatch
salloc
srun
SLURM assigns a unique jobid (integer number) to each job when it is submitted. This jobid is returned at submission time or can be obtained from the squeue
command.
The scancel
command is used to abort a job or job step that is running or waiting for execution.
The scontrol
command is mainly used by cluster administrators to view or modify the configuration of the SLURM system but it also offers the users the possibility to control their jobs (e.g. to hold and release a pending job).
The Table below lists basic user activities for job submission and cancellation and the corresponding SLURM commands.
User activities for job submission and cancellation (user supplied information is given in italics)
User activity | SLURM command |
---|---|
Submit a job script for (later) execution | sbatch job-script |
Allocate a set of nodes for interactive use | salloc –nodes=N |
Launch a parallel task (e.g. program, command, or script) within allocated resources by sbatch (i.e. within a job script) or salloc | srun task |
Allocate a set of nodes and launch a parallel task directly | srun –nodes=N task |
Abort a job that is running or waiting for execution | scancel jobid |
Abort all jobs of a user | scancel –user=usernameor generally scancel –user=$USER |
Put a job on hold (i.e. pause waiting) and Release a job from hold (These related commands are rarely used in standard operation.) | scontrol hold jobidscontrol release jobid |
The major command line options that are used for sbatch
and salloc
are listed in the Table below. These options can also be specified for srun
, if srun
is not used in the context of nodes previously allocated via sbatch
or salloc
.
Major sbatch
and salloc
options
Specification | Option | Comments |
---|---|---|
Number of nodes requested | –nodes=N | |
Number of tasks to invoke on each node | –tasks-per-node=n | Can be used to specify the number of cores to use per node, e.g. to avoid hyper-threading. (If option is omitted, all cores and hyperthreads are used; Hint: using hyperthreads is not always advantageous.) |
Partition | –partition= partitionname | |
Job time limit | –time=time-limit | time-limit may be given as minutes or in hh:mm:ss or d-hh:mm:ss format (d means number of days) |
Output file | –output=out | Location of stdout redirection |
For the sbatch
command these options may also be specified directly in the job script using a pseudo comment directive starting with #SBATCH
as a prefix. The directives must precede any executable command in the batch script:
''#!/bin/bash'' ''#SBATCH --partition=std'' ''#SBATCH --nodes=2'' ''#SBATCH --tasks-per-node=16'' ''#SBATCH --time=00:10:00'' ''...'' ''srun ./helloParallelWorld''
A complete list of parameters can be retrieved from the man
pages for sbatch
, salloc
, or srun
, e.g. via
''man sbatch''
Monitoring job and system information:
There are four commands for monitoring job and system information:
sinfo
squeue
TIME
column shows for running jobs their execution time so far (or 0:00 for pending jobs).NODELIST (REASON)
column shows either on which nodes a job is running or why the job is pending. A job is pending for two main reasons:Resources
),Priority
), i.e. there are other jobs with a higher priority pending in the queue.squeue
command is the main way to monitor a job and can e.g. also be used to get the information about the expected starting time of a job (see Table below).sstat
scontrol
The Table below lists basic user activities for job and system monitoring and the corresponding SLURM commands.
User activity | SLURM command |
---|---|
View information about currently available nodes and partitions. The state of a partition may be UP , DOWN , or INACTIVE . If the state is INACTIVE , no new submissions are allowed to the partition. | sinfo [–partition=partitionname] |
View summary about currently available nodes and partitions. The NODES (A/I/O/T ) column contains corresponding number of nodes being allocated, idle, in some other state and the total of the three numbers. | sinfo -s |
Check the state of all jobs. | squeue |
Check the state of all own jobs. | squeue –user=$USER |
Check the state of a single job. | squeue -j jobid |
Check the expected starting time of a pending job. | squeue –start -j jobid |
Display status information of a running job (e.g. average CPU time, average Virtual Memory (VM) usage – see sstat –helpformat and man sstat for information on more options). | sstat –format=AveCPU, AveVMSize -j jobid |
View SLURM configuration information for a partition cluster node (e.g. associated nodes). | scontrol show partition partitionname |
View SLURM configuration information for a cluster node. | scontrol show node nodename |
View detailed job information. | scontrol show job jobid |
Retrieving accounting information:
There are two commands for retrieving accounting information:
sacct
sacctmgr
In Slurm werden die verschiedenen Warteschlangen von Jobs als Partitionen bezeichnet. Jede Partition hat ein bestimmtes Verhalten, Beispielsweise wie viele Knoten mindestens und/oder maximal genutzt werden können oder mit welcher Priorität freie Knoten zugeteilt werden.
Auf welche Partitionen ein Nutzer zugreifen kann, wird vom Nutzerrat entschieden. Dadurch kann sichergestellt werden, dass bestimmten Nutzer(gruppen) priorisiert werden können bzw. das wichtige Jobs nicht ausgebremst werden.
#!/bin/bash # Job name: #SBATCH --job-name=SLinpack # Wall clock limit: #SBATCH --time=1:00:00 # Number of tasks (cores): #SBATCH --ntasks=4 #SBATCH --exclusive module add mpi/intelmpi/5.1.2.150 module add intel-studio-2016 mkdir ~/systemtests/hpl/$HOSTNAME cd ~/systemtests/hpl/$HOSTNAME cp ${MKLROOT}/benchmarks/mp_linpack/bin_intel/intel64/runme_intel64_prv . cp ${MKLROOT}/benchmarks/mp_linpack/bin_intel/intel64/xhpl_intel64 . HPLLOG=~/systemtests/hpl/$HOSTNAME/HPL.log.2015.$(date +%y-%m-%d_%H%M) mpirun -genv I_MPI_FABRICS shm -np $MPI_PROC_NUM -ppn $MPI_PER_NODE ./runme_intel64_prv "$@" | tee -a $HPLLOG
sbatch --job-name=$jobname -N <num_nodes> --ntasks-per-node=<ppn> Jobscript
Eine Startzeit kann mit dem Schalter –begin vorgegeben werden Beispielsweise:
--begin=16:00 --begin=now+1hour --begin=now+60 (seconds by default) --begin=2010-01-20T12:34:00
Weitere Informationen bietet auch man sbatch Sämtliche dort verwendeten Parameter können auch im Jobscript selber mit #SBATCH angebene werden.
scancel <jobid>
Die dazu notwendige ID kann mit dem squeue Kommando ermittelt werden.
Alle Jobs eines Users löschen:
scancel –u <username>
squeue
scontrol update nodename=node[005-008] state=drain reason=”RMA”
Würde die Knoten aus den verfügbaren Knoten ausschließen so dass keine Jobs mehr dorthin submittet werden können, und man die Knoten zum Testen/Reparieren benutzen kann.
scontrol update nodename=node[005-008] state=idle
würde dies zurücksetzen. und ist ggf auch notwendig wenn Knotenabstürze dazu führten, dass ein Knoten aus dem Batchsystem ausgeschlossen wurde.
sbatch beispiel.job
Submitted batch job 1256
beispiel.job #!/bin/bash -l #SBATCH --partition=standard #SBATCH --nodes=1 #SBATCH --time=7-00:00:00 #SBATCH --job-name=towhee #SBATCH --ntasks-per-node=1 cd ~/data_dir sleep 168h
squeue -l zeigt an auf welchen knoten der job läuft
1256 standard towhee raskrato RUNNING 0:04 7-00:00:00 1 node282
Mit ssh auf node282 einloggen
Dann mit screen eine Shell aufmachen, die bestehen bleibt, wenn man sich ausloggt.
Das Programm in dieser shell starten.
Mit
strg a d
die Shell verlassen (aber Hintergrund weiter laufen lassen). Soviele weitere Shells mit screen aufmachen, wie benötigt werden. Mit screen -r die shells anzeigen lassen (wenn es nur eine gibt klingt man sich sofort wieder ein). Mit screen -r shellnummer kommt auf eine shell im Hintergrund rein. Um die Shell zu beenden strg c drücken und exit eingeben.