Unterschiede

Hier werden die Unterschiede zwischen zwei Versionen angezeigt.

Link zu dieser Vergleichsansicht

Beide Seiten der vorigen Revision Vorhergehende Überarbeitung
Nächste Überarbeitung
Vorhergehende Überarbeitung
hlr:phoenix:queueing-system_slurm [2017/10/03 20:50]
y0050733 [Interaktive Jobs]
hlr:phoenix:queueing-system_slurm [2020/11/12 17:31] (aktuell)
tarkhell [Basics of SLURM]
Zeile 1: Zeile 1:
-====== Queuing-System (Slurm) ====== +====== Queuing-System (SLURMand Jobfiles ​======
-**Kurze Einführung und FAQ zum Queueing-System**+
  
-===== Allgemeines =====+**Target audience: Beginners in Environmental Modules and the Queuing-System SLURM**
  
-Als Batchsystem wurde Slurm installiert. Dokumentationen dazu können auf folgender Webseite gefunden werden: [[http://slurm.schedmd.com/]]. +__//Important info://__
-===== Partitionen (Queues) =====+
  
-In Slurm werden die verschiedenen Warteschlangen von Jobs als Partitionen bezeichnetJede Partition hat ein bestimmtes Verhalten, Beispielsweise wie viele Knoten mindestens und/oder maximal genutzt werden können oder mit welcher Priorität freie Knoten zugeteilt werden+Big parts of this script was taken from the documentation of the [[https://​www.hhcc.uni-hamburg.de/learning-hpc.html|Hamburg HPC Competence Center (HHCC)]]. 
 +Please visit their website for more details on the [[https://​www.hhcc.uni-hamburg.de/​learning-hpc/​getting-started-with-hpc-clusters-b/​getting-started-with-hpc-clusters-b-y-selecting-the-software-environment-b.html|Use of the Command Line Interface ]] or about [[https://​www.hhcc.uni-hamburg.de/​learning-hpc/​getting-started-with-hpc-clusters-b/​getting-started-with-hpc-clusters-b-y-using-shell-scripts-b.html|Using Shell Scripts]].
  
-Auf welche Partitionen ein Nutzer zugreifen kann, wird vom Nutzerrat entschieden. Dadurch kann sichergestellt werden, dass bestimmten Nutzer(gruppen) priorisiert werden können bzw. das wichtige Jobs nicht ausgebremst werden.+__//​Description://​__
  
 +  * You will learn to how to use Environment Modules, a widely used system for handling different software environments (basic level)
 +  * You will learn to use the workload manager SLURM to allocate HPC resources (e.g. CPUs) and to submit a batch job (basic level)
 +  * You will learn how a simple jobscript looks like and how to submit it (basic to intermediate level)
  
-===== Jobs =====+===== General Information ​=====
  
-==== Beispieljobscript ====+//​Environment Modules// are a tool for managing environment variables of the shell. Modules can be loaded and unloaded dynamically and atomically, in an clean fashion. Details can be found on the [[http://​modules.sourceforge.net//​|official website]].
  
 +The workload manager used on the Phoenix-Cluster is SLURM (Simple Linux Utility for Resource Management). SLURM a widely used open source workload managers for large and small Linux clusters which is controlled via a CLI (Command Line Interface). Details can be found in the [[https://​slurm.schedmd.com/​|official documentation]].
 +
 +===== Environment Modules =====
 +
 +__//​Introduction://​__
 +
 +The ''​module load''​ command extends variables containing search paths (e.g. ''​PATH''​ or ''​MANPATH''​). The ''​module unload''​ command is the corresponding inverse operation, it removes entries from search paths. By extending search paths software is made callable. Effectively software can be provided through Modules. An advantage over defining environment variables directly in the shell is that Modules allow to undo changes of environment variables. The idea of introducing Modules is to be able to define software environments in a modular way. In the context of HPC, Modules make it easy to switch compilers or libraries, or to choose between different versions of an application software package.
 +
 +__//​Naming://​__
 +
 +Names of Modules have the format ''​program/​version'',​ just ''​program''​ or even a slightly more nested path description. Modules can be loaded (and always be unloaded) without specifying a version. If the ''​version''​ is not specified the default ''​version''​ will be loaded. The default ''​version''​ is either explicitly defined (and will be marked in the output of ''​module avail''​) or module will load the ''​version''​ that appears to be the latest one. Because defaults can change ''​version''​**s should always be given if reproducibility is required**.
 +
 +__//​Dependences and conflicts://​__
 +
 +Modules can have dependences,​ i.e. a Module can enforce that other Modules that it depends on must be loaded before the Module itself can be loaded.
 +Module can be conflicting,​ i.e. these modules must not be loaded at the same time (e.g. two version of a compiler). A conflicting Module must be unloaded before the Module it conflicts with can be loaded.
 +
 +__//​Caveats://​__
 +
 +The name Modules suggest that Modules can be picked and combined in a modular fashion. For Modules providing application packages this is true (up to possible dependences and conflicts described above), i.e. it is possible to chose any combination of application software.
 +
 +However, today, environments for building software are not modular anymore. In particular, it is no longer guaranteed that a library that was built with one compiler can be used with code generated by a different compiler. Hence, the corresponding Modules cannot be modular either. A popular way to handle this situation is to append compiler information to the version information of library Modules. Firstly, this leads to long names and secondly, to very many Modules that are hard to overlook. A more modern way is to build up toolchains with Modules. For example, in such a toolchain only compiler Modules are available at the beginning. Once a compiler Module is loaded, MPI libraries (the next level of tools) become available and after that all other Modules (that were built with that chain).
 +
 +__//​Important commands://​__
 +
 +Important Module commands are:
 +|list Modules currently loaded ​    | ''​module list'' ​    |
 +|list available Modules ​    | ''​module avail'' ​    |
 +|load a Module ​    | ''​module load program[/​version]'' ​    |
 +|unload a Module ​    | ''​module unload program'' ​    |
 +|switch a Module (e.g. compiler version) ​    | ''​module switch program program/​version'' ​    |
 +|add or remove a directory/​path to the Module search path (e.g. by an own Module directory) ​    | ''​module [un]use [--append] path'' ​    |
 +
 +__//​Self-documentation://​__
 +
 +Modules are self-documented:​
 +|show the actions of a Module ​    | ''​module display program/​version'' ​    |
 +|short description of [one or] all Modules ​    | ''​module whatis [program/​version]'' ​    |
 +|longer help text on a Module ​    | ''​module help program/​version'' ​    |
 +|help on module itself ​    | ''​module help'' ​    |
 +
 +===== Basics of SLURM =====
 +
 +__//​Introduction://​__
 +
 +There are three key functions of SLURM described on the SLURM website:
 +
 +//“… First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. …”//
 +
 +SLURM’s default scheduling is based on a FIFO-queue, which is typically enhanced with the Multifactor Priority Plugin to achieve a very versatile facility for ordering the queue of jobs waiting to be scheduled. In contrast to other workload managers SLURM does not use several job queues. Cluster nodes in a SLURM configuration can be assigned to multiple partitions by the cluster administrators instead. This enables the same functionality.
 +
 +A compute center will seek to configure SLURM in a way that resource utilization and throughput are maximized, waiting times and turnaround times are minimized, and all users are treated fairly.
 +
 +The basic functionality of SLURM can be divided into three areas:
 +  * Job submission and cancellation
 +  * Monitoring job and system information
 +  * Retrieving accounting information
 +
 +__//Job submission and cancellation://​__
 +
 +There are three commands for handling job submissions:​
 +  * ''​sbatch''​
 +    * submits a batch job script to SLURM’s job queue for (later) execution. The batch script may be given to sbatch by a file name on the command line or can be read from stdin. Resources needed by the job may be specified via command line options and/or directly in the job script. A job script may contain several job steps to perform several parallel tasks within the same script. Job steps themselves may be run sequentially or in parallel. SLURM regards the script as the first job step.
 +  * ''​salloc''​
 +    * allocates a set of nodes, typically for interactive use. Resources needed may be specified via command line options.
 +  * ''​srun''​
 +    * usually runs a command on nodes previously allocated via sbatch or salloc. Each invocation of srun within a job script corresponds to a job step and launches parallel tasks across the allocated resources. A task is represented e.g. by a program, command, or script. If srun is not invoked within an allocation it will via command line options first create a resource allocation in which to run the parallel job.
 +
 +SLURM assigns a unique //jobid// (integer number) to each job when it is submitted. This //jobid// is returned at submission time or can be obtained from the ''​squeue''​ command.
 +
 +The ''​scancel''​ command is used to abort a job or job step that is running or waiting for execution.
 +
 +The ''​scontrol''​ command is mainly used by cluster administrators to view or modify the configuration of the SLURM system but it also offers the users the possibility to control their jobs (e.g. to hold and release a pending job).
 +
 +The Table below lists basic user activities for job submission and cancellation and the corresponding SLURM commands.
 +
 +User activities for job submission and cancellation (user supplied information is given in //​italics//​)
 +^ User activity ^ SLURM command ^ 
 +| Submit a job script for (later) execution | ''​sbatch''​ //​job-script//​ |
 +| Allocate a set of nodes for interactive use | ''​salloc''​ --nodes=//​N//​ |
 +| Launch a parallel task (e.g. program, command, or script) within allocated resources\\ by ''​sbatch''​ (i.e. within a job script) or ''​salloc''​ | ''​srun''​ //task// |
 +| Allocate a set of nodes and launch a parallel task directly | ''​srun''​ --nodes=//N task// |
 +| Abort a job that is running or waiting for execution | ''​scancel''​ //jobid// |
 +| Abort all jobs of a user | ''​scancel''​ --user=//​username//​\\ or generally\\ ''​scancel''​ --user=$USER |
 +| Put a job on hold (i.e. pause waiting) and Release a job from hold\\ (These related commands are rarely used in standard operation.) | ''​scontrol''​ hold //jobid//\\ ''​scontrol''​ release //jobid// |
 +
 +The major command line options that are used for ''​sbatch''​ and ''​salloc''​ are listed in the Table below. These options can also be specified for ''​srun'',​ if ''​srun''​ is not used in the context of nodes previously allocated via ''​sbatch''​ or ''​salloc''​.
 +
 +Major ''​sbatch''​ and ''​salloc''​ options
 +^ Specification ^ Option ^ Comments ^ 
 +| Number of nodes requested | --nodes=//​N//​ |  | 
 +| Number of tasks to invoke on each node | --tasks-per-node=//​n//​ | Can be used to specify the number of cores to use per node, e.g. to avoid [[https://​en.wikipedia.org/​wiki/​Hyper-threading|hyper-threading]]. (If option is omitted, all cores and hyperthreads are used; Hint: using hyperthreads is not always advantageous.) | 
 +| Partition | --partition= //​partitionname//​ | SLURM supports multiple partitions instead of several queues
 +| Job time limit | --time=//​time-limit//​ | time-limit may be given as minutes or in hh:mm:ss or d-hh:mm:ss format (d means number of days) | 
 +| Output file | --output=//​out//​ | Location of stdout redirection | 
 +
 +For the ''​sbatch''​ command these options may also be specified directly in the job script using a pseudo comment directive starting with ''#​SBATCH''​ as a prefix. The directives must precede any executable command in the batch script:
 +
 +        #!/bin/bash
 +        #SBATCH --partition=std
 +        #SBATCH --nodes=2
 +        #SBATCH --tasks-per-node=16
 +        #SBATCH --time=00:​10:​00
 +        ...
 +        srun ./​helloParallelWorld
 +        ​
 +A complete list of parameters can be retrieved from the ''​man''​ pages for ''​sbatch'',​ ''​salloc'',​ or ''​srun'',​ e.g. via
 +
 +        man sbatch
 +
 +__//​Monitoring job and system information://​__
 +
 +There are four commands for monitoring job and system information:​
 +  * ''​sinfo''​
 +    * shows current information about nodes and partitions for a system managed by SLURM. Command line options can be used to filter, sort, and format the output in a variety of ways. By default it essentially shows for each partition if it is available and how many nodes and which nodes in the partition are allocated or idle (or are possibly in another state like down or drain, i.e. not available for some time). This is useful for the user e.g. to decide in which partition to run a job. The number of allocated and idle nodes indicates the actual utilization of the cluster.
 +  * ''​squeue''​
 +    * shows current information about jobs in the SLURM scheduling queue. Command line options can be used to filter, sort, and format the output in a variety of ways. By default it lists all pending jobs, sorted descending by their priority, followed by all running jobs, sorted descending by their priority. The major job states are:
 +      * R for Running
 +      * PD for Pending
 +      * CD for Completed
 +      * F for Failed
 +      * CA for Cancelled
 +    * The ''​TIME''​ column shows for running jobs their execution time so far (or 0:00 for pending jobs).
 +    * The ''​NODELIST (REASON)''​ column shows either on which nodes a job is running or why the job is pending. A job is pending for two main reasons:
 +      * it is still waiting for resources to become scheduled, shown as (''​Resources''​),​
 +      * its priority is still not sufficient for it to become executed, shown as (''​Priority''​),​ i.e. there are other jobs with a higher priority pending in the queue.
 +    * The position of a pending job in the queue indicates how many jobs are executed before and after it. The ''​squeue''​ command is the main way to monitor a job and can e.g. also be used to get the information about the expected starting time of a job (see Table below).
 +  * ''​sstat''​
 +    * is mainly used to display various status information of a running job taken as a snapshot. The information relates to CPU, task, node, Resident Set Size (RSS), and virtual memory (VM), etc.
 +  * ''​scontrol''​
 +    * is mainly used by cluster administrators to view or modify the configuration of the SLURM system, but it also offers users the possibility to get some information about the cluster configuration (e.g. about partitions, nodes, and jobs).
 +
 +The Table below lists basic user activities for job and system monitoring and the corresponding SLURM commands.
 +^ User activity ^ SLURM command ^ 
 +| View information about currently available nodes and partitions. The state of a partition may be ''​UP'',​ ''​DOWN'',​ or ''​INACTIVE''​. If the state is ''​INACTIVE'',​ no new submissions are allowed to the partition. | ''​sinfo''​ [--partition=//​partitionname//​] |
 +| View summary about currently available nodes and partitions. The ''​NODES''​(''​A/​I/​O/​T''​) column contains corresponding number of nodes being allocated, idle, in some other state and the total of the three numbers. | ''​sinfo''​ -s |
 +| Check the state of all jobs. | ''​squeue''​ |
 +| Check the state of all own jobs. | ''​squeue''​ --user=$USER |
 +| Check the state of a single job. | ''​squeue''​ -j //jobid// |
 +| Check the expected starting time of a pending job. | ''​squeue''​ --start -j //jobid// |
 +| Display status information of a running job (e.g. average CPU time, average Virtual Memory (VM) usage – see ''​sstat''​ --helpformat and ''​man sstat''​ for information on more options). | ''​sstat''​ --format=AveCPU,​ AveVMSize -j //jobid// |
 +| View SLURM configuration information for a partition cluster node (e.g. associated nodes). | ''​scontrol''​ show partition //​partitionname//​ |
 +| View SLURM configuration information for a cluster node. | ''​scontrol''​ show node //​nodename//​ |
 +| View detailed job information. | ''​scontrol''​ show job //jobid// |
 +
 +__//​Retrieving accounting information://​__
 +
 +There are two commands for retrieving accounting information:​
 +  * ''​sacct''​
 +    * shows accounting information for jobs and job steps in the SLURM job accounting log or SLURM database. For active jobs the accounting information is accessed via the job accounting log file. For completed jobs it is accessed via the log data saved in the SLURM database. Command line options can be used to filter, sort, and format the output in a variety of ways. Columns for jobid, jobname, partition, account, allocated CPUs, state, and exit code are shown by default for each of the user’s jobs eligible after midnight of the current day.
 +  * ''​sacctmgr''​
 +    * is mainly used by cluster administrators to view or modify the SLURM account information,​ but it also offers users the possibility to get some information about their account. The account information is maintained within the SLURM database. Command line options can be used to filter, sort, and format the output in a variety of ways.
 +    * The Table below lists basic user activities for retrieving accounting information and the corresponding SLURM commands.
 +
 +^ User Activity ^ SLURM Command ^ 
 +| View job account information for a specific job. | ''​sacct''​ -j //jobid// |
 +| View all job information from a specific start date (given as yyyy-mm-dd). | ''​sacct''​ -S //​startdate//​ -u $USER |
 +| View execution time for (completed) job (formatted as days-hh:​mm:​ss,​ cumulated over job steps, and without any header). | ''​sacct''​ -n -X -P -o Elapsed -j //jobid// |
 +    ​
 +===== Jobscripts =====
 +
 +__//​Submitting a batch job://__
 +
 +Below an example script for a SLURM batch job – in the sense of a hello world program – is given. The job is suited to be run in the Phoenix HPC cluster at the Gauß-IT-Zentrum. For other cluster systems some appropriate adjustments will probably be necessary.
  
 <​code>​ <​code>​
 #!/bin/bash #!/bin/bash
-Job name: +   Do not forget to select a proper partition if the default 
-#SBATCH --job-name=SLinpack ​# Wall clock limit: +   # one is no fit for the job! You can do that either in the sbatch 
-#SBATCH --time=1:00:00 +   # command line or here with the other settings. 
-# Number of tasks (cores): #SBATCH --ntasks=4 +#SBATCH --partition=standard 
-#​SBATCH ​--exclusive +   # Number of nodes used: 
-module add mpi/​intelmpi/​5.1.2.150 module add intel-studio-2016 +#​SBATCH ​--nodes=
-mkdir ~/​systemtests/​hpl/​$HOSTNAME cd ~/​systemtests/​hpl/​$HOSTNAME +   # Wall clock limit: 
-cp ${MKLROOT}/​benchmarks/​mp_linpack/​bin_intel/​intel64/​runme_intel64_prv . cp ${MKLROOT}/​benchmarks/​mp_linpack/​bin_intel/​intel64/​xhpl_intel64 . +#SBATCH --time=12:00:00 
-HPLLOG=~/​systemtests/​hpl/​$HOSTNAME/​HPL.log.2015.$(date +%y-%m-%d_%H%M)+   # Name of the job: 
 +#SBATCH --job-name=nearest 
 +   # Number of tasks (cores) ​per node 
 +#SBATCH --ntasks-per-node=20
  
-mpirun -genv I_MPI_FABRICS shm -np $MPI_PROC_NUM -ppn $MPI_PER_NODE ​./runme_intel64_prv "​$@"​ | tee -a $HPLLOG+   # If needed, set your working environment here. 
 +working_dir=~ 
 +cd $working_dir 
 + 
 +   # Load environment modules for your application here. 
 +module load comp/gcc/6.3.0 
 +module load mpi/​openmpi/​2.1.0/​gcc 
 + 
 +   # Execute the application. 
 +mpiexec ​-np 40 ./​test/​mpinearest
 </​code>​ </​code>​
-==== Wie startet man einen Job? ==== 
  
-<​code>​sbatch ​--job-name=$jobname -N <​num_nodes>​ --ntasks-per-node=<​ppn>​ Jobscript</​code>​+The job script file above can be stored e.g. in ''​$HOME/​hello_world.sh''​ (''​$HOME''​ is mapped to the user’s home directory). 
 + 
 +The job is submitted to SLURM’s batch queue using the default value for partition (scontrol show partitions (also see above) can be used to show that information):​ 
 + 
 +<​code>​ 
 +[exampleusername@node001 14:​48:​33]~$ ​sbatch $HOME/​hello_world.sh 
 +Submitted batch job 123456 
 +</​code>​
  
-Eine Startzeit kann mit dem Schalter ​--begin ​vorgegeben werden Beispielsweise:+The start time can be selected via ''​--begin'',​ for example::
  
 <​code>​--begin=16:​00 <​code>​--begin=16:​00
Zeile 42: Zeile 226:
 --begin=2010-01-20T12:​34:​00 --begin=2010-01-20T12:​34:​00
 </​code>​ </​code>​
-Weitere Informationen bietet auch //man sbatch// 
-Sämtliche dort verwendeten Parameter können auch im Jobscript selber mit #SBATCH angebene werden. 
  
-==== Wie beendet ​man einen Job? ====+More information can be found via ''​man sbatch''​. All parameters shown there can be included in the jobscript via ''#​SBATCH''​. 
 + 
 +The output of ''​sbatch''​ will contain the jobid, like 123456 in this example. During execution the output of the job is written to a file, named ''​slurm-123456.out''​. 
 +If there had been errors (i.e. any output to the //stderr// stream) a corresponding file named ''​slurm-123456.err''​ would have been created. 
 + 
 +__//​Cancelling a batch job://__
  
 <​code>​scancel <​jobid></​code>​ <​code>​scancel <​jobid></​code>​
  
-Die dazu notwendige ​ID kann mit dem squeue ​Kommando ermittelt werden.+The required ​ID can be viewed via the general command ''​squeue''​ or the user specific command ''​squeue -u $USER''​
  
-Alle Jobs eines Users löschen:+If you want to delete all jobs of a user:
  
 <​code>​scancel –u <​username></​code>​ <​code>​scancel –u <​username></​code>​
  
-==== Wie frage ich den Status eigener Jobs ab? ==== +__//How to change a node status ​(root only)://__
- +
-<​code>​squeue<​/code> +
- +
-==== Wie ändere ich den Knotenstatus?​(nur root) ====+
  
 <​code>​scontrol update nodename=node[005-008] state=drain reason=”RMA”</​code>​ <​code>​scontrol update nodename=node[005-008] state=drain reason=”RMA”</​code>​
  
-Würde die Knoten aus den verfügbaren Knoten ausschließen so dass keine Jobs mehr dorthin submittet werden könnenund man die Knoten zum Testen/​Reparieren benutzen kann.+This command will exclude the node from the list of available nodes. This ensures that no more jobs can be submitted to this nodeallowing it to be used for testing etc.
  
 <​code>​scontrol update nodename=node[005-008] state=idle</​code>​ <​code>​scontrol update nodename=node[005-008] state=idle</​code>​
  
-würde dies zurücksetzenund ist ggf auch notwendig wenn Knotenabstürze dazu führten, dass ein Knoten aus dem Batchsystem ausgeschlossen wurde.+This reverses the previous command and returns the node back to the list of available nodesExecuting this command might also be necessary if a node crash caused a ramovel of a node from the batch system.
  
-==== Interaktive Jobs ==== 
  
-sbatch beispiel.job+===== Interactive jobs (intermediate difficulty):​ =====
  
-Submitted batch job 1256+__//Method one://__ 
 + 
 +Assume you have submitted a job as follows: 
 + 
 +<​code>​sbatch beispiel.job 
 +Submitted batch job 1256</​code>​ 
 + 
 +Let the corresponding jobfile be the following:
  
 <​code>​ <​code>​
Zeile 90: Zeile 279:
 </​code>​ </​code>​
  
-squeue -l zeigt an auf welchen knoten der job läuft+In this case, the command ''​squeue -l''​ will show you which node the job is currently running on. For example:
  
-1256 standard towhee raskrato ​ RUNNING ​      0:04 7-00:​00:​00 ​     1 node282+<​code>​1256 standard towhee raskrato ​ RUNNING ​      0:04 7-00:​00:​00 ​     1 node282</​code>​
  
 +You can then log onto that node via ''​ssh node282''​ and start a new shell via [[https://​linuxize.com/​post/​how-to-use-linux-screen/​|screen]] (please follow the link for further information). The program can then be started in this new shell.
  
-Mit ssh auf node282 einloggen+Once you are done, you can exit the shell via:
  
-Dann mit screen eine Shell aufmachen, die bestehen bleibt, wenn man sich ausloggt.+<​code>​ 
 +strg a d 
 +</​code>​
  
-Das Programm ​in dieser ​shell starten.+  * You can start as many shells as you like. The command ''​screen -r''​ will show a list of all shells (if it is only one, you will instead return to said shell).  
 +  * You can access a shell running ​in the background via ''​screen -r <​shellnummer>''​.  
 +  * You can quit a shell by pressing the key-combination //CTRL+C// and typing in ''​exit''​.
  
-Mit+Another way to use the allocated nodes is via the ''​salloc''​ command (see method two below). 
 + 
 +__//Method two://__ 
 + 
 +Interactive sessions under control of the batch system can be created via ''​salloc''​. ''​salloc''​ differs from ''​sbatch''​ by the fact that resources are initially only reserved (i.e. allocated) without executing a job script. Also, the session is running on the node on which ''​salloc''​ was invoked (but not on a compute node in contrast to submission with ''​sbatch''​). This is often useful during the interactive development of a parallel program. 
 + 
 +A single node is reserved for interactive usage as follows: 
 + 
 +<​code>​[exampleusername@node001 14:48:33]~$ salloc</​code>​ 
 + 
 +When the resources are granted by SLURM, ''​salloc''​ will start a new shell on the (login or head) node where ''​salloc''​ was executed. This interactive session is terminated by exiting the shell or by reaching the time limit. 
 + 
 +An OpenMP program using //N// threads, for example, can be started on the allocated node as follows:
  
 <​code>​ <​code>​
-strg a d+[exampleusername@node001 14:48:33]~$ export OMP_NUM_THREADS=N 
 +[exampleusername@node001 14:48:33]~$ srun my-openmp-binary
 </​code>​ </​code>​
-die Shell verlassen (aber Hintergrund weiter laufen lassen). ​ Soviele weitere Shells mit screen aufmachenwie benötigt werden. Mit screen ​-r die shells anzeigen lassen ​(wenn es nur eine gibt klingt man sich sofort wieder ein). Mit screen -r shellnummer kommt auf eine shell im Hintergrund rein. Um die Shell zu beenden strg c drücken und exit eingeben.+ 
 +To start an interactive parallel MPI program //N// nodes can be allocated as follows: 
 + 
 +<​code>​[exampleusername@node001 14:48:33]~$ salloc --nodes=N</​code>​ 
 + 
 +The MPI Program using //n=32// processesfor example, can be started on the allocated nodes as follows: 
 + 
 +<​code>​[exampleusername@node001 14:48:33]~$ mpirun ​-np 32 my-mpi-binary</​code>​ 
 + 
 +Another way to use the allocated nodes is to use ssh to establish connections to them (see method one above). 
hlr/phoenix/queueing-system_slurm.1507056634.txt.gz · Zuletzt geändert: 2017/10/03 20:50 von y0050733
Gau-IT-Zentrum