Slurm Cluster

From ImageWiki

(Difference between revisions)
Jump to: navigation, search
(Start an Array of Jobs using Matlab)
Line 77: Line 77:
     #SBATCH --time= 24:00:00
     #SBATCH --time= 24:00:00
     # start matlab
     # start matlab
     matlab /minimize /nosplash /nodesktop /r "myMatlabScript(${SLURM_ARRAY_TASK_ID})"
     matlab -minimize -nosplash -nodesktop -r "myMatlabScript(${SLURM_ARRAY_TASK_ID})"
=Administrative Commands=
=Administrative Commands=

Revision as of 13:25, 18 January 2018

All Information on the page are subject to change! Especially hostnames are going to be replaced by nice alias names.

The cluster consists of two partitions:

  • image1 with 11 compute nodes with 2x20 Intel-Cores each
  • image2 with 1 compute nodes with 8x8 AMD cores each


Basic Access and First Time Setup

For accessing the cluster, you need access to Ask one of the Admins to grant you access. The way to access the cluster is with <kuid> being your ku-username:

  ssh <kuid>
  ssh a00552

The simplest way is to add an entry in .ssh/config

   Host cluster
   HostName a00552
   User <kuid>
   ProxyCommand ssh -q -W %h:%p <kuid>

With this in place, you can directly login via

  ssh cluster

This will also come in handy if you want to copy your files via scp

  scp -r my_file1 my_file2 my_folder/ cluster:~/Dir

This will copy my_file1 my_file2 and my_folder/ into the path /home/kuid/Dir/. All files in your home directory are available to all compute nodes. You can also copy back simply by

  scp -r cluster:~/Dir ./

After your first login, you have to setup a private key which allows password free login to any of the other nodes. This is required for slurm to function properly! Simply execute the following. When asked for a password, leave blank:

   ssh-copy-id a00553

Using Slurm

Slurm is a Batch processing manager which allows you to submit tasks and request a specific amount of resources which have to be reserved for the job. Resources are for example Memory, number of processing cores, GPUs or even a number of machines. Moreover, Slurm allows you to start arrays of jobs easily, for example to Benchmark an algorithm with different parameter settings. When a job is submitted, it is enqueued to the waiting queue and will stay there until the required resources are available. Slurm is therefore perfectly suited for executing long-running tasks.

To see how many jobs are queued type


To submit a job use


Where the file is a normal bash or sh script that also contains information about the ressources to allocate. Jobs are run on the node in the same path as the path you were when you submitted the job. This means that storing files relative to your current path will work flawlessly.

Examples for BatchScripts

Minimal Example

A quite minimal script looks like:

   #SBATCH --job-name=MyJob
   #number of independent tasks we are going to start in this script
   #SBATCH --ntasks=1
   #number of cpus we want to allocate for each program
   #SBATCH --cpus-per-task=4
   #We expect that our program should not run langer than 2 days
   #Note that a program will be killed once it exceeds this time! 
   #SBATCH --time=2-00:00:00
   #Skipping many options! see man sbatch
   # From here on, we can start our program
   ./my_program option1 option2 option3

Start an Array of Jobs using Matlab

This is a another small script to start an array of several independent jobs using Matlab. The script assumes that in the current folder there is a function called "myMatlabScript" which is taking the task number as a single argument. Internally the function will then assign the chosen hyper parameters based on the task number, e.g. by using it as an index in an array and then run the experiment. Please take care that every task number saves the results in different files, otherwise the processes will overwrite each other. In the script the number of cores is restricted to 4 for each task in the array, so the total script takes 28 cpres

   #SBATCH --job-name=ArrayMatlab
   # we start 7 tasks numbered 1-7
   #SBATCH --array 1-7
   #number of cpus we want to allocate for each task
   #SBATCH --cpus-per-task=4
   # max run time is 24 hours
   #SBATCH --time= 24:00:00
   # start matlab
   matlab -minimize -nosplash -nodesktop -r "myMatlabScript(${SLURM_ARRAY_TASK_ID})"

Administrative Commands

Note: This section is for administrative purposes, once it becomes too big we will move it into another entry

Rebooting Crashed Nodes

After a crashed node got rebooted, Slurm will not trust it anymore, querying state will look like:

  $ sudo scontrol show node a00562
  NodeName=a00562 Arch=x86_64 CoresPerSocket=10
     Reason=Node unexpectedly rebooted

If we are sure that there is no hardware fault, we can simply tell Slurm to Resume operations with this node:

   sudo scontrol update NodeName=a00562 State=RESUME


When a maintenance window is scheduled we want to drain the nodes, i.e. only jobs are allowed to run which will terminate before the maintenance starts. This can be done using:

   sudo scontrol create reservation starttime=2017-10-12T16:00:00 duration=600 user=root flags=maint,ignore_jobs nodes=ALL

here we initialize a maintenance window for 600minutes starting from the 12th october 20117, 4pm. When the maintenance window arrives we can sutdown the server using

   sudo scontrol shutdown

When the machines get rebooted, the slurm daemons will also come up automatically.

Personal tools