Slurm

From Cheaha
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.


Attention: Research Computing Documentation has Moved
https://docs.rc.uab.edu/


Please use the new documentation url https://docs.rc.uab.edu/ for all Research Computing documentation needs.


As a result of this move, we have deprecated use of this wiki for documentation. We are providing read-only access to the content to facilitate migration of bookmarks and to serve as an historical record. All content updates should be made at the new documentation site. The original wiki will not receive further updates.

Thank you,

The Research Computing Team

SLURM is a queue management system and stands for Simple Linux Utility for Resource Management. SLURM was developed at the Lawrence Livermore National Lab and currently runs some of the largest compute clusters in the world. SLURM is now the primary job manager on Cheaha, it replaces SUN Grid Engine (SGE) the job manager used earlier.

SLURM is similar in many ways to GridEngine or most other queue systems. You write a batch script then submit it to the queue manager (scheduler). The queue manager then schedules your job to run on the queue (or partition in SLURM parlance) that you designate. Below we will provide an outline of how to submit jobs to SLURM, how SLURM decides when to schedule your job, and how to monitor progress.


General SLURM Documentation

The primary source for documentation on SLURM usage and commands can be found at the SLURM site. If you Google for SLURM questions, you'll often see the Lawrence Livermore pages as the top hits, but these tend to be outdated.

A great way to get details on the SLURM commands is the man pages available from the Cheaha cluster. For example, if you type the following command:

man sbatch

you'll get the manual page for the sbatch command.

Logging on and Running Jobs from the command line

Once you've gone through the account setup procedure and obtained a suitable terminal application, you can login to the Cheaha system via ssh

 ssh blazerid@cheaha.rc.uab.edu

Existing users, follow these instructions to add SSH keys for the new system

Cheaha (new hardware) run the CentOS 7 version of the Linux operating system and commands are run under the "bash" shell (the default shell). There are a number of Linux and bash references, cheat sheets and tutorials available on the web.

Typical Workflow

  • Stage data to $USER_SCRATCH (your scratch directory)
  • Determine how to run your code in "batch" mode. Batch mode typically means the ability to run it from the command line without requiring any interaction from the user.
  • Identify the appropriate resources needed to run the job. The following are mandatory resource requests for all jobs on Cheaha:
    • Number of processor cores required by the job
    • Maximum memory (RAM) required per core
    • Maximum runtime
  • Write a job script specifying queuing system parameters, resource requests, and commands to run program
  • Submit script to queuing system (sbatch script.job)
  • Monitor job (squeue)
  • Review the results and resubmit as necessary
  • Clean up the scratch directory by moving or deleting the data off of the cluster

Batch Job

TODO: provide an explanation of what makes a batch job and why use that vs an interactive job

For additional information on the sbatch command execute man sbatch at the command line to view the manual.

Example Batch Job Script

A job consists of resource requests and tasks. The Slurm job scheduler interprets lines beginning with #SBATCH as Slurm arguments. In this example, the job is requesting to run 1 task

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=res.txt
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=$USER@uab.edu

srun hostname
srun sleep 60

Interactive Session

Login Node (the host that you connected to when you setup the SSH connection to Cheaha) is supposed to be used for submitting jobs and/or lighter prep work required for the job scripts. Do not run heavy computations on the login node. If you have a heavier workload to prepare for a batch job (eg. compiling code or other manipulations of data) or your compute application requires interactive control, you should request a dedicated interactive node for this work.

Interactive resources are requested by submitting an "interactive" job to the scheduler. Interactive jobs will provide you a command line on a compute resource that you can use just like you would the command line on the login node. The difference is that the scheduler has dedicated the requested resources to your job and you can run your interactive commands without having to worry about impacting other users on the login node.

Interactive jobs, that can be run on command line, are requested with the srun command.

srun --ntasks=4 --mem-per-cpu=4096 --time=08:00:00 --partition=medium --job-name=JOB_NAME --pty /bin/bash

This command requests for 4 cores (--ntasks) with each task requesting size 4GB of RAM for 8 hrs (--time).

More advanced interactive scenarios to support graphical applications are available using VNC or X11 tunneling X-Win32 2014 for Windows

Interactive jobs that requires running a graphical application, are requested with the sinteractive command, via Terminal on your VNC window.

sinteractive --ntasks=4 --mem-per-cpu=4096 --time=08:00:00 --partition=medium --job-name=JOB_NAME 

Job List - SQUEUE

To check your job status, you can use the following command

squeue -u BLAZERID

Following fields are displayed when you run squeue

JOBID - ID assigned to your job by SLURM scheduler
PARTITION - Partition your job gets, depends upon time requested (express(max 2 hrs), short(max 12 hrs), medium(max 50 hrs), long(max 150 hrs), sinteractive(0-2 hrs))
NAME - JOB name given by user
USER - User who started the job
ST - State your job is in. The typical states are PENDING (PD), RUNNING(R), SUSPENDED(S), COMPLETING(CG), and COMPLETED(CD)
TIME - Time for which your job has been running
NODES - Number of nodes your job is running on
NODELIST - Node on which the job is running

For more details on squeue, go here.

Job Status

SSTAT

The sstat command shows status and metric information for a running job.

NOTE: the job parts must be executed using srun otherwise sstat will not display useful output

[rcs@login001 ~]$ sstat 36145

TODO: paste example output

For more details on sstat, go here.

SCONTROL

$ scontrol show jobid -dd 123

JobId=123 JobName=SLI
   UserId=rcuser(1000) GroupId=rcuser(1000)
   Priority=4294898073 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=06:27:02 TimeLimit=08:00:00 TimeMin=N/A
   SubmitTime=2016-09-12T14:40:20 EligibleTime=2016-09-12T14:40:20
   StartTime=2016-09-12T14:40:20 EndTime=2016-09-12T22:40:21
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=medium AllocNode:Sid=login001:123
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=c0003
   BatchHost=c0003
   NumNodes=1 NumCPUs=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=24,mem=10000,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=c0003 CPU_IDs=0-23 Mem=10000
   MinCPUsNode=1 MinMemoryNode=10000M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/share/apps/rc/git/rc-sched-scripts/bin/_interactive
   WorkDir=/scratch/user/rcuser/work/other/rhea/Gray/MERGED
   StdErr=/dev/null
   StdIn=/dev/null
   StdOut=/dev/null
   Power= SICP=0

Job History

TODO: Provide some examples of using the sacct or our wrapper rc-sacct to view historical information.

This example uses the rc-sacct wrapper script, for comparison here is the equivalent sacct command:

$ sacct --starttime 2016-08-30 \
      --allusers \
      --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
$ rc-sacct --allusers --starttime 2016-08-30

     User        JobID    JobName  Partition      State  Timelimit               Start                 End    Elapsed     MaxRSS  MaxVMSize   NNodes      NCPUS        NodeList
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ---------- ---------------
 kxxxxxxx 34308        Connectom+ interacti+    PENDING   08:00:00             Unknown             Unknown   00:00:00                              1          4   None assigned
 kxxxxxxx 34310        Connectom+ interacti+    PENDING   08:00:00             Unknown             Unknown   00:00:00                              1          4   None assigned
 dxxxxxxx 35927         PK_htseq1     medium  COMPLETED 2-00:00:00 2016-08-30T09:21:33 2016-08-30T10:06:25   00:44:52                              1          4       c0005
          35927.batch       batch             COMPLETED            2016-08-30T09:21:33 2016-08-30T10:06:25   00:44:52    307704K    718152K        1          4       c0005
 bxxxxxxx 35928                SI     medium    TIMEOUT   12:00:00 2016-08-30T09:36:04 2016-08-30T21:36:42   12:00:38                              1          1       c0006
          35928.batch       batch                FAILED            2016-08-30T09:36:04 2016-08-30T21:36:43   12:00:39     31400K    286532K        1          1       c0006
          35928.0        hostname             COMPLETED            2016-08-30T09:36:16 2016-08-30T09:36:17   00:00:01      1112K    207252K        1          1       c0006

Additional information about the sacct command can be found by running man sacct or found here

The rc-sacct wrapper script supports the following arguments:

$ rc-sacct --help

  Copyright (c) 2016 Mike Hanby, University of Alabama at Birmingham IT Research Computing.

  rc-sacct - version 1.0.0

  Run sacct to display history in a nicely formatted output.

    -r, --starttime                  HH:MM[:SS] [AM|PM]
                                     MMDD[YY] or MM/DD[/YY] or MM.DD[.YY]
                                     MM/DD[/YY]-HH:MM[:SS]
                                     YYYY-MM-DD[THH:MM[:SS]]
    -a, --allusers                   Dispay hsitory for all users)
    -u, --user user_list             Display hsitory for all users in the comma seperated user list
    -f, --format a,b,c               Comma separated list of columns: i.e. --format jobid,elapsed,ncpus,ntasks,state
        --debug                      Display additional output like internal structures
    -?, -h, --help                   Display this help message


SGE - SLURM

This section shows SLURM and SGE equivalent commands

   SGE                   SLURM  
---------             ------------
  qsub                  sbatch   
  qlogin                sinteractive
  qdel                  scancel
  qstat                 squeue (-u BLAZERID)
  

To get more info about individual commands, run : man SLURM_COMMAND