Slurm: Difference between revisions
Jpr@uab.edu (talk | contribs) (→General Slurm Documentation: Add links to SchedMD quickstart and CECI tutorial that have been very helpful) |
|||
Line 98: | Line 98: | ||
sinteractive --ntasks=1 --cpus-per-task=4 --mem-per-cpu=4096 --time=08:00:00 --partition=medium --job-name=JOB_NAME | sinteractive --ntasks=1 --cpus-per-task=4 --mem-per-cpu=4096 --time=08:00:00 --partition=medium --job-name=JOB_NAME | ||
</pre> | </pre> | ||
====Requesting for GPUs==== | |||
To request for an interactive session on one of the GPU nodes (c0089-c0092 K80's and c0097-c0114 P100's), add --gres parameter to the 'srun' or 'sinteractive' command. | |||
<pre> | |||
srun --ntasks=1 --cpus-per-task=1 --mem-per-cpu=4096 --time=08:00:00 --partition=pascalnodes --job-name=JOB_NAME --gres=gpu:1 --pty /bin/bash | |||
</pre> | |||
<pre> | |||
sinteractive --ntasks=1 --cpus-per-task=1 --mem-per-cpu=4096 --time=08:00:00 --partition=pascalnodes --job-name=JOB_NAME --gres=gpu:1 | |||
</pre> | |||
'''NOTE:''' | |||
* If you want to use more then one GPU on the node, please increase the value in --gres=gpu:[1-4] | |||
* If you want to use the P100s please use the partition as 'pascalnodes', wheres for K80s please use either of the express, short, medium or long as partitions. | |||
=== MPI Job === | === MPI Job === |
Revision as of 19:26, 17 October 2017
Slurm is a queue management system and stands for Simple Linux Utility for Resource Management. Slurm was developed at the Lawrence Livermore National Lab and currently runs some of the largest compute clusters in the world. Slurm is now the primary job manager on Cheaha, it replaces SUN Grid Engine (SGE) the job manager used earlier.
Slurm is similar in many ways to GridEngine or most other queue systems. You write a batch script then submit it to the queue manager (scheduler). The queue manager then schedules your job to run on the queue (or partition in Slurm parlance) that you designate. Below we will provide an outline of how to submit jobs to Slurm, how Slurm decides when to schedule your job, and how to monitor progress.
General Slurm Documentation
The primary source for documentation on Slurm usage and commands can be found at the Slurm site. If you Google for Slurm questions, you'll often see the Lawrence Livermore pages as the top hits, but these tend to be outdated.
The SLURM QuickStart Guide provides a very useful overview of how SLURM treats a cluster as pool of resources which you can allocate to get your work done. The Example section on that page is a very useful orientation to SLURM environments.
The SLURM Tutorial at CECI, a European Consortium of HPC sites, provides a very good introduction on submitting single threaded, multi-threaded, and MPI jobs.
A great way to get details on the Slurm commands is the man pages available from the Cheaha cluster. For example, if you type the following command:
man sbatch
you'll get the manual page for the sbatch command.
Slurm Partitions
Cheaha has the following Slurm partitions (can also be thought of in terms of SGE queues) defined (the lower the number the higher the priority).
Note:Jobs must request the appropriate partition (ex: --partition=short) to satisfy the jobs resource request (maximum runtime, number of compute nodes, etc...)
Partition | Max Runtime | Max Compute Nodes | Priority |
---|---|---|---|
express | 2 hours | No Limit | 2 |
short | 12 hours | 44 | 4 |
medium | 2 days 2 hours | 44 | 6 |
long | 6 days 6 hours | 22 | 8 |
interactive | 2 hours | 1 | 10 |
Logging on and Running Jobs from the command line
Once you've gone through the account setup procedure and obtained a suitable terminal application, you can login to the Cheaha system via ssh
ssh BLAZERID@cheaha.rc.uab.edu
Alternatively, existing users could follow these instructions to add SSH keys and access the new system.
Cheaha (new hardware) run the CentOS 7 version of the Linux operating system and commands are run under the "bash" shell (the default shell). There are a number of Linux and bash references, cheat sheets and tutorials available on the web.
Typical Workflow
- Stage data to $USER_SCRATCH (your scratch directory)
- Determine how to run your code in "batch" mode. Batch mode typically means the ability to run it from the command line without requiring any interaction from the user.
- Identify the appropriate resources needed to run the job. The following are mandatory resource requests for all jobs on Cheaha:
- Number of processor cores required by the job
- Maximum memory (RAM) required per core
- Maximum runtime
- Write a job script specifying queuing system parameters, resource requests, and commands to run program
- Submit script to queuing system (sbatch script.job)
- Monitor job (squeue)
- Review the results and resubmit as necessary
- Clean up the scratch directory by moving or deleting the data off of the cluster
Slurm Job Types
Batch Job
TODO: provide an explanation of what makes a batch job and why use that vs an interactive job
For additional information on the sbatch command execute man sbatch at the command line to view the manual.
Example Batch Job Script
A job consists of resource requests and tasks. The Slurm job scheduler interprets lines beginning with #SBATCH as Slurm arguments. In this example, the job is requesting to run 1 task
Note:Jobs must request the appropriate partition (ex: --partition=short) to satisfy the jobs resource request (maximum runtime, number of compute nodes, etc...)
#!/bin/bash # #SBATCH --job-name=test #SBATCH --output=res.txt #SBATCH --ntasks=1 #SBATCH --partition=express # # Time format = HH:MM:SS, DD-HH:MM:SS # #SBATCH --time=10:00 # # Mimimum memory required per allocated CPU in MegaBytes. # #SBATCH --mem-per-cpu=100 #SBATCH --mail-type=FAIL #SBATCH --mail-user=YOUR_EMAIL_ADDRESS srun hostname srun sleep 60
Click here for more example SLURM job scripts.
Interactive Job
Login Node (the host that you connected to when you setup the SSH connection to Cheaha) is supposed to be used for submitting jobs and/or lighter prep work required for the job scripts. Do not run heavy computations on the login node. If you have a heavier workload to prepare for a batch job (eg. compiling code or other manipulations of data) or your compute application requires interactive control, you should request a dedicated interactive node for this work.
Interactive resources are requested by submitting an "interactive" job to the scheduler. Interactive jobs will provide you a command line on a compute resource that you can use just like you would the command line on the login node. The difference is that the scheduler has dedicated the requested resources to your job and you can run your interactive commands without having to worry about impacting other users on the login node.
Interactive jobs, that can be run on command line, are requested with the srun command.
srun --ntasks=1 --cpus-per-task=4 --mem-per-cpu=4096 --time=08:00:00 --partition=medium --job-name=JOB_NAME --pty /bin/bash
This command requests for 4 cores (--cpus-per-task) for a single task (--ntasks) with each cpu requesting size 4GB of RAM (--mem-per-cpu) for 8 hrs (--time).
More advanced interactive scenarios to support graphical applications are available using VNC or X11 tunneling X-Win32 2014 for Windows
Interactive jobs that requires running a graphical application, are requested with the sinteractive command, via Terminal on your VNC window.
sinteractive --ntasks=1 --cpus-per-task=4 --mem-per-cpu=4096 --time=08:00:00 --partition=medium --job-name=JOB_NAME
Requesting for GPUs
To request for an interactive session on one of the GPU nodes (c0089-c0092 K80's and c0097-c0114 P100's), add --gres parameter to the 'srun' or 'sinteractive' command.
srun --ntasks=1 --cpus-per-task=1 --mem-per-cpu=4096 --time=08:00:00 --partition=pascalnodes --job-name=JOB_NAME --gres=gpu:1 --pty /bin/bash
sinteractive --ntasks=1 --cpus-per-task=1 --mem-per-cpu=4096 --time=08:00:00 --partition=pascalnodes --job-name=JOB_NAME --gres=gpu:1
NOTE:
- If you want to use more then one GPU on the node, please increase the value in --gres=gpu:[1-4]
- If you want to use the P100s please use the partition as 'pascalnodes', wheres for K80s please use either of the express, short, medium or long as partitions.
MPI Job
TODO add MPI information and a job example
OpenMP / SMP Job
OpenMP / SMP jobs are those that use multiple CPU cores on a single compute node.
It is very important to properly structure an SMP job to ensure that the requested CPU cores are assigned to the same compute node. The following example requests 4 CPU cores by setting the number of ntasks to 1 and cpus-per-tasks to 4
srun --partition=short \ --ntasks=1 \ --cpus-per-task=4 \ --mem-per-cpu=1024 \ --time=5:00:00 \ --job-name=rsync \ --pty /bin/bash
Job Status
SQUEUE
To check your job status, you can use the following command
squeue -u $USER
Following fields are displayed when you run squeue
JOBID - ID assigned to your job by Slurm scheduler PARTITION - Partition your job gets, depends upon time requested (express(max 2 hrs), short(max 12 hrs), medium(max 50 hrs), long(max 150 hrs), sinteractive(0-2 hrs)) NAME - JOB name given by user USER - User who started the job ST - State your job is in. The typical states are PENDING (PD), RUNNING(R), SUSPENDED(S), COMPLETING(CG), and COMPLETED(CD) TIME - Time for which your job has been running NODES - Number of nodes your job is running on NODELIST - Node on which the job is running
For more details on squeue, go here.
SSTAT
The sstat command shows status and metric information for a running job.
NOTE: the job parts must be executed using srun otherwise sstat will not display useful output
[rcs@login001 ~]$ sstat 256483 JobID MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ConsumedEnergy MaxDiskRead MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite ------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- -------------- ------------ --------------- --------------- ------------ ------------ ---------------- ---------------- ------------ 256483.0 1962728K c0043 1 1960633K 91920K c0043 3 91867K 67K c0043 3 50K 00:00.000 c0043 0 00:00.000 8 1.20G Unknown Unknown Unknown 0 1M c0043 5 1M 0.34M c0043 5 0.34M
For more details on sstat, go here.
SCONTROL
$ scontrol show jobid -dd 123 JobId=123 JobName=SLI UserId=rcuser(1000) GroupId=rcuser(1000) Priority=4294898073 Nice=0 Account=(null) QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=06:27:02 TimeLimit=08:00:00 TimeMin=N/A SubmitTime=2016-09-12T14:40:20 EligibleTime=2016-09-12T14:40:20 StartTime=2016-09-12T14:40:20 EndTime=2016-09-12T22:40:21 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=medium AllocNode:Sid=login001:123 ReqNodeList=(null) ExcNodeList=(null) NodeList=c0003 BatchHost=c0003 NumNodes=1 NumCPUs=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=24,mem=10000,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* Nodes=c0003 CPU_IDs=0-23 Mem=10000 MinCPUsNode=1 MinMemoryNode=10000M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/share/apps/rc/git/rc-sched-scripts/bin/_interactive WorkDir=/scratch/user/rcuser/work/other/rhea/Gray/MERGED StdErr=/dev/null StdIn=/dev/null StdOut=/dev/null Power= SICP=0
Job History
TODO: Provide some examples of using the sacct or our wrapper rc-sacct to view historical information.
This example uses the rc-sacct wrapper script, for comparison here is the equivalent sacct command:
$ sacct --starttime 2016-08-30 \ --allusers \ --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
$ rc-sacct --allusers --starttime 2016-08-30 User JobID JobName Partition State Timelimit Start End Elapsed MaxRSS MaxVMSize NNodes NCPUS NodeList --------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ---------- --------------- kxxxxxxx 34308 Connectom+ interacti+ PENDING 08:00:00 Unknown Unknown 00:00:00 1 4 None assigned kxxxxxxx 34310 Connectom+ interacti+ PENDING 08:00:00 Unknown Unknown 00:00:00 1 4 None assigned dxxxxxxx 35927 PK_htseq1 medium COMPLETED 2-00:00:00 2016-08-30T09:21:33 2016-08-30T10:06:25 00:44:52 1 4 c0005 35927.batch batch COMPLETED 2016-08-30T09:21:33 2016-08-30T10:06:25 00:44:52 307704K 718152K 1 4 c0005 bxxxxxxx 35928 SI medium TIMEOUT 12:00:00 2016-08-30T09:36:04 2016-08-30T21:36:42 12:00:38 1 1 c0006 35928.batch batch FAILED 2016-08-30T09:36:04 2016-08-30T21:36:43 12:00:39 31400K 286532K 1 1 c0006 35928.0 hostname COMPLETED 2016-08-30T09:36:16 2016-08-30T09:36:17 00:00:01 1112K 207252K 1 1 c0006
Additional information about the sacct command can be found by running man sacct or found here
The rc-sacct wrapper script supports the following arguments:
$ rc-sacct --help Copyright (c) 2016 Mike Hanby, University of Alabama at Birmingham IT Research Computing. rc-sacct - version 1.0.0 Run sacct to display history in a nicely formatted output. -r, --starttime HH:MM[:SS] [AM|PM] MMDD[YY] or MM/DD[/YY] or MM.DD[.YY] MM/DD[/YY]-HH:MM[:SS] YYYY-MM-DD[THH:MM[:SS]] -a, --allusers Dispay hsitory for all users) -u, --user user_list Display hsitory for all users in the comma seperated user list -f, --format a,b,c Comma separated list of columns: i.e. --format jobid,elapsed,ncpus,ntasks,state --debug Display additional output like internal structures -?, -h, --help Display this help message
Slurm Variables
The following is a list of useful Slurm environment variables (click here for the full list):
Variable | Description |
---|---|
SLURM_NTASKS | Total number of processes in the current job (and SLURM_NPROCS for backwards compatibility) |
SLURM_NODELIST | List of nodes allocated to the job |
SLURM_NNODES | Total number of nodes in the job's resource allocation |
SLURM_JOB_NAME | Set to the value of the --job-name option or the command name when srun is used to create a new job allocation. Not set when srun is used only to create a job step (i.e. within an existing job allocation) |
SLURM_JOB_ID | Job id of the executing job (and SLURM_JOBID for backwards compatibility) |
SLURM_SUBMIT_DIR | The directory from which srun was invoked |
SLURM_TASKS_PER_NODE | Number of tasks to be initiated on each node. Values are comma separated and in the same order as SLURM_NODELIST. If two or more consecutive nodes are to have the same task count, that count is followed by "(x#)" where "#" is the repetition count. For example, "SLURM_TASKS_PER_NODE=2(x3),1" indicates that the first three nodes will each execute three tasks and the fourth node will execute one task |
SGE - Slurm
This section shows Slurm and SGE equivalent commands
SGE Slurm --------- ------------ qsub sbatch qlogin sinteractive qdel scancel qstat squeue
To get more info about individual commands, run : man SLURM_COMMAND . For an extensive list of Slurm-SGE equivalent commands, go here or Slurm's official documentation