4.1. Cluster queues, resources and limits

Cluster queues

A cluster queue is a resource that can handle and execute user jobs. Depending on the job's demands, the job will be executed on a given queue or another. Every queue has its own limits, behavior and default values. Currently, Snow cluster has six different queues shown on the following table:

Queue name Allowed use Comment
short.q Batch processing Intended for short time, low-cpu jobs, that must be processed and dispatched fast. 
default.q Batch processing Intended for long time, high cpu, high memory demanding jobs in amd processors.
intel.q Batch processing Intended for long time, high cpu, high memory demanding jobs in intel processors.
cuda.q Batch processing Intended for solve computationally and data-intensive problems using multicore processors GPUs.
inter.q Interactive sessions Intended to manage interactive sessions on cluster nodes. Limited resources.
all.q Batch and interactive For testing and administration purposes only.

All queues are defined with some common parameters. Unless specified otherwise, these parameters are inherited by all the jobs that run on these queues. This imposes limits, for example, on time or consumed resources for the jobs that run inside a given queue. Let's see, for example, the configuration of the queue short.q:

ijimenez@login:~$ qconf -sq short.q
qname                 short.q
hostlist              @allhosts
seq_no                1
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH
ckpt_list             NONE
pe_list               make smp ompi matlab
rerun                 FALSE
slots                 1,[@abudhabi=64]
tmpdir                /scratch
shell                 /bin/bash
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
[ ... ]
terminate_method      NONE
notify                00:00:60
[ ... ]
initial_state         default
s_rt                  01:55:00
h_rt                  02:00:00
s_cpu                 127:50:00
h_cpu                 128:00:00
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY
ijimenez@login:~$ 

The parameters s_rt, h_rt, s_cpu and h_cpu force all the jobs submitted to this queue to have the corresponding limits.

Cluster limits

When a given user registers a job on the scheduler, limits are applied. If the job's requeriments are higher than the available resources, the job will wait on the queue until the resources get free. But if the job's requeriments are higher than the limits, the job cannot be registered. The limits are setup at three different levels: user, research group and queue.

Cluster limits are defined as resource quotas, and are explained in the next tables:

Table 1. short.q limits

Item     Limit  Comment
Queue type Batch processing No user interactive usage allowed
Wall time 2 hours Every job can run for two hours in the cluster, no matter how many CPUs it’s going to use
CPU time 128 hours Every job can use a total time of 128 hours. That is, we allow a job to be on the system for 2 hours using all queue resources: 2 hours * 64 cores = 128 hours of calculation
Maximum user alllocatable slots   32 cores A single user can allocate up to 64 slots per job, so maximum parallelism allowed in this queue is 32
Maximum research group allocatable slots 256 Researchers on the same research group can allocate up to 256 processors on this queue

Table 2. default.q limits

Item Limit  Comment
Queue type             Batch processing     No user interactive usage allowed
Wall time Not set No time limit is set
CPU time Not set No time limit is set
Maximum user alllocatable slots   320 cores A single user can allocate up to 320 slots per job, so maximum parallelism allowed in this queue is 320
Maximum research group allocatable slots 576 cores  Researchers on the same research group can allocate up to 576 processors on this queue

Table 3. interactive.q limits

Item Limit Comment
Queue type             Interactive    User interaction managed by the scheduler
Wall time 4 hours Every user can have up to 4 hours of interactive session in the node
CPU time  64 hours Every user can have up to 64 CPU hours of interactive session in the node
Minimum slots available 16 cores In case of heavy cluster usage, al least 16 processors are always reserved for this queue.
Maximum user allocatable slots 4 cores A single user can allocate up to 4 processors on a single interactive session
Maximum research group allocatable slots  8 Researchers on the same research group can allocate up to 8 processors on this queue   


This behavior is modeled as a resource as shown below:

{
   name         maxslots
   description  "Max slots per user"
   enabled      TRUE
   limit        users {*} queues short.q to slots=32
   limit        projects {*} queues short.q to slots=256
   limit        users {*} queues default.q to slots=256
   limit        projects {*} queues default.q to slots=576
   limit        queues !inter.q to slots=688
}