4.2. Submitting basic jobs

Submitting basic jobs

The basic command to send a job is qsub. If don't specified otherwise, the scheduler is going to guess as best as it can in which queue the job is gonna be registered. 

ijimenez@login:~$ qsub sleeper.sh 
Your job 153624 ("Sleeper") has been submitted
ijimenez@login:~$ 

However, there is an exception when submitting jobs to the Intel processor nodes, because they only support the intel.q queue. Thus, it is composory to specify this argument in the qsub command format.

ijimenez@login:~$ qsub -q intel.q Tasa.sh 
Your job 673725 ("Tasa") has been submitted
ijimenez@login:~$

Once registered, the scheduler tells the job ID. Keep it handy, in case of problems you'll need this value to debug what happens.

Although you can specify the options when calling qsub, when submitting a job execution request it is strongly advised that all options be provided within a job definition file (in these examples the file will be called "job.sge"). This file will contain the command you wish to execute and any Grid Engine resource request options that you need. 

ijimenez@login:~$ vi job.sge

All Grid Engine options are preceded by the string "#$ ", and all other lines in the file are executed by the shell (the default shell is /bin/bash):

#!/bin/bash
#$ -N Test
#$ -q short.q
#$ -cwd
uname -a

The "-N" option sets the name of the job. This name is used to create the output log files for the job. We recommend using a capital letter for the job name is order to distinguish these log files from the other files in your working directory. This makes it easier to delete the log files later.

The "-q" option requests the queue in which the job should run.  The "-cwd" option instructs the Grid Engine to execute your jobs from the directory in which you submit the job (using qsub). This is recommended because it eliminates the need to "cd" to the correct sub-directory for job execution, and it ensures that you log files will be located in the same directory as your job-definition file.

Remember to set the "-q" parameter pointing to intel.q to use the Intel proceesor nodes.

#!/bin/bash
#$ -N Test
#$ -q intel.q
#$ -cwd
uname -a

We can monitor how our job is doing with the qstat command:

ijimenez@login:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
153646 0.00000 Test       ijimenez     qw    12/18/2013 17:12:00                                    1       
ijimenez@login:~$ 

ijimenez@login:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 673734 0.62465 Tasa       ijimenez         r     04/15/2015 17:45:36 intel.q@node14                     1    

The qstat output shows the task ID of our job (job-IF), priority (prior) who has launched it (user), which is the state of the job (state) , when and where it has been registered (submit and queueu), and how many slots it's been using (slots). The most informatinal column is state, as it shows what is actually the job doing.

Most common states are described in the following table:

Fig 1. Description of job states

Abbreviate State Comment
qw Queue waiting The job is waiting to be assigned to a given queue
t Transferring Job is assigned to a queue and it's being transferred to one or more execution hosts.
r Running The job is running on the execution host
E Error The job has failed for some reason and it's still running. Output is being sent to the file specified with -e flag, to the default error file otherwise.
h Hold The job is being hold for some reason. Most common is to nbe waiting for another job to finish
R Restarted The job has been restarted for some reason. Most common reason are errors on the execution host, and the job is sent to another execution host to be processed again.

Job run time

Each queue has specific policies that enforce how long jobs are allowed to execute: short* queues allow up to 2 hours, medium* queues allow up to 24 hours, and long* queues have no runtime limit. When you submit a job, you are either implicitly or explicitly indicating how long the job is expected to run. The first way to to indicate the maximum runtime using the format

#!/bin/bash
#$ -N Test
#$ -cwd
#$ -l s_rt=04:30:00
#$ -l h_rt=05:00:00
uname -a

This way, the scheduler will check which one of our queues is matching the job requeriments, and will send the job to the most appropiate queue. In this case, the job is going to be processed by the default.q, as long as we're requesting a soft limit of 4:30 hours and a hard limit of 05:00 hours and it exceeds the short.q threshold:

ijimenez@login:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                slots ja-task-ID
--------------------------------------------------------------------------------------------------------
153662 0.50000 Test       ijimenez     r     12/18/2013 17:51:08 default.q@node10        1  

The main resources to control job time  executions are the flags shown below:

Fig 2. Parameters to control execution time

Flag  Request Comment
-l s_rt=hh:mm:ss Sets a soft-limit walltime of hh:mm:ss  The system will send a soft SIGN if execution time limit, set to hh:mm:ss is exceeded
-l h_rt=hh:mm:ss Sets a hard-limit walltime of hh:mm:ss The execution time limit for the job will be set to hh:mm:ss. If exceeded, the job will error and die.
-l s_cpu=hh:mm:ss Sets a soft-limit CPU of hh:mm:ss The system will send a soft SIGN if CPU time limit, set to hh:mm:ss is exceeded
-l h_cpu=hh:mm:ss Sets a hard-limit CPU of hh:mm:ss The CPU time limit for the job will be set to hh:mm:ss. If exceeded, the job will error and die.
-q <queue_name> Requests an specific queue for the job Forces the scheduler to register the job on a given queue. If job requeriments does not fit the queue, the job stands on qw state forever unless deleted
  • Walltime: is the 'real time' a job is running.
  • CPU time: is the CPU time a job is running
  • If a job is not parallelized, the walltime and the CPU time are the same; but ifa given program lauches N threads, the CPU time is the real time spent in execution * N threads launched.

Redirection output and error files

By default, if not specified otherwise, the scheduler will redirect the output of anuy job you launch to a couple of files, placed on yout $HOME, called <job_name>.e<job_id> and <job_name>.o<job_id>. After a few executions and test, probably your $HOME will look like this:

ijimenez@login:~$ ls
examples             hostname.sh.o153601  Program settings  Sleeper.e153624  Sleeper.o153624  Test.e153646  Test.e153662  Test.o153648
__hostname.err       job.sge              scripts           Sleeper.e153625  Sleeper.o153625  Test.e153647  Test.o153627  Test.o153649
__hostname.out       Matlab               Sleeper.e153622   Sleeper.o153622  sleeper.sh       Test.e153648  Test.o153646  Test.o153662
hostname.sh.e153601  programari           Sleeper.e153623   Sleeper.o153623  Test.e153627     Test.e153649  Test.o153647  user-scripts
ijimenez@login:~$ 

As a general rule, you are advised to use the following flags to redirect the input and error files:

Fig 3. Redirecting output and error files

Flag  Request Comment
-e <path>/<filename> Redirect error file The system will create the given file on the path specified and will redirect the job's error file here. If name is not specified, default name will apply.
-o <path>/<filename> Redirect output file The system will create the given file on the path specified and will redirect the job's output file here. If name is not specified, default name will apply.
-cwd Change error and output to the working directory The output file and the error file will be placed in the directory from which 'qsub' is called.

Of course, we can place these option on our job definition file:

ijimenez@login:~/examples$ vi job.sge 
#!/bin/bash
#$ -N Test
#$ -cwd
#$ -l s_rt=04:30:00
#$ -l h_rt=05:00:00
#$ -e $HOME/examples/uname/error
#$ -o $HOME/examples/uname/output
uname -a

And when launching the job, we'll see the output files created at $HOME/example/uname/output. If no error is reported, an empty file will be created.

ijimenez@login:~/examples/uname$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
153670 0.50000 Test       ijimenez     r     12/18/2013 18:19:08 default.q@node10                   1 
ijimenez@login:~/examples/uname$ cd output/
ijimenez@login:~/examples/uname/output$ cat Test.o153670
Linux node10 3.2.0-4-amd64 #1 SMP Debian 3.2.51-1 x86_64 GNU/Linux
ijimenez@login:~/examples/uname/output$ 

Sending notifications

These examples run in seconds, buy mayhaps you're running jobs and simulations for hours or days, even weeks. The cluster Snow is geared with an exim4 email system, and you can configure your jobs to send you notifications once they're finished or if something goes wrong:

Fig 4. Sending notifications

Flag  Request Comment
-M <valid_email> Activate mailing feature and emal to address <valid_email> The system will send an email to the provided email when any of the the switches provided happen
-m b Begin Mail is sent at the beginning of the job, once the job enters the state 'r'
-m e End Mail is sent at the end of the job, when the job unregisters from the scheduler and no error is reported (no 'E' state)
-m a Aborted or rescheduled Mail is sent if job is aborted or rescheduled ('E' state appears)
-m s Suspended Mail is sent if job is suspended (usually by a user with higher privileges, 's' state appears)

Of course, we'll add these options to our job defintiion and we'll run:

#!/bin/bash
#$ -N Test
#$ -cwd
#$ -l s_rt=04:30:00
#$ -l h_rt=05:00:00
#$ -e $HOME/examples/uname/error
#$ -o $HOME/examples/uname/output
#$ -M ivan.jimenez@upf.edu
#$ -m bea
uname -a
ijimenez@login:~/examples/uname$ qsub job.sge
Your job 153676 ("Test") has been submitted

Deleting and modifying jobs

We can modify the requeriments of a job while it's waiting to be processed. Once it's

Fig 4. Deleting a job

Flag  Request Comment
qdel <job_id> Delete the job The system will remove the job and all its dependencies from the queues and the execution hosts