4.5. Monitoring

Monitoring jobs

We can monitor our jobs with the qstat command. If we call qstat without arguments, it will show the state of the jobs for the current user:

ijimenez@login:~/Matlab/Mandelbrot$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 154084 0.47598 Mandelbrot ijimenez     r     01/07/2014 15:37:59 short.q@node11                     6        
ijimenez@login:~/Matlab/Mandelbrot$ 
 
Next table shows the output for qstat command:
 

Fig 1. qstat information

Field Description Comment
job-ID Numerical ID of the job Numerical identifier of the job 
prior Priority Priority of the job, from lowest (-20) to highest (20). Similar to Linux 'nice' command, is used for the cluster administrators to priorize jobs on a given queue
name Name of job owner The name of the user who owns the job
state Job status Job status. Available states are shown Table 3.
submit/start at Start time Date and time when the scheduler has registered the job. To register the job does not imply to execute it. Once the job is registered, it must wait for the selected resources to be free.
queue Queue Queue where the job runs. 
slots Slots Number of processors used by the job
ja-task-ID Parallel job Task ID Task-ID. Only shown in the case of array jobs,
 
 

Fig 2. Description of job states

Abbreviate State Comment
qw Queue waiting The job is waiting to be assigned to a given queue
t Transferring Job is assigned to a queue and it's being transferred to one or more execution hosts.
r Running The job is running on the execution host
E Error The job has failed for some reason and it's still running. Output is being sent to the file specified with -e flag, to the default error file otherwise.
h Hold The job is being hold for some reason. Most common is to nbe waiting for another job to finish
R Restarted The job has been restarted for some reason. Most common reason are errors on the execution host, and the job is sent to another
 
The qstat command accepts several arguments:
 
Fig 3 qstat command options
 
Command Option Comment
-f Show full output Show full information of all used slots
-F  Show full output Show complete information of all used slots
-u <user> Show jobs for a given user Show the jobs for a given users
-j <job_id> Show schedule information Shows scheduling options for a given job_id