Job Control

From ACENET
Jump to: navigation, search
Achtung.png Legacy documentation

This page describes a service provided by a retired ACENET system. Most ACENET services are currently provided by national systems, for which please visit https://docs.computecanada.ca.


All production-class jobs must be submitted via the scheduler, which manages the available computing resources and assigns them to the waiting jobs. The scheduler used on all ACENET clusters is Sun Grid Engine (SGE).

Main commands

The three most important SGE commands are:

qsub
Submits a batch job
qstat
Shows the status of jobs and queues
qdel
Deletes or kills a job

Submitting a simple job

Write a script that describes your job. Here's a trivial example:

#$ -cwd
#$ -j y
#$ -l h_rt=00:03:00

echo Hello from inside a Grid Engine job running on `hostname`
echo Job beginning at `date`
sleep 120
echo Job ending at `date`

Note that this is just a shell script — a list of commands (echo, sleep) to be executed in order. The comment lines beginning with #$ provide extra information to the job scheduler. More about them below. The default execution shell is bash unless specified otherwise with the -S option.

Save the script with some name, like trivial.sh. Then submit the script to the scheduler by typing

 $ qsub trivial.sh

The system will reply something like

Your job 7635 ("trivial.sh") has been submitted

and queue up your job to wait its turn. The number is called the JOB_ID and will be different for every job.

If you are looking for a parallel job script example, refer to the Parallel Jobs page.

Monitoring jobs

How can you tell if your job has run? The usual way is with

$ qstat

The output from qstat is very wide and looks something like this:

job-ID  prior   name     user  state   submit/start at     queue          slots  ja-task-ID
-------------------------------------------------------------------------------------------
  7635  0.5404  trivial  jdoe    r     11/18/2011 23:16:10 short.q@cl061     1

While your job is waiting to run there will be "qw" in the state column. When it starts to run it changes to "r". When the job ends, either because it finished or because it crashed, it disappears from the list and qstat will return nothing at all --- unless you have other jobs submitted.

When the job is done, the output appears in a file with a name like trivial.sh.o7635. The components of the output file name are the job name (trivial.sh in our example), the job id (7635), and between them ".o" for "output".

There are other utilities that can show you some useful information about your jobs:

  • qsum, for a simplified view of the entire system load
  • showq, for insight into when your job might begin running
  • More on qstat, including the meaning of job status codes
  • qacct, for data about a finished job (memory used, run time, error codes, etc.)

Deleting jobs

If you want to remove your job from the queue, whether it's running or waiting, you can use the qdel command. You can delete one or more jobs by specifying their names or job IDs like so:

$ qdel job_name1 job_id2

If you delete a job by a name, and several of your jobs have the same name, then all of them will be deleted. The alternative to this is to use a job ID, which is unique.

You can delete all of your jobs by using a wildcard like so:

$ qdel "*"

If you are running an array job and want to delete only one task, then you need to specify a job ID as well as a task ID separated with a full stop, like so:

$ qdel job_id.task_id

You can also delete a range of tasks like so:

$ qdel job_id.task_id1-task_id2

You can find a task ID in the qstat output.

Finally, if your cannot delete your job and it's stuck in the d state for a long time, then you can force its deletion providing the -f option to qdel. However, before doing so, please read the relevant section in our FAQ.

Parameters

Complete parameter list: Grid Engine

Here are the most commonly-used job parameters:

Option Description
-l h_rt=time Run time limit either in seconds or in hh:mm:ss format
-l h_vmem=mem Hard virtual memory limit; mem specifier may include k, K, m, M, g, G; details at man queue_conf
-cwd Start the job script in the same directory it was submitted from, the "current working directory". If absent, job will start in your home directory.
-j y Join the stderr output stream to the stdout stream. Error messages will be mixed in with the job script standard output. If absent then standard error will go into job_name.ejob_id
-N name Assigns a name to the job other than the name of the job script
-o file Redirects the standard output to the named file
-S shell Shell to interpret the job script: /bin/bash (default) or /bin/csh

Every job must be submitted with a run time limit, h_rt. This is a hard limit, which means your job will be killed after it has been running for that length of time, so you should give yourself a margin of error. If you really don't know what run time to set, 48 hours is an acceptable choice. All other parameters are optional.

There are three ways to set a parameter or supply an option to a job:

  1. With #$ directives inside the job script, as shown above
  2. With flags to qsub when the job is submitted
  3. With flags to qalter while the job is waiting to run

The second method follows this pattern:

$ qsub -l h_rt=0:1:0 trivial.sh

Options to the qsub command override any conflicting options set with directives inside the job script trivial.sh. So the job in this example will initially have a run-time limit of one minute (0:1:0) regardless of what is given inside the script. Note that when using qsub that the script name (and any arguments to the script) must appear after all the Grid Engine flags.

The third method follows this pattern:

$ qalter -l h_rt=0:2:0 job_id

After the qalter command the run time limit will change to 2 minutes, but this will only have an effect if the job has not yet started. Please note that

  • Changing a parameter on a job that is already executing, for example to give it more time or more memory, has no effect.
  • You must re-supply the h_rt and any other arguments to -l when you use qalter.

Further reading