The AVIDD-B Cluster

PBS Professional

PBS Professional is the resource manager on AVIDD-B. Tools for job submission and management are available in /usr/pbs/bin, most of which have associated man pages. Documentation is also available in pdf format:

The PBS paradigm, in a nutshell, is as follows: users submit jobs to various queues on the system, each separate queue representing a group of resources with attributes necessary for the queue's jobs. Commonly used PBS tools include qsub, for job submission; qstat, for monitoring the status of jobs; and qdel, to terminate jobs prior to completion. More detailed information regarding these commands and others is available below, or in the documentation described earlier.

Policy

Interactive jobs, roughly defined as jobs that are not managed by PBS, and therefore that will run on the user nodes, are limited to 20 minutes of CPU time. Any job requiring more than 20 minutes should be submitted to a PBS queue and run on the cluster compute nodes. Monitoring scripts on the user nodes will kill processes exceeding 20 minutes of wallclock time.

Queues

AVIDD-B compute node resources are divided into three queues:

  • bg - the "background queue", AVIDD-B compute nodes bc03-bc96
  • fastq - "fast queue", AVIDD-B compute nodes bc01 and bc02

Jobs submitted from AVIDD-B login nodes (bh1, bh2, or bh3) are automatically sent to bg (the name "background queue", for which bg is an abbreviation, is a remnant of an early AVIDD queue configuration; it might as easily be thought of as the "AVIDD-B queue"). The fastq is intended for test jobs requiring 30 minutes of walltime or less.

Jobs

Scripts

PBS most commonly handles job scripts, although interactive jobs are also supported. A job script may be as simple as a bash or tcsh shell script, but also may include a number of PBS job directives. In any case, it is always recommended that PBS job scripts, which will be executed under your preferred login shell, begin with a "sha-bang" line specifying which command interpreter it should run under. For example:

#!/bin/bash

PBS directives, which are lines beginning with the string #PBS, include switches for specifying such useful information as walltime required to complete the job, number of nodes and processors necessary, and filenames for job output and error. An example PBS job script might look like this:

#!/bin/bash
#PBS -k o
#PBS -l nodes=4:ppn=2,cput=4:00:00,walltime=30:00
#PBS -M username@indiana.edu
#PBS -m abe
#PBS -N JobName
#PBS -j oe
mpirun -np 8 -machinefile $PBS_NODEFILE ~/bin/binaryname
Line by line, this script says:
  1. use bash as the command interpreter for this script
  2. job output should be kept
  3. this job requires 4 nodes, 2 processors per node, 4 hours of CPU time and 30 minutes of wall clock time
  4. send job-related email to username@indiana.edu
  5. send email if the job is aborted (a), begins (b) and ends (e)
  6. the job name is JobName
  7. standard output and standard error should be joined
  8. execute ~/bin/binaryname on 8 processors from the machines in $PBS_NODEFILE using mpirun

Submission

Submit jobs with the qsub command. If the command exits successfully, a job id will be returned to standard output. For example:
[jdoe@bh2 AVIDD]$ qsub job.script
123456.aviss.avidd.iu.edu
[jdoe@bh2 AVIDD]$
If you require attribute values different than the defaults, but less than the maximum allowed, specify these either in the job script with PBS directives, or on the command line with the -l switch. For example, to submit a job that needs more than the default two hours of walltime on AVIDD-B:

qsub -l walltime=10:00:00 job.script

There are a couple of things to note here. First, command line arguments override directives in the job script, and second, you may specify many attributes on the command line, either as comma-seperated options following the -l switch, or each with its own -l switch. The following two commands are equivalent:

qsub -l cput=01:30:00,ncpus=16,mem=1024mb job.script

qsub -l cput=01:30:00 -l ncpus=16 -l mem=1024mb job.script

Useful qsub switches include:

  • -q queue name (to specify non-default queues)
  • -r (job is rerunnable)
  • -a date_time (only execute the job after date_time)
  • -V (export environment variables in qsub command's environment to the job)
  • -I (run interactively, usually for testing purposes)
See the qsub man page or PDF documentation for more information.

Monitoring

The qstat command is useful for monitoring the status of a queued or running job. Switches include:
  • -u user_list (display jobs for users in user_list)
  • -a (display all jobs)
  • -r (display running jobs)
  • -f (display full listing of jobs, excessive detail)
  • -n (display nodes allocated to jobs)

For example, to see all the running jobs in the AVIDD-B bq, type this at an AVIDD shell prompt:

qstat -r bq | less

Deleting

You may delete queued or running jobs with the qdel command. Occasionally, a node will become unresponsive to the point that it cannot respond to the PBS server's requests that a job be killed. In that case, try adding the -W force option to qdel. Otherwise, contact High Performance Systems, hps-admin@iu.edu, for assistance.

Error Codes

When your PBS job exits, hopefully, the exit status will be 0, indicating a successful run. However, in many cases you won't be so fortunate. This list of PBS Error Codes may provide some clues as to why a job doesn't exit successfully.

Default and Maximum Configurations

Server-wide

  • Default Walltime: 2 hours
  • Default CPUs per job: 1
  • Default Nodes per job: 1

Queue-specific

iq
  • Maximum jobs: 1500
  • Maximum Walltime: 1080 hours (45 days)
  • Maximum jobs per user: 24
bg
  • Maximum jobs: 1500
  • Maximum Walltime: 816 hours (34 days)
  • Maximum jobs per user: 64
fastq
  • Maximum CPUs per job: 4
  • Maximum Nodes per job: 2
  • Maximum Walltime: 30 minutes

PBS-related Tools

The local AVIDD tools allpbsnodes and whosusing generate reports on node-allocation and per-user job and cpu-usage information.

allpbsnodes

screenshot of allpbsnodes output

whosusing

screenshot of whosusing output

See the man pages for more information on these commands.