Navigation Bar

 

Using Sun Grid Engine (SGE) on the CGL Socrates Cluster

Sun Grid Engine (SGE) is an open source batch queueing system. SGE enables both the users and the computational cluster to get the most out of available resources. Users can treat the collection of compute nodes like a single system, and submit jobs from any of the nodes. SGE will automatically run jobs on less loaded nodes, regardless of which node the jobs were submitted on; SGE will also queue jobs for later execution to avoid overloading available resources and causing the entire system to operate poorly for everyone.

From the user perspective, there are two major components to a batch queue: the node hardware and the job execution queues. The hardware determines the available computational capacity. The queues define the system job-execution policy, e.g., how many jobs can run simultaneously and how much CPU time a job can consume. Typically there are multiple queues, each defining a different policy, and users can choose which queue to use for a particular job.

CGL Socrates Cluster Hardware

The CGL socrates cluster consists of five nodes: guanine, adenine, cytosine, thymine and uracil. Guanine is an Alpha GS1280 node with 32 faster CPUs and 64GB of memory; the other four are Alpha ES45 nodes with four CPUs and 16GB of memory. Guanine serves as the interactive node, which executes interactive commands from logged-in users. All five nodes are available for executing long-running jobs.

SGE Batch Queues

For the CGL socrates cluster, we have created a single queue for running jobs. There are no time or memory limits imposed by SGE for jobs in this queue. There is also no user-specific limits on the number of jobs, either active or submitted. However, we expect users to demonstrate discretion when using SGE. Running a few active jobs at once when usage is light is probably fine; submitting 50 jobs at once and locking out other users will result in queue reconfiguration, such as defining per-user limits.

Setting up the SGE Environment

Before submitting a job, you must set up the proper SGE environment. SGE provides a script for setting up the proper environment for submitting jobs. The command to invoke the script depends on the type of shell you are using. To check which shell you are using, give the command
> echo $SHELL
If the output ends in "sh", "ksh" or "bash", you are using a Bourne-shell compatible shell; if it ends in "csh" or "tcsh", you are using C-shell compatible shell. (For the remainder of the document, "Bourne shell" refers to any of the Bourne-shell compatible shells, and "C shell" refers to any of the C-shell compatible shells.)

If you are using Bourne shell, give the command:

> . /usr/local/sge/CGL/common/settings.sh
If you are using C shell, give the command
> source /usr/local/sge/CGL/common/settings.csh
Neither command generates any output. All they do is make the SGE commands available without requiring you to prefix them with "/usr/local/sge/bin" each time. They also set up default SGE job submission parameters, which may be overridden when you submit jobs. If you use SGE frequently, you can put the command in the ".profile" or ".cshrc" (for Bourne shell and C shell respectively) in your home directory, and it will be executed automatically when you log in.

Submitting a Job

Once you've set up your SGE environment, you need to create a shell script file that contains the commands to run your job. Here is an example job script file that simply echoes a string to the standard output and then prints the current working directory:

#!/bin/sh
## The next line is an instruction to SGE: it tells SGE to email
## you when your job "b"egins, "a"borts, and "e"nds.
#$ -m bae
echo "Hello world"
pwd
## end of batch script
You can use either Bourne or C shell syntax, but you must select the shell to use on the first line; in this case, we are using Bourne (sh) shell syntax. The shell script may be named anything you wish. For this example, let's call the file "script.sge". To submit this script for execution in SGE, run the command (the "> " represents the shell prompt, you do not need to type it)
> qsub script.sge
If you create the file "script.sge" with the contents above, and "qsub" it, you will almost immediately get an email message saying that the job has been started, and almost immediately after, that the job has completed. If a lot of other people have jobs running, your job might not start immediately. You can check on the status of your job by using the "qstat" command:
> qstat

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
    240 0.25000 script.sge conrad       r     01/30/2006 10:12:49 all.q@adenine.cgl.ucsf.edu         1

In this case, only one job is running: its id is "240"; the command script is "script.sge"; and it's being run for user "conrad" (this, of course, will be your login name instead of mine when you try this). The "state" of the job is "r" which is short for "running"; it started running on January 30 around 10AM; and it is running in queue "all.q" on node "adenine.cgl.ucsf.edu". If the job has not yet started running, the state will be "qw", short for "queued and waiting".

Changing Output File Location

The output of SGE jobs will be found in your home directory, in files "script.sge.oXXX" and "script.sge.eXXX". The "XXX" will match the job id; "script.sge.oXXX" will contain the standard output from your script, and "script.sge.eXXX" will contain the standard error. In fact, by default, any file names used in SGE jobs will be relative to your home directory. So if your job creates a file "output", it will also appear in your home directory. This is because SGE was designed to handle a loosely coupled cluster of nodes where not all directories are shared across all nodes. When a job is submitted on one node but executed on another, the execution node may not have access to the submission directory. Thus, for lack of a better place, SGE sets the current working directory to the user's home directory on the execution node when running a job.

It is, however, possible to override this behavior on the socrates cluster because all directories are shared across all nodes. So, instead of running jobs in your home directory, you can ask SGE to run the job in the directory where you submit your job. To do this, you can add the following line to the beginning of the script:

#$ -cwd
and our example script will look like:
#!/bin/sh
#$ -cwd
## The next line is an instruction to SGE: it tells SGE to email
## you when your job "b"egins, "a"borts, and "e"nds.
#$ -m bae
echo "Hello world"
pwd
## end of batch script
This will cause the script to execute in the directory from which the job was submitted. The output and error files will also be deposited in job-submission directory.

Using SACS Tools with SGE

When your jobs run under SGE, they do not inherit the same execution environment as when you login. In particular, they do not read the ".profile" or ".cshrc" (Bourne and C, respectively) shell start-up script. If you are a SACS user, this means your SACS environment is not set up in SGE jobs. The simple solution is to insert the following statements at the beginning of your command script to set up the SACS environment:
source /usr/local/lib/seq/seqpaths
source /usr/local/lib/seq/seqenvirons
You also need to make sure that you are using the SACS version of C shell, which sets up access permissions to SACS tools and databases. This is done by making the first line of your script:
#!/sacs/shells/csh
The remainder of your script should then be able to use SACS commands the same way as if you were logged in. (We've only given the C shell solution for using SACS tools because all SACS users use C shell by default. If you are a SACS user using Bourne shell and are having difficulties, please contact a SACS staff member for help.)

SGE Documentation

To get a detailed description of all the available SGE directives and how to use them, read the qsub manual page using the following command:
man submit
It provides information on what environment variables to use, what options are available, how input and output streams are handled, and much more.

For even more SGE documentation, visit the RBVI Sun Grid Engine page.


Last updated January 31, 2006
Please send corrections and suggestions to Conrad Huang


Laboratory Overview | Research | Outreach & Training | Available Resources | Visitors Center | Search