| TORQUE/Maui Cluster TORQUE stands for Terascale Open-Source Resource and QUEue Manager.  It is an Open Source distributed resource manager originally based on OpenPBS, 
                        the Portable Batch System (PBS). Our
                        installation is used for running parallel
                        jobs or making use of dedicated
                        reservations.  We use a separate program
                        called Maui
                        for scheduling jobs in TORQUE.
                          TORQUE and Maui are installed at /opt/UMtorque and /opt/UMmaui respectively. Please make sure /opt/UMtorque/bin and /opt/UMmaui/bin are added in your PATH environment variable.  The cluster is composed of a frontend (submit nodes) and  
                      compute nodes. The frontend is to be used for editing 
                      your code, compiling and submitting. To run any processing 
                      or testing of your code, you must submit it through the 
                      scheduler. The scheduler takes care of assigning compute nodes 
                      to jobs. Basically, when you get assigned a node, you will 
                      be the only person on it for the duration of your job. After 
                      your timelimit is up or your process ends, the node will 
                      be cleaned and locked down for the next submission. 
                      Logging in
 To gain access to any of the nodes on a cluster, you will first need 
                          to log into submitnodes using 
                          ssh. This machine acts as a gateway to the rest of the 
                          cluster. No intensive processing is to be run 
                          on submit nodes. Submit nodes are shared with every 
                          other person in the cluster and in various research projects 
                          throughout the institute. If you run an intensive process 
                          on submit nodes, it will be killed so other research will 
                          not be affected.   More information about UMIACS cluster submit nodes and compute nodes is  here 
                      Setting up your environment
 After you are logged in, you will have to set your 
                          account up to allow pbs to access from any of the 
                          compute nodes. This is required since pbs will write 
                          the stdout and stderr to files in your account. Use
ssh-keygen with no password to create a keypair that can be used to grant
access for your jobs. These can be generated by running the following:
 To test your keys, you should be able to 'ssh submitnode' and be returned to a prompt.
cd $HOME
ssh-keygen -t rsa1 -N ""  -f $HOME/.ssh/identity
ssh-keygen -t rsa -N "" -f $HOME/.ssh/id_rsa
ssh-keygen -t dsa -N "" -f $HOME/.ssh/id_dsa
cd .ssh
touch authorized_keys authorized_keys2
cat identity.pub >> authorized_keys
cat id_rsa.pub id_dsa.pub >> authorized_keys2
chmod 640 authorized_keys authorized_keys2
Requesting interactive usage
 Sometimes you will want to test an intensive program without 
			preparing a submission script and going through the hassle of the scheduler.
			You can run '/opt/UMtorque/bin/qsub -I' to request interactive usage on a node. After running
			qsub -I your shell will hang until a resource can be allocated to you. When
			 the resource has been allocated, it will open up a new shell on the 
			allocated node. You can now ssh into the node for the duration of the 
			allocated shell. When you logout from the initial shell, or your timelimit 
			is up, the node will again be locked down and you will have to ask the 
			scheduler for access again. Below is an example to get a interactve session: 
 [xhe@opensub01 24] qsub -I
qsub: waiting for job 152.opensrv.umiacs.umd.edu to start
qsub: job 152.opensrv.umiacs.umd.edu ready
 [xhe@openlab00 21] echo hello
hello
 [xhe@openlab00 22] exit
logout
qsub: job 152.opensrv.umiacs.umd.edu completed
 [xhe@opensub01 25] Running your first job
 We will  walk through a simple 
                          'hello world' submission script will help you understand 
                          how submitting jobs works. 
                          Create a submission fileIn your home directory on a submit node, create a file 
                            called test.sh that contains the following:
 
  
                            
#!/bin/bash
#PBS -lwalltime=10:00
#PBS -lnodes=3
echo hello world
hostname
echo finding each node I have access to
for node in `cat ${PBS_NODEFILE}` ; do
 echo ----------
 /usr/bin/ssh $node hostname
 echo ---------- 
done
The script is a normal shell script except that 
                              it includes extra #PBS directives. These directives 
                              control how you request resources on the cluster. 
                              In this case we are requesting 10 minutes of total 
                              node time split across 3 nodes. Each node will be 
                              given 3:33 minutes of access to you. Often times 
                              people will forget to specify walltime for jobs 
                              over 2 nodes. The default walltime is 48hrs/node, 
                              so requesting 3 nodes will try to schedule 144 hours 
                              of cluster time which exceeds the maximum allowed.submit the job to the scheduler using /opt/UMtorque/bin/qsub 
                            [xhe@opensub00 28]$ /opt/UMtorque/bin/qsub test.sh
123.opensrv.umiacs.umd.edu
 You can check the status of your job by running 
                              /opt/UMtorque/bin/qstat [xhe@opensub00 29]$ /opt/UMtorque/bin/qstat -n
opensrv.umiacs.umd.edu: 
                                                                   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
123.opensrv.umiacs.u xhe      dque     test.sh       --      3  --    --  48:00 R   -- 
   openlab00/0+openlab01/0+openlab02/0
 [opensub00 30]
 This shows us that the job is running 'R' and is 
                              using nodes openlab00, openlab01 and openlab02. A 'Q' for 
                              status means that your job is waiting in line for 
                              resources to free up. If you requested too many 
                              resources, your job will sit in queue until the 
                              end of time.check output
 When your job is finished, you will have two files 
                              in the directory you submitted the job from. They 
                              contain stdout (.oJOBID) and stderr (.eJOBID) The job we submitted above generated an empty error 
                              file test.sh.e123and the following 
                              stdout file: [xhe@opensub00 30]$ cat test.sh.o123 
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
hello world
openlab00.umiacs.umd.edu
finding each node I have access to
----------
openlab00.umiacs.umd.edu
----------
----------
openlab01.umiacs.umd.edu
----------
----------
openlab02.umiacs.umd.edu
----------
 [xhe@opensub00 31]
The first three lines in your output are a standard 
                              part of how we have our cluster configured and do 
                              not affect how your program runs.Running MPI program as Batch Job
 At UMIACS, we have LAM, openmpi, MPICH1, MPICH2 installed. LAM is installed at 
/usr/local/stow/lam-version; MPICH1 is available in /usr/local/stow/mpich1-version; MPICH2 is available 
in /usr/local/stow/mpich2-version; openmpi is available in /usr/local/stow/openmpi-version 
 First, you need to have an MPI based program written. Here's a simple one: alltoall.c 
                        Please note that if you compile your program with either mpich, lam or openmpi, you MUST execute
 it in the same environment. If you compiling program use mpicc from LAM and 
then attempting to run program using MPICH's mpiexec. This will fail and you will get an error message similiar to the following: LAMTo compile this program and execute under using LAM, make sure /usr/local/stow/lam-7.1.4/bin is in your PATH environment variable. 
                        It can be compiled by doing: mpicc alltoall.c -o alltoall-lam The submission file lamsub.sh can be submitted to run your program  
#!/bin/bash
#PBS -l nodes=8
#PBS -l walltime=0:10:0
cd ~/torquejobs/lamtest
mpiexec -machinefile ${PBS_NODEFILE} alltoall-lam 
Here is what looks like on your terminal after submit the job:  
[opensub00 142] qsub lamsub.sh
127.opensrv.umiacs.umd.edu
 [opensub00 143] qstat -n
opensrv.umiacs.umd.edu:
                                                                   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
127.opensrv.umiacs.u xhe      dque     lamsub.sh     --      8  --    --  48:00 R   --
   openlab00/0+openlab01/0+openlab02/0+openlab03/0+openlab04/0+openlab05/0
   +openlab06/0+openlab07/0
 [opensub00 144]
Output files for this job: lamsub.sh.o127 and lamsub.sh.e127(empty) The submission file lamsub2.sh uses mpirun instead of mpiexec, which user need 
to set up the MPI environment by starting lamboot, then run lamhalt to stop it afterwards.  We 
recommend user use mpiexec since it will set up MPI runtime environment for your jobs. Openmpi To compile and run this program using openmpi, you need to include 
/usr/local/stow/openmpi-version in your path.  The following example uses /usr/local/stow/openmpi-1.2.6.The following script will set up your path variables.
                         
 setenv PATH /usr/local/stow/openmpi-1.2.6/bin:$PATH
 if ( $?LD_LIBRARY_PATH ) then
      setenv LD_LIBRARY_PATH /usr/local/stow/openmpi-1.2.6/lib:$LD_LIBRARY_PATH
 else
      setenv  LD_LIBRARY_PATH /usr/local/stow/openmpi-1.2.6/lib
 endif The sample c code can be compiled by doing: The following is the submission file openmpisub.shmpicc alltoall.c -o alltoall-openmpi (we changed our environment to point to openmpi's mpicc) 
#!/bin/bash
# Special PBS control comments
#PBS -l nodes=8,walltime=0:10:0
# Set up the path
export PATH=/usr/local/stow/openmpi-1.2.6/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/stow/openmpi-1.2.6/lib:$LD_LIBRARY_PATH
cd ~/torquejobs/openmpitest
echo starting
mpiexec -mca btl tcp,self -n 8 ./alltoall-openmpi
echo ending
 Here is what lookslike when submit the script at your prompt:  
 [xhe@opensub00 91] qsub openmpisub.sh
133.opensrv.umiacs.umd.edu
 [xhe@opensub00 92] qstat -n
opensrv.umiacs.umd.edu: 
                                                                   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
133.opensrv.umiacs.u xhe      dque     openmpisub  16472     8  --    --  48:00 R   -- 
   openlab00/0+openlab01/0+openlab02/0+openlab03/0+openlab04/0+openlab05/0
   +openlab06/0+openlab07/0
 [xhe@opensub00 93]
Output files for this job: openmpisub.sh.o133 and openmpisub.sh.e133MPICH1To compile and run this program under MPICH1 you need to set up your environment:The following script will set the appropriate environment.
setenv MPI_ROOT /usr/local/stow/mpich-version
setenv MPI_LIB  $MPI_ROOT/lib
setenv MPI_INC  $MPI_ROOT/include
setenv MPI_BIN $MPI_ROOT/bin
# add MPICH commands to your path (includes mpirun and mpicc)
set path=($MPI_BIN $path)
# add MPICH LD_LIBRARY_PATH pages to your path
if ( $?LD_LIBRARY_PATH ) then
     setenv LD_LIBRARY_PATH  $MPI_ROOT/LD_LIBRRAY_PATH:$LD_LIBRARY_PATH
else
     setenv LD_LIBRARY_PATH  $MPI_ROOT/LD_LIBRARY_PATH
endif
It can be compiled by doing: The submission file mpich1sub.sh is almost the same except you need to call mpirun instead of mpiexec.mpicc alltoall.c -o alltoall-mpich1 (remember we changed our environment to point to MPICH's mpicc) 
#!/bin/bash
# Special PBS control comments
#PBS -l nodes=8,walltime=60
# Set up the path
PATH=/usr/local/stow/mpichgm-1.2.7p1-20/bin:$PATH
export $PATH
cd ~/mpich1test/
echo $PBS_NODEFILE
# Run the program
mpirun -np $( wc -l < $PBS_NODEFILE ) ./alltoall-mpich1
Here is what looks like when submit the job from a submit machine at your prompt: 
[xhe@brood00 ~/mpich1test]$ qsub mpich1sub.sh
167.queen.umiacs.umd.edu
[xhe@brood00 ~/mpich1test]$ qstat -n
queen.umiacs.umd.edu: 
                                                                   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
167.queen.umiacs.umd xhe      dque     mpich1sub.    --      8  --    --  04:00 R   -- 
   bug00/0+bug01/0+bug02/0+bug03/0+bug04/0+bug05/0+bug06/0+bug07/0
[xhe@brood00 ~/mpich1test]$ 
Output files for this job: mpich1sub.sh.o167 and mpich1sub.sh.e167(empty) MPICH2 To compile use mpich2, you need set up your environment for it.   You need to set up the path variables to include the MPICH2 version you want to use.
You will make two changes: one to the PATH variable; the other, the LD_LIBRARY_PATH. 
The following example uses version MPICH2-1.0.7For a bash shell user, append the following in .bash_profile: 
 For a C shell user, append the following in your .cshrc:
    export MPICH2_HOME=/usr/local/stow/mpich2-1.0.7
    export PATH=$MPICH2_HOME/bin:$PATH
    export LD_LIBRARY_PATH=$MPICH2_HOME/lib:$LD_LIBRARY_PATH
    setenv MPICH2_HOME /usr/local/stow/mpich2-1.0.7
    setenv PATH $MPICH2_HOME/bin:$PATH
    setenv LD_LIBRARY_PATH $MPICH2_HOME/lib:$LD_LIBRARY_PATH
The sample c code can be compiled by doing: Here is a sample submission file  for mpich2: mpich2sub.sh.mpicc alltoall.c -o alltoall-mpich2 ( we uses MPICH2's mpicc) 
#!/bin/bash
#PBS -lwalltime=0:10:0
#PBS -lnodes=8
# Set up the path
export MPICH2_HOME=/usr/local/stow/mpich2-1.0.7
export PATH=$MPICH2_HOME/bin:$PATH
echo starting
mpiexec -n 8   /nfshomes/xhe/torquejobs/mpich2test/alltoall-mpich2
echo ending
Before you submit you job to the cluster, you need do the following to start mpd daemon, which must run on each 
compute node to be used by your program.Make sure you have a file .mpd.conf in your home directory, with a line like this: 
Create a hostfile for mpd daemon in some directory that you 
can reference.  It includes the compute nodes that you want daemons to be started.  List the nodes name, one line per node as the following. secretword=your-favorite-word
 openlab00
 openlab01
 openlab02
 ...
 openlan07Then start the mpd daemon by type (You will need to run mpiallexit later to shut down the daemon after your job finished.) mpdboot -n #ofnodes -f path-to-hostfile/hostfile.After the daemon started, you can submit the about mpich2sub.sh script using command qsub.Here is what looks like when you submit from your prompt:
 
[xhe@opensub01 68] mpdboot -n 8 -f mpd.hostfile
[xhe@opensub01 69] qsub mpich2sub.sh
140.opensrv.umiacs.umd.edu
[xhe@opensub01 70] qstat -n
opensrv.umiacs.umd.edu:
                                                                  Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
140.opensrv.umiacs.u xhe      dque     mpich2sub.   3403     8  --    --  48:00 R   --
  openlab00/0+openlab01/0+openlab02/0+openlab03/0+openlab04/0+openlab05/0
  +openlab06/0+openlab07/0
[xhe@opensub01 71] qstat -n
[xhe@opensub01 72] mpdallexit
[xhe@opensub01 73] 
Here are the standard output and  standard error for this job: mpich2sub.sh.o140 and mpich2sub.sh.e140(empty)
                         
 It seems that there is no lamd running on the host openlab02.umiacs.umd.edu.
 
This indicates that the LAM/MPI runtime environment is not operating.
The LAM/MPI runtime environment is necessary for MPI programs to run
(the MPI program tired to invoke the "MPI_Init" function).
 
Please run the "lamboot" command the start the LAM/MPI runtime
environment.  See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
Commands
 Please make sure /opt/UMtorque/bin is in your PATH environment variable. 
 
   
   qsub
 
    
    Basic usage
 The qsub program is the mechanism for submitting a job.  A
    job is a shell script, taken either from standard input or as an
    argument on the command line.
 
 The basic syntax of qsub, that you will probably be using
 most of the time, is:
 
 qsub -l nodes=<nodes> <scriptname>
 
 where <nodes> is the number of machines you'd like to allocate.
 Then, when PBS runs your job, the name of the file with the nodes
 allocated to you will be in $PBS_NODEFILE, and PBS will begin
 running your job on one single node from that allocation.
 
 When you run qsub, you will get a message like:
 
 123.opensrv.umiacs.umd.edu
 
 This is your job id.  This is used for many things, and you should
 probably keep a record of it.
 
 When a job finishes, PBS deposits the standard output and standard
 error as <jobname>.o<number> and
 <jobname>.e<number>, where
 <jobname> is the name of the script you submitted (or
 STDIN if it came from qsub's standard in), and <number>
 is the leading number in the job id.
 
 
    -l option
 The -l option is used to specify resources used by a PBS job.
 Two important ones are nodes, which specifies the number of nodes
 used, and walltime, which specifies the maximum amount of
 wall clock time that the process will use.  The following invocation
 of qsub runs a job on 2 nodes for one hour:
 
 qsub -l nodes=2,walltime=01:00:00
 
 It is important that you specify walltime.  Without it, your
 job may be scheduled unfavorably (because your job takes less than the
 thirty minute default).  Even worse, your job may be terminated
 prematurely if you go over the thirty minute default.
 
 See pbs_resources(7) for more information.
 
 
 
     
     The nodes resource
 In addition to specifying the number of nodes in a job, you can also
 use the nodes resource to specify features required for your job.
 
 
     Submitting to specific nodes
 To submit to a specific set of nodes, you can specify those nodes,
 separated by a "+" character, in the nodes resources.  For
 instance:
 
 qsub -l nodes=openlab00+openlab01,walltime=60
 
 ... will submit a two node job on openlab00 and openlab01,
 with a maxmimum time of sixty seconds.
 
 In general, this should be avoided, since you are limited to those
 nodes that you specify.  For instance, if you have files that only
 reside on particular nodes, in the scratch space, you might want to
 use this option.
 
 
    -I option
 To submit an interactive job, use the -I option:
 
 qsub -l <resources> -I
 
 Then, instead of enqueuing a batch job and exiting, the qsub
 program will wait until your interactive job runs.  When it does, PBS
 will present you with a shell on one of the nodes that you have been
 allocated.  You can then use all nodes allocated, until your time
 allocation is consumed.
 
 
    Extended job descriptions
 The qsub program will let you put information about your job
 in your script, by including comments that begin the line with '#PBS',
 and include a single command line option.  For instance, if I always
 want my job to use two nodes, I could put the following at the
 beginning of my script:
 
 #PBS -l nodes=2
 
 The "EXTENDED DESCRIPTION" heading in qsub(1) has
 more information about using this feature.
 
 
   qstat
 This program tells you the status of your's and other people's jobs.
 The basic case of running qstat is very simple: you just run
 qstat, with no options.  If it gives no output, it means
 there are no jobs in the queue.
 
 
   qdel
 The qdel program is used to remove your job from the queue,
 and cancel it if it's running.  The syntax for qdel is "qdel <job id>",
 but you can abbreviate the job ID with just the leading number.
 
 
   pbsnodes
 The pbsnodes command is used to list nodes and their status.  You will
 probably only use this one way, with the "-a" argument:
 
 pbsnodes -a
 
 
   pbsdsh
 Run a shell command on all nodes allocated
 
 
 For more information, see the man pages for PBS commands.  If, for some reason,
  these can't be viewed with the default manpath, you can use:man -M /opt/UMtorque/man <topic>Compilers
 The following compilers and memory analyze tools are available at UMIACS. 
 
   
   Gnu compilers
   Besides default Gnu C/C++ and fortran compilers, UUMIACS also have several versions of gcc install at 
/usr/local/stow/gcc-version directory.
   PGI compiler
 The Portland Group C and Fortran compilers are installed at /opt/pgi directory.
   Intel
 Intel C and Fortran compilers are installed at /opt/intel.
   NAGWare 
UMIACS has NAGWare fortran compiler, it is installed at /opt/NAGWare_f95 directory.
   
   insure++
 Parasoft Insure++, is a runtime Analysis, Memory Error Detection software for C and C++.  It is installed at /opt/insure directory.
      To find other software that installed at UMIACS, please check /usr/local/stow or /opt directory.     
                      UMIACS Condor pool information can be found at:   condorintro.html  |