x86 Cluster Manual

  • Introduction

    • What is a cluster?

      Traditionally, supercomputers are single big expensive machines, with
      a shared memory and one or more processing units. A cluster, by
      contrast, is a collection of independent, usually cheaper, machines,
      used together as a supercomputer. This gets better performance per
      dollar for many classes of problems.

    • How do I use a cluster?

      Clusters are usually programmed using a message passing library, such
      as MPI or PVM. We have two different implementations of MPI and the
      latest version of PVM installed on the cluster. More information
      about MPI in general can be found at:

      http://www-unix.mcs.anl.gov/mpi/

      The cluster is managed through a batch queuing mechanism (PBS), used
      to arbitrate access to the computing resources. We use this to
      enforce certain policies on the cluster, and to ensure that only one
      user is ever using a single node.

    • How do I get an account on the cluster?

      Go to:

      https://intranet.umiacs.umd.edu/intranet/accountRequest/

      In the "Comment:" field, put down that you'd like an account
      added for the the "red/blue nodes" The specified faculty member in UMIACS will then approve the account.

    • How is the cluster configured?

      The front end to the cluster is redleader.umiacs.umd.edu. This is the host you will connect to when using the cluster. All other nodes are not accessible from the outside world (but they are from the Computer Science and UMIACS domains).

      There are two communication mechanisms in the cluster. The one used by research processes is a gigabit ethernet on an Alcatel 5022 OmniCore switch. The other is the regular fast ethernet, which is mainly there for administrative and utility purposes.
      There are four classes of nodes:

      • red

        There are sixteen red nodes, named red01 through red16. These are the oldest nodes in the cluster, but we've gotten better performance out of their gigabit interfaces:
        • Dual Pentium II 450MHz processors
        • 1GB Memory
        • 18GB Disk
        • Gigabit ethernet interface (PacketEngines GNIC-II)

      • blue

        There are twelve blue nodes, named blue01 through blue12.

        • Dual Pentium III 550MHz/512kB cache processors
        • 1GB Memory
        • 36GB Disk
        • Gigabit ethernet interface (Intel EtherExpress/1000)

      • deathstar

        There is a single 8-way node called deathstar. It was purchased at the same time as the blue nodes. It is not managed through PBS - any user can log on at any time and run processes.
        • Eight Pentium III 550MHz/1MB cache processors
        • 4GB Memory
        • 36GB Disk
        • Gigabit ethernet interface (Intel EtherExpress/1000)

      • rogue

        There is a cluster of 50 rogue nodes not managed through PBS. Access is limited to a subset of the general cluster users. The interconnect is channel-bonded fast ethernet running at 100mbits/s.
        • One Pentium III 650MHz/1MB cache processor
        • 768MB Memory
        • 160-320GB Disk
        • Channel Bonded Fast Ethernet

    • Configuring your environment.

      After you are logged in, you will have to set your account up to allow pbs to access is from any of the processing nodes. This is required since pbs will write the stdout and stderr to files in your account. Use ssh-keygen with no password to create a keypair that can be used to grant access for your jobs. These can be generated by running the following:
      
      cd $HOME
      ssh-keygen -t rsa1 -N ""  -f $HOME/.ssh/identity
      ssh-keygen -t rsa -N "" -f $HOME/.ssh/id_rsa
      ssh-keygen -t dsa -N "" -f $HOME/.ssh/id_dsa
      cd .ssh
      touch authorized_keys authorized_keys2
      cat identity.pub >> authorized_keys
      cat id_rsa.pub id_dsa.pub >> authorized_keys2
      chmod 640 authorized_keys authorized_keys2
      
      To test your keys, you should be able to 'ssh hostname' where hostname is the name of the machine you are logged into (not localhost), and be returned to a prompt.

  • PBS
    • Overview
      • Introduction to PBS

        The Linux cluster has shifted from a loosely organized set of timeshared machines to an orderly space-shared cluster of processing resources. This change has both benefits and potential problems, but is ultimately necessary as usage increases on the cluster. This document is intended to get you starting on learning this system.

      • Why use a scheduler?

        • No competition for timeshared resources

          The old alpha cluster was presented to the user as simply a collection of multiuser machines. The problem with that, was sometimes people would compete with each other for cycles, by doing things like spawning off many of their jobs on a single node. Sometimes this would increase the system load to as high as 40. No one wins in that situation: everyone's jobs take longer to complete, even if you have 30 copies running.

        • More easily reproduced results

          Other than system processes, the only person with processes on a given node will be you. There won't be any editor or login sessions from uninvited users. Additionally, we've taken extra steps to help minimize the effect of system processes.

      • What kinds of problems is this going to cause?
        • Different way of doing things

          The new setup will require that you learn a few new tools and techniques. If you've had experience on the old SP, then using the job queuing mechanism should be fairly familiar to you.

          The normal way of starting a process on a UNIX machine is to log in, or start it remotely with rsh or ssh. The cluster uses a different mechanism to start processes. You'll need to go through the Job Queuing Mechanism, described below.

    • Basic Commands

      By default, the PBS commands should be in your path (in /usr/local/bin). However, if it isn't, you should add that directory to your path.

      Please make sure /opt/UMtorque/bin is in front of /usr/local/bin in yo ur PATH environment variable. We will make it default after we upgrade all of cluster to use TORQUE.


      • qsub
        • Basic usage

          The qsub program is the mechanism for submitting a job. A job is a shell script, taken either from standard input or as an argument on the command line.

          The basic syntax of qsub, that you will probably be using
          most of the time, is:

          qsub -l nodes=<nodes> <scriptname>

          where <nodes> is the number of machines you'd like to allocate.
          Then, when PBS runs your job, the name of the file with the nodes
          allocated to you will be in $PBS_NODEFILE, and PBS will begin
          running your job on one single node from that allocation.

          When you run qsub, you will get a message like:

          187.rogueleader.umiacs.umd.edu

          This is your job id. This is used for many things, and you should
          probably keep a record of it.

          When a job finishes, PBS deposits the standard output and standard
          error as <jobname>.o<number> and
          <jobname>.e<number>, where
          <jobname> is the name of the script you submitted (or
          STDIN if it came from qsub's standard in), and <number>
          is the leading number in the job id.

        • -l option

          The -l option is used to specify resources used by a PBS job.
          Two important ones are nodes, which specifies the number of nodes
          used, and walltime, which specifies the maximum amount of
          wall clock time that the process will use. The following invocation
          of qsub runs a job on 2 nodes for one hour:

          qsub -l nodes=2,walltime=01:00:00

          It is important that you specify walltime. Without it, your
          job may be scheduled unfavorably (because your job takes less than the
          thirty minute default). Even worse, your job may be terminated
          prematurely if you go over the thirty minute default.

          See pbs_resources(7) for more information.

          • The nodes resource

            In addition to specifying the number of nodes in a job, you can also
            use the nodes resource to specify features required for your job.

            The most important features, currently, are the "red" and
            "blue" features, which will force your job on to one or the
            other class of nodes (you should only specify one of "red" or
            "blue"). Without it, your job might be scheduled across a
            heterogeneous set of nodes, which is probably not what you want.

            The only other feature currently implemented is the "gigabit"
            feature, which will ensure that your job gets scheduled on a node with
            a working gigabit ethernet interface.

            The following runs a job on two red nodes with working gigabit
            interfaces for one hour:

            qsub -l nodes=2:red:gigabit,walltime=01:00:00

          • Submitting to specific nodes

            To submit to a specific set of nodes, you can specify those nodes,
            separated by a "+" character, in the nodes resources. For
            instance:

            qsub -l nodes=red01+red02,walltime=60

            ... will submit a two node job on red01 and red02,
            with a maxmimum time of sixty seconds.

            In general, this should be avoided, since you are limited to those
            nodes that you specify. For instance, if you have files that only
            reside on particular nodes, in the scratch space, you might want to
            use this option.

        • -I option

          To submit an interactive job, use the -I option:

          qsub -l <resources> -I

          Then, instead of enqueuing a batch job and exiting, the qsub
          program will wait until your interactive job runs. When it does, PBS
          will present you with a shell on one of the nodes that you have been
          allocated. You can then use all nodes allocated, until your time
          allocation is consumed.

        • Extended job descriptions

          The qsub program will let you put information about your job
          in your script, by including comments that begin the line with '#PBS',
          and include a single command line option. For instance, if I always
          want my job to use two nodes, I could put the following at the
          beginning of my script:

          #PBS -l nodes=2

          The "EXTENDED DESCRIPTION" heading in qsub(1) has
          more information about using this feature.

      • qstat

        This program tells you the status of your's and other people's jobs.
        The basic case of running qstat is very simple: you just run
        qstat, with no options. If it gives no output, it means
        there are no jobs in the queue.

      • qdel

        The qdel program is used to remove your job from the queue,
        and cancel it if it's running. The syntax for qdel is "qdel <job id>",
        but you can abbreviate the job ID with just the leading number.

      • pbsnodes

        The pbsnodes command is used to list nodes and their status. You will
        probably only use this one way, with the "-a" argument:

        pbsnodes -a

    • More information

      For more information, see the PBS man pages. If, for some reason,
      these can't be viewed with the default manpath, you can use:

      man -M /usr/local/stow/pbs/man <topic>

      Possible subjects include the four commands listed above, plus the
      following:

      • qalter
        Alter a batch job

      • qhold
        Hold a batch job

      • qmsg
        Send a message to a batch job

      • qmove
        Move a batch job to another queue

      • qrls
        Release held jobs

      • qrerun
        Rerun a batch job

      • qselect
        Select a specific subset of jobs

      • qsig
        Send a signal to a batch job

      • pbsdsh
        Run a shell command on all nodes allocated

  • Policy

    In order to make sure everyone gets fair usage of the cluster, we've
    implemented a set of policies on the cluster to regulate usage.

    • Batch system limits

      You can submit as many jobs as you'd like, but only two of them will
      ever be considered for scheduling. In addition to this, there are
      some important limits that change, depending on the time of day.

      Overall limits on the number of node*hours per month may be imposed if
      users are found to be monopolizing use of the machine.

      At the current time, the policy is a maximum of 96 node*hours for any
      given job, with a maximum of 48 walltime hours and two
      running jobs per user.

      • What is walltime?

        Wall time is the maximum amount of real-world time that your job can
        use. For instance, if you submit a job with a walltime
        specification of "01:00:00", then you've requested a job
        allocaton for one hour "by the wall clock". This time is as opposed
        to the amount of time spent on the CPU in the user state, the amount
        of time spent on the CPU in the system state, or the amount of time
        spent idle (on an I/O wait, for instance).

      • What is a node*hour?

        The node*hour is the number of nodes you've requested, multiplied by
        the walltime you've requested. The default walltime
        is 48 hours, so if you submit a job with more than two nodes and you
        don't specify a walltime, then you will exceed the node*hour
        limit, and your job will never run.

    • Reservation limits

      Occasionally, it is necessary to be able to ensure the availability of
      resources. This is done through the reservation mechanism.
      Currently, only staff can make reservations. We have the following
      guidelines as to what is allowed:

      • Advance notification

        You must let us know at least a week in advance before we schedule
        reservations.

      • Day time

        Up to four nodes can be reserved for up to four hours.

      • Night time

        One to twelve nodes can be reserved for up to three hours, and more
        than twelve nodes can be reserved for up to one hour.

      • Exceptional situations

        Reservations that exceed these guidelines can be made, pending the
        approval of the primary investigators.

      • How do reservations work?

        The job scheduler will only schedule your jobs on those nodes during
        the reservation period. In order to run jobs on those nodes,
        specifically, you must use the syntax for submitting jobs to specific
        nodes (see the section on qsub in this manual).

    • Disk

      Currently there are varying amounts of scratch disk space on every
      cluster node. This space is available as /scratch* for anyone to use.
      We cannot guarantee the contents of these disks will remain from
      install to install, so you should not put anything on these disks that
      cannot be replaced.

  • Other software

    The cluster would not be very useful without other software to support
    it. The following tools have been installed to facilitate your work.

    • LAM/MPI

      • Description

        LAM/MPI is an implementation of MPI from the University of Notre Dame.

      • Use

        LAM is the default MPI implementation on the cluster.

        The C compiler for LAM is mpicc. This ensures that your program gets
        linked against the appropriate libraries. To run a LAM program, you
        need to run lamboot, then mpirun.

        For instance, if you wanted to run a given program on two nodes, you
        would do the following:

        mpicc -o program program.c

        In your job script, you'd put the following:

        #!/bin/bash
        #PBS -l nodes=2
        lamboot $PBS_NODEFILE
        mpirun -np 2 program


        The pbslam program, though, handles all of the
        initialization, running, and cleanup for LAM under PBS, so you
        wouldn't have to call these programs yourself.

        NOTE: There are program name conflicts between LAM and MPICH, so you
        probably just want to stick to having one in your path.

      • Information
        • Man pages

          For more information, see the LAM man pages. If, for some reason,
          these can't be viewed with the default manpath, you can use:

          man -M /usr/local/stow/lam/man <topic>

        • WWW

          More information about lam can be found at:

          http://www.lam-mpi.org

      • LAM and PBS

        There is a script called /usr/local/bin/pbslam that makes
        executing LAM programs safe and easy under PBS. For example, if your
        program is located in ~/src/program/, the following PBS job
        file would work:

        #!/bin/bash
        #PBS -l nodes=2
        cd ~/src/program/
        exec /usr/local/bin/pbslam -gvCD ./program


        In this case, the -g option causes pbslam to use the gigabit
        interfaces, -v causes pbslam to produce verbose output,
        -C causes pbslam to not check processor activity levels
        before executing, and -D causes pbslam to use the location of
        the program as the current directory.

        This should be your preferred mechanism for using LAM. For
        information on using that script, see the built in help, with:

        /usr/local/bin/pbslam -h

    • MPICH

      • Description

        MPICH is an implementation of MPI from the Argonne National
        Laboratory.

      • Use

        To use it, put /usr/local/stow/mpich-version/bin in your path BEFORE
        /usr/local/bin.

        The C compiler for MPICH is mpcc. This ensures that your program gets
        linked against the appropriate libraries. To run a MPICH program, you
        need to use mpirun.

        For instance, if you wanted to run a given program on two nodes, you
        would do the following:

        mpcc -o program program.c

        Then, in your job script, you'd put the following:

        #!/bin/bash
        #PBS -l nodes=2
        mpirun -np 2 -machinefile $PBS_NODEFILE program


        The pbsmpich program, though, handles all of the initialization,
        running, and cleanup for MPICH under PBS.

        NOTE: There are program name conflicts between LAM and MPICH, so you
        probably just want to stick to having one in your path.

      • Information
      • MPICH and PBS

        There is a script called /usr/local/bin/pbsmpich that makes
        executing MPICH programs safe and easy under PBS. For example, if
        your program is located in ~/src/program/, the following PBS job
        file would work:

        #!/bin/bash
        #PBS -l nodes=2
        cd ~/src/program/
        /usr/local/bin/pbsmpich -gvCD ./program


        In this case, the -g option causes pbsmpich to use the
        gigabit interfaces, -v causes pbsmpich to produce verbose
        output, -C causes pbsmpich to not check processor activity
        levels before executing, and -D causes pbsmpich to use the
        location of the program as the current directory.

        This should be your preferred mechanism for using MPICH. For
        information on using that script, see the built in help, with:

        /usr/local/bin/pbsmpich -h

    • PVM
      • Description

        PVM is a system designed to present an abstract parallel machine to
        the programmer.

      • Use

        The default version of PVM that comes with RH 7.3 is installed. I can
        make no guarantees to whether or not it works, since I don't know how
        to use it, nor do I have any sample programs that use it. Do yourself
        a favor and use MPI.

      • Information

        More information about PVM can be found at:

        http://www.epm.ornl.gov/pvm/

    • Insure++
      • Description

        Insure++ is a set of memory debugging tools, much like purify.

      • Use

        Insure should be in the default path, in /usr/local/bin.

      • Information

 

home | projects | facilities | reference | contact us
© Copyright 2003, Institute for Advanced Computer Study, University of Maryland, All rights reserved.