New CLuster Assembly Journal

Jul 30, 2004
Looks like the down nodes are 00,38,51,55. 00 won't netboot, 38 installed, but appears to have some hardware issues after installation. Possibly a bad disk or other, it's out of the pbs pool for now. 51 and 55 shoulda installed, but didn't.

OK, 51's up and running, 55 just didn't get config'd in the bios and is installing now.
Jul 22, 2004
NAG Fortran 95 compilers installed in /usr/local/stow/NAGWare-f95

Jul 21, 2004
61 node LAM mpi job over Myrinet works. Any node that appears up when running 'pbs_nodes -a' can be used to do whatever. LAM has been installed with native access to myrinet and the tm PBS modules. This means that jobs can be started without the earlier pbslam script on other clusters. Calling lamboot, mpirun and other friends as you would on stand-alone nodes is good enough. Sample submit script:
#!/bin/bash
#PBS -l nodes=61
#PBS -l walltime=40:00
/usr/local/bin/lamboot $PBS_NODEFILE
/usr/local/bin/mpirun -ssi rpi_verbose level:1000 C alltoall
/usr/local/bin/lamhalt
Jul 17, 2004
61 node ssh test passed. Time to get MPI up and working. The newrhosts command has been decrecated in favor of using passwordless ssh keys. Please see the updates cluster manual for instructions for setting up your keys.

Jul 16, 2004
Finish PBS/ssh true issue testing and test a 21-node job. First 21 nodes are open for public testing. Remaining two racks installed and successful except for nodes 00, 51, and 55. Tomorrow if there's time we can test a 61 node job.

Jul 14-15, 2004
finish kickstart testing, roll maui/pbs/condor RPM's. Tim rolled myrinet driver RPM and tested. A few nagging bugs about condor and the post-install. Switched from campus rhn proxy to redhat's due to scalability issues on more than 5 node installs. First bulk install of 25 nodes in the first rack. All well except bug00 won't dhcp itself.

Jul 12-13, 2004
Myrinet work and some simple node install tests

Jul 7-9, 2004
Grunt labor time, set the bios to serial, netboot and gather MAC addresses for all the machines.

Jul 6-7, 2004
power on worked, time to try and get the first nodes to kickstart. Initial switch and managed power configuration. Cluster is on a 100mbit uplink until fiber arrives for the uplink.

Jul 1, 2004
New Cluster arrives and it's time to assemble it. Power on will be Monday to let all components adjust to the temperature and any condensation to evaporate.

74 boxes, 9 pallets
Finished! All 70 nodes racked.


June 19-24, 2004
Clean out the old SP2 and get the space prep'd for the new arrival

Student Staffers remove dozens of old disks :)

 

home | projects | facilities | reference | contact us
© Copyright 2003, Institute for Advanced Computer Study, University of Maryland, All rights reserved.