Boris Jeremic: GeoWulf

GeoWulf

UCDavis
Current status:

CPU cores: 36 (=4x2 + 4x2 + 2x4 + 4x2 + 4)

GPGU cores: 240 (1xNvidia TESLA1060)

RAM: 53 GB (distributed: 4x2 + 4x2 + 16 + 2 + 2 + 2 + 3 + 12)

Disk space: 8+ TB (distributed)

History:

1998

February: My first parallel cluster computer (@Clarkson):

1999

September: My second parallel machine (first @UCDavis) (First 4 nodes have been installed. By the end of the school year we'll have 24 nodes. Moreover, we are working in close collaboration with Professor Kleeman and will be able to connect GeoWulf with his Beowulf Machine and have total of more than 128 computational nodes by fall of 2000.

October: 8 nodes installed and the front-end computer just arrived. MPI 1.1.2. installed.
November: Portland Group compilers installed; Parasoft insure and codewizard installed;

2000

January: New MPI 1.2 installed and tested. Minor problems with front-end computer fixed. Mysterious networking problem fixed (cron job pings outside world every 15 minutes).

February: Single pile and pile groups jobs runing in parallel on GeoWulf.

March: Memory upgrade for the controller machine. 0.5GB added. Controller machine now has 1GB of RAM memory.

October: BIG1 disk (IBM Ultrastar 34.6GB SCSI) is failing on main machine. Still holding on, but it is quite noisy and reporting problems frequently.

November: Eight more nodes (AMD Duron 750MHz, 128MB RAM, 20GB disk) + service node (AMD Duron 750MHz, 128MB RAM, 2 x 46GB disks) ordered.

December: Nodes arrived.

2001

January: Putting together upgrade nodes. Upgrade team: Mark Olton, Frank McKenna, Zhaohui Yang and Boris Jeremic.

February: System installed on upgrade computers. GeoWulf now consists of 16 heterogenous node computers based on the Intel Pentium II (8 nodes) and AMD Duron (8 nodes) processors, one controller computer (dual Pentium III) and one service computer (AMD Duron). It features: over 3GB of distributed memory and over 420 GB of distributed disk space.

February: BIG1 disk almost unusable, need to get replacement.

May: BIG1 disk replaced.

June: Service machine upgraded to a full workstation. To be used for testing firewalls, backup systems and various other things.

October: Design of new head machine Koyaanisquatsi (Dual AMD 1.2MHz, 2GB RAM, 36GB SCSI disk...)

December: Koyaanisquatsi complete, problems with RedHat (unsuported hardware?)

2002

Februar: SuSE installed instead of RedHat, still problem with the last 0.5GB RAM chip.

May: Installing globus on the cluster.

July: Reinstalling nodes with SuSE.

August: reinstalling MPI for globus device, testing of OpenSeesGrid in progress.

2003

April: Geomechanics added to the head section of the cluster.

May: Reinstalling nodes with RedHat.

June: GeoWulf and Geomechanics upgraded to dual AMD 2400 with 2GB of RAM each. Current setup is as follows: 16 heterogenous node computers based on the Intel Pentium II (8 nodes) and AMD Duron (8 nodes) processors, two controller computer (dual AMD 24000) and one service computer (AMD Duron). The system features: over 6GB of distributed memory and over 420 GB of distributed disk space.

September: 35GB SCSI disk drive replaced BIG1 on GeoWulf (disk failure?).

July - September: Koyaanisqatsi machine becomes RAIDS server (over 0.5TB).

2004

February: Fried CPU and motherboard on Koyaanisqatsi.

March: Disk failure on Koyaanisqatsi.

June: GeoWulf moved next door to a airconditioned room (temperature is kept below 18 degrees centigrade).

November: Backup USB disk (115GB) added for backup.

2005

May: Shunya SMP machine (4xIntel, Dual CPU, double core) machine added to the GeoWulf Cluster(Prof. Kunnath)

September: Initial upgrade to GigB network.

2006

March: Construction of 4 new nodes, dual core Pentium D with 2GB each... This was in part class project (Computational Geomechanics, ECI285)

June: System hacked (due to a little stupid omission). Operating system reinstall from scratch (fedora core 5). Recompilation of all numerical and message passing libraries to optimize performance.

July: New 4x2 core "TeachingMachine" connected to the system.

October: Deployment of 4 new nodes (dual core, Intel core-duo technology), each one with 4GB of memory.

December: Current GeoWulf system (note much smaller "footprint", yet the most powerfull machine so far).

2007

July: Addition of another head node (dual CPU, 4GB RAM, 400MB disc

August: problem with cooling in the computer room...

2008

April: Cooling problems, computer nodes survived 12 hours of over 60C...

May: implementation of alternative (additional) air flow (cooling)

2009

September: Adding/replacing nodes, going for many multi core CPUs as well as GPGUs

November: TESLA1060 is here (together with new Koyaanisqatsi machine, Xeon E55520 Quad core, 2.26GHz, 12GB RAM, 858GB disk RAID)

Boris Jeremić