High-Performance Computing at the RBVI

home overview research resources outreach & training outreach & training visitors center visitors center search search

High-Performance Computing at the RBVI
The RBVI maintains a cluster of high-performance servers to provide for the compute- and data-intensive needs of our user community. This cluster appears to users as a single computing environment, and is described in more detail below. In most cases, RBVI users can also get access to the "QB3 at UCSF" shared computing cluster which is comprised of more than 4,000 CPU cores and is used for compute-intensive simulations.
Hardware:
The RBVI "plato" cluster is based on Hewlett Packard's (HP) ProLiant DL500 series servers. These systems, such as the DL580 G7, typically have four 8-core dual-threaded Intel Xeon processors and are configured with as much as 128GB of memory to enable efficient hosting of database applications. Storage is provided by a fibre channel based storage area network (SAN) implemented using HP's StorageWorks EVA6400 and Storageworks EVA5000 arrays, with 35+ TB of disk. Time among the various cluster nodes is synchronized using a highly accurate Praecis time and frequency reference module provided by EndRun Technologies.
Our older "socrates" cluster is in the process of being retired. It is based on HP's AlphaServer family of computers and includes a 32-processor GS1280 and four 4-processor ES45s. These systems are multi-processor servers organized in a symmetrical multiprocessor (SMP) architecture. The GS1280 system has thirty-two Alpha EV7 processors and 64 GB of memory, while each ES45 system has four Alpha EV68 processors and 16 GB of memory. All servers are interconnected using a high-bandwidth, low-latency interconnect technology known as Memory Channel, supporting 90MB/s channel bandwidth between any two server nodes and 2.1 usec end-to-end latency.

Software:
The "plato" cluster is based on Linux and runs the Red Hat Cluster Suite to provide application and services failover and load balancing in order to maximize reliability and uptime. The GFS2 Global File System allows all nodes on the cluster to have direct concurrent access to the same shared file storage.
The "socrates" cluster runs HP's TruCluster Server operating system, and provides for high-performance, scalable, highly available services. All server nodes utilize the same "single system image" of the operating system, and home directories, user files, and system files are accessible from all nodes in the cluster, resulting in location independence for all application software. This technology makes it possible to do application load sharing among cluster nodes, so that large compute-intensive jobs can be run on separate nodes from interactive jobs, for example. This technology also provides a highly-available computing environment, since a hardware or software failure on one member of the cluster results in the migration of those services provided by that node onto the remaining active nodes of the cluster. The entire cluster is accessed through a common cluster address (cluster alias). Depending on which server nodes are available and which specific service is being accessed (e.g. web server), the cluster alias resolves to a specific node that then provides the requested service. Additional technical details on TruCluster Server are available here.
Sun Grid Engine is used to control the execution of compute-intensive jobs on both clusters.

Laboratory Overview | Research | Outreach & Training | Available Resources | Visitors Center | Search