UABgrid Documentation:Community Portal

Mission
HPC Services is the division within the IT Infrastructure Services organization with a focus on HPC support for research and other HPC activities. HPC Services support includes HPC Cluster Support, Networking & Infrastructure, Middleware, and Academic Research Support. By Research, it is meant specifically to assist or collaborate with grant activities that require IT resources. In addition, it may also include acquiring and managing high performance computing resources, such as Beowulf clusters and network storage arrays. HPC Services participates in institutional strategic planning and self-study as related to academic IT. HPC Services represents the Office of Vice-President of Information Technology to IT-related academic campus committees, regional / national technology research organizations and/or committees as requested.

Note: The term HPC is used to mean high performance computing, which has many definitions available on the web. At UAB, HPC generally refer to “computational facilities substantially more powerful than current desktops computers (PCs and workstations) …by an order of magnitude or better.” See http://parallel.hpc.unsw.edu.au/rks/docs/hpc-intro/node3.html for more description of this usage of HPC.

Overview
HPC Services is a unit within IT Infrastructure Services in the Office of the Vice President for Information Technology. Our mission is to support the high performance computing and collaboration application needs of researchers by fostering the growth of an HPC-grid computing and Internet2/NLR 10Gbps networking infrastructure able to harness resources across organizational boundaries.

Our active projects include:

UABgrid is a standards-based software infrastructure built upon components like the Globus Toolkit, Shibboleth, and GridShib. UABgrid includes GridWay, a grid scheduler that manages computational workflows across HPC resources located in the UAB Shared Computing Facility, Department of Computer and Information Sciences (CIS), or at a collaborator's compute facility. UABgrid leverages the identity infrastructure of InCommon to support universal access to shared resources by members of UAB and their collaborators at other institutions. UABgrid hosts web collaboration tools to help collaborations get operational quickly. HPC services in collaboration with CIS and Mechanical Engineering, has provided educational programs like the 2007 Grid Computing Boot Camp, and is exploring the expansion of our campus model for HPC to other campuses of the UA System and the Alabama Supercomputing Authority.
 * UABgrid Collaboration Environment - UABgrid connects researchers to high performance computing and collaboration resources on campus, within the state, and around the globe. It provides uniform access to local applications and HPC clusters and integrates with the application platforms of SURAgrid, TeraGrid and caBIG.


 * InCommon Federated Identity - InCommon is a federation of higher-education institutions in the US that have standardized the interface to their identity management systems. This enables BlazerID-based access to resources hosted by member institutions, an increasing collection of affiliated vendor applications, like WebAssign, and grant management interfaces for the NSF and NIH.  The ability to cross organization boundaries using a consistent identifier can simplify many collaboration scenarios.


 * Professional Collaborations - HPC Services is an engaged participant in technology groups in Alabama, across the region, the US, and Europe. Activities include the UA System Collaborative Technology activities, Alabama Regional Optical Network, Internet2, SURAgrid, EDUCAUSE, Open Grid Forum, and SuperComputing.

HPC Services is available to help you integrate these technologies with your research initiatives.

For more information please contact Bob Cloud ([mailto:recloud@uab.edu recloud@uab.edu], 996-5707), David Shealy ([mailto:dls@uab.edu dls@uab.edu], 934-8068) or John-Paul Robinson ([mailto:jpr@uab.edu jpr@uab.edu], 975-0124).

HPC Project Five Year Plan as of Summer 2006
As a result of discussions between IT, CIS, and ETL to determine the best methods and associated costs to interconnect HPC clusters in campus buildings BEC and CH, a preliminary draft of scope and five year plan for HPC at UAB was prepared. In order to ensure growth and stability of IT support for research computing and to obtain wide support for academic researchers for a workable model the mission of IT Academic Computing has been revised and merged into a more focused unit within IT Network & Infrastructure Services under the name of HPC Services, which is the division within the IT Infrastructure Services. See Office of VP of IT Organization Chart.
 * Scope: Building upon the exiting UAB HPC resources in CIS and ETL, IT and campus researchers are setting a goal to establish a UAB HPC data center, whose operations will be managed by IT Infrastructure and which will include additional machine room space designed for HPC and equipped with a new cluster. The UAB HPC Data Center and HPC resource will be used by researchers throughout UAB, the UAS system, and other State of Alabama Universities and research entities in conjunction with the Alabama Supercomputer Authority. Oversight of the UAB HPC resources will be provided by a committee made up of UAB Deans, Department Heads, Faculty, and the VPIT. Daily administration of this shared resource will be provided by the Department of Network and Infrastructure Services.
 * Integrate the design, construction, and staffing of an HPC Data Center with overall IT plans.
 * Secure funding for a new xxxxTeraFlop HPC Cluster. For example, HPCS will continue working with campus researchers in submitting proposals.
 * Preliminary Timeline
 * FY2007: Rename Academic Computing, HPCS, and merge HPCS with Network and Infrastructure, to leverage the HPC related talents, and resources of both organizations.
 * FY2007: Connect existing HPC Clusters to each other and 10Gig backbone.
 * FY2007: Bring up pilot grid identity management system – GridShib (HPCS, Network/Services)
 * FY2007: Enable Grid Meta Scheduling (HPCS, CIS, ETL)
 * FY2007: Establish Grid connectivity with SURA, UAS, and, ASA.
 * FY2007: Develop shared HPC resource policies.
 * FY2008: Increase support staff as needed by reassigning legacy Mainframe technical resources
 * FY2008: Develop requirements for expansion or replacement of older HPC’s. xxxxTeraFlops.
 * FY2008: Using HPC requirements (xxxx TeraFlops) for Data Center Design, begin design of HPC Data Center.
 * FY2009: Secure Funding for new HPC Cluster xxxxTera Flops
 * FY2010: Complete HPC Data Center Infrastructure.
 * FY2010: Secure final funding for expansion or replacement of older HPC’s.
 * FY2011: Procure and deploy new HPC cluster. xxxxTeraFlops.

Goals for FY2007

 * GOAL 1: UAB Grid Computing Project
 * Bring up pilot of grid identity management based on using GridShib software which incorporate Shibboleth in the core grid software Globus;
 * Enable a grid meta-scheduling capability in collaboration with CIS and ETL so that UAB users will see a single interface for submission of HPC jobs running on primary clusters in ETL and CIS;
 * Explore expanding the campus model for HPC to other campuses of UA System and to the Alabama Supercomputing Center.
 * GOAL 2:	InCommon / Shibboleth Project
 * Work with Infrastructure and Network Services to coordinate new and expanding campus applications using Shibboleth;
 * Evaluate establishing a second pilot Shibboleth application with other members of InCommon;
 * Establish UAB grid as a UAB application offered to InCommon members; and
 * Evaluate establishing pilot Shibboleth applications as an advanced technology demonstration of capabilities for inter-institutional user authentication and authorization for access to common workspace supporting calendar, document sharing, data sharing, and communication technologies for desktop.
 * GOAL 3:	Participation in External IT Groups within Alabama, Region and US, such as, UA System Collaborative Technology activities, Alabama Regional Optical Network, Internet2, SURA grid, EDUCAUSE, Global Grid Forum, and Super-Computing

Accomplishments for FY2007

 * GOAL 1:	UAB Grid Computing Project
 * Bring up pilot of grid identity management based on using GridShib software which incorporate Shibboleth in the core grid software Globus;
 * IdM equipment order and operational, May 9, 2007
 * GridShib installed - May 25, 2007
 * UABgrid Login sevice operational – June 19, http://uabgrid.uab.edu/login
 * UABgrid VO management service operational - target July 1
 * UABgrid GridShib CA migration operational - target July 17
 * Enable a grid meta-scheduling capability in collaboration with CIS and ETL so that UAB users will see a single interface for submission of HPC jobs running on primary clusters in ETL and CIS;
 * SURA talk and demonstration – The GridWay meta-scheduler and an example research application, DynamicBLAST,  was demonstrated to the SURAgrid all-hands mtg in collaboration with CIS
 * UABgrid meta-scheduler operation - target July 17
 * UABgrid Boot Camp being scheduled for mid-August
 * Explore expanding the campus model for HPC to other campuses of UA System and to the Alabama Supercomputing Center.


 * GOAL 2:	InCommon / Shibboleth Project
 * Work with Infrastructure and Network Services to coordinate new and expanding campus applications using Shibboleth;
 * Evaluate establishing a second pilot Shibboleth application with other members of InCommon;
 * Establish UAB grid as a UAB application offered to InCommon members; and
 * UABgrid Incommon Application draft has been circulated for reviews and comments.
 * Evaluate establishing pilot Shibboleth applications as an advanced technology demonstration of capabilities for inter-institutional user authentication and authorization for access to common workspace supporting calendar, document sharing, data sharing, and communication technologies for desktop.
 * This is the research collaboration focus of UABgrid
 * GOAL 3:	Participation in External IT Groups within Alabama, Region and US, such as, UA System Collaborative Technology activities, Alabama Regional Optical Network, Internet2, SURA grid, EDUCAUSE, Global Grid Forum, and Super-Computing
 * List all meetings attended since Oct 1, 06: SC06, Internet2 Fall 06, SURAgrid All Hands (march), Internet2 Spring 07l
 * SURAgrid Goverance: John-Paul Robinson has been elected to serve a one-year term on the inaugural SURAgrid GC
 * SURAgrid working group: John-Paul Robinson is serving on accounting systems working group
 * CI-Team proposals: David L Shealy was a senior scientist of the large collabortive proposal submitted to NSF by Texas Tech University to present 3 two day workshops on grid computing
 * UAB Research Computing plans
 * Developed IT CyberInfrastructure presentation for ASA campus visit on April 3, 2007
 * Circulated IT research computing planning draft to the Office of VP of Research and Economic Development

Campus Network

 * On Campus High Speed Network Connectivity The core of the campus network is a centrally-managed backbone comprised of  ring protected enterprise-class 10-Gigabit Ethernet routers, supporting IP, IP Multicast, IPX, and Appletalk protocols. All buildings on campus are connected to one of three communication hubs using optical fiber. Within buildings, Category 5 or higher unshielded twisted pair wiring connects desktops to the network. A Gigabit Ethernet building backbone over multimode optical fiber is used for multifloor buildings.  Computer server clusters are connected to the building entrance using Gigabit Ethernet.  Each floor contains one or more switches connected to the building backbone using Gigabit Ethernet. Desktops are connected at 10 or 100 megabits/second speed (gigabit available when needed).
 * UAB is a charter member of Internet2 and hosts one the nodes of the Gulf Central GigaPoP and the Alabama Research and Education Network (AREN). The UA System (UA, UAB, UAH) share two OC-3s of an OC12 link of bandwidth from Birmingham to Atlanta to connect to Southern Cross Roads for I2 connectivity.  The Alabama Regional Optical Network (ARON) is a dedicated (dark-fiber) dense wavelength division multiplexed (DWDM) network currently under construction and scheduled for completion in 2007. Owned and operated by the University of Alabama System, and contract agreement with Georgia Tech and the Southern Light Rail (SLR), ARON connects the University of Alabama’s three research institutions to the National LambdaRail (NLR) and will replace the current Internet2 connections.  UAB NLR connectivity is expected by late 2007.

Research Network
The UAB Research Network is currently a dedicated 10GE optical connection between Shared HPC Facility and Computer Science HPC Lab which will leverage network for staging grid-based compute jobs and allow direct connection to high-bandwidth regional networks. This network allows very high speed secure connectivity between the existing HPC clusters in Engineering and Computer science as well as high speed file transfer of very large data sets, between clusters, without the concerns of interfering with other traffic on the campus backbone. This dedicated connection also guarantees a predictable latency between the clusters. See the figure at the right for a logical view of this network and the associated services.

Grid Computing
There are two groups on campus working in the area of grid computing. The first group is lead by Professor Bangalore (NS&M/CIS), which focuses on basic research in grid computing, distributed computing, and web-based computing within the Collaborative Computing Laboratory (CCL). The second group is within UAB IT, which has developed an applied grid computing project, known as UABgrid that has been developed during the past 4 years as a result of UAB participation in the NSF/SURA NMI Testbed project (D.L. Shealy, PI). UABgrid is the campus infrastructure for computation and collaboration in the grid environment. During FY07, new functionality will be added to include the Shibboleth-based identity management capability, which facilitates external research collaborations, and a meta-scheduler, which allows scheduling of HPC jobs over multiple clusters. Both of these two new capabilities of UABgrid are being demonstrated in collaboration with UAB CCL and SURAgrid and presented at Internet2 and grid conferences during 2007.


 * Middleware Tools: Tools available through UABGrid include: Globus Toolkit, GridShib for Globus, Ganglia, GridWay metascheduler, MyProxy, UABGrid CA, GridSphere portal, Shibboleth, myVoc box management node, and GridFTP.

High Performance Computing
UAB Shared High Performance Computing Facility provides UAB-wide shared software and hardware infrastructure and support for the high performance parallel and distributed computing, numerical tools and information technology-based computing environments, and computational simulation to UAB researchers. The facility now a joint IT and multi-school use, supported and funded initiative initially jump started by the School of Engineering, in collaboration with the Schools of Medicine and Public Health. The current HPC combined performance of the facility is about 2.2 Teraflops. The facility is equipped with the following:
 * IBM BlueGene L cluster with 2048 700 MHz processors with 512 MB of memory in each. The system has 13 terabytes of storage. This cluster should benchmark at 4.5 to 5 Teraflops.
 * DELL Xeon 64-bit Linux Cluster (CHEAHA) which consists of 128 nodes of DELL PE1425 computer, with dual Xeon 3.6GHz processors with either 2GB or 6GB of memory per node. It uses a Gigabit Ethernet inter-node network connection.  There are 4 Terabytes of disk storage available to this cluster.  This cluster is rated at more than 1.1 Teraflops computing capacity.
 * Verari Opteron 64-bit Linux Cluster (COOSA) which is a 64-node computing cluster consisting of dual AMD Opteron 242 processors, with 2GB of memory each node. Each node is interconnected with a Gigabit Ethernet network.
 * IBM Linux Cluster (CAHABA) is a highly scalable Linux cluster solution for high performance and commercial computing workloads. It is constructed with IBM x335 Series with a total of 128-processor (64 nodes, dual Xeon 2.4GHz, 2 to 4GB memory each node) and 1 Terabyte storage unit.  Each node is interconnected with Gigabit network.
 * Supermicro Xeon 32-bit Linux Cluster which is a 10-node visualization cluster consisting of Supermicro computers with dual Xeon 2.4GHz processors, 2GB of memory each node and 3-Terabytes of accumulative disk space.
 * DNP Holo Screen Display (60”), a transparent display which allows viewers to look at and see through the screen and makes the image appear suspended in mid-air. It gives an impression of almost-3D depth.
 * Passive Stereoscopic Display System (VisBox), which is a one-wall, fully integrated, projection-based VR system with head-tracking and stereo display. The screen is 10 feet diagonal, which makes it significantly more immersive than other much more expensive systems.  The VisBox uses high-end LINUX PCs and bright projectors.  The footprint of a VisBox is 8’x8’, and it is a few inches shy of being 8 feet tall, making it close to an 8’x8’x8’ cube.  With this system, researchers can visualize their data in a stereoscopic virtual environment. This display system is a passive stereo display system in an all-in-one unit with 2 polarizing LCD projectors and 2 mirrors, precision-mounted in a custom frame.  A Linux PC drives this system with a high-end dual-headed graphics card.  Users wear lightweight, inexpensive polarized eyeglasses and see a stereoscopic image.
 * Tiled Display Wall System (VisWall) (8'x8' and 3x3 configurations) is capable of a combined screen resolution of 3000x2300 pixels. It provides researchers with a display solution to visualize data or images at an ultra-high resolution.  A high-end dual-processor LINUX cluster and nVidia graphics cards are used to drive the graphics applications.  This is a scalable solution, which means that we can expand the number of tiles to m x n to increase the combined resolution as the budget permits.  A 10-node dual processor Linux cluster drives this nine-tile visualization wall.  The software synchronizes images at the tile interface.  This provides an ultra-resolution visualization capability for very large-scale images/data.  A Linux PC console communicating through high-speed Myrinet network drives the VisWall.  Each computer is connected to a projector that contributes 1024x768screen resolution in the overall projection area.

Department of Computer and Information Sciences(CIS) provides access to high performance computing resources including:
 * 128 node compute cluster which has been benchmarked at 1.41 Teraflops. Each node consists of two 3.2GHz Intel Xeon processors with 4GB of memory. The cluster includes a low-latency Infiniband network fabric in addition to a secondary gigabit ethernet network fabric. It also provides 4 terabytes of storage managed by the Ibrix parallel filesystem.
 * 32 node dual-processor 1.6 GHz Opteron cluster with each node having 2 GB of RAM, 80GB of hard drive, and gigabit Ethernet connected with gigabit switch. Out of the 32 nodes, 4 nodes have 4 GB RAM and 160GB hard drive.
 * A 13 megapixel, nine tile visualization wall that measures approximately 10' wide by 8' high. The unified image is created by 9 DLP projectors connected via optical DVI cabling to 5 rendering nodes in a visualization cluster.  The nodes consist of dual-core, dual-processor AMD Opteron 270 servers with dual-head NVIDIA GeForce 7800 graphics adapters and 4GB of memory. A sixth similarly-configured node serves as the head node which contains 8GB of memory and serves as the master node for the visualization cluster.
 * All the clusters are part of the CIS Grid Node, which is part of the distributed campus-wide computational infrastructure - UABgrid.

Off-campus Resources

 * ASA/AREN: Alabama Supercomputer Center (http://www.asc.edu) provides UAB investigators with access to a variety of high performance computing resources. These resources include:
 * An SGI Altix 350 supercomputer with 144 CPU cores, 592 GB of shared memory, and 10.8 terabytes in the CXFS file system. Each CPU is a 64 bit Intel Itanium 2 processor. The system consists of SGI Altix 350 nodes with 1.4 GHz processors, Altix 350 nodes with 1.5 GHz processors, and Altix 450 nodes with dual core 1.6 GHz processors. This gives the entire system a floating point performance of 776 GigaFLOPS. Sets of from 6 to 72 CPUs are grouped together into shared memory nodes. There are multiple networks connecting the processors. These include; NUMAlink for sharing memory, a fiber channel switch for CXFS file system data, gigabit ethernet for internet connectivity, and a secondary ethernet connection as a redundant fail over and management network.
 * A Cray XD1 supercomputer with 144 CPUs and 240 gigabytes of distributed memory. Each compute node has a local disk (60 Gigabytes of which are accessible as /tmp). Also attached to the XD1 is a high performance Fiber Channel RAID Array (600MB/s) running the Lustre Filesystem, which has 3 Terabytes of high performance accessible as /scratch from each node. Home directories as well as third party applications use the NFS Filesystem and share 4 Terabytes of Fiber Channel RAID storage.
 * A large number of software packages are installed supporting a variety of analyses including programs for Design Analysis, Quantum Chemistry, Molecular Mechanics/Dynamics, Crystallography, Fluid Dynamics, Statistics, Visualization, and Bioinformatics.

=== Tools and Support
 * Internet2/NLR
 * Alabama RON
 * SURA