Cheaha:Community Portal

From Cheaha
Jump to navigation Jump to search

HPC Services Plans

Mission

HPC Services is the division within the IT Infrastructure Services organization with a focus on HPC support for research and other HPC activities. HPC Services support includes HPC Cluster Support, Networking & Infrastructure, Middleware, and Academic Research Support. By Research, it is meant specifically to assist or collaborate with grant activities that require IT resources. In addition, it may also include acquiring and managing high performance computing resources, such as Beowulf clusters and network storage arrays. HPC Services participates in institutional strategic planning and self-study as related to academic IT. HPC Services represents the Office of Vice-President of Information Technology to IT-related academic campus committees, regional / national technology research organizations and/or committees as requested.

Note: The term HPC is used to mean high performance computing, which has many definitions available on the web. At UAB, HPC generally refer to “computational facilities substantially more powerful than current desktops computers (PCs and workstations) …by an order of magnitude or better.” See http://parallel.hpc.unsw.edu.au/rks/docs/hpc-intro/node3.html for more description of this usage of HPC.

HPC Project Five Year Plan as of Summer 2006

As a result of discussions between IT, CIS, and ETL to determine the best methods and associated costs to interconnect HPC clusters in campus buildings BEC and CH, a preliminary draft of scope and five year plan for HPC at UAB was prepared. In order to ensure growth and stability of IT support for research computing and to obtain wide support for academic researchers for a workable model the mission of IT Academic Computing has been revised and merged into a more focused unit within IT Network & Infrastructure Services under the name of HPC Services, which is the division within the IT Infrastructure Services. See Office of VP of IT Organization Chart.

  • Scope: Building upon the exiting UAB HPC resources in CIS and ETL, IT and campus researchers are setting a goal to establish a UAB HPC data center, whose operations will be managed by IT Infrastructure and which will include additional machine room space designed for HPC and equipped with a new cluster. The UAB HPC Data Center and HPC resource will be used by researchers throughout UAB, the UAS system, and other State of Alabama Universities and research entities in conjunction with the Alabama Supercomputer Authority. Oversight of the UAB HPC resources will be provided by a committee made up of UAB Deans, Department Heads, Faculty, and the VPIT. Daily administration of this shared resource will be provided by the Department of Network and Infrastructure Services.
  • Integrate the design, construction, and staffing of an HPC Data Center with overall IT plans.
  • Secure funding for a new xxxxTeraFlop HPC Cluster. For example, HPCS will continue working with campus researchers in submitting proposals.
  • Preliminary Timeline
    • FY2007: Rename Academic Computing, HPCS, and merge HPCS with Network and Infrastructure, to leverage the HPC related talents, and resources of both organizations.
    • FY2007: Connect existing HPC Clusters to each other and 10Gig backbone.
    • FY2007: Bring up pilot grid identity management system – GridShib (HPCS, Network/Services)
    • FY2007: Enable Grid Meta Scheduling (HPCS, CIS, ETL)
    • FY2007: Establish Grid connectivity with SURA, UAS, and, ASA.
    • FY2007: Develop shared HPC resource policies.
    • FY2008: Increase support staff as needed by reassigning legacy Mainframe technical resources
    • FY2008: Develop requirements for expansion or replacement of older HPC’s. xxxxTeraFlops.
    • FY2008: Using HPC requirements (xxxx TeraFlops) for Data Center Design, begin design of HPC Data Center.
    • FY2009: Secure Funding for new HPC Cluster xxxxTera Flops
    • FY2010: Complete HPC Data Center Infrastructure.
    • FY2010: Secure final funding for expansion or replacement of older HPC’s.
    • FY2011: Procure and deploy new HPC cluster. xxxxTeraFlops.

HPC Services Goals and Accomplishments for FY2007

Goals for FY2007

  • GOAL 1: UAB Grid Computing Project
    • Bring up pilot of grid identity management based on using GridShib software which incorporate Shibboleth in the core grid software Globus;
    • Enable a grid meta-scheduling capability in collaboration with CIS and ETL so that UAB users will see a single interface for submission of HPC jobs running on primary clusters in ETL and CIS;
    • Explore expanding the campus model for HPC to other campuses of UA System and to the Alabama Supercomputing Center.
  • GOAL 2: InCommon / Shibboleth Project
    • Work with Infrastructure and Network Services to coordinate new and expanding campus applications using Shibboleth;
    • Evaluate establishing a second pilot Shibboleth application with other members of InCommon;
    • Establish UAB grid as a UAB application offered to InCommon members; and
    • Evaluate establishing pilot Shibboleth applications as an advanced technology demonstration of capabilities for inter-institutional user authentication and authorization for access to common workspace supporting calendar, document sharing, data sharing, and communication technologies for desktop.
  • GOAL 3: Participation in External IT Groups within Alabama, Region and US, such as, UA System Collaborative Technology activities, Alabama Regional Optical Network, Internet2, SURA grid, EDUCAUSE, Global Grid Forum, and Super-Computing

Accomplishments for FY2007

  • GOAL 1: UAB Grid Computing Project
    • Bring up pilot of grid identity management based on using GridShib software which incorporate Shibboleth in the core grid software Globus;
      • IdM equipment order and operational, May 9, 2007
      • GridShib installed - May 25, 2007
      • UABgrid Login sevice operational – June 19, http://uabgrid.uab.edu/login
      • UABgrid VO management service operational - target July 1
      • UABgrid GridShib CA migration operational - target July 17
    • Enable a grid meta-scheduling capability in collaboration with CIS and ETL so that UAB users will see a single interface for submission of HPC jobs running on primary clusters in ETL and CIS;
      • SURA talk and demonstration – The GridWay meta-scheduler and an example research application, DynamicBLAST, was demonstrated to the SURAgrid all-hands mtg in collaboration with CIS
      • UABgrid meta-scheduler operation - target July 17
      • UABgrid Boot Camp being scheduled for mid-August
    • Explore expanding the campus model for HPC to other campuses of UA System and to the Alabama Supercomputing Center.
  • GOAL 2: InCommon / Shibboleth Project
    • Work with Infrastructure and Network Services to coordinate new and expanding campus applications using Shibboleth;
    • Evaluate establishing a second pilot Shibboleth application with other members of InCommon;
    • Establish UAB grid as a UAB application offered to InCommon members; and
    • UABgrid Incommon Application draft has been circulated for reviews and comments.
    • Evaluate establishing pilot Shibboleth applications as an advanced technology demonstration of capabilities for inter-institutional user authentication and authorization for access to common workspace supporting calendar, document sharing, data sharing, and communication technologies for desktop.
    • This is the research collaboration focus of UABgrid
  • GOAL 3: Participation in External IT Groups within Alabama, Region and US, such as, UA System Collaborative Technology activities, Alabama Regional Optical Network, Internet2, SURA grid, EDUCAUSE, Global Grid Forum, and Super-Computing
    • List all meetings attended since Oct 1, 06: SC06, Internet2 Fall 06, SURAgrid All Hands (march), Internet2 Spring 07l
    • SURAgrid Goverance: John-Paul Robinson has been elected to serve a one-year term on the inaugural SURAgrid GC
    • SURAgrid working group: John-Paul Robinson is serving on accounting systems working group
    • CI-Team proposals: David L Shealy was a senior scientist of the large collabortive proposal submitted to NSF by Texas Tech University to present 3 two day workshops on grid computing
    • UAB Research Computing plans
    • Developed IT CyberInfrastructure presentation for ASA campus visit on April 3, 2007
    • Circulated IT research computing planning draft to the Office of VP of Research and Economic Development

HPC Services Goals and Accomplishments for FY2008

HPC Services Goals for FY2008

  • GOAL 1: UAB Grid Computing Project

1. ASA-UABgrid Pilot Project has a goal of UAB and ASA users being able to submit jobs via UABgrid to HPC resources located at either ASC or UAB a. Both JPR and ASA staff understand each overall job submission processes b. Install Globus software on ASA test system and joint UABgrid before c. SC07 Install Globus software on new ASA HPC cluster and join UABgrid during Dec 31, 2007 d. Develop shared HPC resource policies. 2. Collaborative Tools Core UABgrid Software: i. Move UABgrid GridShib CA into operational status 2. Increase the number of collaborative tools available to UABgrid users. i. Implement GridSphere on UABgrid – target dd MM YY 4. Applications: i. R-group: establish workflow for SSG at UAB to use UABgrid for submission of the R-jobs to HPC clusters at UAB, ASA, and other HPC centers. 5. Explore expanding the campus model for HPC to other campuses of UA System 6. Use 10GE Research for meta-scheduling support services

GOAL 2. InCommon / Shibboleth Project – • Work with Infrastructure and Network Services to coordinate new and expanding campus applications using Shibboleth; • Finalize parameters releases as part of UABgrid application offered to members of InCommon.

GOAL 3. Participation in External IT Groups within Alabama, Region and US, such as, UA System Collaborative Technology activities, Alabama Regional Optical Network, Internet2, SURA grid, EDUCAUSE, Global Grid Forum, and Super-Computing i. List all meetings attended after Oct 1, 07 1. SC|07, Internet2 Fall 07, SURAgrid All Hands (march), Internet2 Spring 08 ii. SURAgrid Goverance 1. JPR has been elected to serve a one-year term on the inaugural SURAgrid GC iii. SURAgrid working group 1. JPR serving on accounting systems working group GOAL 4. Support IT Strategic Plan for Research o How to meeting gaps in applications, such as, bioinfomatics GOAL 5. Staffing Request: o 2 graduate and 1 undergraduate IT Interns o File open position of Progammer/Analyst II formed during Q1.

Research Computing Web Pages

Campus Network

  • On Campus High Speed Network Connectivity The core of the campus network is a centrally-managed backbone comprised of ring protected enterprise-class 10-Gigabit Ethernet routers, supporting IP, IP Multicast, IPX, and Appletalk protocols. All buildings on campus are connected to one of three communication hubs using optical fiber. Within buildings, Category 5 or higher unshielded twisted pair wiring connects desktops to the network. A Gigabit Ethernet building backbone over multimode optical fiber is used for multifloor buildings. Computer server clusters are connected to the building entrance using Gigabit Ethernet. Each floor contains one or more switches connected to the building backbone using Gigabit Ethernet. Desktops are connected at 10 or 100 megabits/second speed (gigabit available when needed).
  • UAB is a charter member of Internet2 and hosts one the nodes of the Gulf Central GigaPoP and the Alabama Research and Education Network (AREN). The UA System (UA, UAB, UAH) share two OC-3s of an OC12 link of bandwidth from Birmingham to Atlanta to connect to Southern Cross Roads for I2 connectivity. The Alabama Regional Optical Network (ARON) is a dedicated (dark-fiber) dense wavelength division multiplexed (DWDM) network currently under construction and scheduled for completion in 2007. Owned and operated by the University of Alabama System, and contract agreement with Georgia Tech and the Southern Light Rail (SLR), ARON connects the University of Alabama’s three research institutions to the National LambdaRail (NLR) and will replace the current Internet2 connections. UAB NLR connectivity is expected by late 2007.

Research Network

The UAB Research Network is currently a dedicated 10GE optical connection between Shared HPC Facility and Computer Science HPC Lab which will leverage network for staging grid-based compute jobs and allow direct connection to high-bandwidth regional networks. This network allows very high speed secure connectivity between the existing HPC clusters in Engineering and Computer science as well as high speed file transfer of very large data sets, between clusters, without the concerns of interfering with other traffic on the campus backbone. This dedicated connection also guarantees a predictable latency between the clusters

Grid Computing

There are two groups on campus working in the area of grid computing. The first group is lead by Professor Bangalore (NS&M/CIS), which focuses on basic research in grid computing, distributed computing, and web-based computing within the Collaborative Computing Laboratory (CCL). The second group is within UAB IT, which has developed an applied grid computing project, known as UABgrid that has been developed during the past 4 years as a result of UAB participation in the NSF/SURA NMI Testbed project (D.L. Shealy, PI). UABgrid is the campus infrastructure for computation and collaboration in the grid environment. During FY07, new functionality will be added to include the Shibboleth-based identity management capability, which facilitates external research collaborations, and a meta-scheduler, which allows scheduling of HPC jobs over multiple clusters. Both of these two new capabilities of UABgrid are being demonstrated in collaboration with UAB CCL and SURAgrid and presented at Internet2 and grid conferences during 2007.

  • Middleware Tools: Tools available through UABGrid include: Globus Toolkit, GridShib for Globus, Ganglia, GridWay metascheduler, MyProxy, UABGrid CA, GridSphere portal, Shibboleth, myVoc box management node, and GridFTP.

High Performance Computing

UAB Shared High Performance Computing Facility provides UAB-wide shared software and hardware infrastructure and support for the high performance parallel and distributed computing, numerical tools and information technology-based computing environments, and computational simulation to UAB researchers. The facility now a joint IT and multi-school use, supported and funded initiative initially jump started by the School of Engineering, in collaboration with the Schools of Medicine and Public Health. The current HPC combined performance of the facility is about 2.2 Teraflops. The facility is equipped with the following:

  • IBM BlueGene L cluster with 2048 700 MHz processors with 512 MB of memory in each. The system has 13 terabytes of storage. This cluster should benchmark at 4.5 to 5 Teraflops.
  • DELL Xeon 64-bit Linux Cluster (CHEAHA) which consists of 128 nodes of DELL PE1425 computer, with dual Xeon 3.6GHz processors with either 2GB or 6GB of memory per node. It uses a Gigabit Ethernet inter-node network connection. There are 4 Terabytes of disk storage available to this cluster. This cluster is rated at more than 1.1 Teraflops computing capacity.
  • Verari Opteron 64-bit Linux Cluster (COOSA) which is a 64-node computing cluster consisting of dual AMD Opteron 242 processors, with 2GB of memory each node. Each node is interconnected with a Gigabit Ethernet network.
  • IBM Linux Cluster (CAHABA) is a highly scalable Linux cluster solution for high performance and commercial computing workloads. It is constructed with IBM x335 Series with a total of 128-processor (64 nodes, dual Xeon 2.4GHz, 2 to 4GB memory each node) and 1 Terabyte storage unit. Each node is interconnected with Gigabit network.
  • Supermicro Xeon 32-bit Linux Cluster which is a 10-node visualization cluster consisting of Supermicro computers with dual Xeon 2.4GHz processors, 2GB of memory each node and 3-Terabytes of accumulative disk space.
  • DNP Holo Screen Display (60”), a transparent display which allows viewers to look at and see through the screen and makes the image appear suspended in mid-air. It gives an impression of almost-3D depth.
  • Passive Stereoscopic Display System (VisBox), which is a one-wall, fully integrated, projection-based VR system with head-tracking and stereo display. The screen is 10 feet diagonal, which makes it significantly more immersive than other much more expensive systems. The VisBox uses high-end LINUX PCs and bright projectors. The footprint of a VisBox is 8’x8’, and it is a few inches shy of being 8 feet tall, making it close to an 8’x8’x8’ cube. With this system, researchers can visualize their data in a stereoscopic virtual environment. This display system is a passive stereo display system in an all-in-one unit with 2 polarizing LCD projectors and 2 mirrors, precision-mounted in a custom frame. A Linux PC drives this system with a high-end dual-headed graphics card. Users wear lightweight, inexpensive polarized eyeglasses and see a stereoscopic image.
  • Tiled Display Wall System (VisWall) (8'x8' and 3x3 configurations) is capable of a combined screen resolution of 3000x2300 pixels. It provides researchers with a display solution to visualize data or images at an ultra-high resolution. A high-end dual-processor LINUX cluster and nVidia graphics cards are used to drive the graphics applications. This is a scalable solution, which means that we can expand the number of tiles to m x n to increase the combined resolution as the budget permits. A 10-node dual processor Linux cluster drives this nine-tile visualization wall. The software synchronizes images at the tile interface. This provides an ultra-resolution visualization capability for very large-scale images/data. A Linux PC console communicating through high-speed Myrinet network drives the VisWall. Each computer is connected to a projector that contributes 1024x768screen resolution in the overall projection area.

Department of Computer and Information Sciences(CIS) provides access to high performance computing resources including:

  • 128 node compute cluster which has been benchmarked at 1.41 Teraflops. Each node consists of two 3.2GHz Intel Xeon processors with 4GB of memory. The cluster includes a low-latency Infiniband network fabric in addition to a secondary gigabit ethernet network fabric. It also provides 4 terabytes of storage managed by the Ibrix parallel filesystem.
  • 32 node dual-processor 1.6 GHz Opteron cluster with each node having 2 GB of RAM, 80GB of hard drive, and gigabit Ethernet connected with gigabit switch. Out of the 32 nodes, 4 nodes have 4 GB RAM and 160GB hard drive.
  • A 13 megapixel, nine tile visualization wall that measures approximately 10' wide by 8' high. The unified image is created by 9 DLP projectors connected via optical DVI cabling to 5 rendering nodes in a visualization cluster. The nodes consist of dual-core, dual-processor AMD Opteron 270 servers with dual-head NVIDIA GeForce 7800 graphics adapters and 4GB of memory. A sixth similarly-configured node serves as the head node which contains 8GB of memory and serves as the master node for the visualization cluster.
  • All the clusters are part of the CIS Grid Node, which is part of the distributed campus-wide computational infrastructure - UABgrid.

Off-campus Resources

  • ASA/AREN: Alabama Supercomputer Center (http://www.asc.edu) provides UAB investigators with access to a variety of high performance computing resources. These resources include:
    • An SGI Altix 350 supercomputer with 144 CPU cores, 592 GB of shared memory, and 10.8 terabytes in the CXFS file system. Each CPU is a 64 bit Intel Itanium 2 processor. The system consists of SGI Altix 350 nodes with 1.4 GHz processors, Altix 350 nodes with 1.5 GHz processors, and Altix 450 nodes with dual core 1.6 GHz processors. This gives the entire system a floating point performance of 776 GigaFLOPS. Sets of from 6 to 72 CPUs are grouped together into shared memory nodes. There are multiple networks connecting the processors. These include; NUMAlink for sharing memory, a fiber channel switch for CXFS file system data, gigabit ethernet for internet connectivity, and a secondary ethernet connection as a redundant fail over and management network.
    • A Cray XD1 supercomputer with 144 CPUs and 240 gigabytes of distributed memory. Each compute node has a local disk (60 Gigabytes of which are accessible as /tmp). Also attached to the XD1 is a high performance Fiber Channel RAID Array (600MB/s) running the Lustre Filesystem, which has 3 Terabytes of high performance accessible as /scratch from each node. Home directories as well as third party applications use the NFS Filesystem and share 4 Terabytes of Fiber Channel RAID storage.
    • A large number of software packages are installed supporting a variety of analyses including programs for Design Analysis, Quantum Chemistry, Molecular Mechanics/Dynamics, Crystallography, Fluid Dynamics, Statistics, Visualization, and Bioinformatics.
  • Internet2/NLR
  • Alabama RON
  • SURA

=== Tools and Support