From UABgrid Documentation
Hello UAB Research Computing Community!
Welcome to the Research Computing System
The Research Computing System (RCS) provides a framework for sharing data, accessing compute power, and collaborating with peers on campus and around the globe. Our goal is to construct a dynamic "network of services" that you can use to organize your data, study it, and share outcomes.
'docs' (the service you are looking at while reading this text) is one of a set of core services, or libraries, available for you to organize information you gather. Docs is a wiki, an online editor to collaboratively write and share documentation. (Wiki is a Hawaiian term meaning fast.) You can learn more about docs on the page UnderstandingDocs. The docs wiki is filled with pages that document the many different services and applications available on the Research Computing System. If you see information that looks out of date please don't hesitate to ask about it or fix it.
The Research Computing System is designed to provide services to researchers in three core areas:
- Data Analysis - using the High Performance Computing (HPC) fabric we call Cheaha for analyzing data and running simulations. Many applications are already available or you can install your own
- Data Sharing - supporting the trusted exchange of information using virtual data containers to spark new ideas
- Application Development - providing virtual machines and web-hosted development tools empowering you to serve others with your research
Support and Development
The Research Computing System is developed and supported by UAB IT's Research Computing Group. We are also developing a core set of applications to help you to easily incorporate our services into your research processes and this documentation collection to help you leverage the resources already available. We follow the best practices of the Open Source community and develop the RCS openly. You can follow our progress via the our development wiki.
The Research Computing System is an out growth of the UABgrid pilot, launched in September 2007 which has focused on demonstrating the utility of unlimited analysis, storage, and application for research. RCS is being built on the same technology foundations used by major cloud vendors and decades of distributed systems computing research, technology that powered the last ten years of large scale systems serving prominent national and international initiatives like the Open Science Grid, XSEDE, TeraGrid, the LHC Computing Grid, and caBIG.
The UAB IT Research Computing Group has collaborated with a number of prominent research projects at UAB to identify use cases and develop the requirements for the RCS. Our collaborators include the Center for Clinical and Translational Science (CCTS), Heflin Genomics Center, the Comprehensive Cancer Center (CCC), the Department of Computer and Information Sciences (CIS), the Department of Mechanical Engineering (ME), Lister Hill Library, the School of Optometry's Center for the Development of Functional Imaging, and Health System Information Services (HSIS).
As part of the process of building this research computing platform, the UAB IT Research Computing Group has hosted an annual campus symposium on research computing and cyber-infrastructure (CI) developments and accomplishments. Starting as CyberInfrastructure (CI) Days in 2007, the name was changed to UAB Research Computing Day in 2011 to reflect the broader mission to support research. IT Research Computing also participates in other campus wide symposiums including UAB Research Core Day.
Featured Research Applications
The Research Computing Group also helps support the campus MATLAB license with self-service installation documentation and supports using MATLAB on the HPC platform, providing a pathway to expand your computational power and freeing your laptop from serving as a compute platform.
The UAB IT Research Computing group, the CCTS BMI, and Heflin Center for Genomic Science have teamed up to help improve genomic research at UAB. Researchers can work with the scientists and research experts to produce a research pipeline from sequencing, to analysis, to publication.
Users of Cheaha are solely responsible for backing up their files. This includes files under /data/user, /data/project, and /home.
There is no automatic back up of any user data on the cluster in home, data, or scratch. At this time, all user data back up processes are defined and managed by each user and/or lab. Given that data backup demands vary widely between different users, groups, and research domains, this approach enables those who are most familiar with the data to make appropriate decisions based on their specific needs.
For example, if a group is working with a large shared data set that is a local copy of a data set maintained authoritatively at a national data bank, maintaining a local backup is unlikely to be a productive use of limited storage resources, since this data could potentially be restored from the authoritative source. If, however, you are maintaining a unique source of data of which yours is the only copy, then maintaining a backup is critical if you value that data set. It's worth noting that while this "uniqueness" criteria may not apply to the data you analyze, it may readily apply to the codes that define your analysis pipelines.
An often recommended backup policy is the 3-2-1 rule: maintain three copies of data, on two different media, with one copy off-site. You can read more about the 3-2-1 rule here. In the case of your application codes, using revision control tools during development provides an easy way to maintain a second copy, makes for a good software development process, and can help achieve reproducible research goals.
Please review the data storage options provided by UAB IT for maintaining copies of your data. In choosing among these options, you should also be aware of UAB's data classification rules and requirements for security requirements for sensitive and restricted data storage. Given the importance of backup, Research Computing continues to explore options to facilitate data backup workflows from the cluster. Please contact us if you have questions or would like to discuss specific data backup scenarios.
A good guide for thinking about your backup strategy might be: "If you aren't managing a data back up process, then you have no backup data."
Grant and Publication Resources
THE FOLLOWING INFORMATION IS OUT OF DATE. We are currently in the process of transitioning the information to a new format and consolidating. Please see Resources for the current hardware information.
The following description may prove useful in summarizing the services available via Cheaha. Any publications that rely on computations performed on Cheaha should include a statement acknowledging the use of UAB Research Computing facilities in your research, see the suggested example below. We also request that you send us a list of publications based on your use of Cheaha resources.
Description of Cheaha for Grants (short)
UAB IT Research Computing maintains high performance compute (HPC) and storage resources for investigators. The Cheaha compute cluster provides over 3744 conventional INTEL CPU cores and 80 accelerators (including 72 NVIDIA P100 GPUS's) interconnected via an EDR InfiniBand network and provides 528 TFLOP/s of aggregate theoretical peak performance. A high performance, 6.6PB raw GPFS storage on a DDN SFA14KX cluster with site replication to a DDN SFA12KX cluster, is also connected to the compute nodes via an InfiniBand fabric. An additional 20TB of traditional SAN storage is also available for home directories. This general access compute fabric is available to all UAB investigators.
Description of Cheaha for Grants (Detailed)
The Cyberinfrastructure supporting University of Alabama at Birmingham (UAB) investigators includes high performance computing clusters, storage, campus, statewide and regionally connected high-bandwidth networks, and conditioned space for hosting and operating HPC systems, research applications and network equipment.
Cheaha HPC system
Cheaha is a campus HPC resource dedicated to enhancing research computing productivity at UAB. Cheaha is managed by UAB Information Technology's Research Computing group (RC) and is available to members of the UAB community in need of increased computational capacity. Cheaha supports high performance computing (HPC) and high throughput computing (HTC) paradigms. Cheaha is composed of resources that span data centers located in two UAB campus IT data centers, in the 936 Building and the RUST Computer Center, and a commercial data center at DC BLOX in Birmingham. Research Computing, in open collaboration with the campus research community, is leading the design and development of these resources.
Cheaha provides users with both a web based interface, via open OnDemand, and a traditional command-line interactive environment, via SSH. These interfaces provide access to many scientific tools that can leverage a dedicated pool of local compute resources via the SLURM batch scheduler. The local compute pool provides access to five generations of compute hardware based on the x86 64-bit architecture. Gen6 (2015-2016) includes 96 nodes: 2x12 core (2304 cores total) 2.5 GHz Intel Xeon E5-2680 v3 compute nodes with an FDR InfiniBand interconnect. Of the 96 compute nodes, 36 nodes have 128 GB RAM, 38 nodes have 256 GB RAM, and 14 nodes have 384 GB RAM. There are also four compute nodes with the Intel Xeon Phi 7210 accelerator cards and four compute nodes with the NVIDIA K80 GPUs. Gen7 (2017) is composed of 18 nodes: 2x14 core (504 cores total) 2.4GHz Intel Xeon E5-2680 v4 compute nodes with 256GB RAM, four NVIDIA Tesla P100 16GB GPUs per node, and an EDR InfiniBand interconnect. Gen8 (2019) is composed of 35 nodes with EDR InfiniBand interconnect: 2x12 core (840 cores total) 2.60GHz Intel Xeon Gold 6126 compute nodes with 21 compute nodes at 192GB RAM, 10 nodes at 768GB RAM and 4 nodes at 1.5TB of RAM. Gen9 (available Q2 2021) is composed of 52 nodes with EDR InfiniBand interconnect: 2x24 core (2496 cores total) 3.0GHz Intel Xeon Gold 6248R compute nodes each with 192GB RAM. The compute nodes combine to provide over 600 TFLOP/s of dedicated computing power.
In addition UAB researchers also have access to regional and national HPC resources such as Alabama Supercomputer Authority (ASA), XSEDE and Open Science Grid (OSG).
Research Computing has operated a development OpenStack cloud resource since 2019. This platform has been used to support application development and DevOps processes to research labs across campus. In 2021 a production implementation of this cloud platform will be made available to researchers on campus. This fabric is composed of five Dell R640 48 core 192G RAM compute nodes for 240 cores and 960GB of standard cloud compute resources. In addition the fabric will feature four NVIDIA DGX A100 nodes that include 8 A100 GPUs and 1TB of RAM each. All of these resources will be available to the research community for provisioning on demand via the OpenStack services (Ussuri release). The production implementation will further support researchers making their hosted services available beyond campus while adhering to standard campus network security practices. This off-campus access feature has not been available via the development cloud.
The compute nodes on Cheaha are backed by high performance, 6.6PB GPFS raw storage on DDN SFA14KX hardware connected via an EDR /FDR InfiniBand fabric. The non-scratch files on the GPFS cluster are replicated to 6.0PB raw storage on a DDN SFA12KX located in the RUST data center to provide site redundancy. An additional 10TB of traditional SAN storage is also available for home directories.
Three new storage fabrics will come on line in 2021. All three storage fabrics are based on Ceph with different hardware configurations to address different usage scenarios. The fabrics are a 6.9PB archive storage fabric built using 12 Dell DSS7500 nodes, an expanded 1.3PB nearline storage fabric built with 14 Dell 740xd nodes, and a 248TB SSD cache storage fabric built with 8 Dell 840 nodes.
The UAB Research Network is currently a dedicated 40GE optical connection between the UAB Shared HPC Facility in 936 and the RUST Campus Data Center to create a multi-site facility housing the Research Computing System (RCS). This network is being upgraded in 2021 to replace aging equipment and extend service to the DC BLOX data center. The new network provides a 200Gbs Ethernet backbone for East-West traffic for connecting storage and compute hosting resources. The network supports direct connection to campus and high-bandwidth regional networks via 40Gbps Globus Data Transfer Nodes (DTNs) providing the capability to connect data intensive research facilities directly with the high performance computing and storage services of the Research Computing System. This network can support very high speed secure connectivity between nodes connected to it for high speed file transfer of very large data sets without the concerns of interfering with other traffic on the campus backbone, ensuring predictable latencies. The Science DMZ interface with (DTNs) includes Perfsonar measurement nodes and a Bro security node connected directly to the border router that provide a "friction-free" pathway to access external data repositories as well as computational resources.
The campus network backbone is based on a 40 gigabit redundant Ethernet network with 480 gigabit/second back-planes on the core L2/L3 Switch/Routers. For efficient management, a collapsed backbone design is used. Each campus building is connected using 10 Gigabit Ethernet links over single mode optical fiber. Desktops are connected at 1 gigabits/second speed. The campus wireless network blankets classrooms, common areas and most academic office buildings.
UAB connects to the Internet2 high-speed research network via the University of Alabama System Regional Optical Network (UASRON), a University of Alabama System owned and operated DWDM Network offering 100Gbps Ethernet to the Southern Light Rail (SLR)/Southern Crossroads (SoX) in Atlanta, Ga. The UASRON also connects UAB to UA, and UAH, the other two University of Alabama System institutions, and the Alabama Supercomputer Center. UAB is also connected to other universities and schools through Alabama Research and Education Network (AREN).
UAB IT Research Computing currently maintains a support staff of 10 lead by the Assistant Vice President for Research Computing and includes an HPC Architect-Manager, four Software developers, two Scientists, two system administrators and a project coordinator.
Acknowledgment in Publications
This work was supported in part by the National Science Foundation under Grants Nos. OAC-1541310, the University of Alabama at Birmingham, and the Alabama Innovation Fund. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the University of Alabama at Birmingham.