Resources and Obsolete: Welcome: Difference between pages

From Cheaha
(Difference between pages)
Jump to navigation Jump to search
(Update to reflect current conditions and new compute resources.)
 
(Obsolete content for Welcome page)
 
Line 1: Line 1:
The [[wikipedia:Cyberinfrastructure|Cyberinfrastructure]] supporting UAB investigators includes high performance computing clusters, high-speed storage systems, campus, state-wide and regionally connected high-bandwidth networks, and conditioned space for hosting and operating HPC systems, research applications and network equipment.
Welcome to the '''Research Computing System'''


[[Cheaha]] is a campus HPC resource dedicated to enhancing research computing productivity at UAB. Cheaha is managed by UAB Information Technology's Research Computing group (RC) and is available to members of the UAB community in need of increased computational capacity. Cheaha supports high-performance computing (HPC) and high throughput computing (HTC) paradigms. Cheaha is composed of resources that span data centers located in the UAB IT Data Centers in the 936 Building and the RUST Computer Center as well as an expansion to commercial facilities at DC BLOX in Birmingham. Research Computing, in open collaboration with community members, is leading the design and development of these resources.
The Research Computing System (RCS) provides a framework for sharing data, accessing compute power, and collaborating with peers on campus and around the globe. Our goal is to construct a dynamic "network of services" that you can use to organize your data, study it, and share outcomes.


A description of the facilities available to UAB researchers are included below. If you would like an account on the HPC system, please {{CheahaAccountRequest}} and provide a short statement on your intended use of the resources and your affiliation with the university.  
''''docs'''' (the service you are looking at while reading this text) is one of a set of core services, or libraries, available for you to organize information you gather. Docs is a wiki, an online editor to collaboratively write and share documentation. ([http://en.wikipedia.org/wiki/Wiki Wiki is a Hawaiian term] meaning fast.)  You can learn more about '''docs''' on the page [[UnderstandingDocs]].  The docs wiki is filled with pages that document the many different services and applications available on the Research Computing System.  If you see information that looks out of date please don't hesitate to [mailto:support@vo.uabgrid.uab.edu ask about it] or fix it.


== UAB High Performance Computing (HPC) Clusters ==
The Research Computing System is designed to provide services to researchers in three core areas:


=== Compute Resources ===
* '''Data Analysis''' - using the High Performance Computing (HPC) fabric we call [[Cheaha]] for analyzing data and running simulations. Many [[Cheaha_Software|applications are already available]] or you can install your own
* '''Data Sharing''' - supporting the trusted exchange of information using virtual data containers to spark new ideas
* '''Application Development''' - providing virtual machines and web-hosted development tools empowering you to serve others with your research


The current compute fabric for this system is anchored by the [[Cheaha]] HPC cluster, a commodity cluster with 6144 cores connected by low-latency Fourteen Data Rate (FDR) and Enhanced Data Rate (EDR) InfiniBand networks.
== Support and Development ==


A description of the different hardware generations are summarized in the following table:
The Research Computing System is developed and supported by UAB IT's Research Computing Group. We are also developing a core set of applications to help you to easily incorporate our services into your research processes and this documentation collection to help you leverage the resources already available. We follow the best practices of the Open Source community and develop the RCS openly. You can follow our progress via the [http://dev.uabgrid.uab.edu our development wiki].
* Gen9: (available Q2 2021) is composed of 52 nodes with EDR InfiniBand interconnect: 2x24 core (2496 cores total) 3.0GHz Intel Xeon Gold 6248R compute nodes each with 192GB RAM.
* Gen8: 35 2x12 core (840 cores total) 2.60GHz Intel Xeon Gold 6126 compute nodes with 21 compute nodes at 192GB RAM, 10 nodes at 768GB RAM and 4 nodes at 1.5TB of RAM
* Gen7: 18 2x14 core (504 cores total) 2.4GHz Intel Xeon E5-2680 v4 compute nodes with 256GB RAM, four NVIDIA Tesla P100 16GB GPUs, and EDR InfiniBand interconnect (supported by UAB, 2017).
* Gen6: 96 2x12 core (2304 cores total) 2.5 GHz Intel Xeon E5-2680 v3 compute nodes with FDR InfiniBand interconnect. Out of the 96 compute nodes, 36 nodes have 128 GB RAM, 38 nodes have 256 GB RAM, and 14 nodes have 384 GB RAM. There are also four compute nodes with the Intel Xeon Phi 7210 accelerator cards and four compute nodes with the NVIDIA K80 GPUs (supported by UAB, 2015/2016).


==== Retired Generations ====
The Research Computing System is an out growth of the UABgrid pilot, launched in September 2007 which has focused on demonstrating the utility of unlimited analysis, storage, and application for research.  RCS is being built on the same technology foundations used by major cloud vendors and decades of distributed systems computing research, technology that powered the last ten years of large scale systems serving prominent national and international initiatives like the [http://opensciencegrid.org/ Open Science Grid], [http://xsede.org XSEDE], [http://www.teragrid.org/ TeraGrid], the [http://lcg.web.cern.ch/LCG/ LHC Computing Grid], and [https://cabig.nci.nih.gov caBIG].


* Gen5: 12 2x8 core (192 cores total) 2.0 GHz Intel Xeon E2650 nodes with 96GB RAM per node and 10 Gbps interconnect dedicated to OpenStack and Ceph (supported by UAB IT, 2012).
== Outreach ==
* Gen4: 3 2x8 core (48 cores total) 2.70 GHz Intel Xeon compute nodes with 384GB RAM per node (24GB per core), QDR InfiniBand interconnect (supported by Section on Statistical Genetics, School of Public Health, 2012).
* Gen3: 48 2x6 core (576 cores total) 2.66 GHz Intel Xeon compute nodes with 48GB RAM per node (4GB per core), QDR InfiniBand interconnect (supported by NIH grant S10RR026723-01, 2010)
* Gen2: 24 2x4 (192 cores total) Intel 3.0 GHz Intel Xeon compute nodes with 16GB RAM per node (2GB per core), DDR InfiniBand interconnect (supported by UAB IT, 2008).
* Gen1: 60 2-core (120 cores total) AMD 1.6GHz Opteron 64-bit compute nodes with 2GB RAM per node (1GB per core), and Gigabit Ethernet connectivity between the nodes (supported by Alabama EPSCoR Research Infrastructure Initiative, NSF EPS-0091853, 2005). 


{{CheahaTflops}}
The UAB IT Research Computing Group has collaborated with a number of prominent research projects at UAB to identify use cases and develop the requirements for the RCS.  Our collaborators include the Center for Clinical and Translational Science (CCTS), Heflin Genomics Center, the Comprehensive Cancer Center (CCC), the Department of Computer and Information Sciences (CIS), the Department of Mechanical Engineering (ME), Lister Hill Library, the School of Optometry's Center for the Development of Functional Imaging, and Health System Information Services (HSIS).


=== Storage Resources ===
As part of the process of building this research computing platform, the UAB IT Research Computing Group has hosted an annual campus symposium on research computing and cyber-infrastructure (CI) developments and accomplishments. Starting as CyberInfrastructure (CI) Days in 2007, the name was changed to [http://docs.uabgrid.uab.edu/wiki/UAB_Research_Computing_Day '''UAB Research Computing Day'''] in 2011 to reflect the broader mission to support research.  IT Research Computing also participates in other campus wide symposiums including UAB Research Core Day.


In 2016, as part of the Alabama Innovation Fund grant working in partnership with numerous departments, 6.6PB raw GPFS storage on DDN SFA12KX hardware was added to meet the growing data needs of UAB researchers. In Fall 2018, UAB IT Research Computing upgraded the 6PB GPFS storage backend with the next generation DDN SFA14KX. This hardware improved HPC performance by increasing the speed at which research application can access their data sets. In 2019, the SFA12KX was moved to the RUST data center and act as a replicate pair for the /data file system on the SFA14KX in 936.
== Featured Research Applications ==


==== Retired Storage Resources ====
The Research Computing Group also helps support the campus MATLAB license with self-service installation documentation and supports using MATLAB on the HPC platform, providing a pathway to expand your computational power and freeing your laptop from serving as a compute platform.


In 2009, annual investment funds were directed toward establishing a fully connected dual data rate Infiniband network between the compute nodes added in 2008 and laying the foundation for a research storage system with a 60TB DDN storage system accessed via the Lustre distributed file system. In 2010, UAB was awarded an NIH Small Instrumentation Grant (SIG) to further increase analytical and storage capacity by an additional 120TB of high performance Lustre storage on a DDN hardware (retired in 2016). In Fall 2013, UAB IT Research Computing acquired an OpenStack cloud and Ceph storage software fabric through a partnership between Dell and Inktank in order to extend cloud-computing solutions to the researchers at UAB and enhance the interfacing capabilities for HPC. This storage system provides an aggregate of half-petabytes of raw storage that is distributed across 12 compute nodes each with node having 16 cores, 96GB RAM, and 36TB of storage and connected together with a 10Gigabit Ethernet networking (pilot implementation retired in Spring 2017)).
{{abox
| UAB MATLAB Information |
In January 2011, UAB acquired a site license from Mathworks for MATLAB, SimuLink and 42 Toolboxes.
* Learn more about [[MATLAB|MATLAB and how you can use it at UAB]]
* Learn more about the [[UAB TAH license|UAB Mathworks Site license]] and review [[Matlab site license FAQ|frequently asked questions about the license]]
}}


=== Network Resources ===
The UAB IT Research Computing group, the CCTS BMI, and [http://www.uab.edu/hcgs/bioinformatics Heflin Center for Genomic Science] have teamed up to help improve genomic research at UAB.  Researchers can work with the scientists and research experts to produce a research pipeline from sequencing, to analysis, to publication.


==== Research Network ====
{{abox
|'''Galaxy'''|
A web front end to run analyses on the cluster fabric. Currently focused on NGS (Next Generation Sequencing; biology) analysis support.
* [[Galaxy|Galaxy Project Home]]
* [http://projects.uabgrid.uab.edu/galaxy Galaxy Development Wiki]
}}


'''UAB Research Network''' The UAB Research Network is currently a dedicated 40GE optical connection between the UAB Shared HPC Facility in 936 and the RUST Campus Data Center to create a multi-site facility housing the Research Computing System (RCS). This network is being upgraded in 2021 to replace aging equipment and extend service to the DC BLOX data center. The new network provides a 200Gbs Ethernet backbone for East-West traffic for connecting storage and compute hosting resources. The network supports direct connection to campus and high-bandwidth regional networks via 40Gbps Globus Data Transfer Nodes (DTNs) providing the capability to connect data intensive research facilities directly with the high performance computing and storage services of the Research Computing System. This network can support very high speed secure connectivity between nodes connected to it for high speed file transfer of very large data sets without the concerns of interfering with other traffic on the campus backbone, ensuring predictable latencies. The Science DMZ interface with (DTNs) includes Perfsonar measurement nodes and a Bro security node connected directly to the border router that provide a "friction-free" pathway to access external data repositories as well as computational resources.
== Data Backups ==


Users of Cheaha are solely responsible for backing up their files. This includes files under '''/data/user''', '''/data/project''', and '''/home'''.


==== Campus Network ====
{{ClusterDataBackup}}
'''Campus High Speed Network Connectivity''' The campus network backbone is based on a 40 gigabit redundant Ethernet network with 480 gigabit/second backplanes on the core L2/L3 Switch/Routers. For efficient management, a collapsed backbone design is used. Each campus building is connected using gigabit Ethernet links over single mode optical fiber. Within multi-floor buildings, a gigabit Ethernet building backbone over multimode optical fiber is used and Category 5 or better, unshielded twisted pair wiring connects desktops to the network. Computer server clusters are connected to the building entrance using Gigabit Ethernet. Desktops are connected at 1 gigabits/second speed. The campus wireless network blankets classrooms, common areas and most academic office buildings.


==== Regional Networks ====
== Grant and Publication Resources ==
'''Off-campus Network Connections''' UAB connects to the Internet2 high-speed research network via the University of Alabama System Regional Optical Network (UASRON), a University of Alabama System owned and operated DWDM Network offering 100Gbps Ethernet to the Southern Light Rail (SLR)/Southern Crossroads (SoX) in Atlanta, Ga. The UASRON also connects UAB to UA, and UAH, the other two University of Alabama System institutions, and the Alabama Supercomputer Center. UAB is also connected to other universities and schools through Alabama Research and Education Network (AREN).


==== Historical Network Developments ====
The following description may prove useful in summarizing the services available via Cheaha. Any publications that rely on computations performed on Cheaha should include a statement acknowledging the use of UAB Research Computing facilities in your research, see the suggested example below.  We also request that you send us a list of publications based on your use of Cheaha resources.


UAB was awarded the NSF CC*DNI Networking Infrastructure grant ([http://www.nsf.gov/awardsearch/showAward?AWD_ID=1541310 CC-NIE-1541310]) in Fall 2016 to establish a dedicated high-speed research network (UAB Science DMZ) that establishes a 40Gbps networking core and provides researchers at UAB with 10Gbps connections from selected computers to the shared computational facility.
=== Description of Cheaha for Grants (short)===
 
UAB IT Research Computing maintains high performance compute (HPC) and storage resources for investigators. The Cheaha compute cluster provides over 3744 conventional INTEL CPU cores and 80 accelerators (including 72 NVIDIA P100 GPUS's) interconnected via an EDR InfiniBand network and provides 528 TFLOP/s of aggregate theoretical peak performance. A high performance, 6.6PB raw GPFS storage on a DDN SFA14KX cluster with site replication to a DDN SFA12KX cluster, is also connected to the compute nodes via an InfiniBand fabric. An additional 20TB of traditional SAN storage is also available for home directories. This general access compute fabric is available to all UAB investigators.
 
=== Description of Cheaha for Grants (Detailed) ===
 
The Cyberinfrastructure supporting University of Alabama at Birmingham (UAB) investigators includes high performance computing clusters, storage, campus, statewide and regionally connected high-bandwidth networks, and conditioned space for hosting and operating HPC systems, research applications and network equipment.
 
==== Cheaha HPC system ====
 
Cheaha is a campus HPC resource dedicated to enhancing research computing productivity at UAB. Cheaha is managed by UAB Information Technology's Research Computing group (RC) and is available to members of the UAB community in need of increased computational capacity. Cheaha supports high performance computing (HPC) and high throughput computing (HTC) paradigms. Cheaha is composed of resources that span data centers located in two UAB campus IT data centers, in the 936 Building and the RUST Computer Center, and a commercial data center at DC BLOX in Birmingham. Research Computing, in open collaboration with the campus research community, is leading the design and development of these resources.
 
==== Compute Resources ====
 
Cheaha provides users with both a web based interface, via open OnDemand,  and a traditional command-line interactive environment, via SSH.  These interfaces provide access to many scientific tools that can leverage a dedicated pool of local compute resources via the SLURM batch scheduler. The local compute pool provides access to five generations of compute hardware based on the x86 64-bit architecture. Gen6 (2015-2016) includes 96 nodes:  2x12 core (2304 cores total) 2.5 GHz Intel Xeon E5-2680 v3 compute nodes with an FDR InfiniBand interconnect. Of the 96 compute nodes, 36 nodes have 128 GB RAM, 38 nodes have 256 GB RAM, and 14 nodes have 384 GB RAM. There are also four compute nodes with the Intel Xeon Phi 7210 accelerator cards and four compute nodes with the NVIDIA K80 GPUs. Gen7 (2017) is composed of 18 nodes: 2x14 core (504 cores total) 2.4GHz Intel Xeon E5-2680 v4 compute nodes with 256GB RAM, four NVIDIA Tesla P100 16GB GPUs per node, and an EDR InfiniBand interconnect. Gen8 (2019) is composed of 35 nodes with EDR InfiniBand interconnect: 2x12 core (840 cores total) 2.60GHz Intel Xeon Gold 6126 compute nodes with 21 compute nodes at 192GB RAM, 10 nodes at 768GB RAM and 4 nodes at 1.5TB of RAM. Gen9 (available Q2 2021) is composed of 52 nodes with EDR InfiniBand interconnect: 2x24 core (2496 cores total) 3.0GHz Intel Xeon Gold 6248R compute nodes each with 192GB RAM. The compute nodes combine to provide over 600 TFLOP/s of dedicated computing power.
 
In addition UAB researchers also have access to regional and national HPC resources such as Alabama Supercomputer Authority (ASA), XSEDE and Open Science Grid (OSG).
 
==== Cloud Resources ====
 
Research Computing has operated a development OpenStack cloud resource since 2019.  This platform has been used to support application development and DevOps processes to research labs across campus.  In 2021 a production implementation of this cloud platform will be made available to researchers on campus.  This fabric is composed of five Dell R640 48 core 192G RAM compute nodes for 240 cores and 960GB of standard cloud compute resources.  In addition the fabric will feature four NVIDIA DGX A100 nodes that include 8 A100 GPUs and 1TB of RAM each.  All of these resources will be available to the research community for provisioning on demand via the OpenStack services (Ussuri release).  The production implementation will further support researchers making their hosted services available beyond campus while adhering to standard campus network security practices.  This off-campus access feature has not been available via the development cloud.
 
==== Storage Resources ====
 
The compute nodes on Cheaha are backed by high performance, 6.6PB GPFS raw storage on DDN SFA14KX hardware connected via an EDR /FDR InfiniBand fabric. The non-scratch files on the GPFS cluster are replicated to 6.0PB raw storage on a DDN SFA12KX located in the RUST data center to provide site redundancy. An additional 10TB of traditional SAN storage is also available for home directories.
 
Three new storage fabrics will come on line in 2021.  All three storage fabrics are based on Ceph with different hardware configurations to address different usage scenarios.  The fabrics are a 6.9PB archive storage fabric built using 12 Dell DSS7500 nodes, an expanded 1.3PB nearline storage fabric built with 14 Dell 740xd nodes, and a 248TB SSD cache storage fabric built with 8 Dell 840 nodes.
 
==== Network Resources ====
 
The UAB Research Network is currently a dedicated 40GE optical connection between the UAB Shared HPC Facility in 936 and the RUST Campus Data Center to create a multi-site facility housing the Research Computing System (RCS).  This network is being upgraded in 2021 to replace aging equipment and extend service to the DC BLOX data center.  The new network provides a 200Gbs Ethernet backbone for East-West traffic for connecting storage and compute hosting resources. The network supports direct connection to campus and high-bandwidth regional networks via 40Gbps Globus Data Transfer Nodes (DTNs) providing the capability to connect data intensive research facilities directly with the high performance computing and storage services of the Research Computing System. This network can support very high speed secure connectivity between nodes connected to it for high speed file transfer of very large data sets without the concerns of interfering with other traffic on the campus backbone, ensuring predictable latencies. The Science DMZ interface with (DTNs) includes Perfsonar measurement nodes and a Bro security node connected directly to the border router  that provide a "friction-free" pathway to access external data repositories as well as computational resources.
 
The campus network backbone is based on a 40 gigabit redundant Ethernet network with 480 gigabit/second back-planes on the core L2/L3 Switch/Routers. For efficient management, a collapsed backbone design is used. Each campus building is connected using 10 Gigabit Ethernet links over single mode optical fiber. Desktops are connected at 1 gigabits/second speed. The campus wireless network blankets classrooms, common areas and most academic office buildings.
 
UAB connects to the Internet2 high-speed research network via the University of Alabama System Regional Optical Network (UASRON), a University of Alabama System owned and operated DWDM Network offering 100Gbps Ethernet to the Southern Light Rail (SLR)/Southern Crossroads (SoX) in Atlanta, Ga. The UASRON also connects UAB to UA, and UAH, the other two University of Alabama System institutions, and the Alabama Supercomputer Center. UAB is also connected to other universities and schools through Alabama Research and Education Network (AREN).
 
==== Personnel ====
 
UAB IT Research Computing currently maintains a support staff of 10 lead by the Assistant Vice President for Research Computing and includes an HPC Architect-Manager, four Software developers, two Scientists, two system administrators and a project coordinator.
 
=== Acknowledgment in Publications ===
 
{{Grant_Ack}}

Latest revision as of 17:22, 30 August 2022

Welcome to the Research Computing System

The Research Computing System (RCS) provides a framework for sharing data, accessing compute power, and collaborating with peers on campus and around the globe. Our goal is to construct a dynamic "network of services" that you can use to organize your data, study it, and share outcomes.

'docs' (the service you are looking at while reading this text) is one of a set of core services, or libraries, available for you to organize information you gather. Docs is a wiki, an online editor to collaboratively write and share documentation. (Wiki is a Hawaiian term meaning fast.) You can learn more about docs on the page UnderstandingDocs. The docs wiki is filled with pages that document the many different services and applications available on the Research Computing System. If you see information that looks out of date please don't hesitate to ask about it or fix it.

The Research Computing System is designed to provide services to researchers in three core areas:

  • Data Analysis - using the High Performance Computing (HPC) fabric we call Cheaha for analyzing data and running simulations. Many applications are already available or you can install your own
  • Data Sharing - supporting the trusted exchange of information using virtual data containers to spark new ideas
  • Application Development - providing virtual machines and web-hosted development tools empowering you to serve others with your research

Support and Development

The Research Computing System is developed and supported by UAB IT's Research Computing Group. We are also developing a core set of applications to help you to easily incorporate our services into your research processes and this documentation collection to help you leverage the resources already available. We follow the best practices of the Open Source community and develop the RCS openly. You can follow our progress via the our development wiki.

The Research Computing System is an out growth of the UABgrid pilot, launched in September 2007 which has focused on demonstrating the utility of unlimited analysis, storage, and application for research. RCS is being built on the same technology foundations used by major cloud vendors and decades of distributed systems computing research, technology that powered the last ten years of large scale systems serving prominent national and international initiatives like the Open Science Grid, XSEDE, TeraGrid, the LHC Computing Grid, and caBIG.

Outreach

The UAB IT Research Computing Group has collaborated with a number of prominent research projects at UAB to identify use cases and develop the requirements for the RCS. Our collaborators include the Center for Clinical and Translational Science (CCTS), Heflin Genomics Center, the Comprehensive Cancer Center (CCC), the Department of Computer and Information Sciences (CIS), the Department of Mechanical Engineering (ME), Lister Hill Library, the School of Optometry's Center for the Development of Functional Imaging, and Health System Information Services (HSIS).

As part of the process of building this research computing platform, the UAB IT Research Computing Group has hosted an annual campus symposium on research computing and cyber-infrastructure (CI) developments and accomplishments. Starting as CyberInfrastructure (CI) Days in 2007, the name was changed to UAB Research Computing Day in 2011 to reflect the broader mission to support research. IT Research Computing also participates in other campus wide symposiums including UAB Research Core Day.

Featured Research Applications

The Research Computing Group also helps support the campus MATLAB license with self-service installation documentation and supports using MATLAB on the HPC platform, providing a pathway to expand your computational power and freeing your laptop from serving as a compute platform.


UAB MATLAB Information

In January 2011, UAB acquired a site license from Mathworks for MATLAB, SimuLink and 42 Toolboxes.

The UAB IT Research Computing group, the CCTS BMI, and Heflin Center for Genomic Science have teamed up to help improve genomic research at UAB. Researchers can work with the scientists and research experts to produce a research pipeline from sequencing, to analysis, to publication.


Galaxy

A web front end to run analyses on the cluster fabric. Currently focused on NGS (Next Generation Sequencing; biology) analysis support.

Data Backups

Users of Cheaha are solely responsible for backing up their files. This includes files under /data/user, /data/project, and /home.

There is no automatic back up of any user data on the cluster in home, data, or scratch. At this time, all user data back up processes are defined and managed by each user and/or lab. Given that data backup demands vary widely between different users, groups, and research domains, this approach enables those who are most familiar with the data to make appropriate decisions based on their specific needs.

For example, if a group is working with a large shared data set that is a local copy of a data set maintained authoritatively at a national data bank, maintaining a local backup is unlikely to be a productive use of limited storage resources, since this data could potentially be restored from the authoritative source. If, however, you are maintaining a unique source of data of which yours is the only copy, then maintaining a backup is critical if you value that data set. It's worth noting that while this "uniqueness" criteria may not apply to the data you analyze, it may readily apply to the codes that define your analysis pipelines.

An often recommended backup policy is the 3-2-1 rule: maintain three copies of data, on two different media, with one copy off-site. You can read more about the 3-2-1 rule here. In the case of your application codes, using revision control tools during development provides an easy way to maintain a second copy, makes for a good software development process, and can help achieve reproducible research goals.

Please review the data storage options provided by UAB IT for maintaining copies of your data. In choosing among these options, you should also be aware of UAB's data classification rules and requirements for security requirements for sensitive and restricted data storage. Given the importance of backup, Research Computing continues to explore options to facilitate data backup workflows from the cluster. Please contact us if you have questions or would like to discuss specific data backup scenarios.

A good guide for thinking about your backup strategy might be: "If you aren't managing a data back up process, then you have no backup data."

Grant and Publication Resources

The following description may prove useful in summarizing the services available via Cheaha. Any publications that rely on computations performed on Cheaha should include a statement acknowledging the use of UAB Research Computing facilities in your research, see the suggested example below. We also request that you send us a list of publications based on your use of Cheaha resources.

Description of Cheaha for Grants (short)

UAB IT Research Computing maintains high performance compute (HPC) and storage resources for investigators. The Cheaha compute cluster provides over 3744 conventional INTEL CPU cores and 80 accelerators (including 72 NVIDIA P100 GPUS's) interconnected via an EDR InfiniBand network and provides 528 TFLOP/s of aggregate theoretical peak performance. A high performance, 6.6PB raw GPFS storage on a DDN SFA14KX cluster with site replication to a DDN SFA12KX cluster, is also connected to the compute nodes via an InfiniBand fabric. An additional 20TB of traditional SAN storage is also available for home directories. This general access compute fabric is available to all UAB investigators.

Description of Cheaha for Grants (Detailed)

The Cyberinfrastructure supporting University of Alabama at Birmingham (UAB) investigators includes high performance computing clusters, storage, campus, statewide and regionally connected high-bandwidth networks, and conditioned space for hosting and operating HPC systems, research applications and network equipment.

Cheaha HPC system

Cheaha is a campus HPC resource dedicated to enhancing research computing productivity at UAB. Cheaha is managed by UAB Information Technology's Research Computing group (RC) and is available to members of the UAB community in need of increased computational capacity. Cheaha supports high performance computing (HPC) and high throughput computing (HTC) paradigms. Cheaha is composed of resources that span data centers located in two UAB campus IT data centers, in the 936 Building and the RUST Computer Center, and a commercial data center at DC BLOX in Birmingham. Research Computing, in open collaboration with the campus research community, is leading the design and development of these resources.

Compute Resources

Cheaha provides users with both a web based interface, via open OnDemand, and a traditional command-line interactive environment, via SSH. These interfaces provide access to many scientific tools that can leverage a dedicated pool of local compute resources via the SLURM batch scheduler. The local compute pool provides access to five generations of compute hardware based on the x86 64-bit architecture. Gen6 (2015-2016) includes 96 nodes: 2x12 core (2304 cores total) 2.5 GHz Intel Xeon E5-2680 v3 compute nodes with an FDR InfiniBand interconnect. Of the 96 compute nodes, 36 nodes have 128 GB RAM, 38 nodes have 256 GB RAM, and 14 nodes have 384 GB RAM. There are also four compute nodes with the Intel Xeon Phi 7210 accelerator cards and four compute nodes with the NVIDIA K80 GPUs. Gen7 (2017) is composed of 18 nodes: 2x14 core (504 cores total) 2.4GHz Intel Xeon E5-2680 v4 compute nodes with 256GB RAM, four NVIDIA Tesla P100 16GB GPUs per node, and an EDR InfiniBand interconnect. Gen8 (2019) is composed of 35 nodes with EDR InfiniBand interconnect: 2x12 core (840 cores total) 2.60GHz Intel Xeon Gold 6126 compute nodes with 21 compute nodes at 192GB RAM, 10 nodes at 768GB RAM and 4 nodes at 1.5TB of RAM. Gen9 (available Q2 2021) is composed of 52 nodes with EDR InfiniBand interconnect: 2x24 core (2496 cores total) 3.0GHz Intel Xeon Gold 6248R compute nodes each with 192GB RAM. The compute nodes combine to provide over 600 TFLOP/s of dedicated computing power.

In addition UAB researchers also have access to regional and national HPC resources such as Alabama Supercomputer Authority (ASA), XSEDE and Open Science Grid (OSG).

Cloud Resources

Research Computing has operated a development OpenStack cloud resource since 2019. This platform has been used to support application development and DevOps processes to research labs across campus. In 2021 a production implementation of this cloud platform will be made available to researchers on campus. This fabric is composed of five Dell R640 48 core 192G RAM compute nodes for 240 cores and 960GB of standard cloud compute resources. In addition the fabric will feature four NVIDIA DGX A100 nodes that include 8 A100 GPUs and 1TB of RAM each. All of these resources will be available to the research community for provisioning on demand via the OpenStack services (Ussuri release). The production implementation will further support researchers making their hosted services available beyond campus while adhering to standard campus network security practices. This off-campus access feature has not been available via the development cloud.

Storage Resources

The compute nodes on Cheaha are backed by high performance, 6.6PB GPFS raw storage on DDN SFA14KX hardware connected via an EDR /FDR InfiniBand fabric. The non-scratch files on the GPFS cluster are replicated to 6.0PB raw storage on a DDN SFA12KX located in the RUST data center to provide site redundancy. An additional 10TB of traditional SAN storage is also available for home directories.

Three new storage fabrics will come on line in 2021. All three storage fabrics are based on Ceph with different hardware configurations to address different usage scenarios. The fabrics are a 6.9PB archive storage fabric built using 12 Dell DSS7500 nodes, an expanded 1.3PB nearline storage fabric built with 14 Dell 740xd nodes, and a 248TB SSD cache storage fabric built with 8 Dell 840 nodes.

Network Resources

The UAB Research Network is currently a dedicated 40GE optical connection between the UAB Shared HPC Facility in 936 and the RUST Campus Data Center to create a multi-site facility housing the Research Computing System (RCS). This network is being upgraded in 2021 to replace aging equipment and extend service to the DC BLOX data center. The new network provides a 200Gbs Ethernet backbone for East-West traffic for connecting storage and compute hosting resources. The network supports direct connection to campus and high-bandwidth regional networks via 40Gbps Globus Data Transfer Nodes (DTNs) providing the capability to connect data intensive research facilities directly with the high performance computing and storage services of the Research Computing System. This network can support very high speed secure connectivity between nodes connected to it for high speed file transfer of very large data sets without the concerns of interfering with other traffic on the campus backbone, ensuring predictable latencies. The Science DMZ interface with (DTNs) includes Perfsonar measurement nodes and a Bro security node connected directly to the border router that provide a "friction-free" pathway to access external data repositories as well as computational resources.

The campus network backbone is based on a 40 gigabit redundant Ethernet network with 480 gigabit/second back-planes on the core L2/L3 Switch/Routers. For efficient management, a collapsed backbone design is used. Each campus building is connected using 10 Gigabit Ethernet links over single mode optical fiber. Desktops are connected at 1 gigabits/second speed. The campus wireless network blankets classrooms, common areas and most academic office buildings.

UAB connects to the Internet2 high-speed research network via the University of Alabama System Regional Optical Network (UASRON), a University of Alabama System owned and operated DWDM Network offering 100Gbps Ethernet to the Southern Light Rail (SLR)/Southern Crossroads (SoX) in Atlanta, Ga. The UASRON also connects UAB to UA, and UAH, the other two University of Alabama System institutions, and the Alabama Supercomputer Center. UAB is also connected to other universities and schools through Alabama Research and Education Network (AREN).

Personnel

UAB IT Research Computing currently maintains a support staff of 10 lead by the Assistant Vice President for Research Computing and includes an HPC Architect-Manager, four Software developers, two Scientists, two system administrators and a project coordinator.

Acknowledgment in Publications

This work was supported in part by the National Science Foundation under Grants Nos. OAC-1541310, the University of Alabama at Birmingham, and the Alabama Innovation Fund. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the University of Alabama at Birmingham.