Cheaha: Difference between revisions

From Cheaha
Jump to navigation Jump to search
(Modified all relevant occurrences of SLURM to Slurm)
(Update url for new docs site and use consistent text for obsolete reference.)
 
(33 intermediate revisions by 7 users not shown)
Line 1: Line 1:
{{Main_Banner}}
The Cheaha overview page has moved to Research Computing's new documentation site. Please visit https://docs.rc.uab.edu/ for information on Cheaha.
'''Cheaha''' is a campus resource dedicated to enhancing research computing productivity at UAB. [http://cheaha.uabgrid.uab.edu Cheaha] is managed by [http://www.uab.edu/it UAB Information Technology's Research Computing group (UAB ITRC)] and is available to members of the UAB community in need of increased computational capacity.  Cheaha supports [http://en.wikipedia.org/wiki/High-performance_computing high-performance computing (HPC)] and [http://en.wikipedia.org/wiki/High-throughput_computing high throughput computing (HTC)] paradigms.


Cheaha provides users with a traditional command-line interactive environment with access to many scientific tools that can leverage its dedicated pool of local compute resources.  Alternately, users of graphical applications can start a [[Setting_Up_VNC_Session|cluster desktop]]. The local compute pool provides access to three generations of compute hardware based on the [http://en.wikipedia.org/wiki/X86_64 x86-64 64-bit architecture].  The compute resources are organized into a unified Research Computing System.  The compute fabric for this system is anchored by the Cheaha cluster, [[ Resources |a commodity cluster with several generations of hardware with approximately 3000 cores]] connected by low-latency Fourteen Data Rate (FDR) and Quad Data Rate (QDR) InfiniBand networks.  The compute nodes are backed by 6.6PB raw GPFS storage on DDN SFA12KX hardware, a 180TB of high performance Lustre storage on a DDN hardware and an additional 20TB available for home directories on a traditional Hitachi SAN and other ancillary services. The compute nodes combine to provide over 120TFlops of dedicated computing power. 
The obsolete content of the original page can be found at [[Obsolete: Cheaha]] for historical reference.
 
Cheaha is composed of resources that span data centers located in the UAB Shared Computing facility  UAB 936 Building and the RUST Computer Center. Resource design and development is lead by UAB IT Research Computing in open collaboration with community members. Operational [mailto:support@vo.uabgrid.uab.edu support] is provided by UAB IT's Research Computing group.
 
Cheaha is named in honor of [http://en.wikipedia.org/wiki/Cheaha_Mountain Cheaha Mountain], the highest peak in the state of Alabama.  Cheaha is a popular destination whose summit offers clear vistas of the surrounding landscape. (Cheaha Mountain photo-streams on [http://www.flickr.com/search/?q=cheaha  Flikr] and [http://picasaweb.google.com/lh/view?q=cheaha&psc=G&filter=1# Picasa]).
 
== Using ==
 
=== Getting Started ===
 
For information on getting an account, logging in, and running a job, please see [[Cheaha2_GettingStarted|Getting Started]].
 
== History ==
 
[[Image:Research-computing-platform.png|right|thumb|450px|Logical Diagram of Cheaha Configuration]]
 
=== 2005 ===
 
In 2002 UAB was awarded an infrastructure development grant through the NSF EPsCoR program.  This led to the 2005 acquisition of a 64 node compute cluster with two AMD Opteron 242 1.6Ghz CPUs per node (128 total cores).  This cluster was named Cheaha.  Cheaha expanded the compute capacity available at UAB and was the first general-access resource for the community. It lead to expanded roles for UAB IT in research computing support through the development of the UAB Shared HPC Facility in BEC and provided further engagement in Globus-based grid computing resource development on campus via UABgrid and regionally via [http://www.suragrid.org SURAgrid].
 
=== 2008 ===
 
In 2008, money was allocated by UAB IT for hardware upgrades which lead to the acquisition of an additional 192 cores based on a Dell clustering solution with Intel Quad-Core E5450 3.0Ghz CPU in August of 2008. This uprade migrated Cheaha's core infrastructure to the Dell blade clustering solution. It provided a 3 fold increase in processor density over the original hardware and enables more computing power to be located in the same physical space with room for expansion, an important consideration in light of the continued growth in processing demand.  This hardware represented a major technology upgrade that included space for additional expansion to address over-all capacity demand and enable resource reservation. 
 
The 2008 upgrade began a continuous resource improvement plan that includes a phased development approach for Cheaha with on-going increases in capacity and feature enhancements being brought into production via an [http://projects.uabgrid.uab.edu/cheaha open community process].
 
Software improvements rolled into the 2008 upgrade included grid computing services to access distributed compute resources and orchestrate jobs using the [http://www.gridway.org GridWay] meta-scheduler. An initial 10Gigabit Ethernet link establishing the UABgrid Research Network was designed to supports high speed data transfers between clusters connected to this network.
 
=== 2009 ===
 
In 2009, annual investment funds were directed toward establishing a fully connected dual data rate Infiniband network between the compute nodes added in 2008 and laying the foundation for a research storage system with a 60TB DDN storage system accessed via the Lustre distributed file system.  The Infiniband and storage fabrics were designed to support significant increases in research data sets and their associate analytical demand.
 
=== 2010 ===
 
In 2010, UAB was awarded an NIH Small Instrumentation Grant (SIG) to further increase analytical and storage capacity.  The grant funds were combined with the annual investment funds adding 576 cores (48 nodes) based on the Intel Westmere 2.66 GHz CPU, a quad data rate Infiniband fabric with 32 uplinks, an additional 120 TB of storage for the DDN fabric, and additional hardware to improve reliability. Additional improvements to the research compute platform involved extending the UAB Research Network to link the BEC and RUST data centers, adding 20TB of user and ancillary services storage
 
=== 2012 ===
 
In 2012, UAB IT Research Computing invested in the foundation hardware to expand long term storage and virtual machine capabilities with aqcuisition of 12 Dell 720xd system, each containing 16 cores, 96GB RAM, and 36TB of storage, creating a 192 core and 432TB virtual compute and storage fabric.
 
Additionaly hardware investment by the School of Public Health's Section on Statistical Genetics added three 384GB large memory nodes and an additional 48 cores to the QDR Infiniband fabric.
 
=== 2013 ===
 
In 2013, UAB IT Research Computing acquired an [http://blogs.uabgrid.uab.edu/jpr/2013/03/were-going-with-openstack/ OpenStack cloud and Ceph storage software fabric] through a partnership between Dell and Inktank in order to extend cloud computing solutions to the researchers at UAB and enhance the interfacing capabilities for HPC.  This fabric is under [http://dev.uabgrid.uab.edu active development] and will see feature releases in the coming months.
 
=== 2015/2016 ===
 
In 2015/2016 UAB IT Research computing acquired 96 2x12 core (2304 cores total) 2.5 GHz Intel Xeon E5-2680 v3 compute nodes with FDR InfiniBand interconnect. Out of the 96 compute nodes, 36 nodes have 128 GB RAM, 38 nodes have 256 GB RAM, and 14 nodes have 384 GB RAM. There are also four compute nodes with the Intel Xeon Phi 7210 accelerator cards and four compute nodes with the NVIDIA K80 GPUs. More information can be found at [[Resources]].
 
== Grant and Publication Resources ==
 
The following description may prove useful in summarizing the services available via Cheaha.  If you are using Cheaha for grant funded research please send information about your grant (funding source and grant number), a statement of intent for the research project and a list of the applications you are using to UAB IT Research Computing.  If you are using Cheaha for exploratory research, please send a similar note on your research interest.  Finally, any publications that rely on computations performed on Cheaha should include a statement acknowledging the use of UAB Research Computing facilities in your research, see the suggested example below.  Please note, your acknowledgment may also need to include an addition statement acknowledging grant-funded hardware.  We also ask that you send any references to publications based on your use of Cheaha compute resources.
 
=== Description of Cheaha for Grants ===
 
UAB IT Research Computing maintains high performance compute and storage resources for investigators. The Cheaha compute cluster provides 3120 conventional CPU cores across five generations of hardware that provide over 120 TFLOP/s of combined computational performance, and 20 TB of system memory interconnected via an Infiniband network. A high-performance, 6.6PB raw GPFS storage on DDN SFA12KX hardware and 180TB Lustre parallel file system built on a Direct Data Network (DDN) hardware platform is also connected to these cores via the Infiniband fabric. An additional 20TB of traditional SAN storage and 432TB of OpenStack+Ceph storage is available via a 10+ GigE network fabric. This general access compute fabric is available to all UAB investigators.
 
Additionally, NIH funded investigators are granted priority access to the NIH SIG award acquired compute and storage pool that includes an additional 576 2.66GHz Intel-based compute cores, 2.3TB RAM and 180TB in the high-performance Lustre parallel file system all interconnected via a QDR Infiniband network fabric.
 
=== Acknowledgment in Publications ===
 
This work was supported in part by the research computing resources acquired and managed by UAB IT Research Computing. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the University of Alabama at Birmingham.
 
Additionally, NIH funded researchers using the cluster for genomics or proteomics analyses should include the following statement regarding the NIH funded gen3 (SIG) compute resources.
 
"Computational portions of this research were supported by NIH S10RR026723.  Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Institutes of Health"
 
== System Profile ==
 
=== Performance ===
{{CheahaTflops}}
 
=== Hardware ===
 
The Cheaha Compute Platform includes three generations of commodity compute hardware, totaling 868 compute cores, 2.8TB of RAM, and over 200TB of storage.
 
The hardware is grouped into generations designated gen1, gen2, and gen3 (oldest to newest). The following descriptions highlight the hardware profile for each generation.
 
* <strike> Generation 1 (gen1) -- 64 2-CPU AMD 1.6 GHz compute nodes with Gigabit interconnect. This is the original hardware collection purchased with NSF EPSCoR funds in 2005, approx $150K. These nodes are sometimes called the "Verari" nodes. These nodes are tagged as "verari-compute-#-#" in the ROCKS naming convention.</strike> (gen1 was decommissioned in June 2013)
* Generation 2 (gen2) -- 24 2x4 core (196 cores total) Intel 3.0 GHz Intel compute nodes with dual data rate Infiniband interconnect and the initial high-perf storage implementation using 60TB DDN. This is the hardware collection purchased exclusively with the annual VPIT funds allocation, approx $150K/yr for the 2008 and 2009 fiscal years.  These nodes are sometimes confusingly called "cheaha2" or "cheaha" nodes. These nodes are tagged as "cheaha-compute-#-#" in the ROCKS naming convention.
* Generation 3 (gen3) -- 48 2x6 core (576 cores total) 2.66 GHz Intel compute nodes with quad data rate Infiniband, ScaleMP, and the high-perf storage build-out for capacity and redundancy with 120TB DDN. This is the hardware collection purchased with a combination of the NIH SIG funds and some of the 2010 annual VPIT investment. These nodes were given the code name "sipsey" and tagged as such in the node naming for the queue system. These nodes are tagged as "sipsey-compute-#-#" in the ROCKS naming convention. 16 of the gen3 nodes (sipsey-compute-0-1 thru sipsey-compute-0-16) were upgraded in 2014 from 48GB to 96GB of memory per node.
* Generation 4 (gen4) -- 3 16 core (48 cores total) compute nodes. This hardware collection purchase by [http://www.soph.uab.edu/ssg/people/tiwari Hemant Tiwari of SSG]. These nodes were given the code name "ssg" and tagged as such in the node naming for the queue system. These nodes are tagged as "ssg-compute-0-#" in the ROCKS naming convention.
* Generation 6 (gen6) --
** 26 Compute Nodes with two 12 core processors (Intel Xeon E5-2680 v3 2.5GHz) with 256GB DDR4 RAM, FDR InfiniBand and 10GigE network cards
** 14 Compute Nodes with two 12 core processors (Intel Xeon E5-2680 v3 2.5GHz) with 384GB DDR4 RAM, FDR InfiniBand and 10GigE network card
** FDR InfiniBand Switch
** 10Gigabit Ethernet Switch
** Management node and gigabit switch for cluster management
** Bright Advanced Cluster Management software licenses
 
Summarized, Cheaha's compute pool includes:
* gen4 is 48 cores of [http://ark.intel.com/products/64583/Intel-Xeon-Processor-E5-2680-20M-Cache-2_70-GHz-8_00-GTs-Intel-QPI 2.70GHz eight-core Intel Xeon E5-2680 processors] with 24G of RAM per core or 384GB total
* gen3.1 is 192 cores of [http://ark.intel.com/products/47922/Intel-Xeon-Processor-X5650-12M-Cache-2_66-GHz-6_40-GTs-Intel-QPI?q=x5650 2.67GHz six-core Intel Xeon X5650 processors] with 8Gb RAM per core or 96GB total
* gen3 is 384 cores of [http://ark.intel.com/products/47922/Intel-Xeon-Processor-X5650-12M-Cache-2_66-GHz-6_40-GTs-Intel-QPI?q=x5650 2.67GHz six-core Intel Xeon X5650 processors] with 4Gb RAM per core or 48GB total
* gen2 is 192 cores of [http://ark.intel.com/products/33083/Intel-Xeon-Processor-E5450-12M-Cache-3_00-GHz-1333-MHz-FSB 3.0GHz quad-core Intel Xeon E5450 processors] with 2Gb RAM per core
* <strike> gen1 is 100 cores of 1.6GhZ AMD Opteron 242 processors with 1Gb RAM per core </strike> (decomissioned June 2013)
 
{|border="1" cellpadding="2" cellspacing="0"
|+ Physical Nodes
|- bgcolor=grey
!gen!!queue!!#nodes!!cores/node!!RAM/node
|-
|gen6.1|| ?? || 26 || 24 || 256G
|-
|gen6.2|| ?? || 14 || 24 || 384G
|-
|gen5||openstack(?)|| ? || ? || ?G
|-
|gen4||ssg||3||16||385G
|-
|gen3.1||sipsey||16||12||96G
|-
|gen3||sipsey||32||12||48G
|-
|gen2||cheaha||24||8||16G
|}
 
=== Software ===
 
Details of the software available on Cheaha can be found on the [http://me.eng.uab.edu/wiki/index.php?title=Cheaha#Installed_software Cheaha cluster configuration page], an overview follows.
 
Cheaha uses [http://modules.sourceforge.net/ Environment Modules] to support account configuration. Please follow these [http://me.eng.uab.edu/wiki/index.php?title=Cheaha#Environment_Modules specific steps for using environment modules].
 
Cheaha's software stack is built with the [http://www.rocksclusters.org/ ROCKS], a Linux-based cluster distribution. Cheaha's operating system is CentOS with the following major cluster components:
* BrightCM 7.2
* CentOS 7.2 x86_64
* [[Slurm]] 15.08
 
A brief summary of the some of the available computational software and tools available includes:
* Amber
* FFTW
* Gromacs
* GSL
* NAMD
* VMD
* Intel Compilers
* GNU Compilers
* Java
* R
* OpenMPI
* MATLAB
 
=== Network ===
 
Cheaha is connected to the UAB Research Network which provides a dedicated 10Gbs networking backplane between clusters located the UAB Shared Computing Facility and Department of Computer and Information Science HPC Center.  At present only Cheaha and Ferrum are connected via these 10Gbs interfaces. Data transfers rates of almost 8Gbs between these hosts have been demonstrated using Grid FTP, a multi-channel file transfer service that is used by GridWay to move data between clusters as part of the job management operations.  This performance promises very efficient job management and the seamless integration of other clusters as connectivity to the research network is expanded.
 
=== Benchmarks ===
 
The continuous resource improvement process involves collecting benchmarks of the system.  One of the measures of greatest interest to users of the system are benchmarks of specific application codes.  The following benchmarks have been performed on the system and will be further expanded as additional benchmarks are performed.
 
* [[Cheaha-BGL_Comparison|Cheaha-BGL Comparison]]
 
* [[Gromacs_Benchmark|Gromacs]]
 
* [[NAMD_Benchmarks|NAMD]]
 
=== Performance Statistics ===
 
Cheaha uses Ganglia to report cluster performance data. This information provides a helpful overview of the current and historical operating stats for Cheaha.  You can access the Ganglia monitoring page [https://cheaha.uabgrid.uab.edu/ganglia/ here].
 
== Availability ==
 
Cheaha is a general-purpose computer resource made available to the UAB community by UAB IT.  As such, it is available for legitimate research and educational needs and is governed by [http://www.uabgrid.uab.edu/aup UAB's Acceptable Use Policy (AUP)] for computer resources. 
 
Many software packages commonly used across UAB are available via Cheaha.
 
To request access to Cheaha, please send a request to [mailto:support@vo.uabgrid.uab.edu send a request] to the cluster support group.
 
Cheaha's intended use implies broad access to the community, however, no guarantees are made that specific computational resources will be available to all users.  Availability guarantees can only be made for reserved resources.
 
=== Secure Shell Access ===
 
Please configure you client secure shell software to use the official host name to access Cheaha:
 
<pre>
cheaha.rc.uab.edu
</pre>
 
== Scheduling Framework ==
 
[http://slurm.schedmd.com/ Slurm] is a queue management system and stands for Simple Linux Utility for Resource Management. Slurm was developed at the Lawrence Livermore National Lab and currently runs some of the largest compute clusters in the world. '''[[Slurm]]''' is now the primary job manager on Cheaha, it replaces SUN Grid Engine (SGE) the job manager used earlier.
 
Slurm is similar in many ways to GridEngine or most other queue systems. You write a batch script then submit it to the queue manager (scheduler). The queue manager then schedules your job to run on the queue (or '''partition''' in Slurm parlance) that you designate. Below we will provide an outline of how to submit jobs to Slurm, how Slurm decides when to schedule your job, and how to monitor progress.
 
== Support ==
 
Operational support for Cheaha is provided by the Research Computing group in UAB IT.  For questions regarding the operational status of Cheaha please send your request to [mailto:support@vo.uabgrid.uab.edu support@vo.uabgrid.uab.edu].  As a user of Cheaha you will automatically by subscribed to the hpc-announce email list.  This subscription is mandatory for all users of Cheaha.  It is our way of communicating important information regarding Cheaha to you.  The traffic on this list is restricted to official communication and has a very low volume.
 
We have limited capacity, however, to support non-operational issue like "How do I write a job script" or "How do I compile a program".  For such requests, you may find it more fruitful to send your questions to the hpc-users email list and request help from our peers in the HPC community at UAB.  As with all mailing lists, please observe [http://lifehacker.com/5473859/basic-etiquette-for-email-lists-and-forums common mailing etiquette].
 
Finally, please remember that as you learned about HPC from others it becomes part of your responsibilty to help others on their quest.  You should update this documentation or respond to mailing list requests of others.
 
You can subscribe to hpc-users by sending an email to:
 
[mailto:sympa@vo.uabgrid.uab.edu?subject=subscribe%20hpc-users  sympa@vo.uabgrid.uab.edu with the subject ''subscribe hpc-users''].
 
You can unsubribe from hpc-users by sending an email to:
 
[mailto:sympa@vo.uabgrid.uab.edu?subject=unsubscribe%20hpc-users  sympa@vo.uabgrid.uab.edu with the subject ''unsubscribe hpc-users''].
 
You can review archives of the list in the [http://vo.uabgrid.uab.edu/sympa/arc/hpc-users web hpc-archives].
 
If you need help using the list service please send an email to:
 
[mailto:sympa@vo.uabgrid.uab.edu?subject=help sympa@vo.uabgrid.uab.edu with the subject ''help'']
 
If you have questions about the operation of the list itself, please send an email to the owners of the list:
 
[mailto:hpc-users-request@vo.uabgrid.uab.edu sympa@vo.uabgrid.uab.edu with a subject relavent to your issue with the list]
 
If you are interested in contributing to the enhancement of HPC features at UAB or would like to talk to other cluster administrators, [mailto:sympa@vo.uabgrid.uab.edu?subject=subscribe%20hpc-dev please join the hpc developers community at UAB].

Latest revision as of 20:13, 31 August 2022

The Cheaha overview page has moved to Research Computing's new documentation site. Please visit https://docs.rc.uab.edu/ for information on Cheaha.

The obsolete content of the original page can be found at Obsolete: Cheaha for historical reference.