Cheaha

From UABgrid Documentation
(Difference between revisions)
Jump to: navigation, search
(Support: minor wording adjustments)
(System Profile: Redirected performance/hardware to Resources page)
 
(118 intermediate revisions by 10 users not shown)
Line 1: Line 1:
'''Cheaha''' is a campus resource dedicated to enhancing research computing productivity at UAB. Cheaha is sponsored by [http://www.uab.edu/it UAB Information Technology (UAB IT)] and is available to members of the UAB community in need of increased computational capacity.  Cheaha supports [http://en.wikipedia.org/wiki/High-performance_computing high-performance computing (HPC)] and [http://en.wikipedia.org/wiki/High-throughput_computing high-throughput computing (HTC)] and is the primary interface for leveraging computational resources on UABgrid,  the campus distributed research support infrastructure.
 
  
Cheaha includes a dedicated pool of local compute resources and provides seamless access to remote compute resources through the use of inter-cluster scheduling technologiesThe local compute pool contains two processor banks based on the [http://en.wikipedia.org/wiki/X86_64 x86_64 64-bit architecture]. 192 3.0Ghz cores and 120 1.6Ghz cores combine to provide nearly 3TFlops of dedicated computing power.  
+
'''Cheaha''' is a campus resource dedicated to enhancing research computing productivity at UAB. [http://cheaha.uabgrid.uab.edu Cheaha] is managed by [http://www.uab.edu/it UAB Information Technology's Research Computing group (UAB ITRC)] and is available to members of the UAB community in need of increased computational capacityCheaha supports [http://en.wikipedia.org/wiki/High-performance_computing high-performance computing (HPC)] and [http://en.wikipedia.org/wiki/High-throughput_computing high throughput computing (HTC)] paradigms.
  
Use of the local compute pool is governed by scheduling policies designed to maximize availability of total capacity and ensure guaranteed access to reserved resources.  Use of the remote compute pool is contingent upon allocations for individual users on specific resources. Incorporation of remote resources enables simplified management of scientific workflows and can significantly increase available compute capacity.
+
Cheaha provides users with a traditional command-line interactive environment with access to many scientific tools that can leverage its dedicated pool of local compute resources.  Alternately, users of graphical applications can start a [[Setting_Up_VNC_Session|cluster desktop]]. The local compute pool provides access to compute hardware based on the [http://en.wikipedia.org/wiki/X86_64 x86-64 64-bit architecture].  The compute resources are organized into a unified Research Computing SystemThe compute fabric for this system is anchored by the Cheaha cluster, [[ Resources |a commodity cluster with approximately 2400 cores]] connected by low-latency Fourteen Data Rate (FDR) InfiniBand networks.  The compute nodes are backed by 6.6PB raw GPFS storage on DDN SFA12KX hardware, an additional 20TB available for home directories on a traditional Hitachi SAN, and other ancillary services. The compute nodes combine to provide over 110TFlops of dedicated computing power.  
  
Cheaha is located in the UAB Shared Computing facility in BEC. Resource design and development is lead by UAB IT Infrastructure Services in open collaboration with community members. Development effort is coordinated though [http://projects.uabgrid.uab.edu/cheaha Cheaha's project web site].  Operational support is provided by UAB School of Engineering's [http://me.eng.uab.edu/wiki/index.php?title=Main_Page cluster support group].
+
Cheaha is composed of resources that span data centers located in the UAB Shared Computing facility UAB 936 Building and the RUST Computer Center. Resource design and development is lead by UAB IT Research Computing in open collaboration with community members. Operational [mailto:support@listserv.uab.edu support] is provided by UAB IT's Research Computing group.
  
Cheaha is named in honor of [http://en.wikipedia.org/wiki/Cheaha_Mountain Cheaha Mountain], the highest peak in the state of Alabama.  Cheaha's summit offers clear vistas of the surrounding landscape. (A Cheaha Mountain photo-stream on [http://www.flickr.com/search/?q=cheaha  Flikr]).
+
Cheaha is named in honor of [http://en.wikipedia.org/wiki/Cheaha_Mountain Cheaha Mountain], the highest peak in the state of Alabama.  Cheaha is a popular destination whose summit offers clear vistas of the surrounding landscape. (Cheaha Mountain photo-streams on [http://www.flickr.com/search/?q=cheaha  Flikr] and [http://picasaweb.google.com/lh/view?q=cheaha&psc=G&filter=1# Picasa]).
 +
 
 +
== Using ==
 +
 
 +
=== Getting Started ===
 +
 
 +
For information on getting an account, logging in, and running a job, please see [[Cheaha2_GettingStarted|Getting Started]].
  
 
== History ==
 
== History ==
  
[[Image:cheaha-2phase-flat.png|right|thumb|450px|Cheaha Phased Development Approach]]
+
[[Image:Research-computing-platform.png|right|thumb|450px|Logical Diagram of Cheaha Configuration]]
 +
 
 +
=== 2005 ===
  
 
In 2002 UAB was awarded an infrastructure development grant through the NSF EPsCoR program.  This led to the 2005 acquisition of a 64 node compute cluster with two AMD Opteron 242 1.6Ghz CPUs per node (128 total cores).  This cluster was named Cheaha.  Cheaha expanded the compute capacity available at UAB and was the first general-access resource for the community. It lead to expanded roles for UAB IT in research computing support through the development of the UAB Shared HPC Facility in BEC and provided further engagement in Globus-based grid computing resource development on campus via UABgrid and regionally via [http://www.suragrid.org SURAgrid].
 
In 2002 UAB was awarded an infrastructure development grant through the NSF EPsCoR program.  This led to the 2005 acquisition of a 64 node compute cluster with two AMD Opteron 242 1.6Ghz CPUs per node (128 total cores).  This cluster was named Cheaha.  Cheaha expanded the compute capacity available at UAB and was the first general-access resource for the community. It lead to expanded roles for UAB IT in research computing support through the development of the UAB Shared HPC Facility in BEC and provided further engagement in Globus-based grid computing resource development on campus via UABgrid and regionally via [http://www.suragrid.org SURAgrid].
  
=== 2008 Hardware Upgrade ===
+
=== 2008 ===
  
In 2008, money was allocated for hardware upgrades which lead to the acquisition of an additional 192 cores based on the Intel Quad-Core E5450 3.0Ghz CPU in August of 2008.  This hardware represented a major technology upgrade that included space for additional expansion to address over-all capacity demand and enable resource reservation.   
+
In 2008, money was allocated by UAB IT for hardware upgrades which lead to the acquisition of an additional 192 cores based on a Dell clustering solution with Intel Quad-Core E5450 3.0Ghz CPU in August of 2008. This uprade migrated Cheaha's core infrastructure to the Dell blade clustering solution. It provided a 3 fold increase in processor density over the original hardware and enables more computing power to be located in the same physical space with room for expansion, an important consideration in light of the continued growth in processing demand.  This hardware represented a major technology upgrade that included space for additional expansion to address over-all capacity demand and enable resource reservation.   
  
This upgrade also included enhancements to enable access to the aggregate compute power available to the UAB community and improve management of compute jobs across clusters that are part of the UABgrid computing infrastructure. 10Gigabit Ethernet connectivity to the UABgrid Research Network supports high speed data transfers between clusters connected to this network, enabling efficient job staging on multiple resources. [http://www.gridway.org GridWay-based] meta-scheduling enables management of compute jobs across cluster boundaries and brings grid-computing into production.
+
The 2008 upgrade began a continuous resource improvement plan that includes a phased development approach for Cheaha with on-going increases in capacity and feature enhancements being brought into production via an [http://projects.uabgrid.uab.edu/cheaha open community process].
  
=== Continuous Resource Improvement ===
+
Software improvements rolled into the 2008 upgrade included grid computing services to access distributed compute resources and orchestrate jobs using the [http://www.gridway.org GridWay] meta-scheduler. An initial 10Gigabit Ethernet link establishing the UABgrid Research Network was designed to supports high speed data transfers between clusters connected to this network.
  
The 2008 upgrade began a phased development approach for Cheaha with on-going increases in capacity and feature enhancements being brought into production via an [http://projects.uabgrid.uab.edu/cheaha open community process].  The first two phases are represented in the diagram on the right, which highlights the logical connectivity between resourcesPhase 1 is scheduled for production in January 2009.
+
=== 2009 ===
 +
 
 +
In 2009, annual investment funds were directed toward establishing a fully connected dual data rate Infiniband network between the compute nodes added in 2008 and laying the foundation for a research storage system with a 60TB DDN storage system accessed via the Lustre distributed file system.  The Infiniband and storage fabrics were designed to support significant increases in research data sets and their associate analytical demand.
 +
 
 +
=== 2010 ===
 +
 
 +
In 2010, UAB was awarded an NIH Small Instrumentation Grant (SIG) to further increase analytical and storage capacity.  The grant funds were combined with the annual investment funds adding 576 cores (48 nodes) based on the Intel Westmere 2.66 GHz CPU, a quad data rate Infiniband fabric with 32 uplinks, an additional 120 TB of storage for the DDN fabric, and additional hardware to improve reliability. Additional improvements to the research compute platform involved extending the UAB Research Network to link the BEC and RUST data centers, adding 20TB of user and ancillary services storage
 +
 
 +
=== 2012 ===
 +
 
 +
In 2012, UAB IT Research Computing invested in the foundation hardware to expand long term storage and virtual machine capabilities with aqcuisition of 12 Dell 720xd system, each containing 16 cores, 96GB RAM, and 36TB of storage, creating a 192 core and 432TB virtual compute and storage fabric.
 +
 
 +
Additionaly hardware investment by the School of Public Health's Section on Statistical Genetics added three 384GB large memory nodes and an additional 48 cores to the QDR Infiniband fabric.
 +
 
 +
=== 2013 ===
 +
 
 +
In 2013, UAB IT Research Computing acquired an [http://blogs.uabgrid.uab.edu/jpr/2013/03/were-going-with-openstack/ OpenStack cloud and Ceph storage software fabric] through a partnership between Dell and Inktank in order to [http://dev.uabgrid.uab.edu extend cloud computing solutions] to the researchers at UAB and enhance the interfacing capabilities for HPC.
 +
 
 +
=== 2015 ===
 +
 
 +
UAB IT received $500,000 from the university’s Mission Support Fund for a compute cluster seed expansion of 48 teraflopsThis added 936 cores across 40 nodes with 2x12 core 2.5 GHz Intel Xeon E5-2680 v3 compute nodes and FDR InfiniBand interconnect.
 +
 
 +
UAB received a $500,000 grant from the Alabama Innovation Fund for a three petabyte research storage array. This funding with additional matching from UAB provided a multi-petabyte [https://en.wikipedia.org/wiki/IBM_General_Parallel_File_System GPFS] parallel file system to the cluster which went live in 2016.
 +
 
 +
=== 2016 ===
 +
 
 +
In 2016 UAB IT Research computing received additional funding from Deans of CAS, Engineering, and Public Heath to grow the compute capacity provided by the prior year's seed funding.  This added an additional compute nodes providing researchers at UAB with a 96 2x12 core (2304 cores total) 2.5 GHz Intel Xeon E5-2680 v3 compute nodes with FDR InfiniBand interconnect. Out of the 96 compute nodes, 36 nodes have 128 GB RAM, 38 nodes have 256 GB RAM, and 14 nodes have 384 GB RAM. There are also four compute nodes with the Intel Xeon Phi 7210 accelerator cards and four compute nodes with the NVIDIA K80 GPUs. More information can be found at [[Resources]]. 
 +
 
 +
In addition to the compute, the GPFS six petabyte file system came online. This file system, provided each user five terabyte of personal space, additional space for shared projects and a greatly expanded scratch storage all in a single file system.
 +
 
 +
The 2015 and 2016 investments combined to provide a completely new core for the Cheaha cluster, allowing the retirement of earlier compute generations.
 +
 
 +
== Grant and Publication Resources ==
 +
 
 +
The following description may prove useful in summarizing the services available via Cheaha.  If you are using Cheaha for grant funded research please send information about your grant (funding source and grant number), a statement of intent for the research project and a list of the applications you are using to UAB IT Research Computing.  If you are using Cheaha for exploratory research, please send a similar note on your research interest.  Finally, any publications that rely on computations performed on Cheaha should include a statement acknowledging the use of UAB Research Computing facilities in your research, see the suggested example below.  Please note, your acknowledgment may also need to include an addition statement acknowledging grant-funded hardware.  We also ask that you send any references to publications based on your use of Cheaha compute resources.
 +
 
 +
=== Description of Cheaha for Grants (short) ===
 +
 
 +
UAB IT Research Computing maintains high performance compute and storage resources for investigators. The Cheaha compute cluster provides approximately 3744 CPU cores and 80 accelerators (including 72 NVIDIA P100 GPUS's) interconnected via an InfiniBand network and provides over 572 TFLOP/s of aggregate theoretical peak performance. A high-performance, 12PB raw GPFS storage on DDN SFA12KX hardware is also connected to these compute nodes via the Infiniband fabric. An additional 20TB of traditional SAN storage is also available for home directories. This general access compute fabric is available to all UAB investigators.
 +
 
 +
=== Description of Cheaha for Grants (Detailed) ===
 +
 
 +
The Cyberinfrastructure supporting University of Alabama at Birmingham (UAB) investigators includes high performance computing clusters, storage, campus, statewide and regionally connected high-bandwidth networks, and conditioned space for hosting and operating HPC systems, research applications and network equipment.
 +
 
 +
==== Cheaha HPC system ====
 +
 
 +
Cheaha is a campus HPC resource dedicated to enhancing research computing productivity at UAB. Cheaha is managed by UAB Information Technology's Research Computing group (RC) and is available to members of the UAB community in need of increased computational capacity. Cheaha supports high-performance computing (HPC) and high throughput computing (HTC) paradigms. Cheaha is composed of resources that span data centers located in the UAB IT Data Centers in the 936 Building and the RUST Computer Center. Research Computing in open collaboration with the campus research community is leading the design and development of these resources.
 +
 
 +
==== Compute Resources ====
 +
 
 +
The UAB Cheaha High Performance Computing environment includes a high performance cluster with approximately 3744 CPU cores, 18 GPU nodes, and large memory nodes. The compute nodes combine to provide over 572 TFIops of dedicated computing power. The Ruffner OpenStack private cloud is available to develop and host scientific applications.
 +
 
 +
==== Storage Resources ====
 +
 
 +
The high performance compute nodes are backed by a replicated 6PB (12PB raw) high speed storage system with an Infiniband fabric. Additional storage tiers for project space and archive are also available.
 +
 
 +
==== Network Resources ====
 +
 
 +
The UAB Research Network is currently a dedicated 40Gbps optical link. The UAB LAN provides 1Gbs to the desktop and 10Gbs for instruments.
 +
 
 +
The research network also includes a secure Science DMZ with data transfer nodes (DTNs) connected directly to the border router that provide a "friction-free" pathway to access external data repositories and other computational resources.
 +
 
 +
UAB connects to the Internet2 high-speed research network at 100 Gbs via the University of Alabama System Regional Optical Network (UASRON).
 +
 
 +
Globus technologies provide secure, reliable and fast data transfers.
 +
 
 +
==== Personnel ====
 +
 
 +
UAB IT Research Computing currently maintains a support staff of 10 lead by the Assistant Vice President for Research Computing and includes an HPC Architect-Manager, four Software developers, two Scientists, a system administrator and a project coordinator.
 +
 
 +
=== Acknowledgment in Publications ===
 +
 
 +
To acknowledge the use of Cheaha for compute time in published work, please consider adding the following to the acknowledgements section of your publication:
 +
 
 +
<blockquote>
 +
The authors gratefully acknowledge the resources provided by the University of Alabama at Birmingham IT-Research Computing group for high performance computing (HPC) support and CPU time on the Cheaha compute cluster.
 +
</blockquote>
 +
 
 +
If Globus was used to transfer data to/from Cheaha, please consider adding the following to the acknowledgements section of your publication:
 +
 
 +
<blockquote>
 +
This work was supported in part by the National Science Foundation under Grants Nos. OAC-1541310, the University of Alabama at Birmingham, and the Alabama Innovation Fund. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the University of Alabama at Birmingham.
 +
</blockquote>
 +
 
 +
== System Profile ==
 +
 
 +
=== Hardware ===
 +
See [[Resources]] for more information.
 +
 
 +
=== Software ===
 +
 
 +
Details of the software available on Cheaha can be found on the [https://docs.uabgrid.uab.edu/wiki/Cheaha_Software Installed software page], an overview follows.
 +
 
 +
Cheaha uses [http://modules.sourceforge.net/ Environment Modules] to support account configuration. Please follow these [http://me.eng.uab.edu/wiki/index.php?title=Cheaha#Environment_Modules specific steps for using environment modules].
 +
 
 +
Cheaha's software stack is built with the [http://www.brightcomputing.com Bright Cluster Manager]. Cheaha's operating system is CentOS with the following major cluster components:
 +
* BrightCM 7.2
 +
* CentOS 7.2 x86_64
 +
* [[Slurm]] 15.08
 +
 
 +
A brief summary of the some of the available computational software and tools available includes:
 +
* Amber
 +
* FFTW
 +
* Gromacs
 +
* GSL
 +
* NAMD
 +
* VMD
 +
* Intel Compilers
 +
* GNU Compilers
 +
* Java
 +
* R
 +
* OpenMPI
 +
* MATLAB
 +
 
 +
=== Network ===
 +
 
 +
Cheaha is connected to the UAB Research Network which provides a dedicated 10Gbs networking backplane between clusters located in the 936 data center and the campus network core.  Data transfers rates of almost 8Gbps between these hosts have been demonstrated using Grid FTP, a multi-channel file transfer service that is used to move data between clusters as part of the job management operationsThis performance promises very efficient job management and the seamless integration of other clusters as connectivity to the research network is expanded.
 +
 
 +
=== Benchmarks ===
 +
 
 +
The continuous resource improvement process involves collecting benchmarks of the system.  One of the measures of greatest interest to users of the system are benchmarks of specific application codes.  The following benchmarks have been performed on the system and will be further expanded as additional benchmarks are performed.
 +
 
 +
* [[Cheaha-BGL_Comparison|Cheaha-BGL Comparison]]
 +
 
 +
* [[Gromacs_Benchmark|Gromacs]]
 +
 
 +
* [[NAMD_Benchmarks|NAMD]]
 +
 
 +
=== Cluster Usage Statistics ===
 +
 
 +
Cheaha uses Bright Cluster Manager to report cluster performance data. This information provides a helpful overview of the current and historical operating stats for Cheaha.  You can access the status monitoring page [https://cheaha-master01.rc.uab.edu/userportal/ here] (accessible only on the UAB network or through VPN).
  
 
== Availability ==
 
== Availability ==
Line 29: Line 166:
 
Cheaha is a general-purpose computer resource made available to the UAB community by UAB IT.  As such, it is available for legitimate research and educational needs and is governed by [http://www.uabgrid.uab.edu/aup UAB's Acceptable Use Policy (AUP)] for computer resources.   
 
Cheaha is a general-purpose computer resource made available to the UAB community by UAB IT.  As such, it is available for legitimate research and educational needs and is governed by [http://www.uabgrid.uab.edu/aup UAB's Acceptable Use Policy (AUP)] for computer resources.   
  
Many software packages commonly used across UAB are available via Cheaha. For more information and introductory help on using this resource please visit the [http://me.eng.uab.edu/wiki/index.php?title=Cheaha resource details page].
+
Many software packages commonly used across UAB are available via Cheaha.
  
To request access to Cheaha, please [http://etlab.eng.uab.edu/ submit an authorization request] to the School of Engineering cluster support group.
+
To request access to Cheaha, please send a request to [mailto:support@listserv.uab.edu send a request] to the cluster support group.
  
 
Cheaha's intended use implies broad access to the community, however, no guarantees are made that specific computational resources will be available to all users.  Availability guarantees can only be made for reserved resources.
 
Cheaha's intended use implies broad access to the community, however, no guarantees are made that specific computational resources will be available to all users.  Availability guarantees can only be made for reserved resources.
 +
 +
=== Secure Shell Access ===
 +
 +
Please configure you client secure shell software to use the official host name to access Cheaha:
 +
 +
<pre>
 +
cheaha.rc.uab.edu
 +
</pre>
  
 
== Scheduling Framework ==
 
== Scheduling Framework ==
  
Enhancements to Cheaha in general, and, in particular, its scheduling framework are intended to remain transparent to the user community.  From this perspective, Cheaha can be seen as an ordinary cluster which supports job management via [http://gridengine.org Sun Grid Engine (SGE)].  
+
[http://slurm.schedmd.com/ Slurm] is a queue management system and stands for Simple Linux Utility for Resource Management. Slurm was developed at the Lawrence Livermore National Lab and currently runs some of the largest compute clusters in the world. '''[[Slurm]]''' is now the primary job manager on Cheaha, it replaces SUN Grid Engine (SGE) the job manager used earlier.
  
User's who have no need for or interest in maximizing access to computational resources are encouraged to continuing using the existing SGE job management system to manage compute jobs on Cheaha.
+
Slurm is similar in many ways to GridEngine or most other queue systems. You write a batch script then submit it to the queue manager (scheduler). The queue manager then schedules your job to run on the queue (or '''partition''' in Slurm parlance) that you designate. Below we will provide an outline of how to submit jobs to Slurm, how Slurm decides when to schedule your job, and how to monitor progress.
  
=== SGE ===
+
== Support ==
  
Cheaha provides access to its local compute pool via the SGE schedulerThis arrangement is identical to the existing clusters on campus and mirrors the long-established configuration of Cheaha. Cluster users experienced with other SGE-based clusters should find no difficulty leveraging this service.  For more information on using SGE on Cheaha please see the [http://me.eng.uab.edu/wiki/index.php?title=Cheaha cluster resources page]. For user support requests with using SGE on Cheaha or any other issues related to cluster usage, please [http://etlab.eng.uab.edu/ submit a support request on-line.]
+
Operational support for Cheaha is provided by the Research Computing group in UAB ITFor questions regarding the operational status of Cheaha please send your request to [mailto:support@listserv.uab.edu support@listserv.uab.edu]. For more details on optimizing your support experience, please see [[Support email]]. As a user of Cheaha you will automatically by subscribed to the hpc-announce email list. This subscription is mandatory for all users of Cheaha. It is our way of communicating important information regarding Cheaha to you.  The traffic on this list is restricted to official communication and has a very low volume.
  
=== GridWay ===
+
We have limited capacity, however, to support non-operational issue like "How do I write a job script" or "How do I compile a program".  For such requests, you may find it more fruitful to send your questions to the hpc-users email list and request help from our peers in the HPC community at UAB.  As with all mailing lists, please observe [http://lifehacker.com/5473859/basic-etiquette-for-email-lists-and-forums common mailing etiquette].
  
Cheaha provides enhanced job management capabilities that enable the user to leverage all computing resources to which they have accessThese services are provided via the [http://www.gridway.org GridWay], a grid-based meta-scheduling infrastructure.  The scheduler operates much like SGE by providing job submit and monitoring commands.  Outside of the slightly different commands and syntax a subtle difference is that an explicit (though automated) job staging step is involved in order to start jobs on multiple clusters, this can require more explicit handling of input and output files that is ordinarily required by SGE.
+
Finally, please remember that as you learned about HPC from others it becomes part of your responsibilty to help others on their questYou should update this documentation or respond to mailing list requests of others.  
  
Additionally, GridWay cannot perform magic.  If you ordinarily do not have access to other clusters or your code does not run on a targeted cluster, GridWay cannot solve these problems for you.  You must ensure your codes run on all compute resources you intend to include in your scheduling pool prior to submitting jobs to those resources. 
+
You can subscribe to hpc-users by sending an email to:
  
Furthermore, GridWay cannot run MPI jobs across cluster boundariesIf you simply use MPI to coordinate the workers (rather than for low-latency peer communication), you should generally be able to structure your job to work across cluster boundaries.  Otherwise, additional effort may be required to divide your data into smaller work units.
+
[mailto:sympa@vo.uabgrid.uab.edu?subject=subscribe%20hpc-users sympa@vo.uabgrid.uab.edu with the subject ''subscribe hpc-users''].
  
Adoption of GridWay is encouraged and future compute capacity enhancements will leverage the flexibility of GridWay. The nature of this new technology implies a slight learning curve.  The learning curve need not be steep, however, and direct migration of SGE scripts is possible.  Additionally, all Cheaha accounts are configured to support the use of GridWay as an alternative scheduler, empowering the adventurous.
+
You can unsubribe from hpc-users by sending an email to:
  
=== GridWay Support ===
+
[mailto:sympa@vo.uabgrid.uab.edu?subject=unsubscribe%20hpc-users  sympa@vo.uabgrid.uab.edu with the subject ''unsubscribe hpc-users''].
  
The UABgrid User Community stands ready to help you with GridWay adoption.  If you are interested in exploring or adapting your jobs to use GridWay please subscribe to the [mailto:uabgrid-user-subscribe@vo.uabgrid.uab.edu UABgrid-User list] and ask all questions there.  Please do not submit GridWay-related questions via the ordinary cluster support channels.
+
You can review archives of the list in the [http://vo.uabgrid.uab.edu/sympa/arc/hpc-users web hpc-archives].
  
We will continue to develop on-line documentation to provided migration examples to help you further explore the power of GridWay.
+
If you need help using the list service please send an email to:
  
== Support ==
+
[mailto:sympa@vo.uabgrid.uab.edu?subject=help sympa@vo.uabgrid.uab.edu with the subject ''help'']
 +
 
 +
If you have questions about the operation of the list itself, please send an email to the owners of the list:
 +
 
 +
[mailto:hpc-users-request@vo.uabgrid.uab.edu sympa@vo.uabgrid.uab.edu with a subject relavent to your issue with the list]
  
Operational support for Cheaha is provided by the School of Engineering's cluster support group.  Please submit support requests directly  [http://etlab.eng.uab.edu/ via the on-line form].
+
If you are interested in contributing to the enhancement of HPC features at UAB or would like to talk to other cluster administrators, [mailto:sympa@vo.uabgrid.uab.edu?subject=subscribe%20hpc-dev please join the hpc developers community at UAB].

Latest revision as of 10:16, 17 August 2021

Cheaha is a campus resource dedicated to enhancing research computing productivity at UAB. Cheaha is managed by UAB Information Technology's Research Computing group (UAB ITRC) and is available to members of the UAB community in need of increased computational capacity. Cheaha supports high-performance computing (HPC) and high throughput computing (HTC) paradigms.

Cheaha provides users with a traditional command-line interactive environment with access to many scientific tools that can leverage its dedicated pool of local compute resources. Alternately, users of graphical applications can start a cluster desktop. The local compute pool provides access to compute hardware based on the x86-64 64-bit architecture. The compute resources are organized into a unified Research Computing System. The compute fabric for this system is anchored by the Cheaha cluster, a commodity cluster with approximately 2400 cores connected by low-latency Fourteen Data Rate (FDR) InfiniBand networks. The compute nodes are backed by 6.6PB raw GPFS storage on DDN SFA12KX hardware, an additional 20TB available for home directories on a traditional Hitachi SAN, and other ancillary services. The compute nodes combine to provide over 110TFlops of dedicated computing power.

Cheaha is composed of resources that span data centers located in the UAB Shared Computing facility UAB 936 Building and the RUST Computer Center. Resource design and development is lead by UAB IT Research Computing in open collaboration with community members. Operational support is provided by UAB IT's Research Computing group.

Cheaha is named in honor of Cheaha Mountain, the highest peak in the state of Alabama. Cheaha is a popular destination whose summit offers clear vistas of the surrounding landscape. (Cheaha Mountain photo-streams on Flikr and Picasa).

Contents

[edit] Using

[edit] Getting Started

For information on getting an account, logging in, and running a job, please see Getting Started.

[edit] History

Logical Diagram of Cheaha Configuration

[edit] 2005

In 2002 UAB was awarded an infrastructure development grant through the NSF EPsCoR program. This led to the 2005 acquisition of a 64 node compute cluster with two AMD Opteron 242 1.6Ghz CPUs per node (128 total cores). This cluster was named Cheaha. Cheaha expanded the compute capacity available at UAB and was the first general-access resource for the community. It lead to expanded roles for UAB IT in research computing support through the development of the UAB Shared HPC Facility in BEC and provided further engagement in Globus-based grid computing resource development on campus via UABgrid and regionally via SURAgrid.

[edit] 2008

In 2008, money was allocated by UAB IT for hardware upgrades which lead to the acquisition of an additional 192 cores based on a Dell clustering solution with Intel Quad-Core E5450 3.0Ghz CPU in August of 2008. This uprade migrated Cheaha's core infrastructure to the Dell blade clustering solution. It provided a 3 fold increase in processor density over the original hardware and enables more computing power to be located in the same physical space with room for expansion, an important consideration in light of the continued growth in processing demand. This hardware represented a major technology upgrade that included space for additional expansion to address over-all capacity demand and enable resource reservation.

The 2008 upgrade began a continuous resource improvement plan that includes a phased development approach for Cheaha with on-going increases in capacity and feature enhancements being brought into production via an open community process.

Software improvements rolled into the 2008 upgrade included grid computing services to access distributed compute resources and orchestrate jobs using the GridWay meta-scheduler. An initial 10Gigabit Ethernet link establishing the UABgrid Research Network was designed to supports high speed data transfers between clusters connected to this network.

[edit] 2009

In 2009, annual investment funds were directed toward establishing a fully connected dual data rate Infiniband network between the compute nodes added in 2008 and laying the foundation for a research storage system with a 60TB DDN storage system accessed via the Lustre distributed file system. The Infiniband and storage fabrics were designed to support significant increases in research data sets and their associate analytical demand.

[edit] 2010

In 2010, UAB was awarded an NIH Small Instrumentation Grant (SIG) to further increase analytical and storage capacity. The grant funds were combined with the annual investment funds adding 576 cores (48 nodes) based on the Intel Westmere 2.66 GHz CPU, a quad data rate Infiniband fabric with 32 uplinks, an additional 120 TB of storage for the DDN fabric, and additional hardware to improve reliability. Additional improvements to the research compute platform involved extending the UAB Research Network to link the BEC and RUST data centers, adding 20TB of user and ancillary services storage

[edit] 2012

In 2012, UAB IT Research Computing invested in the foundation hardware to expand long term storage and virtual machine capabilities with aqcuisition of 12 Dell 720xd system, each containing 16 cores, 96GB RAM, and 36TB of storage, creating a 192 core and 432TB virtual compute and storage fabric.

Additionaly hardware investment by the School of Public Health's Section on Statistical Genetics added three 384GB large memory nodes and an additional 48 cores to the QDR Infiniband fabric.

[edit] 2013

In 2013, UAB IT Research Computing acquired an OpenStack cloud and Ceph storage software fabric through a partnership between Dell and Inktank in order to extend cloud computing solutions to the researchers at UAB and enhance the interfacing capabilities for HPC.

[edit] 2015

UAB IT received $500,000 from the university’s Mission Support Fund for a compute cluster seed expansion of 48 teraflops. This added 936 cores across 40 nodes with 2x12 core 2.5 GHz Intel Xeon E5-2680 v3 compute nodes and FDR InfiniBand interconnect.

UAB received a $500,000 grant from the Alabama Innovation Fund for a three petabyte research storage array. This funding with additional matching from UAB provided a multi-petabyte GPFS parallel file system to the cluster which went live in 2016.

[edit] 2016

In 2016 UAB IT Research computing received additional funding from Deans of CAS, Engineering, and Public Heath to grow the compute capacity provided by the prior year's seed funding. This added an additional compute nodes providing researchers at UAB with a 96 2x12 core (2304 cores total) 2.5 GHz Intel Xeon E5-2680 v3 compute nodes with FDR InfiniBand interconnect. Out of the 96 compute nodes, 36 nodes have 128 GB RAM, 38 nodes have 256 GB RAM, and 14 nodes have 384 GB RAM. There are also four compute nodes with the Intel Xeon Phi 7210 accelerator cards and four compute nodes with the NVIDIA K80 GPUs. More information can be found at Resources.

In addition to the compute, the GPFS six petabyte file system came online. This file system, provided each user five terabyte of personal space, additional space for shared projects and a greatly expanded scratch storage all in a single file system.

The 2015 and 2016 investments combined to provide a completely new core for the Cheaha cluster, allowing the retirement of earlier compute generations.

[edit] Grant and Publication Resources

The following description may prove useful in summarizing the services available via Cheaha. If you are using Cheaha for grant funded research please send information about your grant (funding source and grant number), a statement of intent for the research project and a list of the applications you are using to UAB IT Research Computing. If you are using Cheaha for exploratory research, please send a similar note on your research interest. Finally, any publications that rely on computations performed on Cheaha should include a statement acknowledging the use of UAB Research Computing facilities in your research, see the suggested example below. Please note, your acknowledgment may also need to include an addition statement acknowledging grant-funded hardware. We also ask that you send any references to publications based on your use of Cheaha compute resources.

[edit] Description of Cheaha for Grants (short)

UAB IT Research Computing maintains high performance compute and storage resources for investigators. The Cheaha compute cluster provides approximately 3744 CPU cores and 80 accelerators (including 72 NVIDIA P100 GPUS's) interconnected via an InfiniBand network and provides over 572 TFLOP/s of aggregate theoretical peak performance. A high-performance, 12PB raw GPFS storage on DDN SFA12KX hardware is also connected to these compute nodes via the Infiniband fabric. An additional 20TB of traditional SAN storage is also available for home directories. This general access compute fabric is available to all UAB investigators.

[edit] Description of Cheaha for Grants (Detailed)

The Cyberinfrastructure supporting University of Alabama at Birmingham (UAB) investigators includes high performance computing clusters, storage, campus, statewide and regionally connected high-bandwidth networks, and conditioned space for hosting and operating HPC systems, research applications and network equipment.

[edit] Cheaha HPC system

Cheaha is a campus HPC resource dedicated to enhancing research computing productivity at UAB. Cheaha is managed by UAB Information Technology's Research Computing group (RC) and is available to members of the UAB community in need of increased computational capacity. Cheaha supports high-performance computing (HPC) and high throughput computing (HTC) paradigms. Cheaha is composed of resources that span data centers located in the UAB IT Data Centers in the 936 Building and the RUST Computer Center. Research Computing in open collaboration with the campus research community is leading the design and development of these resources.

[edit] Compute Resources

The UAB Cheaha High Performance Computing environment includes a high performance cluster with approximately 3744 CPU cores, 18 GPU nodes, and large memory nodes. The compute nodes combine to provide over 572 TFIops of dedicated computing power. The Ruffner OpenStack private cloud is available to develop and host scientific applications.

[edit] Storage Resources

The high performance compute nodes are backed by a replicated 6PB (12PB raw) high speed storage system with an Infiniband fabric. Additional storage tiers for project space and archive are also available.

[edit] Network Resources

The UAB Research Network is currently a dedicated 40Gbps optical link. The UAB LAN provides 1Gbs to the desktop and 10Gbs for instruments.

The research network also includes a secure Science DMZ with data transfer nodes (DTNs) connected directly to the border router that provide a "friction-free" pathway to access external data repositories and other computational resources.

UAB connects to the Internet2 high-speed research network at 100 Gbs via the University of Alabama System Regional Optical Network (UASRON).

Globus technologies provide secure, reliable and fast data transfers.

[edit] Personnel

UAB IT Research Computing currently maintains a support staff of 10 lead by the Assistant Vice President for Research Computing and includes an HPC Architect-Manager, four Software developers, two Scientists, a system administrator and a project coordinator.

[edit] Acknowledgment in Publications

To acknowledge the use of Cheaha for compute time in published work, please consider adding the following to the acknowledgements section of your publication:

The authors gratefully acknowledge the resources provided by the University of Alabama at Birmingham IT-Research Computing group for high performance computing (HPC) support and CPU time on the Cheaha compute cluster.

If Globus was used to transfer data to/from Cheaha, please consider adding the following to the acknowledgements section of your publication:

This work was supported in part by the National Science Foundation under Grants Nos. OAC-1541310, the University of Alabama at Birmingham, and the Alabama Innovation Fund. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the University of Alabama at Birmingham.

[edit] System Profile

[edit] Hardware

See Resources for more information.

[edit] Software

Details of the software available on Cheaha can be found on the Installed software page, an overview follows.

Cheaha uses Environment Modules to support account configuration. Please follow these specific steps for using environment modules.

Cheaha's software stack is built with the Bright Cluster Manager. Cheaha's operating system is CentOS with the following major cluster components:

  • BrightCM 7.2
  • CentOS 7.2 x86_64
  • Slurm 15.08

A brief summary of the some of the available computational software and tools available includes:

  • Amber
  • FFTW
  • Gromacs
  • GSL
  • NAMD
  • VMD
  • Intel Compilers
  • GNU Compilers
  • Java
  • R
  • OpenMPI
  • MATLAB

[edit] Network

Cheaha is connected to the UAB Research Network which provides a dedicated 10Gbs networking backplane between clusters located in the 936 data center and the campus network core. Data transfers rates of almost 8Gbps between these hosts have been demonstrated using Grid FTP, a multi-channel file transfer service that is used to move data between clusters as part of the job management operations. This performance promises very efficient job management and the seamless integration of other clusters as connectivity to the research network is expanded.

[edit] Benchmarks

The continuous resource improvement process involves collecting benchmarks of the system. One of the measures of greatest interest to users of the system are benchmarks of specific application codes. The following benchmarks have been performed on the system and will be further expanded as additional benchmarks are performed.

[edit] Cluster Usage Statistics

Cheaha uses Bright Cluster Manager to report cluster performance data. This information provides a helpful overview of the current and historical operating stats for Cheaha. You can access the status monitoring page here (accessible only on the UAB network or through VPN).

[edit] Availability

Cheaha is a general-purpose computer resource made available to the UAB community by UAB IT. As such, it is available for legitimate research and educational needs and is governed by UAB's Acceptable Use Policy (AUP) for computer resources.

Many software packages commonly used across UAB are available via Cheaha.

To request access to Cheaha, please send a request to send a request to the cluster support group.

Cheaha's intended use implies broad access to the community, however, no guarantees are made that specific computational resources will be available to all users. Availability guarantees can only be made for reserved resources.

[edit] Secure Shell Access

Please configure you client secure shell software to use the official host name to access Cheaha:

cheaha.rc.uab.edu

[edit] Scheduling Framework

Slurm is a queue management system and stands for Simple Linux Utility for Resource Management. Slurm was developed at the Lawrence Livermore National Lab and currently runs some of the largest compute clusters in the world. Slurm is now the primary job manager on Cheaha, it replaces SUN Grid Engine (SGE) the job manager used earlier.

Slurm is similar in many ways to GridEngine or most other queue systems. You write a batch script then submit it to the queue manager (scheduler). The queue manager then schedules your job to run on the queue (or partition in Slurm parlance) that you designate. Below we will provide an outline of how to submit jobs to Slurm, how Slurm decides when to schedule your job, and how to monitor progress.

[edit] Support

Operational support for Cheaha is provided by the Research Computing group in UAB IT. For questions regarding the operational status of Cheaha please send your request to support@listserv.uab.edu. For more details on optimizing your support experience, please see Support email. As a user of Cheaha you will automatically by subscribed to the hpc-announce email list. This subscription is mandatory for all users of Cheaha. It is our way of communicating important information regarding Cheaha to you. The traffic on this list is restricted to official communication and has a very low volume.

We have limited capacity, however, to support non-operational issue like "How do I write a job script" or "How do I compile a program". For such requests, you may find it more fruitful to send your questions to the hpc-users email list and request help from our peers in the HPC community at UAB. As with all mailing lists, please observe common mailing etiquette.

Finally, please remember that as you learned about HPC from others it becomes part of your responsibilty to help others on their quest. You should update this documentation or respond to mailing list requests of others.

You can subscribe to hpc-users by sending an email to:

sympa@vo.uabgrid.uab.edu with the subject subscribe hpc-users.

You can unsubribe from hpc-users by sending an email to:

sympa@vo.uabgrid.uab.edu with the subject unsubscribe hpc-users.

You can review archives of the list in the web hpc-archives.

If you need help using the list service please send an email to:

sympa@vo.uabgrid.uab.edu with the subject help

If you have questions about the operation of the list itself, please send an email to the owners of the list:

sympa@vo.uabgrid.uab.edu with a subject relavent to your issue with the list

If you are interested in contributing to the enhancement of HPC features at UAB or would like to talk to other cluster administrators, please join the hpc developers community at UAB.

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox