Cheaha: Difference between revisions

From Cheaha
Jump to navigation Jump to search
(→‎Description of Cheaha for Grants: Added note on ScaleMP)
(Update url for new docs site and use consistent text for obsolete reference.)
 
(86 intermediate revisions by 10 users not shown)
Line 1: Line 1:
'''Cheaha''' is a campus resource dedicated to enhancing research computing productivity at UAB. [http://cheaha.uabgrid.uab.edu Cheaha] is sponsored by [http://www.uab.edu/it UAB Information Technology (UAB IT)] and is available to members of the UAB community in need of increased computational capacity.  Cheaha supports [http://en.wikipedia.org/wiki/High-performance_computing high-performance computing (HPC)] and [http://en.wikipedia.org/wiki/High-throughput_computing high-throughput computing (HTC)] paradigms and is the primary interface for leveraging computational resources on UABgrid,  the campus distributed research support infrastructure.
The Cheaha overview page has moved to Research Computing's new documentation site. Please visit https://docs.rc.uab.edu/ for information on Cheaha.


Cheaha includes a dedicated pool of local compute resources and provides seamless access to remote compute resources through the use of inter-cluster scheduling technologies.  The local compute pool contains two processor banks based on the [http://en.wikipedia.org/wiki/X86_64 x86-64 64-bit architecture]. 192 3.0Ghz cores and 120 1.6Ghz cores combine to provide nearly 3TFlops of dedicated computing power. 
The obsolete content of the original page can be found at [[Obsolete: Cheaha]] for historical reference.
 
Use of the local compute pool is governed by scheduling policies designed to maximize availability of total capacity and ensure guaranteed access to reserved resources.  Use of the remote compute pool is contingent upon allocations for individual users on specific resources. Incorporation of remote resources enables simplified management of scientific workflows and can significantly increase available compute capacity.
 
Cheaha is located in the UAB Shared Computing facility in BEC. Resource design and development is lead by UAB IT Infrastructure Services in open collaboration with community members.  Development effort is coordinated though [http://projects.uabgrid.uab.edu/cheaha Cheaha's project web site].  Operational support is provided by UAB IT's High Performance Computing Services group.
 
Cheaha is named in honor of [http://en.wikipedia.org/wiki/Cheaha_Mountain Cheaha Mountain], the highest peak in the state of Alabama.  Cheaha is a popular destination whose summit offers clear vistas of the surrounding landscape. (Cheaha Mountain photo-streams on [http://www.flickr.com/search/?q=cheaha  Flikr] and [http://picasaweb.google.com/lh/view?q=cheaha&psc=G&filter=1# Picasa]).
 
== History ==
 
[[Image:cheaha-2phase-flat.png|right|thumb|450px|Logical Diagram of Cheaha Configuration with Development Phase 1 and Phase 2  Highligthed]]
 
In 2002 UAB was awarded an infrastructure development grant through the NSF EPsCoR program.  This led to the 2005 acquisition of a 64 node compute cluster with two AMD Opteron 242 1.6Ghz CPUs per node (128 total cores).  This cluster was named Cheaha.  Cheaha expanded the compute capacity available at UAB and was the first general-access resource for the community. It lead to expanded roles for UAB IT in research computing support through the development of the UAB Shared HPC Facility in BEC and provided further engagement in Globus-based grid computing resource development on campus via UABgrid and regionally via [http://www.suragrid.org SURAgrid].
 
=== 2008 Upgrade ===
 
In 2008, money was allocated by UAB IT for hardware upgrades which lead to the acquisition of an additional 192 cores based on a Dell clustering solution with Intel Quad-Core E5450 3.0Ghz CPU in August of 2008. This uprade migrated Cheaha's core infrastructure to the Dell blade clustering solution. It provided a 3 fold increase in processor density over the original hardware and enables more computing power to be located in the same physical space with room for expansion, an important consideration in light of the continued growth in processing demand.  This hardware represented a major technology upgrade that included space for additional expansion to address over-all capacity demand and enable resource reservation. 
 
This upgrade also included enhancements to enable access to the aggregate compute power available to the UAB community and improve management of compute jobs across clusters that are part of the UABgrid computing infrastructure. 10Gigabit Ethernet connectivity to the UABgrid Research Network supports high speed data transfers between clusters connected to this network, enabling efficient job staging on multiple resources. [http://www.gridway.org GridWay-based] meta-scheduling enables management of compute jobs across cluster boundaries and brings grid-computing into production.
 
=== Continuous Resource Improvement ===
 
The 2008 upgrade began a phased development approach for Cheaha with on-going increases in capacity and feature enhancements being brought into production via an [http://projects.uabgrid.uab.edu/cheaha open community process].  The first two phases are represented in the diagram on the right, which highlights the logical connectivity between resources.  Phase 1 is scheduled for production in January 2009.
 
== Grant and Publication Resources ==
 
The following description may prove useful in summarizing the services available via Cheaha.  If you are using Cheaha for grant funded research please send information about your grant (funding source and grant number), a statement of intent for the research project and a list of the applications you are using to UAB IT Research Computing.  If you are using Cheaha for exploratory research, please send a similar note on your research interest.  Finally, any publications that rely on computations performed on Cheaha should include a statement acknowledging the use of UAB Research Computing facilities in your research, see the suggested example below.  Please note, you're acknowledgment may also need to include an addition statement acknowledging grant-funded hardware.
 
=== Description of Cheaha for Grants ===
 
UAB IT Research Computing maintains high performance compute and storage resources for investigators.  The Cheaha compute cluster includes 192 3.0GHz Intel-based compute cores with 386GB of RAM interconnected via a DDR Infiniband network.  A high-performance, 60TB Lustre parallel file system built on a Direct Data Network (DDN) hardware platform is also connected to these cores via the Infiniband fabric.  An additional 40TB of traditional shared storage and an auxiliary 120 1.6GHz AMD-based compute cores are available via a 1GigE network fabric.  This core general access compute fabric 312 nodes and 100 TB of storage is available to all UAB investigators.
 
NIH funded investigators are granted priority access to an expanded compute and storage pool that includes an additional 576 3.0GHz Intel-based compute cores, 2.3TB RAM and 120TB in the high-performance Lustre parallel file system all interconnected via a QDR Infiniband network fabric.  ScaleMP software is available on this enhanced compute fabric to combine these resources into aggregate system images that can leverage the enter collection within a single system image.
 
=== Acknowledgment in Publications ===
 
This work was supported in part by the research computing resources acquired and managed by UAB IT Research Computing. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the University of Alabama at Birmingham.
 
NIH funded researchers who leverage the additional priority resources available to them should also include a statement of support regarding this NIH funded hardware.
 
== System Profile ==
 
=== Hardware ===
 
Cheaha is composed of a number commodity hardware components.  The current configuration is based on the [http://www.dell.com/content/products/productdetails.aspx/pedge_m1000e?c=us&l=en&s=bsd&cs=04 Dell M1000e] cluster technology which supports modular configurations of blade-based servers.  The 2008 Hardware Upgrade included a 2950 head node with 24 server blades across 2 blade chassis, each with 2 [http://ark.intel.com/cpu.aspx?groupID=33083 3.0GHz quad-core Intel Xeon E5450 processors] and 16Gb RAM (2Gb/core). This processing power is supplemented with the 1st generation system components from the original cluster acquisition which include 60 compute blades with 2 1.6 AMD Opteron 242 processors and 2Gb RAM (1Gb/core).
 
Summarized, Cheaha's dedicated compute pool includes:
* 192 cores at 3.0GHz with 2Gb RAM per core
* 120 cores at 1.6GhZ with 1Gb RAM per core
 
=== Software ===
 
Details of the software available on Cheaha can be found on the [http://me.eng.uab.edu/wiki/index.php?title=Cheaha#Installed_software Cheaha cluster configuration page], an overview follows.
 
Cheaha uses [http://modules.sourceforge.net/ Environment Modules] to support account configuration. Please follow these [http://me.eng.uab.edu/wiki/index.php?title=Cheaha#Environment_Modules specific steps for using environment modules].
 
Cheaha's software stack is built with the [http://www.rocksclusters.org/ ROCKS], a Linux-based cluster distribution. Cheaha's operating system is CentOS with the following major cluster components:
* [http://www.rocksclusters.org/roll-documentation/base/5.1/ Rocks 5.1 (V.I)]
* [http://www.centos.org/docs/5/ CentOS 5.2]
* [http://gridengine.sunsource.net/documentation.html SGE 6.1u5]
* [http://globus.org/toolkit/docs/4.0/ Globus 4.0.8]
* [http://www.gridway.org/doku.php?id=documentation:howto GridWay 5.4.0]
 
A summary of the available computational software and tools available includes:
* Amber
* FFTW
* Gromacs
* GSL
* NAMD
* VMD
* Intel Compilers
* GNU Compilers
* Java
* R
* OpenMPI
 
=== Network ===
 
Cheaha is connected to the UAB Research Network which provides a dedicated 10Gbs networking backplane between clusters located the UAB Shared Computing Facility and Department of Computer and Information Science HPC Center.  At present only Cheaha and Ferrum are connected via these 10Gbs interfaces. Data transfers rates of almost 8Gbs between these hosts have been demonstrated using Grid FTP, a multi-channel file transfer service that is used by GridWay to move data between clusters as part of the job management operations.  This performance promises very efficient job management and the seamless integration of other clusters as connectivity to the research network is expanded.
 
=== Benchmarks ===
 
The continuous resource improvement process involves collecting benchmarks of the system.  One of the measures of greatest interest to users of the system are benchmarks of specific application codes.  The following benchmarks have been performed on the system and will be further expanded as additional benchmarks are performed.
 
* [[Gromacs_Benchmark|Gromacs]]
 
=== Performance Statistics ===
 
Cheaha uses Ganglia to report cluster performance data. This information provides a helpful overview of the current and historical operating stats for Cheaha.  You can access the Ganglia monitoring page [http://cheaha.uabgrid.uab.edu/ganglia/ here].
 
== Availability ==
 
Cheaha is a general-purpose computer resource made available to the UAB community by UAB IT.  As such, it is available for legitimate research and educational needs and is governed by [http://www.uabgrid.uab.edu/aup UAB's Acceptable Use Policy (AUP)] for computer resources. 
 
Many software packages commonly used across UAB are available via Cheaha. For more information and introductory help on using this resource please visit the [http://me.eng.uab.edu/wiki/index.php?title=Cheaha resource details page].
 
To request access to Cheaha, please send a request to [mailto:support@vo.uabgrid.uab.edu send a request] to the cluster support group.
 
Cheaha's intended use implies broad access to the community, however, no guarantees are made that specific computational resources will be available to all users.  Availability guarantees can only be made for reserved resources.
 
=== Secure Shell Access ===
 
Please configure you client secure shell software to use the official host name to access Cheaha:
 
<pre>
cheaha.uabgrid.uab.edu
</pre>
 
To ensure that you are connecting to the legitimate host, you can verify that the fingerprint presented by your secure shell client for cheaha.uabgrid.uab.edu matches:
 
<pre>
d4:2e:cc:12:95:a2:39:cc:b7:2c:d8:97:37:75:e9:6f
</pre>
 
'''Upgrade Note:''' The previous host name (cheaha.ac.uab.edu) is mapped to the new hostname to support the transition for existing users. If you connect using the old host name, this name mapping may trigger a warning in your secure shell software about the host fingerprint change.  The preceding finger print enables you to confirm you are indeed connecting to the legitimate Cheaha interface.
 
=== Account Migration ===
 
The storage systems from the original Cheaha (cheaha.ac.uab.edu) system are scheduled for decommission on '''March 1 2009'''. After this date they will no longer be available.
 
This change affects the home directories of all users who had accounts on the original Cheaha system (cheaha.ac.uab.edu).  Affected users should copy any data they want to preserve from their old home directories to their current home directory on Cheaha by March 1 2009.
 
As a convenience, the old home directories are accessible directly via the file system of the new system. The old files are available as '''read-only''' data to ensure that new data is not saved to these locations.  Users should copy any files they wish to preserve from their old home directories located at /oldcheaha/$LOGNAME to their current home directory (located at $HOME). 
 
The following command will copy all files to a folder call "oldcheaha" under your current home directory. Users are encouraged, however, to use this opportunity to selectively copy only the data they wish to preserve.
 
<pre>
cp -a /oldcheaha/$LOGNAME $HOME/oldcheaha
</pre>
 
== Scheduling Framework ==
 
Cheaha provides performance and management improvements for scientific workflows by enabling access to the processing power of multiple clusters through the use of the [http://www.gridway.org GridWay] scheduling framework.  GridWay enables the development of scientific workflows that leverage all computing resources available to the researcher and that can be controlled through a single management interface. This feature puts state-of-the-art technology in the hands of the research community.
 
Enhancements to Cheaha in general, and, in particular, its scheduling framework are intended to remain transparent to the user community. Cheaha is first and foremost a resource for predictable and dependable computation. In this spirit, Cheaha can continue to be viewed as a traditional HPC cluster that supports job management via [http://gridengine.org Sun Grid Engine (SGE)]. User's who have no need for or interest in maximized access to computational resources can continue using the  familiar SGE scheduling framework to manage compute jobs on Cheaha, with the familiar restriction that SGE managed jobs can only leverage the processing power of Cheaha's local compute pool. That is, these jobs, as in the past, cannot leverage cycles available on other clusters.
 
=== SGE ===
 
Cheaha provides access to its local compute pool via the SGE scheduler.  This arrangement is identical to the existing HPC clusters on campus and mirrors the long-established configuration of Cheaha. Researchers experienced with other SGE-based clusters should find no difficulty leveraging this feature.  For more information on getting started with SGE on Cheaha please see the [http://me.eng.uab.edu/wiki/index.php?title=Cheaha cluster resources page].
 
=== GridWay ===
 
Cheaha provides enhanced scientific workflow management and development capabilities via the [http://www.gridway.org GridWay] scheduling framework.  GridWay enables the orchestration of scientific workflows across multiple clusters. The pool of resources available as part in Phase 1 includes Cheaha, Olympus, Everest, and Ferrum.  This pool can be monitored by any user of Cheaha by executing the <nowiki>gwhost</nowiki> command when logged into Cheaha.
 
The GridWay framework provides two interfaces.  A scheduler interface similar to SGE is recommended for initial exploration and ordinary use. The scheduler activity can be monitored with the `gwps` command. Job submit and monitoring commands initiate and control commands described in a job description file.  Outside of slightly different commands, the job description file operates as a template where specific fields are populated to affect the operation of the scheduler. These templates are less ambiguous than traditional SGE job scripts and can provide a direct migration path from SGE. A more subtle difference is that an explicit (though automated) job staging step is involved in order to start jobs. This can require more explicit handling of input and output files than is ordinarily required by SGE.
 
Additionally, very powerful programatic control is available via the [http://en.wikipedia.org/wiki/DRMAA DRMAA] API. DRMMA enables the development of advanced scientific workflows that can leverage any number of computational resources. GridWay  provides bindings to many popular programming languages like C, Java, Perl, Python and Ruby.
 
Both the traditional scheduler-based interface and the DRMMA API have been explored by development groups on-campus during the pilot evaluation phase of GridWay. 
* UAB IT, UAB's School of Public Health [http://www.ssg.uab.edu/ Section on Statistical Genetics (SSG)], and  CIS established an [http://projects.uabgrid.uab.edu/r-group interest group for R] which leveraged the scheduler-based interface to improve the performance of select [http://en.wikipedia.org/wiki/R_(programming_language) R language] statistical analysis workflows.
* The CIS Collaborative Computing Laboratory has heavily leveraged the Java DRMMA API to develop [http://www.cis.uab.edu/ccl/index.php/DynamicBLAST DynamicBLAST], a scientific workflow to orchestrate and maximize the performance of [http://en.wikipedia.org/wiki/BLAST BLAST] across multiple resources.
 
=== GridWay Adoption ===
 
Adoption of GridWay is encouraged and future compute capacity enhancements will leverage the inherent flexibilities of GridWay. The nature of any new technology, however, implies a learning curve.  The learning curve need not be steep and direct migration of basic SGE scripts is possible.  Additionally, all Cheaha accounts are configured to support the use of GridWay as an alternative scheduler, empowering the adventurous.
 
Some important points worth considering in evaluating adoption of GridWay.
 
* GridWay cannot perform magic.  If you ordinarily do not have access to other clusters or your code does not (or will not) run on a targeted cluster, GridWay cannot solve these problems for you.  You must ensure your codes run on all compute resources you intend to include in your scheduling pool prior to submitting jobs to those resources. 
* Migration of MPI jobs to a multi-cluster environment may involve additional effort. If you simply use MPI to coordinate the workers (rather than for low-latency peer communication), you should generally be able to structure your job to work across cluster boundaries.  Otherwise, additional effort may be required to divide your data into smaller work units. It should be noted, however, that MPI itself cannot be used to communicate across cluster boundaries. Jobs distributed across cluster boundaries that leverage MPI internally must be sub-divided to run within isolated communication domains.
 
The UABgrid User Community stands ready to help you with GridWay adoption.  If you are interested in exploring or adapting your workflows to use GridWay please subscribe to the [mailto:uabgrid-user-subscribe@vo.uabgrid.uab.edu UABgrid-User list] and ask all questions there.  Please do not submit GridWay-related questions via the ordinary cluster support channels. Additional on-line documentation will be developed to provide migration examples to help you further explore the power of GridWay.
 
== Support ==
 
Operational support for Cheaha is provided by the High Performance Computing Support group in UAB IT.  For questions regarding Cheaha or to initiate support requests, please send your request to [mailto:support@vo.uabgrid.uab.edu support@vo.uabgrid.uab.edu].

Latest revision as of 20:13, 31 August 2022

The Cheaha overview page has moved to Research Computing's new documentation site. Please visit https://docs.rc.uab.edu/ for information on Cheaha.

The obsolete content of the original page can be found at Obsolete: Cheaha for historical reference.