Cheaha GettingStarted deprecated: Difference between revisions
Jpr@uab.edu (talk | contribs) (→Access: Change access request to support list instead of ME form.) |
Jpr@uab.edu (talk | contribs) (→Hardware: Rewrite to address current description of hardware and reference to main cheaha page) |
||
Line 24: | Line 24: | ||
Cheaha.uabgrid is made up of a head node (Dell PowerEdge 2950) with 16GB of RAM and two quad core Intel Xeon 3GHz processors and 24 compute nodes. | Cheaha.uabgrid is made up of a head node (Dell PowerEdge 2950) with 16GB of RAM and two quad core Intel Xeon 3GHz processors and 24 compute nodes. | ||
There are 896 cores available for batch computing. They are a collection of three generations of hardware: 576 2.8 GHz cores across 48 nodes each with 48GB RAM and interconnected with QDR Infiniband; 192 3.0GHz cores across 24 nodes each with 16GB RAM and interconnected with DDR Infiniband; and 128 1.8 GHz cores across 64 nodes each with 2 GB RAM and interconnected with 1Gbs Ethernet. | |||
Cheaha | Cheaha has a 240 TB Lustre high performance file system attached via Infiniband and GigE (depending on the nodes connectivity) for use as a scratch file system. An additional 40TB are available for general research storage. | ||
Cheaha | Further details on the hardware are available on the main [[Cheaha]] page. | ||
== Cluster Software == | == Cluster Software == |
Revision as of 15:40, 15 July 2011
Information about the history and future plans for Cheaha is available on the Cheaha UABGrid Documentation page.
Access
To request an account on Cheaha, please submit an authorization request to the IT Research Computing staff. Please include some background information about the work you plan on doing on the cluster and the group you work with, ie. your lab or affiliation.
Usage of Cheaha is governed by UAB's Acceptable Use Policy (AUP) for computer resources.
The official DNS name of Cheaha's frontend machine is cheaha.uabgrid.uab.edu. If you want to refer to the machine as cheaha, you'll have to either add
search uabgrid.uab.edu
to /etc/resolv.conf (you'll need administrator access to edit this file), or add
Host cheaha Hostname cheaha.uabgrid.uab.edu
to your ~/.ssh/config file
Hardware
Cheaha.uabgrid is made up of a head node (Dell PowerEdge 2950) with 16GB of RAM and two quad core Intel Xeon 3GHz processors and 24 compute nodes.
There are 896 cores available for batch computing. They are a collection of three generations of hardware: 576 2.8 GHz cores across 48 nodes each with 48GB RAM and interconnected with QDR Infiniband; 192 3.0GHz cores across 24 nodes each with 16GB RAM and interconnected with DDR Infiniband; and 128 1.8 GHz cores across 64 nodes each with 2 GB RAM and interconnected with 1Gbs Ethernet.
Cheaha has a 240 TB Lustre high performance file system attached via Infiniband and GigE (depending on the nodes connectivity) for use as a scratch file system. An additional 40TB are available for general research storage.
Further details on the hardware are available on the main Cheaha page.
Cluster Software
- Rocks 5.3
- CentOS 5.4 x86_64
- Grid Engine 6.2u4
- Globus 4.0.8
- Gridway 5.4.0
Storage
Home directories
Your home directory on Cheaha is NFS-mounted to the compute nodes as /home/$USER or $HOME. It is acceptable to use your home directory a location to store job scripts, custom code, libraries, job scripts.
The home directory must not be used to store large amounts of data.
Scratch
Research Computing policy requires that all bulky input and output must be located on the scratch space. The home directory is intended to store your job scripts, log files, libraries and other supporting files.
Cheaha has two types of scratch space, network mounted and local.
- Network scratch ($UABGRID_SCRATCH) is available on the head node and each compute node. This storage is a Lustre high performance file system providing roughly 40TB of storage. This should be your jobs primary working directory, unless the job would benefit from local scratch (see below).
- Local scratch is physically located on each compute node and is not accessible to the other nodes (including the head node). This space is useful if the job performs a lot of file I/O. Most of the jobs that run on our clusters do not fall into this category. Because the local scratch is inaccessible outside the job, it is important to note that you must move any data between local scratch to your network accessible scratch within your job. For example, step 1 in the job could be to copy the input from $UABGRID_SCRATCH to /scratch/$USER, step 2 code execution, step 3 move the data back to $UABGRID_SCRATCH.
Important Information:
- Scratch space (network and local) is not backed up.
- Research Computing expects each user to keep their scratch areas clean. The clusters are not to be used for archiving data.
Network Scratch
Network scratch is available using the environment variable $UABGRID_SCRATCH or directly by /lustre/scratch/$USER
It is advisable to use the environment variable whenever possible rather than the hard coded path.
Local Scratch
Local scratch is available on each compute node under /scratch.
Each compute node has a local scratch directory, /scratch. If your job performs a lot of file I/O, the job may run quicker (and possibly more stable) by using /scratch/$USER rather than reading and writing using your network mounted scratch directory. The amount of scratch space available on each compute node is approximately 40GB.
The following is a typical sequence of events within a job script using local scratch:
- Create a directory for your user called /scratch/$USER
mkdir -p /scratch/$USER/$JOB_ID
- Copy the data from $UABGRID_SCRATCH to /scratch/$USER
cp -a $UABGRID_SCRATCH/GeneData /scratch/$USER/$JOB_ID/
- Run the application
geneapp -S 1 -D 10 \< /scratch/$USER/$JOB_ID/GeneData \> /scratch/$USER/$JOB_ID/geneapp.out
- Delete anything that you don't want to move back to network scratch (for example the copy of the input data)
rm -rf /scratch/$USER/$JOB_ID/GeneData
- Move the data that you want to keep from local to network scratch
mv /scratch/$USER/$JOB_ID $UABGRID_SCRATCH/
The following is an example of what the code might look like in a job script (remember if the job is an array job, you may need to use /scratch/$USER/$JOB_ID/$SGE_TASK_ID to prevent multiple tasks from overwriting each other):
#!/bin/bash #$ -S /bin/bash #$ -cwd # #$ -N scratch_example #$ -pe smp 1 #$ -l h_rt=00:20:00,s_rt=00:18:00,vf=2G #$ -j y # #$ -M YOUR_EMAIL_ADDRESS #$ -m eas # module load R/R-2.9.0 if [ ! -d /scratch/$USER/$JOB_ID ]; then mkdir -p /scratch/$USER/$JOB_ID chmod 700 /scratch/$USER fi cp -a $UABGRID_SCRATCH/GeneData /scratch/$USER/$JOB_ID/ $HOME/bin/geneapp -S 1 -D 10 < /scratch/$USER/$JOB_ID/GeneData > /scratch/$USER/$JOB_ID/geneapp.out rm -rf /scratch/$USER/$JOB_ID/GeneData mv /scratch/$USER/$JOB_ID $UABGRID_SCRATCH/
By default, the $USER environment variable contains your login ID and the grid engine will populate $JOB_ID and $SGE_TASK_ID variables with the correct job and task IDs.
The mkdir command creates the full directory path (the -p switch is important). The chmod ensures that other users are not able to view files under this directory.
Please make sure to clean up the scratch space. This space is not to be used as a long term storage device for data and is subject to being erased without notice if the file systems fill up.
Project Storage
Cheaha has a location where shared data can be stored called $UABGRID_PROJECT
This is helpful if a team of researchers must access the same data. Please open a help desk ticket to request a project directory under $UABGRID_PROJECT.
Environment Modules
Environment Modules is installed on Cheaha and should be used when constructing your job scripts if an applicable module file exists. Using the module command you can easily configure your environment for specific software packages without having to know the specific environment variables and values to set. Modules allows you to dynamically configure your environment without having to logout / login for the changes to take affect.
If you find that specific software does not have a module, please submit a helpdesk ticket to request the module.
Note: If you are using LAM MPI for parallel jobs, you must load the LAM module in both your job script and your profile. For example, assume we want to use LAM-MPI compiled for GNU:
- for BASH users add this to your ~/.bashrc and your job script, or for CSH users add this to your ~/.cshrc and your job script
module load lammpi/lam-7.1-gnu
- Cheaha supports bash completion for the module command. For example, type 'module' and press the TAB key twice to see a list of options:
module TAB TAB add display initlist keyword refresh switch use apropos help initprepend list rm unload whatis avail initadd initrm load show unuse clear initclear initswitch purge swap update
- To see the list of available modulefiles on the cluster, run the module avail command (note the example list below may not be complete!) or module load followed by two tab key presses:
module avail R/R-2.11.1 cufflinks/cufflinks-0.9 intel/intel-compilers mvapich-intel rna_pipeline/rna_pipeline-0.31 R/R-2.6.2 eigenstrat/eigenstrat jags/jags-1.0-gnu mvapich2-gnu rna_pipeline/rna_pipeline-0.5.0 R/R-2.7.2 eigenstrat/eigenstrat-2.0 lammpi/lam-7.1-gnu namd/namd-2.6 s.a.g.e./sage-6.0 R/R-2.8.1 ent/ent-1.0.2 lammpi/lam-7.1-intel namd/namd-2.7 samtools/samtools R/R-2.9.0 fastphase/fastphase-1.4 mach/mach openmpi/openmpi-1.2-gnu samtools/samtools-0.1 R/R-2.9.2 fftw/fftw3-gnu macs/macs openmpi/openmpi-1.2-intel shrimp/shrimp-1.2 RAxML/RAxML-7.2.6 fftw/fftw3-intel macs/macs-1.3.6 openmpi/openmpi-gnu shrimp/shrimp-1.3 VEGAS/VEGAS-0.8 freesurfer/freesurfer-4.5 maq/maq-0.7 openmpi/openmpi-intel spparks/spparks amber/amber-10.0-intel fregene/fregene-2008 marthlab/gigabayes openmpi-gnu structure/structure-2.2 amber/amber-11-intel fsl/fsl-4.1.6 marthlab/mosaik openmpi-intel tau/tau apbs/apbs-1.0 genn/genn marthlab/pyrobayes paraview/paraview-3.4 tau/tau-2.18.2p2 atlas/atlas gromacs/gromacs-4-gnu mathworks/R2009a paraview/paraview-3.6 tau/tau-lam-intel birdsuite/birdsuite-1.5.3 gromacs/gromacs-4-intel mathworks/R2009b pdt/pdt tophat/tophat birdsuite/birdsuite-1.5.5 hapgen/hapgen mathworks/R2010a pdt/pdt-3.14 tophat/tophat-1.0.8 bowtie/bowtie hapgen/hapgen-1.3.0 mpich/mpich-1.2-gnu phase/phase tophat/tophat-1.1 bowtie/bowtie-0.10 haskell/ghc mpich/mpich-1.2-intel plink/plink vmd/vmd bowtie/bowtie-0.12 illuminus/illuminus mpich/mpich2-gnu plink/plink-1.05 vmd/vmd-1.8.6 bowtie/bowtie-0.9 impute/impute mrbayes/mrbayes-gnu plink/plink-1.06 chase impute/impute-2.0.3 mrbayes/mrbayes-intel plink/plink-1.07 cufflinks/cufflinks impute/impute-2.1.0 mvapich-gnu python/python-2.6
Some software packages have multiple module files, for example:
- plink/plink
- plink/plink-1.05
- plink/plink-1.06
In this case, the plink/plink module will always load the latest version, so loading this module is equivalent to loading plink/plink-1.06. If you always want to use the latest version, use this approach. If you want use a specific version, use the module file containing the appropriate version number.
Some modules, when loaded, will actually load other modules. For example, the gromacs/gromacs-4-intel module will also load openmpi/openmpi-intel and fftw/fftw3-intel.
- To load a module, ex: for a Gromacs job, use the following module load command in your job script:
module load gromacs/gromacs-4-intel
- To see a list of the modules that you currently have loaded use the module list command
module list Currently Loaded Modulefiles: 1) fftw/fftw3-intel 2) openmpi/openmpi-intel 3) gromacs/gromacs-4-intel
- A module can be removed from your environment by using the module unload command:
module unload gromacs/gromacs-4-intel module list No Modulefiles Currently Loaded.
- The definition of a module can also be viewed using the module show command, revealing what a specific module will do to your environment:
module show gromacs/gromacs-4-intel ------------------------------------------------------------------- /etc/modulefiles/gromacs/gromacs-4-intel: module-whatis Sets up gromacs-intel v4.0.2 in your enviornment module load fftw/fftw3-intel module load openmpi/openmpi-intel prepend-path PATH /opt/uabeng/gromacs/intel/4/bin/ prepend-path LD_LIBRARY_PATH /opt/uabeng/gromacs/intel/4/lib prepend-path MANPATH /opt/uabeng/gromacs/intel/4/man -------------------------------------------------------------------
Installed software
We try to install local software in /opt, /opt/uabeng and /share/apps. However, please do not depend on a particular piece of software being in a specific directory, as we may need to move things around at some point.
In most cases, the descriptions for each software package was copied from the authors web site and represents their own work.
If you don't find a particular package listed on this page, please open a help desk ticket to request the software.
If a module file is available for the software, it is recommended to use the module file in your job script and/or shell profile.
Software (Link to home page) | Version | Software Installation-Directory | Information |
---|---|---|---|
Amber | 10 | /opt/uabeng/amber10/intel | "Amber" refers to two things: a set of molecular mechanical force fields for the simulation of biomolecules (which are in the public domain, and are used in a variety of simulation programs); and a package of molecular simulation programs which includes source code and demos.
Amber is compiled using Intel compilers and uses OpenMPI for the parallel binaries. The following Modules files should be loaded for this package (the amber module will automatically load the openmpi module): For Intel: module load amber/amber-10-intel Use the openmpi parallel environment in your job script (example for a 4 slot job) #$ -pe openmpi 4 |
APBS | 1.0.0 | /share/apps/apbs/apbs-1.0.0-amd64 | APBS - Adaptive Poisson-Boltzmann Solver APBS is a software package for the numerical solution of the Poisson-Boltzmann equation (PBE), one of the most popular continuum models for describing electrostatic interactions between molecular solutes in salty, aqueous media.
Submit APBS jobs via the Grid Engine and do not run them on the head node! module load apbs/apbs-1.0 |
Atlas | 3.8.3 | /usr/lib64/atlas | The ATLAS (Automatically Tuned Linear Algebra Software) project is an
ongoing research effort focusing on applying empirical techniques in order to provide portable performance. At present, it provides C and Fortran77 interfaces to a portably efficient BLAS implementation, as well as a few routines from LAPACK. module load atlas/atlas |
Biopython | 1.51 | Biopython is a set of freely available tools for biological computation written in Python by an international team of developers.
The Biopython packages along with its dependencies (Numpy, python-reportlab, Flex, etc...) are all installed in the default location for Python site-packages, so you should not need to modify any environment variables to use this package. | |
Birdsuite | 1.5.3 | /share/apps/birdsuite/1.5.3 | The Birdsuite is a fully open-source set of tools to detect and report SNP genotypes, common Copy-Number Polymorphisms (CNPs), and novel, rare, or de novo CNVs in samples processed with the Affymetrix platform. While most of the components of the suite can be run individually (for instance, to only do SNP genotyping), the Birdsuite is especially intended for integrated analysis of SNPs and CNVs. Support for chips and platforms other than the Affymetrix SNP 6.0 is currently limited, but we are currently working on creating the supporting files for other common genotyping platforms.
An example job submission script can be found here (copy this to your job directory and make sure to edit the email address!) /share/apps/example-scripts/birdsuite-job.qsub The following Modules files should be loaded for this package: module load birdsuite/birdsuite-1.5 |
boost | 1.33.1 | /usr/lib
/usr/lib64 |
Boost provides free peer-reviewed portable C++ source libraries.
The Boost team emphasize libraries that work well with the C++ Standard Library. Boost libraries are intended to be widely useful, and usable across a broad spectrum of applications. The Boost license encourages both commercial and non-commercial use. Both 32bit and 64bit versions of Boost C++ libraries are provided under /usr/lib and /usr/lib64 |
Bowtie | 0.10.1 | /share/apps/bowtie/bowtie-0.10.1 | Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end). It supports alignment policies equivalent to Maq and SOAP but is substantially faster.
A Bowtie tutorial is available here: http://bowtie-bio.sourceforge.net/tutorial.shtml The following Modules files should be loaded for this package: module load bowtie/bowtie-0.10 |
eigenstrat | 3.0 | /share/apps/eigenstrat | EIGENSTRAT also provides a decent FAQ on their website, click here.
"The EIGENSOFT package combines functionality from our population genetics methods (Patterson et al. 2006) and our EIGENSTRAT stratification method (Price et al. 2006). The EIGENSTRAT method uses principal components analysis to explicitly model ancestry differences between cases and controls along continuous axes of variation; the resulting correction is specific to a candidate marker's variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. The EIGENSOFT package has a built-in plotting script and supports multiple file formats and quantitative phenotypes." The following Modules file should be loaded for this package: module load eigenstrat/eigenstrat |
fastPHASE | 1.4.0 | /share/apps/fastPHASE/1.4 | The program fastPHASE implements methods for estimating haplotypes and missing genotypes from population SNP genotype data.
The following Modules files should be loaded for this package: module load fastphase/fastphase-1.4 |
FFTW | 3.1.2 | /opt/uabeng/fftw3/gnu
/opt/uabeng/fftw3/intel |
FFTW is a C subroutine library for computing the discrete Fourier transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data (as well as of even/odd data, i.e. the discrete cosine/sine transforms or DCT/DST).
The following Modules files should be loaded for this package: For GNU: module load fftw/fftw3-gnu For Intel: module load fftw/fftw3-intel |
Gromacs | 4.0.5 | /opt/uabeng/gromacs/gnu/4
/opt/uabeng/gromacs/intel/4 |
GROMACS is a versatile package to perform molecular dynamics and is primarily designed for biochemical molecules like proteins and lipids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers.
Gromacs is compiled using Intel and GNU compilers using FFTW3, BLAS, and LAPACK and OpenMPI for the parallel binaries. Single and double precision binaries are included (double precision binaries have a _d suffix). The following Modules files should be loaded for this package (module will automatically load any prerequisite modules): For GNU: module load gromacs/gromacs-4-gnu For Intel: module load gromacs/gromacs-4-intel Use the openmpi parallel environment in your job script (example for a 4 slot job) #$ -pe openmpi 4 |
GSL | 1.10 | /usr/lib
/usr/lib64 /usr/include/gsl |
The GNU Scientific Library (GSL) is a numerical library for C and C++ programmers. It is free software under the GNU General Public License.
The library provides a wide range of mathematical routines such as random number generators, special functions and least-squares fitting. There are over 1000 functions in total with an extensive test suite. |
HAPGEN | 1.3.0 | /share/apps/hapgen/1.3.0 | HAPGEN is a program thats simulates case control datasets at SNP markers and can output data in the FILE FORMAT used by IMPUTE, SNPTEST and GTOOL. The approach can handle markers in LD and can simulate datasets over large regions such as whole chromosomes. Hapgen simulates haplotypes by conditioning on a set of population haplotypes and an estimate of the fine-scale recombination rate across the region.
Command line syntax example for HAPGEN can be found by clicking here. The HAPGEN environment module can be loaded as follows module load hapgen/hapgen |
IMPUTE v2 | 2.0.3 | /share/apps/impute/2.0.3 | IMPUTE v2 is a new genotype imputation algorithm based on ideas described in Howie et al. (2009).
For the specific version module load impute/impute-2.0.3 Or to use the latest module load impute/impute Examples for IMPUTE v2 are provided in the $IMPUTEHOME/Example directory. |
JAGS | 1.0.3 | /share/apps/jags/jags-1.0.3/gnu | JAGS (Just Another Gibbs Sampler) is a Bayesian hierarchical model analysis program using Markov Chain Monte Carlo (MCMC) simulation. It is similar to BUGS but will compile on Linux systems.
Click here for a good description of JAGS and how it differs from BUGS. The JAGS environment module can be loaded as follows module load jags/jags-1.0-gnu
|
Java JDK | 1.5.0_10 | /usr/java/jdk1.5.0_10 | JDK (Java Developers Kit) and Runtime from Sun |
JRE | 1.6.0_04 | /usr/java/jre1.6.0_04 | Java Runtime |
Intel | 10.1.015 | /opt/intel/cce
/opt/intel/fce /opt/intel/mkl |
Intel C, C++ and Fortran compilers along with the Intel Math Kernel Libraries
The following Modules file should be loaded for this package: module load intel/intel-compilers-10.1 |
LAM-MPI | 7.1.4 | /opt/uabeng/lam/gnu
/opt/uabeng/lam/intel |
LAM/MPI is now in a maintenance mode. Bug fixes and critical patches are still being applied, but little real "new" work is happening in LAM/MPI. This is a direct result of the LAM/MPI Team spending the vast majority of their time working on our next-generation MPI implementation -- Open MPI.
Although LAM is not going to go away any time soon (we certainly would not abondon our user base!) -- the web pages, user lists, and all the other resources will continue to be available indefinitely -- we would encourage all users to try migrating to Open MPI. Since it's an MPI implementation, you should be able to simply recompile and re-link your applications to Open MPI -- they should "just work." Open MPI contains many features and performance enhancements that are not available in LAM/MPI.
For GNU: module load lammpi/lam-7.1-gnu For Intel: module load lammpi/lam-7.1-intel In order to use LAM-MPI you must load the module in your ~/.bashrc script along with your job submit script. Add the following to your ~/.bashrc (replace -intel with -gnu if using GNU): For Bash Users edit ~/.bashrc: module load lammpi/lam-7.1-intel For Csh Users edit ~/.cshrc: module load lammpi/lam-7.1-intel Use the lam_loose_rsh parallel environment in your job script (example for a 4 slot job) #$ -pe lam_loose_rsh 4 |
MACS | 1.3.6 | /share/apps/macs/1.3.6 | Next generation parallel sequencing technologies made chromatin immunoprecipitation followed by sequencing (ChIP-Seq) a popular strategy to study genome-wide protein-DNA interactions, while creating challenges for analysis algorithms. We present Model-based Analysis of ChIP-Seq (MACS) on short reads sequencers such as Genome Analyzer (Illumina / Solexa). MACS empirically models the length of the sequenced ChIP fragments, which tends to be shorter than sonication or library construction size estimates, and uses it to improve the spatial resolution of predicted binding sites. MACS also uses a dynamic Poisson distribution to effectively capture local biases in the genome sequence, allowing for more sensitive and robust prediction. MACS compares favorably to existing ChIP-Seq peak-finding algorithms, is publicly available open source, and can be used for ChIP-Seq with or without control samples.
To load MACS into your environment, use the following module command: module load macs/macs |
Maq | 0.7.1 | /share/apps/maq/0.7.1 | Maq is a software that builds mapping assemblies from short reads generated by the next-generation sequencing machines. It is particularly designed for Illumina-Solexa 1G Genetic Analyzer, and has preliminary functions to handle ABI SOLiD data.
See the Maq documentation page for usage: http://maq.sourceforge.net/maq-man.shtml The following Modules files should be loaded for this package: module load maq/maq-0.7 |
MPICH | 1.2.7p1 | /opt/mpich/gnu
/opt/mpich/intel |
GNU and Intel compiled versions of MPICH are installed under this directory
The following Modules file should be loaded to use mpich * GNU version of mpich module load mpich/mpich-1.2-gnu * Intel version of mpich module load mpich/mpich-1.2-intel Use the mpich parallel environment in your job script (example for a 4 slot job) #$ -pe mpich 4 |
NAMD | 2.6 | /share/apps/namd/2.6 | NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Based on Charm++ parallel objects, NAMD scales to hundreds of processors on high-end parallel platforms and tens of processors on commodity clusters using gigabit ethernet.
The following Modules files should be loaded for this package: module load namd/namd-2.6 |
OpenMPI | 1.3.3 | /opt/uabeng/openmpi/gnu
/opt/uabeng/openmpi/intel |
The Open MPI Project is an open source MPI-2 implementation that is developed and maintained by a consortium of academic, research, and industry partners.
The following Modules files should be loaded for this package: For GNU: module load openmpi/openmpi-gnu For Intel: module load openmpi/openmpi-intel Use the openmpi parallel environment in your job script (example for a 4 slot job) #$ -pe openmpi 4 To enable verbose Grid Engine logging for OpenMPI, add the following the mpirun command in the job script --mca pls_gridengine_verbose 1, for example: #!/bin/bash #$ -S /bin/bash #$ -cwd # #$ -N j_openmpi_hello #$ -pe openmpi 4 #$ -l h_rt=00:20:00,s_rt=0:18:00 #$ -j y # #$ -M USERID@uab.edu #$ -m eas # # Load the appropriate module files . /etc/profile.d/modules.sh module load openmpi/openmpi-gnu #$ -V mpirun --mca pls_gridengine_verbose 1 -np $NSLOTS hello_world_gnu_openmpi |
PHASE | 2.1.1 | /share/apps/PHASE/2.1.1 | PHASE is software for haplotype reconstruction, and recombination rate estimation from population data. The software implements methods for estimating haplotypes from population genotype data described in:
The following Modules files should be loaded for this package: module load phase/phase |
PLINK | 1.06 | /share/apps/plink/1.06 | PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analysis in a computationally efficient manner.
The PLINK web site also has a tutorial section that users should read through. Please see this page for PLINK citing instructions. To load PLINK into your environment, use the following module command: module load plink/plink The following commands are available
Running gplink: You should NOT run gplink from the cheaha login node (head node), only from the compute nodes using the qrsh command. The qrsh command will provide a shell on a compute node complete with X forwarding. For example: [jsmith@cheaha ~]$ qrsh Rocks Compute Node Rocks 5.1 (V.I) Profile built 13:06 21-Nov-2008 Kickstarted 13:13 21-Nov-2008 [jsmith@compute-0-10 ~]$ module load plink/plink [jsmith@compute-0-10 ~]$ gplink You should see the gPLINK window open. If you get an error similar to "No X11 DISPLAY variable was set", make sure your initial connection to Cheaha had X forwarding enabled. If you want to use the PLINK R plugin functionality, please see this page http://pngu.mgh.harvard.edu/~purcell/plink/rfunc.shtml for instructions. You'll need to install the Rserve package to use the plugin, for example: install.packages("Rserve") |
pvm | 3.4.5 | /usr/bin/pvm | PVM3 (Parallel Virtual Machine) is a library and daemon that allows
distributed processing environments to be constructed on heterogeneous machines and architectures. |
R | 2.7.2
2.8.1 2.9.0 2.9.2 2.11.1 |
/share/apps/R/2.7.2/gnu
/share/apps/R/2.8.1/gnu /share/apps/R/2.9.0/gnu /share/apps/R/2.9.2/gnu /share/apps/R/2.11.1/gnu |
R is a free software environment for statistical computing and graphics. Please refer to the following page for additional instructions for running R on Cheaha Running R Jobs on a Rocks Cluster.
The following Modules files should be loaded for this package: module load R/R-2.7.2 For other versions, simply replace the version number module load R/R-2.11.1 The following libraries are available, additional libraries should be installed by the user under ~/R_exlibs
Sample R Grid Engine Job Script This is an example of a serial (i.e. non parallel) R job that has a 2 hour run time limit requesting 256M of RAM #!/bin/bash #$ -S /bin/bash #$ -cwd # #$ -j y #$ -N rtestjob # Use '#$ -m n' instead to disable all email for this job #$ -m eas #$ -M YOUR_EMAIL_ADDRESS #$ -l h_rt=2:00:00,s_rt=1:55:00 #$ -l vf=256M . /etc/profile.d/modules.sh module load R/R-2.7.2 #$ -v PATH,R_HOME,R_LIBS,LD_LIBRARY_PATH,CWD R CMD BATCH rscript.R |
s.a.g.e. | 6.0.0 | /share/apps/s.a.g.e./SAGE_6.0.0_Linux64 | S.A.G.E. - Statistical Analysis for Genetic Epidemiology contains programs for use in the genetic analysis of family, pedigree and individual data.
Note: This software is NOT the same as the SAGE listed below! Make sure that every publication which presents results from using S.A.G.E. carries an appropriate acknowledgement such as: '(Some of) The results of this paper were obtained by using the program package S.A.G.E., which is supported by a U.S. Public Health Service Resource Grant (1 P41 RR03655) from the National Center for Research Resources' - (it is important that the grant numbers appear under 'acknowledgments'). Send bibliographic information about every paper in which S.A.G.E. is used (author(s), title, journal, volume and page numbers; a reprint will do provided it has the necessary information on it) to: R.C. Elston Department of Epidemiology and Biostatistics Case Western Reserve University Wolstein Research Building 2103 Cornell Road Cleveland, Ohio 44106-7281 The recommended way of referencing the S.A.G.E. programs is as follows: S.A.G.E. [2009]. Statistical Analysis for Genetic Epidemiology 6.0 Computer program package available from the Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland. Demo data files are available under /share/apps/s.a.g.e./SAGE_6.0.0_Linux64/demo/data_files To load the S.A.G.E. environment, use module load s.a.g.e./sage-6.0 |
SHRiMP | 1.3.2 | /share/apps/shrimp/SHRiMP_1_3_2 | SHRiMP is a software package for aligning genomic reads against a target genome. It was primarily developed with the multitudinous short reads of next generation sequencing machines in mind, as well as Applied Biosystem's colourspace genomic representation.
The following Modules files should be loaded for this package: module load shrimp/shrimp-1.3 |
SPRNG | 2.0a | /share/apps/sprng/2.0a | Scalable Parallel Pseudo Random Number Generators Library
The following Modules files should be loaded for this package: module load sprng/sprng-2 |
Subversion | 1.4.2 | /usr/bin/svn | Subversion is a concurrent version control system which enables one
or more users to collaborate in developing and maintaining a hierarchy of files and directories while keeping a history of all changes. Subversion only stores the differences between versions, instead of every complete file. Subversion is intended to be a compelling replacement for CVS. |
STRAT | 1.1 | /share/apps/STRAT/1.1 | STRAT is a companion program to structure. This is a structured association method, for use in association mapping, enabling valid case-control studies even in the presence of population structure. |
Structure | 2.2.2 | /share/apps/structure/2.2.2 | Structure is software for using multi-locus genotype data to investigate population structure. Its uses include inferring the presence of distinct populations, assigning individuals to populations, studying hybrid zones, identifying migrants and admixed individuals, and estimating population allele frequencies in situations where many individuals are migrants or admixed. It can be applied to most of the commonly-used genetic markers, including SNPS, microsatellites, RFLPs and AFLPs. |
TopHat | 1.0.8 | /share/apps/tophat/1.0.8 | TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
TopHat is a collaborative effort between the University of Maryland Center for Bioinformatics and Computational Biology and the University of California, Berkeley Departments of Mathematics and Molecular and Cell Biology. A TopHat tutorial is available here: http://tophat.cbcb.umd.edu/tutorial.html The following Modules files should be loaded for this package, the tophat module will also load the bowtie module: module load tophat/tophat |
VMD | 1.8.6 | /share/apps/vmd/vmd-1.8.6 | VMD is a molecular visualization program for displaying, animating, and analyzing large biomolecular systems using 3-D graphics and built-in scripting.
You'll need to use X forwarding to launch VMD (for example, on a Windows machine, X-Win32). The following Modules files should be loaded for this package: module load vmd/vmd-1.8.6 |
Sample Job Scripts
The following are sample job scripts, please be careful to edit these for your environment (i.e. replace YOUR_EMAIL_ADDRESS with your real email address), set the h_rt to an appropriate runtime limit and modify the job name and any other parameters.
Hello World
Hello World is the classic example used throughout programming. We don't want to buck the system, so we'll use it as well to demonstrate a simple parallel Grid Engine job script. This example also includes the example of compiling the code and submitting the job script to the Grid Engine.
- First, create a directory for the Hello World jobs
$ mkdir -p ~/jobs/helloworld $ cd ~/jobs/helloworld
- Create the Hello World code written in C (this example of MPI enabled Hello World includes a 3 minute sleep to ensure the job runs for several minutes, a normal hello world example would run in a matter of seconds).
$ vi helloworld-mpi.c
#include <stdio.h> #include <mpi.h> main(int argc, char **argv) { int node; int i, j; float f; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &node); printf("Hello World from Node %d.\n", node); sleep(180); for (j=0; j<=100000; j++) for(i=0; i<=100000; i++) f=i*2.718281828*i+i+i*3.141592654; MPI_Finalize(); }
- Compile the code, first purging any modules you may have loaded followed by loading the module for OpenMPI GNU. The mpicc command will compile the code and produce a binary named helloworld_gnu_openmpi
$ module purge $ module load openmpi/openmpi-gnu $ mpicc helloworld-mpi.c -o helloworld_gnu_openmpi
- Create the Grid Engine job script that will request 8 cpu slots and a maximum runtime of 10 minutes
$ vi helloworld.qsub
#$ -S /bin/bash #$ -cwd # #$ -N HelloWorld #$ -pe openmpi 8 #$ -l h_rt=00:10:00,s_rt=0:08:00 #$ -j y # #$ -M YOUR_EMAIL_ADDRESS #$ -m eas # # Load the appropriate module files module load openmpi/openmpi-gnu #$ -V mpirun -np $NSLOTS helloworld_gnu_openmpi
- Submit the job to Grid Engine and check the status using qstat
$ qsub helloworld.qsub Your job 11613 ("HelloWorld") has been submitted $ qstat -u $USER job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 11613 8.79717 HelloWorld jsmith r 03/13/2009 09:24:35 all.q@compute-0-3.local 8
- When the job completes, you should have output files named HelloWorld.o* and HelloWorld.po* (replace the asterisk with the job ID, example HelloWorld.o11613). The .o file is the standard output from the job and .po will contain any errors.
$ cat HelloWorld.o11613 Hello world! I'm 0 of 8 on compute-0-3.local Hello world! I'm 1 of 8 on compute-0-3.local Hello world! I'm 4 of 8 on compute-0-3.local Hello world! I'm 6 of 8 on compute-0-6.local Hello world! I'm 5 of 8 on compute-0-3.local Hello world! I'm 7 of 8 on compute-0-6.local Hello world! I'm 2 of 8 on compute-0-3.local Hello world! I'm 3 of 8 on compute-0-3.local
Gromacs
#!/bin/bash #$ -S /bin/bash # # Request the maximum runtime for the job #$ -l h_rt=2:00:00,s_rt=1:55:00 # # Request the maximum memory needed for each slot / processor core #$ -l vf=256M # # Send mail only when the job ends #$ -m e # # Execute from the current working directory #$ -cwd # #$ -j y # # Job Name and email #$ -N G-4CPU-intel #$ -M YOUR_EMAIL_ADDRESS # # Use OpenMPI parallel environment and 4 slots #$ -pe openmpi 4 # # Load the appropriate module(s) . /etc/profile.d/modules.sh module load gromacs/gromacs-4-intel # #$ -V # # Single precision MDRUN=mdrun_mpi # The $NSLOTS variable is set automatically by SGE to match the number of # slots requests export MYFILE=production-Npt-323K_${NSLOTS}CPU cd ~/jobs/gromacs mpirun -np $NSLOTS $MDRUN -v -np $NSLOTS -s $MYFILE -o $MYFILE -c $MYFILE -x $MYFILE -e $MYFILE -g ${MYFILE}.log
R
If you are using LAM MPI for parallel jobs, you must add the following two lines to your ~/.bashrc or ~/.cshrc file.
module load lammpi/lam-7.1-gnu
The following is an example job script that will use an array of 1000 tasks (-t 1-1000), each task has a max runtime of 2 hours and will use no more than 256 MB of RAM per task (h_rt=2:00:00,vf=256M)
The array is also throttled to only run 32 concurrent tasks at any time (-tc 32), this feature is not available on coosa.
More R examples are available here: Running R Jobs on a Rocks Cluster
Create a working directory and the job submission script
$ mkdir -p ~/jobs/ArrayExample $ cd ~/jobs/ArrayExample $ vi R-example-array-job.qsub
#!/bin/bash #$ -S /bin/bash # # Request the maximum runtime for the job #$ -l h_rt=2:00:00,s_rt=1:55:00 # # Request the maximum memory needed for each slot / processor core #$ -l vf=256M # #$ -M YOUR_EMAIL_ADDRESS # Email me only when tasks abort, use '#$ -m n' to disable all email for this job #$ -m a #$ -cwd #$ -j y # # Job Name #$ -N ArrayExample # #$ -t 1-1000 #$ -tc 32 # #$ -e $HOME/negcon/rep$TASK_ID/$JOB_NAME.e$JOB_ID.$TASK_ID #$ -o $HOME/negcon/rep$TASK_ID/$JOB_NAME.o$JOB_ID.$TASK_ID . /etc/profile.d/modules.sh module load R/R-2.9.0 #$ -v PATH,R_HOME,R_LIBS,LD_LIBRARY_PATH,CWD cd ~/jobs/ArrayExample/rep$SGE_TASK_ID R CMD BATCH rscript.R
Submit the job to the Grid Engine and check the status of the job using the qstat command
$ qsub R-example-array-job.qsub $ qstat