Cheaha GettingStarted deprecated

From Cheaha
Revision as of 15:26, 15 July 2011 by Jpr@uab.edu (talk | contribs) (Initial import from ME wiki. Taken from edit 17:17, 11 Nov 2010 by Mhanby. http://me.eng.uab.edu/wiki/index.php?title=Cheaha)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


Attention: Research Computing Documentation has Moved
https://docs.rc.uab.edu/


Please use the new documentation url https://docs.rc.uab.edu/ for all Research Computing documentation needs.


As a result of this move, we have deprecated use of this wiki for documentation. We are providing read-only access to the content to facilitate migration of bookmarks and to serve as an historical record. All content updates should be made at the new documentation site. The original wiki will not receive further updates.

Thank you,

The Research Computing Team

Information about the history and future plans for Cheaha is available on the Cheaha UABGrid Documentation page.

Access

To request an account on Cheaha, please submit an authorization request to the IT Research Computing staff.

Usage of Cheaha is governed by UAB's Acceptable Use Policy (AUP) for computer resources.

The official DNS name of Cheaha's frontend machine is cheaha.uabgrid.uab.edu. If you want to refer to the machine as cheaha, you'll have to either add

search uabgrid.uab.edu

to /etc/resolv.conf (you'll need administrator access to edit this file), or add

Host cheaha
 Hostname cheaha.uabgrid.uab.edu

to your ~/.ssh/config file

Hardware

Cheaha.uabgrid is made up of a head node (Dell PowerEdge 2950) with 16GB of RAM and two quad core Intel Xeon 3GHz processors and 24 compute nodes.

The compute nodes are Dell m600 blades that also have 16GB of RAM and two quad core Intel Xeon 3GHz processors providing a total of 192 3GHz cores for computing.

Cheaha also has the older Verari compute nodes from cheaha.ac.uab.edu attached. These compute nodes contain two Opteron 242s each with 2 GB of RAM.

Cheaha has a 40 TB Lustre high performance file system attached via Infiniband (new compute nodes) and GigE (older nodes) for use as a scratch file system.

Cluster Software

  • Rocks 5.3
  • CentOS 5.4 x86_64
  • Grid Engine 6.2u4
  • Globus 4.0.8
  • Gridway 5.4.0

Storage

Home directories

Your home directory on Cheaha is NFS-mounted to the compute nodes as /home/$USER or $HOME. It is acceptable to use your home directory a location to store job scripts, custom code, libraries, job scripts.

The home directory must not be used to store large amounts of data.

Scratch

Research Computing policy requires that all bulky input and output must be located on the scratch space. The home directory is intended to store your job scripts, log files, libraries and other supporting files.

Cheaha has two types of scratch space, network mounted and local.

  • Network scratch ($UABGRID_SCRATCH) is available on the head node and each compute node. This storage is a Lustre high performance file system providing roughly 40TB of storage. This should be your jobs primary working directory, unless the job would benefit from local scratch (see below).
  • Local scratch is physically located on each compute node and is not accessible to the other nodes (including the head node). This space is useful if the job performs a lot of file I/O. Most of the jobs that run on our clusters do not fall into this category. Because the local scratch is inaccessible outside the job, it is important to note that you must move any data between local scratch to your network accessible scratch within your job. For example, step 1 in the job could be to copy the input from $UABGRID_SCRATCH to /scratch/$USER, step 2 code execution, step 3 move the data back to $UABGRID_SCRATCH.

Important Information:

  • Scratch space (network and local) is not backed up.
  • Research Computing expects each user to keep their scratch areas clean. The clusters are not to be used for archiving data.

Network Scratch

Network scratch is available using the environment variable $UABGRID_SCRATCH or directly by /lustre/scratch/$USER

It is advisable to use the environment variable whenever possible rather than the hard coded path.

Local Scratch

Local scratch is available on each compute node under /scratch.

Each compute node has a local scratch directory, /scratch. If your job performs a lot of file I/O, the job may run quicker (and possibly more stable) by using /scratch/$USER rather than reading and writing using your network mounted scratch directory. The amount of scratch space available on each compute node is approximately 40GB.

The following is a typical sequence of events within a job script using local scratch:

  1. Create a directory for your user called /scratch/$USER
    mkdir -p /scratch/$USER/$JOB_ID
  2. Copy the data from $UABGRID_SCRATCH to /scratch/$USER
    cp -a $UABGRID_SCRATCH/GeneData /scratch/$USER/$JOB_ID/
  3. Run the application
    geneapp -S 1 -D 10 \< /scratch/$USER/$JOB_ID/GeneData \> /scratch/$USER/$JOB_ID/geneapp.out
  4. Delete anything that you don't want to move back to network scratch (for example the copy of the input data)
    rm -rf /scratch/$USER/$JOB_ID/GeneData
  5. Move the data that you want to keep from local to network scratch
    mv /scratch/$USER/$JOB_ID $UABGRID_SCRATCH/

The following is an example of what the code might look like in a job script (remember if the job is an array job, you may need to use /scratch/$USER/$JOB_ID/$SGE_TASK_ID to prevent multiple tasks from overwriting each other):

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#
#$ -N scratch_example
#$ -pe smp 1
#$ -l h_rt=00:20:00,s_rt=00:18:00,vf=2G
#$ -j y
#
#$ -M YOUR_EMAIL_ADDRESS
#$ -m eas
#
module load R/R-2.9.0

if [ ! -d /scratch/$USER/$JOB_ID ]; then 
  mkdir -p /scratch/$USER/$JOB_ID
  chmod 700 /scratch/$USER
fi

cp -a $UABGRID_SCRATCH/GeneData /scratch/$USER/$JOB_ID/

$HOME/bin/geneapp -S 1 -D 10 < /scratch/$USER/$JOB_ID/GeneData > /scratch/$USER/$JOB_ID/geneapp.out

rm -rf /scratch/$USER/$JOB_ID/GeneData
mv /scratch/$USER/$JOB_ID $UABGRID_SCRATCH/

By default, the $USER environment variable contains your login ID and the grid engine will populate $JOB_ID and $SGE_TASK_ID variables with the correct job and task IDs.

The mkdir command creates the full directory path (the -p switch is important). The chmod ensures that other users are not able to view files under this directory.

Please make sure to clean up the scratch space. This space is not to be used as a long term storage device for data and is subject to being erased without notice if the file systems fill up.

Project Storage

Cheaha has a location where shared data can be stored called $UABGRID_PROJECT

This is helpful if a team of researchers must access the same data. Please open a help desk ticket to request a project directory under $UABGRID_PROJECT.

Environment Modules

Environment Modules is installed on Cheaha and should be used when constructing your job scripts if an applicable module file exists. Using the module command you can easily configure your environment for specific software packages without having to know the specific environment variables and values to set. Modules allows you to dynamically configure your environment without having to logout / login for the changes to take affect.

If you find that specific software does not have a module, please submit a helpdesk ticket to request the module.

Note: If you are using LAM MPI for parallel jobs, you must load the LAM module in both your job script and your profile. For example, assume we want to use LAM-MPI compiled for GNU:

  • for BASH users add this to your ~/.bashrc and your job script, or for CSH users add this to your ~/.cshrc and your job script
module load lammpi/lam-7.1-gnu
  • Cheaha supports bash completion for the module command. For example, type 'module' and press the TAB key twice to see a list of options:
module TAB TAB

add          display      initlist     keyword      refresh      switch       use          
apropos      help         initprepend  list         rm           unload       whatis       
avail        initadd      initrm       load         show         unuse        
clear        initclear    initswitch   purge        swap         update
  • To see the list of available modulefiles on the cluster, run the module avail command (note the example list below may not be complete!) or module load followed by two tab key presses:
module avail
 
R/R-2.11.1                      cufflinks/cufflinks-0.9         intel/intel-compilers           mvapich-intel                   rna_pipeline/rna_pipeline-0.31
R/R-2.6.2                       eigenstrat/eigenstrat           jags/jags-1.0-gnu               mvapich2-gnu                    rna_pipeline/rna_pipeline-0.5.0
R/R-2.7.2                       eigenstrat/eigenstrat-2.0       lammpi/lam-7.1-gnu              namd/namd-2.6                   s.a.g.e./sage-6.0
R/R-2.8.1                       ent/ent-1.0.2                   lammpi/lam-7.1-intel            namd/namd-2.7                   samtools/samtools
R/R-2.9.0                       fastphase/fastphase-1.4         mach/mach                       openmpi/openmpi-1.2-gnu         samtools/samtools-0.1
R/R-2.9.2                       fftw/fftw3-gnu                  macs/macs                       openmpi/openmpi-1.2-intel       shrimp/shrimp-1.2
RAxML/RAxML-7.2.6               fftw/fftw3-intel                macs/macs-1.3.6                 openmpi/openmpi-gnu             shrimp/shrimp-1.3
VEGAS/VEGAS-0.8                 freesurfer/freesurfer-4.5       maq/maq-0.7                     openmpi/openmpi-intel           spparks/spparks
amber/amber-10.0-intel          fregene/fregene-2008            marthlab/gigabayes              openmpi-gnu                     structure/structure-2.2
amber/amber-11-intel            fsl/fsl-4.1.6                   marthlab/mosaik                 openmpi-intel                   tau/tau
apbs/apbs-1.0                   genn/genn                       marthlab/pyrobayes              paraview/paraview-3.4           tau/tau-2.18.2p2
atlas/atlas                     gromacs/gromacs-4-gnu           mathworks/R2009a                paraview/paraview-3.6           tau/tau-lam-intel
birdsuite/birdsuite-1.5.3       gromacs/gromacs-4-intel         mathworks/R2009b                pdt/pdt                         tophat/tophat
birdsuite/birdsuite-1.5.5       hapgen/hapgen                   mathworks/R2010a                pdt/pdt-3.14                    tophat/tophat-1.0.8
bowtie/bowtie                   hapgen/hapgen-1.3.0             mpich/mpich-1.2-gnu             phase/phase                     tophat/tophat-1.1
bowtie/bowtie-0.10              haskell/ghc                     mpich/mpich-1.2-intel           plink/plink                     vmd/vmd
bowtie/bowtie-0.12              illuminus/illuminus             mpich/mpich2-gnu                plink/plink-1.05                vmd/vmd-1.8.6
bowtie/bowtie-0.9               impute/impute                   mrbayes/mrbayes-gnu             plink/plink-1.06
chase                           impute/impute-2.0.3             mrbayes/mrbayes-intel           plink/plink-1.07
cufflinks/cufflinks             impute/impute-2.1.0             mvapich-gnu                     python/python-2.6

Some software packages have multiple module files, for example:

  • plink/plink
  • plink/plink-1.05
  • plink/plink-1.06

In this case, the plink/plink module will always load the latest version, so loading this module is equivalent to loading plink/plink-1.06. If you always want to use the latest version, use this approach. If you want use a specific version, use the module file containing the appropriate version number.

Some modules, when loaded, will actually load other modules. For example, the gromacs/gromacs-4-intel module will also load openmpi/openmpi-intel and fftw/fftw3-intel.

  • To load a module, ex: for a Gromacs job, use the following module load command in your job script:
module load gromacs/gromacs-4-intel
  • To see a list of the modules that you currently have loaded use the module list command
module list
 
Currently Loaded Modulefiles:
 1) fftw/fftw3-intel            2) openmpi/openmpi-intel   3) gromacs/gromacs-4-intel
  • A module can be removed from your environment by using the module unload command:
module unload gromacs/gromacs-4-intel

module list

No Modulefiles Currently Loaded.
  • The definition of a module can also be viewed using the module show command, revealing what a specific module will do to your environment:
module show gromacs/gromacs-4-intel

-------------------------------------------------------------------
/etc/modulefiles/gromacs/gromacs-4-intel:

module-whatis	 Sets up gromacs-intel v4.0.2 in your enviornment 
module		 load fftw/fftw3-intel 
module		 load openmpi/openmpi-intel 
prepend-path	 PATH /opt/uabeng/gromacs/intel/4/bin/ 
prepend-path	 LD_LIBRARY_PATH /opt/uabeng/gromacs/intel/4/lib 
prepend-path	 MANPATH /opt/uabeng/gromacs/intel/4/man 
-------------------------------------------------------------------

Installed software

We try to install local software in /opt, /opt/uabeng and /share/apps. However, please do not depend on a particular piece of software being in a specific directory, as we may need to move things around at some point.

In most cases, the descriptions for each software package was copied from the authors web site and represents their own work.

If you don't find a particular package listed on this page, please open a help desk ticket to request the software.

If a module file is available for the software, it is recommended to use the module file in your job script and/or shell profile.

Software (Link to home page) Version Software Installation-Directory Information
Amber 10 /opt/uabeng/amber10/intel "Amber" refers to two things: a set of molecular mechanical force fields for the simulation of biomolecules (which are in the public domain, and are used in a variety of simulation programs); and a package of molecular simulation programs which includes source code and demos.

Amber is compiled using Intel compilers and uses OpenMPI for the parallel binaries.

The following Modules files should be loaded for this package (the amber module will automatically load the openmpi module):

For Intel:

module load amber/amber-10-intel

Use the openmpi parallel environment in your job script (example for a 4 slot job)

#$ -pe openmpi 4
APBS 1.0.0 /share/apps/apbs/apbs-1.0.0-amd64 APBS - Adaptive Poisson-Boltzmann Solver APBS is a software package for the numerical solution of the Poisson-Boltzmann equation (PBE), one of the most popular continuum models for describing electrostatic interactions between molecular solutes in salty, aqueous media.

Submit APBS jobs via the Grid Engine and do not run them on the head node!

module load apbs/apbs-1.0 
Atlas 3.8.3 /usr/lib64/atlas The ATLAS (Automatically Tuned Linear Algebra Software) project is an

ongoing research effort focusing on applying empirical techniques in order to provide portable performance. At present, it provides C and Fortran77 interfaces to a portably efficient BLAS implementation, as well as a few routines from LAPACK.

module load atlas/atlas 
Biopython 1.51 Biopython is a set of freely available tools for biological computation written in Python by an international team of developers.

The Biopython packages along with its dependencies (Numpy, python-reportlab, Flex, etc...) are all installed in the default location for Python site-packages, so you should not need to modify any environment variables to use this package.

Birdsuite 1.5.3 /share/apps/birdsuite/1.5.3 The Birdsuite is a fully open-source set of tools to detect and report SNP genotypes, common Copy-Number Polymorphisms (CNPs), and novel, rare, or de novo CNVs in samples processed with the Affymetrix platform. While most of the components of the suite can be run individually (for instance, to only do SNP genotyping), the Birdsuite is especially intended for integrated analysis of SNPs and CNVs. Support for chips and platforms other than the Affymetrix SNP 6.0 is currently limited, but we are currently working on creating the supporting files for other common genotyping platforms.

An example job submission script can be found here (copy this to your job directory and make sure to edit the email address!)

/share/apps/example-scripts/birdsuite-job.qsub

The following Modules files should be loaded for this package:

module load birdsuite/birdsuite-1.5
boost 1.33.1 /usr/lib

/usr/lib64

Boost provides free peer-reviewed portable C++ source libraries.

The Boost team emphasize libraries that work well with the C++ Standard Library. Boost libraries are intended to be widely useful, and usable across a broad spectrum of applications. The Boost license encourages both commercial and non-commercial use.

Both 32bit and 64bit versions of Boost C++ libraries are provided under /usr/lib and /usr/lib64

Bowtie 0.10.1 /share/apps/bowtie/bowtie-0.10.1 Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end). It supports alignment policies equivalent to Maq and SOAP but is substantially faster.

A Bowtie tutorial is available here: http://bowtie-bio.sourceforge.net/tutorial.shtml

The following Modules files should be loaded for this package:

module load bowtie/bowtie-0.10
eigenstrat 3.0 /share/apps/eigenstrat EIGENSTRAT also provides a decent FAQ on their website, click here.

"The EIGENSOFT package combines functionality from our population genetics methods (Patterson et al. 2006) and our EIGENSTRAT stratification method (Price et al. 2006). The EIGENSTRAT method uses principal components analysis to explicitly model ancestry differences between cases and controls along continuous axes of variation; the resulting correction is specific to a candidate marker's variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. The EIGENSOFT package has a built-in plotting script and supports multiple file formats and quantitative phenotypes."

The following Modules file should be loaded for this package:

module load eigenstrat/eigenstrat
fastPHASE 1.4.0 /share/apps/fastPHASE/1.4 The program fastPHASE implements methods for estimating haplotypes and missing genotypes from population SNP genotype data.

The following Modules files should be loaded for this package:

module load fastphase/fastphase-1.4
FFTW 3.1.2 /opt/uabeng/fftw3/gnu

/opt/uabeng/fftw3/intel

FFTW is a C subroutine library for computing the discrete Fourier transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data (as well as of even/odd data, i.e. the discrete cosine/sine transforms or DCT/DST).

The following Modules files should be loaded for this package:

For GNU:

module load fftw/fftw3-gnu

For Intel:

module load fftw/fftw3-intel
Gromacs 4.0.5 /opt/uabeng/gromacs/gnu/4

/opt/uabeng/gromacs/intel/4

GROMACS is a versatile package to perform molecular dynamics and is primarily designed for biochemical molecules like proteins and lipids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers.

Gromacs is compiled using Intel and GNU compilers using FFTW3, BLAS, and LAPACK and OpenMPI for the parallel binaries. Single and double precision binaries are included (double precision binaries have a _d suffix).

The following Modules files should be loaded for this package (module will automatically load any prerequisite modules):

For GNU:

module load gromacs/gromacs-4-gnu

For Intel:

module load gromacs/gromacs-4-intel

Use the openmpi parallel environment in your job script (example for a 4 slot job)

#$ -pe openmpi 4
GSL 1.10 /usr/lib

/usr/lib64

/usr/include/gsl

The GNU Scientific Library (GSL) is a numerical library for C and C++ programmers. It is free software under the GNU General Public License.

The library provides a wide range of mathematical routines such as random number generators, special functions and least-squares fitting. There are over 1000 functions in total with an extensive test suite.

HAPGEN 1.3.0 /share/apps/hapgen/1.3.0 HAPGEN is a program thats simulates case control datasets at SNP markers and can output data in the FILE FORMAT used by IMPUTE, SNPTEST and GTOOL. The approach can handle markers in LD and can simulate datasets over large regions such as whole chromosomes. Hapgen simulates haplotypes by conditioning on a set of population haplotypes and an estimate of the fine-scale recombination rate across the region.

Command line syntax example for HAPGEN can be found by clicking here.

The HAPGEN environment module can be loaded as follows

module load hapgen/hapgen
IMPUTE v2 2.0.3 /share/apps/impute/2.0.3 IMPUTE v2 is a new genotype imputation algorithm based on ideas described in Howie et al. (2009).

For the specific version

module load impute/impute-2.0.3

Or to use the latest

module load impute/impute

Examples for IMPUTE v2 are provided in the $IMPUTEHOME/Example directory.

JAGS 1.0.3 /share/apps/jags/jags-1.0.3/gnu JAGS (Just Another Gibbs Sampler) is a Bayesian hierarchical model analysis program using Markov Chain Monte Carlo (MCMC) simulation. It is similar to BUGS but will compile on Linux systems.

Click here for a good description of JAGS and how it differs from BUGS.

The JAGS environment module can be loaded as follows

module load jags/jags-1.0-gnu


Java JDK 1.5.0_10 /usr/java/jdk1.5.0_10 JDK (Java Developers Kit) and Runtime from Sun
JRE 1.6.0_04 /usr/java/jre1.6.0_04 Java Runtime
Intel 10.1.015 /opt/intel/cce

/opt/intel/fce

/opt/intel/mkl

Intel C, C++ and Fortran compilers along with the Intel Math Kernel Libraries

The following Modules file should be loaded for this package:

module load intel/intel-compilers-10.1
LAM-MPI 7.1.4 /opt/uabeng/lam/gnu

/opt/uabeng/lam/intel

LAM/MPI is now in a maintenance mode. Bug fixes and critical patches are still being applied, but little real "new" work is happening in LAM/MPI. This is a direct result of the LAM/MPI Team spending the vast majority of their time working on our next-generation MPI implementation -- Open MPI.

Although LAM is not going to go away any time soon (we certainly would not abondon our user base!) -- the web pages, user lists, and all the other resources will continue to be available indefinitely -- we would encourage all users to try migrating to Open MPI. Since it's an MPI implementation, you should be able to simply recompile and re-link your applications to Open MPI -- they should "just work." Open MPI contains many features and performance enhancements that are not available in LAM/MPI.


The following Modules files should be loaded for this package (for LAM, you must load this module in your profile script and your job script):

For GNU:

module load lammpi/lam-7.1-gnu

For Intel:

module load lammpi/lam-7.1-intel

In order to use LAM-MPI you must load the module in your ~/.bashrc script along with your job submit script. Add the following to your ~/.bashrc (replace -intel with -gnu if using GNU):

For Bash Users edit ~/.bashrc:

module load lammpi/lam-7.1-intel

For Csh Users edit ~/.cshrc:

module load lammpi/lam-7.1-intel

Use the lam_loose_rsh parallel environment in your job script (example for a 4 slot job)

#$ -pe lam_loose_rsh 4
MACS 1.3.6 /share/apps/macs/1.3.6 Next generation parallel sequencing technologies made chromatin immunoprecipitation followed by sequencing (ChIP-Seq) a popular strategy to study genome-wide protein-DNA interactions, while creating challenges for analysis algorithms. We present Model-based Analysis of ChIP-Seq (MACS) on short reads sequencers such as Genome Analyzer (Illumina / Solexa). MACS empirically models the length of the sequenced ChIP fragments, which tends to be shorter than sonication or library construction size estimates, and uses it to improve the spatial resolution of predicted binding sites. MACS also uses a dynamic Poisson distribution to effectively capture local biases in the genome sequence, allowing for more sensitive and robust prediction. MACS compares favorably to existing ChIP-Seq peak-finding algorithms, is publicly available open source, and can be used for ChIP-Seq with or without control samples.

To load MACS into your environment, use the following module command:

module load macs/macs
Maq 0.7.1 /share/apps/maq/0.7.1 Maq is a software that builds mapping assemblies from short reads generated by the next-generation sequencing machines. It is particularly designed for Illumina-Solexa 1G Genetic Analyzer, and has preliminary functions to handle ABI SOLiD data.

See the Maq documentation page for usage: http://maq.sourceforge.net/maq-man.shtml

The following Modules files should be loaded for this package:

module load maq/maq-0.7
MPICH 1.2.7p1 /opt/mpich/gnu

/opt/mpich/intel

GNU and Intel compiled versions of MPICH are installed under this directory

The following Modules file should be loaded to use mpich

* GNU version of mpich
module load mpich/mpich-1.2-gnu
* Intel version of mpich
module load mpich/mpich-1.2-intel

Use the mpich parallel environment in your job script (example for a 4 slot job)

#$ -pe mpich 4
NAMD 2.6 /share/apps/namd/2.6 NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Based on Charm++ parallel objects, NAMD scales to hundreds of processors on high-end parallel platforms and tens of processors on commodity clusters using gigabit ethernet.

The following Modules files should be loaded for this package:

module load namd/namd-2.6
OpenMPI 1.3.3 /opt/uabeng/openmpi/gnu

/opt/uabeng/openmpi/intel

The Open MPI Project is an open source MPI-2 implementation that is developed and maintained by a consortium of academic, research, and industry partners.

The following Modules files should be loaded for this package:

For GNU:

module load openmpi/openmpi-gnu

For Intel:

module load openmpi/openmpi-intel

Use the openmpi parallel environment in your job script (example for a 4 slot job)

#$ -pe openmpi 4

To enable verbose Grid Engine logging for OpenMPI, add the following the mpirun command in the job script --mca pls_gridengine_verbose 1, for example:

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#
#$ -N j_openmpi_hello
#$ -pe openmpi 4
#$ -l h_rt=00:20:00,s_rt=0:18:00
#$ -j y
#
#$ -M USERID@uab.edu
#$ -m eas
#
# Load the appropriate module files
. /etc/profile.d/modules.sh
module load openmpi/openmpi-gnu

#$ -V

mpirun --mca pls_gridengine_verbose 1 -np $NSLOTS hello_world_gnu_openmpi

PHASE 2.1.1 /share/apps/PHASE/2.1.1 PHASE is software for haplotype reconstruction, and recombination rate estimation from population data. The software implements methods for estimating haplotypes from population genotype data described in:
  • Stephens, M., and Donnelly, P. (2003). A comparison of Bayesian methods for haplotype reconstruction from population genotype data. American Journal of Human Genetics, 73:1162-1169.
  • Stephens, M., Smith, N., and Donnelly, P. (2001). A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics, 68, 978--989.
  • Stephens, M., and Scheet, P. (2005). Accounting for Decay of Linkage Disequilibrium in Haplotype Inference and Missing-Data Imputation. American Journal of Human Genetics, 76:449-462.


The software also incorporates methods for estimating recombination rates, and identifying recombination hotspots:

  • Crawford et al (2004). Evidence for substantial fine-scale variation in recombination rates across the human genome. Nature Genetics


Documentation on the usage of PHASE can be downloaded here.

The following Modules files should be loaded for this package:

module load phase/phase
PLINK 1.06 /share/apps/plink/1.06 PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analysis in a computationally efficient manner.

The PLINK web site also has a tutorial section that users should read through.

Please see this page for PLINK citing instructions.

To load PLINK into your environment, use the following module command:

module load plink/plink

The following commands are available

  • plink - The plink executable is the primary binary for this software. Click here for the command line reference.
  • gplink - This is a java based GUI for PLINK that provides the following functionality:
    • is a GUI that allows construction of many common PLINK operations
    • provides a simple project management tool and analysis log
    • allows for data and computation to be on a separate server (via SSH)
    • facilitates integration with Haploview

Running gplink: You should NOT run gplink from the cheaha login node (head node), only from the compute nodes using the qrsh command. The qrsh command will provide a shell on a compute node complete with X forwarding. For example:

[jsmith@cheaha ~]$ qrsh

Rocks Compute Node
Rocks 5.1 (V.I)
Profile built 13:06 21-Nov-2008

Kickstarted 13:13 21-Nov-2008

[jsmith@compute-0-10 ~]$ module load plink/plink

[jsmith@compute-0-10 ~]$ gplink

You should see the gPLINK window open. If you get an error similar to "No X11 DISPLAY variable was set", make sure your initial connection to Cheaha had X forwarding enabled.

If you want to use the PLINK R plugin functionality, please see this page http://pngu.mgh.harvard.edu/~purcell/plink/rfunc.shtml for instructions. You'll need to install the Rserve package to use the plugin, for example:

install.packages("Rserve")
pvm 3.4.5 /usr/bin/pvm PVM3 (Parallel Virtual Machine) is a library and daemon that allows

distributed processing environments to be constructed on heterogeneous machines and architectures.

R 2.7.2

2.8.1 2.9.0 2.9.2 2.11.1

/share/apps/R/2.7.2/gnu

/share/apps/R/2.8.1/gnu /share/apps/R/2.9.0/gnu /share/apps/R/2.9.2/gnu /share/apps/R/2.11.1/gnu

R is a free software environment for statistical computing and graphics. Please refer to the following page for additional instructions for running R on Cheaha Running R Jobs on a Rocks Cluster.

The following Modules files should be loaded for this package:

module load R/R-2.7.2

For other versions, simply replace the version number

module load R/R-2.11.1

The following libraries are available, additional libraries should be installed by the user under ~/R_exlibs

  • /share/apps/R/R-X.X.X/gnu/lib/R/library
    • The default libraries that come with R
    • Rmpi
    • Snow
  • /share/apps/R/R-X.X.X/gnu/lib/R/bioc
    • BioConductor libraries (default package set using getBioC)

Sample R Grid Engine Job Script This is an example of a serial (i.e. non parallel) R job that has a 2 hour run time limit requesting 256M of RAM

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#
#$ -j y
#$ -N rtestjob
# Use '#$ -m n' instead to disable all email for this job
#$ -m eas
#$ -M YOUR_EMAIL_ADDRESS
#$ -l h_rt=2:00:00,s_rt=1:55:00
#$ -l vf=256M
. /etc/profile.d/modules.sh
module load R/R-2.7.2

#$ -v PATH,R_HOME,R_LIBS,LD_LIBRARY_PATH,CWD

R CMD BATCH rscript.R
s.a.g.e. 6.0.0 /share/apps/s.a.g.e./SAGE_6.0.0_Linux64 S.A.G.E. - Statistical Analysis for Genetic Epidemiology contains programs for use in the genetic analysis of family, pedigree and individual data.

Note: This software is NOT the same as the SAGE listed below!

Make sure that every publication which presents results from using
S.A.G.E. carries an appropriate acknowledgement such as: 

'(Some of) The results of this paper were obtained by using the program
package S.A.G.E., which is supported by a U.S. Public Health Service
Resource Grant (1 P41 RR03655) from the National Center for Research
Resources'  - (it is important that the grant numbers appear under
'acknowledgments'). 

Send bibliographic information about every paper in which S.A.G.E. is
used (author(s), title, journal, volume and page numbers; a reprint will
do provided it has the necessary information on it) to: 

R.C. Elston 
Department of Epidemiology and Biostatistics 
Case Western Reserve University 
Wolstein Research Building 
2103 Cornell Road 
Cleveland, Ohio  44106-7281 

The recommended way of referencing the S.A.G.E. programs is as follows: 

S.A.G.E. [2009]. Statistical Analysis for Genetic Epidemiology 6.0 
Computer program package available from the Department of Epidemiology and 
Biostatistics, Case Western Reserve University, Cleveland. 

Demo data files are available under /share/apps/s.a.g.e./SAGE_6.0.0_Linux64/demo/data_files

To load the S.A.G.E. environment, use

module load s.a.g.e./sage-6.0
SHRiMP 1.3.2 /share/apps/shrimp/SHRiMP_1_3_2 SHRiMP is a software package for aligning genomic reads against a target genome. It was primarily developed with the multitudinous short reads of next generation sequencing machines in mind, as well as Applied Biosystem's colourspace genomic representation.

The following Modules files should be loaded for this package:

module load shrimp/shrimp-1.3
SPRNG 2.0a /share/apps/sprng/2.0a Scalable Parallel Pseudo Random Number Generators Library

The following Modules files should be loaded for this package:

module load sprng/sprng-2
Subversion 1.4.2 /usr/bin/svn Subversion is a concurrent version control system which enables one

or more users to collaborate in developing and maintaining a hierarchy of files and directories while keeping a history of all changes. Subversion only stores the differences between versions, instead of every complete file. Subversion is intended to be a compelling replacement for CVS.

STRAT 1.1 /share/apps/STRAT/1.1 STRAT is a companion program to structure. This is a structured association method, for use in association mapping, enabling valid case-control studies even in the presence of population structure.
Structure 2.2.2 /share/apps/structure/2.2.2 Structure is software for using multi-locus genotype data to investigate population structure. Its uses include inferring the presence of distinct populations, assigning individuals to populations, studying hybrid zones, identifying migrants and admixed individuals, and estimating population allele frequencies in situations where many individuals are migrants or admixed. It can be applied to most of the commonly-used genetic markers, including SNPS, microsatellites, RFLPs and AFLPs.
TopHat 1.0.8 /share/apps/tophat/1.0.8 TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.

TopHat is a collaborative effort between the University of Maryland Center for Bioinformatics and Computational Biology and the University of California, Berkeley Departments of Mathematics and Molecular and Cell Biology.

A TopHat tutorial is available here: http://tophat.cbcb.umd.edu/tutorial.html

The following Modules files should be loaded for this package, the tophat module will also load the bowtie module:

module load tophat/tophat
VMD 1.8.6 /share/apps/vmd/vmd-1.8.6 VMD is a molecular visualization program for displaying, animating, and analyzing large biomolecular systems using 3-D graphics and built-in scripting.

You'll need to use X forwarding to launch VMD (for example, on a Windows machine, X-Win32).

The following Modules files should be loaded for this package:

module load vmd/vmd-1.8.6

Sample Job Scripts

The following are sample job scripts, please be careful to edit these for your environment (i.e. replace YOUR_EMAIL_ADDRESS with your real email address), set the h_rt to an appropriate runtime limit and modify the job name and any other parameters.

Hello World

Hello World is the classic example used throughout programming. We don't want to buck the system, so we'll use it as well to demonstrate a simple parallel Grid Engine job script. This example also includes the example of compiling the code and submitting the job script to the Grid Engine.

  • First, create a directory for the Hello World jobs
$ mkdir -p ~/jobs/helloworld
$ cd ~/jobs/helloworld
  • Create the Hello World code written in C (this example of MPI enabled Hello World includes a 3 minute sleep to ensure the job runs for several minutes, a normal hello world example would run in a matter of seconds).
$ vi helloworld-mpi.c
#include <stdio.h>
#include <mpi.h>

main(int argc, char **argv)
{
   int node;

   int i, j;
   float f;

   MPI_Init(&argc,&argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &node);

   printf("Hello World from Node %d.\n", node);
   sleep(180);
   for (j=0; j<=100000; j++)
      for(i=0; i<=100000; i++)
          f=i*2.718281828*i+i+i*3.141592654;

   MPI_Finalize();
}
  • Compile the code, first purging any modules you may have loaded followed by loading the module for OpenMPI GNU. The mpicc command will compile the code and produce a binary named helloworld_gnu_openmpi
$ module purge
$ module load openmpi/openmpi-gnu

$ mpicc helloworld-mpi.c -o helloworld_gnu_openmpi
  • Create the Grid Engine job script that will request 8 cpu slots and a maximum runtime of 10 minutes
$ vi helloworld.qsub
#$ -S /bin/bash
#$ -cwd
#
#$ -N HelloWorld
#$ -pe openmpi 8
#$ -l h_rt=00:10:00,s_rt=0:08:00
#$ -j y
#
#$ -M YOUR_EMAIL_ADDRESS
#$ -m eas
#
# Load the appropriate module files
module load openmpi/openmpi-gnu
#$ -V

mpirun -np $NSLOTS helloworld_gnu_openmpi
  • Submit the job to Grid Engine and check the status using qstat
$ qsub helloworld.qsub

Your job 11613 ("HelloWorld") has been submitted

$ qstat -u $USER

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
  11613 8.79717 HelloWorld jsmith       r     03/13/2009 09:24:35 all.q@compute-0-3.local            8        
  • When the job completes, you should have output files named HelloWorld.o* and HelloWorld.po* (replace the asterisk with the job ID, example HelloWorld.o11613). The .o file is the standard output from the job and .po will contain any errors.
$ cat HelloWorld.o11613

Hello world! I'm 0 of 8 on compute-0-3.local
Hello world! I'm 1 of 8 on compute-0-3.local
Hello world! I'm 4 of 8 on compute-0-3.local
Hello world! I'm 6 of 8 on compute-0-6.local
Hello world! I'm 5 of 8 on compute-0-3.local
Hello world! I'm 7 of 8 on compute-0-6.local
Hello world! I'm 2 of 8 on compute-0-3.local
Hello world! I'm 3 of 8 on compute-0-3.local

Gromacs

#!/bin/bash
#$ -S /bin/bash
#
# Request the maximum runtime for the job
#$ -l h_rt=2:00:00,s_rt=1:55:00
#
# Request the maximum memory needed for each slot / processor core
#$ -l vf=256M
#
# Send mail only when the job ends
#$ -m e
#
# Execute from the current working directory
#$ -cwd
#
#$ -j y
#
# Job Name and email
#$ -N G-4CPU-intel
#$ -M YOUR_EMAIL_ADDRESS
#
# Use OpenMPI parallel environment and 4 slots
#$ -pe openmpi 4
#
# Load the appropriate module(s)
. /etc/profile.d/modules.sh
module load gromacs/gromacs-4-intel
#
#$ -V
#
# Single precision
MDRUN=mdrun_mpi

# The $NSLOTS variable is set automatically by SGE to match the number of
# slots requests
export MYFILE=production-Npt-323K_${NSLOTS}CPU

cd ~/jobs/gromacs

mpirun -np $NSLOTS $MDRUN -v -np $NSLOTS -s $MYFILE -o $MYFILE -c $MYFILE -x $MYFILE -e $MYFILE -g ${MYFILE}.log

R

If you are using LAM MPI for parallel jobs, you must add the following two lines to your ~/.bashrc or ~/.cshrc file.

module load lammpi/lam-7.1-gnu

The following is an example job script that will use an array of 1000 tasks (-t 1-1000), each task has a max runtime of 2 hours and will use no more than 256 MB of RAM per task (h_rt=2:00:00,vf=256M)

The array is also throttled to only run 32 concurrent tasks at any time (-tc 32), this feature is not available on coosa.

More R examples are available here: Running R Jobs on a Rocks Cluster

Create a working directory and the job submission script

$ mkdir -p ~/jobs/ArrayExample
$ cd ~/jobs/ArrayExample
$ vi R-example-array-job.qsub
#!/bin/bash
#$ -S /bin/bash
#
# Request the maximum runtime for the job
#$ -l h_rt=2:00:00,s_rt=1:55:00
#
# Request the maximum memory needed for each slot / processor core
#$ -l vf=256M
#
#$ -M YOUR_EMAIL_ADDRESS
# Email me only when tasks abort, use '#$ -m n' to disable all email for this job
#$ -m a
#$ -cwd
#$ -j y
#
# Job Name
#$ -N ArrayExample
#
#$ -t 1-1000
#$ -tc 32
#
#$ -e $HOME/negcon/rep$TASK_ID/$JOB_NAME.e$JOB_ID.$TASK_ID
#$ -o $HOME/negcon/rep$TASK_ID/$JOB_NAME.o$JOB_ID.$TASK_ID

. /etc/profile.d/modules.sh
module load R/R-2.9.0

#$ -v PATH,R_HOME,R_LIBS,LD_LIBRARY_PATH,CWD

cd ~/jobs/ArrayExample/rep$SGE_TASK_ID
R CMD BATCH rscript.R

Submit the job to the Grid Engine and check the status of the job using the qstat command

$ qsub R-example-array-job.qsub
$ qstat