Namespaces

Variants
Actions
Personal tools

Galaxy DNA-Seq Tutorial

From UABgrid Documentation

Revision as of 15:56, 15 September 2011 by Ozborn@uab.edu (Talk | contribs)

Jump to: navigation, search

Contents

Galaxy DNA-Seq Tutorial

Linking to data

Link in the Mark Pritchard Vaccinia virus data set.

  • Start with a blank history, there should be no numbered items on the right hand side of the pane. Otherwise create a new history.
  • Select "Shared Data" from the top of the screen to bring up the Shared Data screen
  • Select Data Library
  • Select "Pritchard Vaccinia WR FastQ Files" from the alphabetically sorted list

DataLibraryVacWRFastQ.jpg


  • Click on the subfolder top box to select all 6 files
  • Select "import to current history"
  • Click on "Analyze Data" from the upper main menu. It should bring up the main page and your history pane should now look like the image below.

HistoryPane.jpg

  • Notice that with just 3 viruses are already over 4 GB of data!


Formatting and Grooming Data

  • Your data is often NOT what you expect

FastQ Format

  • FastQ format can mean many things, see the wikipedia entry on FastQ format
  • Ilumina FastQ format also not informative, it changes
  • Click on the pencil to icon in one of the virus images to pull up the attributes, your screen should look a bit like this:

DataType.jpg

  • In Galaxy the expected data type of the galaxy tool must match EXACTLY with the data type in your history pane, otherwise the option to use that particular piece of data will not appear in the tool's drop down menu for data selection.
  • Galaxy requires that everything go into Sanger format to be used. If you know your data is in sanger format, select fastqsanger for your data type. If it is not in that format, select fastq and run the FastQ Groomer.
  • FastQ groomer is very useful, but slow
  • Not a bad idea to run it if you are dealing with a new machine and have the time. Costly in terms of space.

Reference Genome

  • Ensure the correct reference genome is selected
  • WARNING - seeing the reference genome doesn't mean it is there for using
  • The current list of genomes available at UAB is here
  • Contact the UAB Galaxy team or me if you need a genome added (similar procedure for Penn State)
  • Used for a variety of tools, it can be overridden earlier
  • Any problems with CMV-21950R_3_1.fastq ?

Assessing the quality of the data

We will use a number of different tools from the "NGS: QC and manipulation" drop down menu. Try processing the FastQ files with:

  • FastQC
  • Compute Quality Statistics


Running FastQC

FastQC gives an attractive visual output and will flag potential problems.

  • Select NGS: QC and manipulation -> FastQC

FastQCRunSettings.jpg

  • Run it on all the fastq files
  • Remember, you are on a cluster, this can be done in parallel

FastQC Results

Take a look at the other areas that show up as quality issues.

  • FastQC flags potential problems
  • Are there any problems for us?

CMV VAC WR 3 2 base quality.jpg

  • It is imperative to understand your organism and its biology to interpret the data

Running Quality Statistics and Boxplot

  • Galaxy has its own set of tools for computing quality statistics (using R)
  • Generates raw statistics in tabular format which can then be used for a pretty box plot
  • Select NGS: QC and manipulation -> Compute Quality Statistics for CMV-21950R_3_1.fastq
  • Run it to compute the tabular file

RunComputeQualityStats.jpg

  • Why wait? Jobs can be queued, so go ahead and queue up a box and whiskers plot
  • "NGS: QC and manipulation" -> Draw Quality Score Boxplot
  • Can run it on "Data 13" (or whatever) even if Data 13 doesn't exist yet / isn't ready

DrawQualityScoreBoxplot.jpg

Quality Statistics and Boxplot Results

  • Should have a tabbed result file that is easy to manipulate in Galaxy

QualityStatsResults.jpg

  • Should get a nice box and whiskers plot

Cmv21950r 3 1 original boxplot.jpg

  • Problems with this data? How does it compare to FastQ?

Trimming Reads

  • About a quarter of the last base is of poor quality
  • Trim off the 3' end

FastQTrimmerRun.jpg

  • There are other ways to clean up, reads can be filtered by other criteria
  • What to keep and what to throw away depends on your requirements

Results

  • Take the trimmed data and run again compute quality statistics and draw the boxplot

Cmv21950r 3 1 trimmed boxplot.jpg

  • Results should look like above if everything was done correctly
  • May want to rename your trimmed result file to something more useful like "Cmv21950r 3 1 trimmed" or something

Short read alignment to reference genome using BWA

  • BWA is the algorithm of choice for DNA-Seq with Illumina data
  • CASAVA 1.8 may do as well for SNPs, BWA does indels better
  • BWA will align all the short reads to our reference genome
  • Select NGS Mapping -> Map with BWA for Illumina

Bwa run cmv 227r.jpg

  • Do this for all 3 genomes, each run will use 2 fastq files

Samtools

  • Essential toolset, performs a variety of functions
  • Format conversion
  • Viewing BAM files
  • Flagstats
  • Generation of Pileup

BAM Conversion

  • The output of many mapping programs (including BWA) is in SAM format and must in many cases be converted to the binary format (BAM) format for downstream analysis
  • The conversion is lossless.
  • As a binary file, it is significantly smaller than the SAM file
  • Convert all 3 SAM files to BAM

Sam2bam.jpg

  • In this case we did not do any filtering on the SAM file prior to conversion to BAM
  • Often when quality is desired, a filtering step will be done to remove reads which:
  • Fail quality control
  • Are optical duplicates
  • Are not properly paired

Additionally a sorting step may be done, important if you are going to examine the BAM file in a browser like IGV

  • See Shared Data -> Published Workflows -> "FastQ to High Quality, Filtered, Headered, Sorted BAM" or
  • Click on this workflow

Flagstat

  • Examines the flag integer in the BAM File

SamBamFileFormatFlags.jpg

Flagstat.jpg


Pileup

  • Can think of the reads "piling up" on each other
  • Each base pair location on the genome is assigned a representative base

GeneratePileup.jpg

Alternatives to Samtools for Variant Detection

  • GATK - Emerging as the new best practice to call variants, not integrated into Galaxy just yet.
  • PATRIC - Whole genome annnotation (once you have a sequence) for microbial genomes
  • MAQ and others - not used too much anymore

SNPEff - Variation Summation

Summarizing variants and effects of SNPs.

  • Reference genome must be in SNPEff, need GFF3 file of your genome annotation
  • True for most genomes (even Vaccinia WR, now)
  • SNPEff looks at heterozygous and homozygous SNPs, MNPs, etc...
  • Plots coverage, indel sizes

Running SNPEff

  • Select SNP:Effect -> SNPEff

SNPEff.jpg

  • Run on any pileup file


Results from SNPEff

  • Only as good as your genome annotation file
  • Introns in viruses?

SNPEffCoverage.jpg

SNPEffResults.jpg


Locating Relevant SNPs versus Control

FilterRunVacWR.jpg

PileupFiltering100Reads.jpg


De novo assembly (time permitting)

Viewing results in IGV