Revision as of 19:43, 15 September 2011

Galaxy DNA-Seq Tutorial

Linking to data

Link in the Mark Pritchard Vaccinia virus data set.

Start with a blank history, there should be no numbered items on the right hand side of the pane. Otherwise create a new history.
Select "Shared Data" from the top of the screen to bring up the Shared Data screen
Select Data Library
Select "Pritchard Vaccinia WR FastQ Files" from the alphabetically sorted list

Click on the subfolder top box to select all 6 files
Select "import to current history"
Click on "Analyze Data" from the upper main menu. It should bring up the main page and your history pane should now look like the image below.

Notice that with just 3 viruses are already over 4 GB of data!

Formatting and Grooming Data

Your data is often NOT what you expect

FastQ Format

FastQ format can mean many things, see the wikipedia entry on FastQ format
Ilumina FastQ format also not informative, it changes
Click on the pencil to icon in one of the virus images to pull up the attributes, your screen should look a bit like this:

In Galaxy the expected data type of the galaxy tool must match EXACTLY with the data type in your history pane, otherwise the option to use that particular piece of data will not appear in the tool's drop down menu for data selection.
Galaxy requires that everything go into Sanger format to be used. If you know your data is in sanger format, select fastqsanger for your data type. If it is not in that format, select fastq and run the FastQ Groomer.
FastQ groomer is very useful, but slow
Not a bad idea to run it if you are dealing with a new machine and have the time. Costly in terms of space.

Reference Genome

Ensure the correct reference genome is selected
WARNING - seeing the reference genome doesn't mean it is there for using
The current list of genomes available at UAB is here
Contact the UAB Galaxy team or me if you need a genome added (similar procedure for Penn State)
Used for a variety of tools, it can be overridden earlier
Any problems with CMV-21950R_3_1.fastq ?

Assessing the quality of the data

We will use a number of different tools from the "NGS: QC and manipulation" drop down menu. Try processing the FastQ files with:

FastQC
Compute Quality Statistics

Running FastQC

FastQC gives an attractive visual output and will flag potential problems.

Select NGS: QC and manipulation -> FastQC

Run it on all the fastq files
Remember, you are on a cluster, this can be done in parallel

FastQC Results

Take a look at the other areas that show up as quality issues.

FastQC flags potential problems
Are there any problems for us?

It is imperative to understand your organism and its biology to interpret the data

Running Quality Statistics and Boxplot

Galaxy has its own set of tools for computing quality statistics (using R)
Generates raw statistics in tabular format which can then be used for a pretty box plot
Select NGS: QC and manipulation -> Compute Quality Statistics for CMV-21950R_3_1.fastq
Run it to compute the tabular file

Why wait? Jobs can be queued, so go ahead and queue up a box and whiskers plot
"NGS: QC and manipulation" -> Draw Quality Score Boxplot
Can run it on "Data 13" (or whatever) even if Data 13 doesn't exist yet / isn't ready

Quality Statistics and Boxplot Results

Should have a tabbed result file that is easy to manipulate in Galaxy

Should get a nice box and whiskers plot

Problems with this data? How does it compare to FastQ?

Trimming Reads

About a quarter of the last base is of poor quality
Trim off the 3' end

There are other ways to clean up, reads can be filtered by other criteria
What to keep and what to throw away depends on your requirements

Results

Short read alignment to reference genome using BWA

BAM Conversion

The output of many mapping programs is in SAM format and must be converted to the binary format. The conversion is lossless.

Variant Analysis

Samtools

Essential toolset, performs a variety of functions including generation of a "pileup" file.

Other Variant Detection Approaches

GATK - Emerging as the new best practise to call variants, not integrated into Galaxy just yet.
SNPEff
PATRIC, annotation sites

@@ Line 87: / Line 87: @@
 * Problems with this data? How does it compare to FastQ?
-== Performing cleanup ==
+== Trimming Reads ==
+* About a quarter of the last base is of poor quality
+* Trim off the 3' end
+[[File:FastQTrimmerRun.jpg]]
+* There are other ways to clean up, reads can be filtered by other criteria
+* What to keep and what to throw away depends on your requirements
+=== Results ===
 [[File:Cmv21950r 3 1 trimmed boxplot.jpg]]

Galaxy DNA-Seq Tutorial: Difference between revisions

Revision as of 19:43, 15 September 2011

Contents

Galaxy DNA-Seq Tutorial

Linking to data

Formatting and Grooming Data

FastQ Format

Reference Genome

Assessing the quality of the data

Running FastQC

FastQC Results

Running Quality Statistics and Boxplot

Quality Statistics and Boxplot Results

Trimming Reads

Results

Short read alignment to reference genome using BWA

BAM Conversion

Variant Analysis

Samtools

Other Variant Detection Approaches

SNPEff and Variant Summation

Running SNPEff

Results from SNPEff

Locating Relevant SNPs versus Control

De novo assembly (time permitting)

Viewing results in IGV

Navigation menu

Galaxy DNA-Seq Tutorial: Difference between revisions

Revision as of 19:43, 15 September 2011

Galaxy DNA-Seq Tutorial

Linking to data

Formatting and Grooming Data

FastQ Format

Reference Genome

Assessing the quality of the data

Running FastQC

FastQC Results

Running Quality Statistics and Boxplot

Quality Statistics and Boxplot Results

Trimming Reads

Results

Short read alignment to reference genome using BWA

BAM Conversion

Variant Analysis

Samtools

Other Variant Detection Approaches

SNPEff and Variant Summation

Running SNPEff

Results from SNPEff

Locating Relevant SNPs versus Control

De novo assembly (time permitting)

Viewing results in IGV

Navigation menu

Search