UAB Galaxy RNA Seq Step by Step Tutorial: Difference between revisions
No edit summary |
|||
Line 1: | Line 1: | ||
= | = Transcriptome analysis via RNA-Seq = | ||
There are several types of RNA-Seq: transcriptome, splice-variant/TSS/UTR analysis, microRNA-Seq, etc. This tutorial will focus on doing a 2 condition, 1 replicate transcriptome analysis in mouse. | |||
= Background = | |||
== Web Resources == | |||
* Jeremy Goecks' Galaxy RNAseq tutorial http://main.g2.bx.psu.edu/u/jeremy/p/galaxy-rna-seq-analysis-exercise | * Jeremy Goecks' Galaxy RNAseq tutorial http://main.g2.bx.psu.edu/u/jeremy/p/galaxy-rna-seq-analysis-exercise | ||
* Tophat short-read RNA mapper http://tophat.cbcb.umd.edu/ | * Tophat short-read RNA mapper http://tophat.cbcb.umd.edu/ | ||
Line 11: | Line 13: | ||
* UCSC Genome Browser http://genome.ucsc.edu/ | * UCSC Genome Browser http://genome.ucsc.edu/ | ||
== File Types == | |||
* [http://en.wikipedia.org/wiki/FASTQ_format FASTQ] - text file; like a FASTA but with extra lines for read quality scores for each base. This is what you usually get from the sequencing center. | * [http://en.wikipedia.org/wiki/FASTQ_format FASTQ] - text file; like a FASTA but with extra lines for read quality scores for each base. This is what you usually get from the sequencing center. | ||
* [http://genome.ucsc.edu/FAQ/FAQformat.html#format5.1 BAM] - '''B'''inary compresssed sequences '''A'''lignment/'''M'''ap format - Contains a copy of the reference genome, and every short read that was mapped, including the short reads location and alignment. It can also contain the "left-over" reads from the source FASTQ's that did NOT align to the reference genome! These files must visualized with a tool, either IGV, IGB or sometimes UCSC Genome Browser. | * [http://genome.ucsc.edu/FAQ/FAQformat.html#format5.1 BAM] - '''B'''inary compresssed sequences '''A'''lignment/'''M'''ap format - Contains a copy of the reference genome, and every short read that was mapped, including the short reads location and alignment. It can also contain the "left-over" reads from the source FASTQ's that did NOT align to the reference genome! These files must visualized with a tool, either IGV, IGB or sometimes UCSC Genome Browser. | ||
Line 18: | Line 21: | ||
* [http://en.wikipedia.org/wiki/FASTA_format FASTA] text file; contains a list of named sequences. Can be used to hold the sequence of a reference genome to which RNAseq reads will be mapped. | * [http://en.wikipedia.org/wiki/FASTA_format FASTA] text file; contains a list of named sequences. Can be used to hold the sequence of a reference genome to which RNAseq reads will be mapped. | ||
= Upload data = | |||
For this tutorial, we will have 2 mouse samples, sequenced with paired-end reads on an Illumina machine. That gives us 4 FASTQ files to upload (forward and reverse sequences for each sample). | For this tutorial, we will have 2 mouse samples, sequenced with paired-end reads on an Illumina machine. That gives us 4 FASTQ files to upload (forward and reverse sequences for each sample). | ||
Line 29: | Line 32: | ||
Or they can be found inside Galaxy at Shared Data / Data Libraries / Tutorial Data Sets / RNAseq Tutorial | Or they can be found inside Galaxy at Shared Data / Data Libraries / Tutorial Data Sets / RNAseq Tutorial | ||
== Set filetype and genome == | |||
To start, you must move the data (FASTQ) from the sequencing center into the Galaxy instance, be sure to specify the filetype (fastqsanger for UAB and HudsonAlpha) and the organism that was sequenced ("Genome database"), in this tutorial, "mm9" for mouse. | To start, you must move the data (FASTQ) from the sequencing center into the Galaxy instance, be sure to specify the filetype ("fastqsanger" for UAB and HudsonAlpha, the default "fastq" will not work) and the organism that was sequenced ("Genome database"), in this tutorial, "mm9" for mouse. | ||
== Check quality of READ data == | |||
At this step, we check the quality of sequencing. This says nothing about the quality of the sample, or whether it was the right sample. We'll check that later. | At this step, we check the quality of sequencing. This says nothing about the quality of the sample, or whether it was the right sample. We'll check that later. | ||
Line 41: | Line 44: | ||
If some of the reports are not green, discuss this with your sequencing center. | If some of the reports are not green, discuss this with your sequencing center. | ||
In some cases, you may need to trim bases off the beginning or ends of the sequence. See the tools in ''NGS: QC and manipulation > FASTX-TOOLKIT FOR FASTQ DATA''. | |||
= Align using TopHat: inputs = | |||
For the moment, TopHat is the only NGS aligner for transcript data - it's the only one that handles splicing. In addition, it can be set to detect indels relative to the reference genome. | For the moment, TopHat is the only NGS aligner for transcript data - it's the only one that handles splicing. In addition, it can be set to detect indels relative to the reference genome. | ||
Line 47: | Line 52: | ||
We'll run TopHat once for each sample (twice, in this case). | We'll run TopHat once for each sample (twice, in this case). | ||
== Menu: [NGS: RNA Analysis > Tophat for Illumina ] == | |||
; Reads (FASTQs) | ; Reads (FASTQs) | ||
Line 74: | Line 79: | ||
== QC alignment == | == QC alignment == | ||
* NGS: SAM Tools > flagstat | |||
* %mapped | * %mapped | ||
* visualize in IGV or | * visualize in IGV, IGB or UCSC | ||
= Construct transcripts : Cufflinks = | |||
* Denovo vs existing annotation | * Denovo vs existing annotation | ||
* UAB Modified Reference annotations | * UAB Modified Reference annotations | ||
= Compare transcript levels: cuffdiff/cuffcompare = |
Revision as of 19:18, 14 September 2011
Transcriptome analysis via RNA-Seq
There are several types of RNA-Seq: transcriptome, splice-variant/TSS/UTR analysis, microRNA-Seq, etc. This tutorial will focus on doing a 2 condition, 1 replicate transcriptome analysis in mouse.
Background
Web Resources
- Jeremy Goecks' Galaxy RNAseq tutorial http://main.g2.bx.psu.edu/u/jeremy/p/galaxy-rna-seq-analysis-exercise
- Tophat short-read RNA mapper http://tophat.cbcb.umd.edu/
- Cufflinks transcript assembler http://cufflinks.cbcb.umd.edu/
- Tophat-cuff[links,compare,diff] tutorial (command-line) http://cufflinks.cbcb.umd.edu/tutorial.html
- IGV (Broad Integrated Genome Viewer) http://www.broadinstitute.org/software/igv/download
- IGB (Integrated Genome Browser) http://bioviz.org/igb/download.shtml
- UCSC Genome Browser http://genome.ucsc.edu/
File Types
- FASTQ - text file; like a FASTA but with extra lines for read quality scores for each base. This is what you usually get from the sequencing center.
- BAM - Binary compresssed sequences Alignment/Map format - Contains a copy of the reference genome, and every short read that was mapped, including the short reads location and alignment. It can also contain the "left-over" reads from the source FASTQ's that did NOT align to the reference genome! These files must visualized with a tool, either IGV, IGB or sometimes UCSC Genome Browser.
- BED - text file; a list of locations, in genomic coordinates.
- GTF Gene Transfer Format - text file; contains gene/transcript annotations for a sequence/genome, also in genomic coordinates.
- FASTA text file; contains a list of named sequences. Can be used to hold the sequence of a reference genome to which RNAseq reads will be mapped.
Upload data
For this tutorial, we will have 2 mouse samples, sequenced with paired-end reads on an Illumina machine. That gives us 4 FASTQ files to upload (forward and reverse sequences for each sample).
FTP site for compressed fastq.gz files
- control_mm9_chr15_Plekhh2-PigF_forward.fastq.gz
- control_mm9_chr15_Plekhh2-PigF_reverse.fastq.gz
- drugged_mm9_chr15_Plekhh2-PigF_forward.fastq.gz
- drugged_mm9_chr15_Plekhh2-PigF_reverse.fastq.gz
Or they can be found inside Galaxy at Shared Data / Data Libraries / Tutorial Data Sets / RNAseq Tutorial
Set filetype and genome
To start, you must move the data (FASTQ) from the sequencing center into the Galaxy instance, be sure to specify the filetype ("fastqsanger" for UAB and HudsonAlpha, the default "fastq" will not work) and the organism that was sequenced ("Genome database"), in this tutorial, "mm9" for mouse.
Check quality of READ data
At this step, we check the quality of sequencing. This says nothing about the quality of the sample, or whether it was the right sample. We'll check that later.
[NGS: QC and manipulation > FASTQ QC > Fastqc]
If some of the reports are not green, discuss this with your sequencing center.
In some cases, you may need to trim bases off the beginning or ends of the sequence. See the tools in NGS: QC and manipulation > FASTX-TOOLKIT FOR FASTQ DATA.
Align using TopHat: inputs
For the moment, TopHat is the only NGS aligner for transcript data - it's the only one that handles splicing. In addition, it can be set to detect indels relative to the reference genome.
We'll run TopHat once for each sample (twice, in this case).
Menu: [NGS: RNA Analysis > Tophat for Illumina ]
- Reads (FASTQs)
- FASTQs As we're doing "paired" reads, we will need to provide 2 FASTQ files: the forward and reverse reads. In order to have a place to specify the 2nd FASTQ file, we must set the "Is this library mate-paired?" pulldown to "Paired-end", then a second pulldown will appear to specify the 2nd FASTQ.
- Mean Inner Distance between Mate Pairs is a value you must get from your sequencing center. This is the mean fragment length of the molecules sequenced, minus the part sequenced. For many RNAseq experiments, it is around 150-175.
- Genome (Built-in or FASTA from history)
- We must provide it with the reference genome to align to. In our case, this is Mouse, and we'll use the already installed "mm9" genome build. If you have an genome that is not on the list, you can either have us add it, or you can upload a FASTA file of the genome into your history, and point TopHat at that.
- TopHat settings to use
- Defaults
- This is the easy thing to do, but, as the manual said, "There is no such thing (yet) as an automated gearshift in splice junction identification. It is all like stick-shift driving in San Francisco. In other words, running this tool with default parameters will probably not give you meaningful results."
- Full parameter list
- The most common thing to change
- Allow indel search (from NO to YES)
- Minimum isoform fraction (to 0 if looking for rare isoforms)
- Maximum/minimum intron length (defaults are for Mammals; other critters will do better with stricter settings)
- The most common thing to change
TopHat output
- accepted_hits
- Two binary files: .BAM (data) and .BAI (index)
- These are the actual paired reads mapped to their position on the genome, and split across exon junctions. This can be visualized in IGV or IGB, but you must download both .BAM and .BAI files to the same directory.
- splice_junctions
- BED file (list of genomic locations, no sequence) listing all the places TopHat had to split a read into two pieces to span an exon junction. This can be visualized at UCSC or in IGV, etc.
- deletions (if indel search is on)
- insertions (if indel search is on)
QC alignment
- NGS: SAM Tools > flagstat
* %mapped * visualize in IGV, IGB or UCSC
Construct transcripts : Cufflinks
* Denovo vs existing annotation * UAB Modified Reference annotations