GalaxyFilterCommonSnps: Difference between revisions
Jump to navigation
Jump to search
(Created page with "Protocol for Filtering common SNPs from a set of alignments Galaxy supports set operations on single columns. Thus, I build an index column for each sample formated as "chr:pos:...") |
No edit summary |
||
Line 45: | Line 45: | ||
</pre> | </pre> | ||
# For each sample, remove the common SNP rows | # For each sample, remove the common SNP rows | ||
## | #* I use the workflow [https://galaxy.uabgrid.uab.edu/u/curtish/w/sop-vcffullidx-remove-common-snps SOP: VCF_fullidx remove common SNPs] | ||
## Text > Compare two data sets | |||
##* Compare: snpEff+fullIdx [sample] ''(list of all variants for sample)'' | |||
##* Using: c14 ''(the idxAlt)'' | |||
##* Against: idxAlt with count=num_samples ''(common variants)'' | |||
##* and column: c1 ''(the idxAlt)'' | |||
## Trim off the index rows to get back to a VCF | |||
##* Text > cut > "c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11" | |||
##* or any other subset of fields you want to report on |
Latest revision as of 22:27, 19 December 2011
Protocol for Filtering common SNPs from a set of alignments
Galaxy supports set operations on single columns. Thus, I build an index column for each sample formated as "chr:pos:ref:alt", which I refer to as the indexAlt.
Tools used
- Text Manipulation
- Compute
- Cut
- Concatenate Datasets tail-to-head
- Filter data on any column
- Join, Subtract and Group
- Group data by a column
- Compare two Datasets
Step by Step
- For each sample
- Creating the BAM files (usually with BWA + GATK realigner)
- Create VCF of variant SNPs (mpileup or GATK)
- Run snpEffect, compute the "indexAlt" column and extract that index to it's own file
- I use the workflow a workflow to run SOP: index snpEffect with Sample name, which actually computes several other files and indices need for building an SNP vs Samples grid.
chrLAB 0 . chrLAB JH03_B8M2 0 . JH03_B8M2 JH03_B8M2 JH03_B8M2 JH03_B8M2 JH03_B8M2 chrLAB:0:chrLAB chrLAB:0:chrLAB:JH03_B8M2 chrI 2323 . C T 471.72 . AC=1;AF=0.50;AN=2;BaseQRankSum=0.330;DP=234;Dels=0.00;FS=21.822;HRun=2;HaplotypeScore=4.4329;MQ=44.55;MQ0=0;MQRankSum=-10.441;QD=2.02;ReadPosRankSum=0.083;EFF=DOWNSTREAM(LOW|||YAL067C|CALC_BIOTYPE||YAL067C|),DOWNSTREAM(LOW|||YAL068 W-A|CALC_BIOTYPE||YAL068W-A|),DOWNSTREAM(LOW|||YAL069W|CALC_BIOTYPE||YAL069W|),UPSTREAM(LOW|||YAL067W-A|CALC_BIOTYPE||YAL067W-A|),UPSTREAM(LOW|||YAL068C|CALC_BIOTYPE||YAL068C|) GT:AD:DP:GQ:PL 0/1:175,58:234:99:502,0,6142 DOWNSTREAM(LOW|||YAL067C|CALC_BIOTY PE||YAL067C|),DOWNSTREAM(LOW|||YAL068W-A|CALC_BIOTYPE||YAL068W-A|),DOWNSTREAM(LOW|||YAL069W|CALC_BIOTYPE||YAL069W|),UPSTREAM(LOW|||YAL067W-A|CALC_BIOTYPE||YAL067W-A|),UPSTREAM(LOW|||YAL068C|CALC_BIOTYPE||YAL068C|) 0/1 chrI:2323:C chrI:2323:C:T
- Concanate idxAlt files from all samples into one file
chrLAB:0:chrLAB:JH01_B8M1 chrI:2323:C:T chrI:2331:A:C chrI:3981:A:T ...
- Group on c1, computing count(c1)
- this produces one line for every SNP in any sample, with a count of how many samples it appears in
- Filter to select only records where count()=num_samples
2micron:265:G:A 10 chrI:100399:G:C 10 chrI:101282:C:A 10
- For each sample, remove the common SNP rows
- I use the workflow SOP: VCF_fullidx remove common SNPs
- Text > Compare two data sets
- Compare: snpEff+fullIdx [sample] (list of all variants for sample)
- Using: c14 (the idxAlt)
- Against: idxAlt with count=num_samples (common variants)
- and column: c1 (the idxAlt)
- Trim off the index rows to get back to a VCF
- Text > cut > "c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11"
- or any other subset of fields you want to report on