GalaxyFilterCommonSnps: Difference between revisions

From Cheaha
Jump to navigation Jump to search
(Created page with "Protocol for Filtering common SNPs from a set of alignments Galaxy supports set operations on single columns. Thus, I build an index column for each sample formated as "chr:pos:...")
 
No edit summary
 
Line 45: Line 45:
</pre>
</pre>
# For each sample, remove the common SNP rows
# For each sample, remove the common SNP rows
##
#* I use the workflow [https://galaxy.uabgrid.uab.edu/u/curtish/w/sop-vcffullidx-remove-common-snps SOP: VCF_fullidx remove common SNPs]
## Text > Compare two data sets
##*  Compare: snpEff+fullIdx [sample]  ''(list of all variants for sample)''
##*  Using: c14 ''(the idxAlt)''
##*  Against: idxAlt with count=num_samples ''(common variants)''
##*  and column: c1 ''(the idxAlt)''
## Trim off the index rows to get back to a VCF
##* Text > cut > "c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11"
##* or any other subset of fields you want to report on

Latest revision as of 22:27, 19 December 2011

Protocol for Filtering common SNPs from a set of alignments

Galaxy supports set operations on single columns. Thus, I build an index column for each sample formated as "chr:pos:ref:alt", which I refer to as the indexAlt.

Tools used

  • Text Manipulation
    • Compute
    • Cut
    • Concatenate Datasets tail-to-head
    • Filter data on any column
  • Join, Subtract and Group
    • Group data by a column
    • Compare two Datasets


Step by Step

  1. For each sample
    1. Creating the BAM files (usually with BWA + GATK realigner)
    2. Create VCF of variant SNPs (mpileup or GATK)
    3. Run snpEffect, compute the "indexAlt" column and extract that index to it's own file
chrLAB	0	.	chrLAB	JH03_B8M2	0	.	JH03_B8M2	JH03_B8M2	JH03_B8M2	JH03_B8M2	JH03_B8M2	chrLAB:0:chrLAB	chrLAB:0:chrLAB:JH03_B8M2
chrI	2323	.	C	T	471.72	.	AC=1;AF=0.50;AN=2;BaseQRankSum=0.330;DP=234;Dels=0.00;FS=21.822;HRun=2;HaplotypeScore=4.4329;MQ=44.55;MQ0=0;MQRankSum=-10.441;QD=2.02;ReadPosRankSum=0.083;EFF=DOWNSTREAM(LOW|||YAL067C|CALC_BIOTYPE||YAL067C|),DOWNSTREAM(LOW|||YAL068
W-A|CALC_BIOTYPE||YAL068W-A|),DOWNSTREAM(LOW|||YAL069W|CALC_BIOTYPE||YAL069W|),UPSTREAM(LOW|||YAL067W-A|CALC_BIOTYPE||YAL067W-A|),UPSTREAM(LOW|||YAL068C|CALC_BIOTYPE||YAL068C|)	GT:AD:DP:GQ:PL	0/1:175,58:234:99:502,0,6142	DOWNSTREAM(LOW|||YAL067C|CALC_BIOTY
PE||YAL067C|),DOWNSTREAM(LOW|||YAL068W-A|CALC_BIOTYPE||YAL068W-A|),DOWNSTREAM(LOW|||YAL069W|CALC_BIOTYPE||YAL069W|),UPSTREAM(LOW|||YAL067W-A|CALC_BIOTYPE||YAL067W-A|),UPSTREAM(LOW|||YAL068C|CALC_BIOTYPE||YAL068C|)	0/1	chrI:2323:C	chrI:2323:C:T
  1. Concanate idxAlt files from all samples into one file
chrLAB:0:chrLAB:JH01_B8M1
chrI:2323:C:T
chrI:2331:A:C
chrI:3981:A:T
...
  1. Group on c1, computing count(c1)
    • this produces one line for every SNP in any sample, with a count of how many samples it appears in
  2. Filter to select only records where count()=num_samples
2micron:265:G:A	10
chrI:100399:G:C	10
chrI:101282:C:A	10
  1. For each sample, remove the common SNP rows
    1. Text > Compare two data sets
      • Compare: snpEff+fullIdx [sample] (list of all variants for sample)
      • Using: c14 (the idxAlt)
      • Against: idxAlt with count=num_samples (common variants)
      • and column: c1 (the idxAlt)
    2. Trim off the index rows to get back to a VCF
      • Text > cut > "c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11"
      • or any other subset of fields you want to report on