GalaxyFilterCommonSnps

Protocol for Filtering common SNPs from a set of alignments

Galaxy supports set operations on single columns. Thus, I build an index column for each sample formated as "chr:pos:ref:alt", which I refer to as the indexAlt.

= Tools used =


 * Text Manipulation
 * Compute
 * Cut
 * Concatenate Datasets tail-to-head
 * Filter data on any column
 * Join, Subtract and Group
 * Group data by a column
 * Compare two Datasets

= Step by Step =

chrLAB	0	. chrLAB	JH03_B8M2	0	. JH03_B8M2	JH03_B8M2	JH03_B8M2	JH03_B8M2	JH03_B8M2	chrLAB:0:chrLAB	chrLAB:0:chrLAB:JH03_B8M2 chrI	2323	. C	T	471.72	. AC=1;AF=0.50;AN=2;BaseQRankSum=0.330;DP=234;Dels=0.00;FS=21.822;HRun=2;HaplotypeScore=4.4329;MQ=44.55;MQ0=0;MQRankSum=-10.441;QD=2.02;ReadPosRankSum=0.083;EFF=DOWNSTREAM(LOW|||YAL067C|CALC_BIOTYPE||YAL067C|),DOWNSTREAM(LOW|||YAL068 W-A|CALC_BIOTYPE||YAL068W-A|),DOWNSTREAM(LOW|||YAL069W|CALC_BIOTYPE||YAL069W|),UPSTREAM(LOW|||YAL067W-A|CALC_BIOTYPE||YAL067W-A|),UPSTREAM(LOW|||YAL068C|CALC_BIOTYPE||YAL068C|)	GT:AD:DP:GQ:PL	0/1:175,58:234:99:502,0,6142	DOWNSTREAM(LOW|||YAL067C|CALC_BIOTY PE||YAL067C|),DOWNSTREAM(LOW|||YAL068W-A|CALC_BIOTYPE||YAL068W-A|),DOWNSTREAM(LOW|||YAL069W|CALC_BIOTYPE||YAL069W|),UPSTREAM(LOW|||YAL067W-A|CALC_BIOTYPE||YAL067W-A|),UPSTREAM(LOW|||YAL068C|CALC_BIOTYPE||YAL068C|)	0/1	chrI:2323:C	chrI:2323:C:T chrLAB:0:chrLAB:JH01_B8M1 chrI:2323:C:T chrI:2331:A:C chrI:3981:A:T ... 2micron:265:G:A	10 chrI:100399:G:C	10 chrI:101282:C:A	10
 * 1) For each sample
 * 2) Creating the BAM files (usually with BWA + GATK realigner)
 * 3) Create VCF of variant SNPs (mpileup or GATK)
 * 4) Run snpEffect, compute the "indexAlt" column and extract that index to it's own file
 * 5) * I use the workflow a workflow to run SOP: index snpEffect with Sample name, which actually computes several other files and indices need for building an SNP vs Samples grid.
 * 1) Concanate idxAlt files from all samples into one file
 * 1) Group on c1, computing count(c1)
 * 2) * this produces one line for every SNP in any sample, with a count of how many samples it appears in
 * 3) Filter to select only records where count=num_samples
 * 1) For each sample, remove the common SNP rows
 * 2) * I use the workflow SOP: VCF_fullidx remove common SNPs
 * 3) Text > Compare two data sets
 * 4) * Compare: snpEff+fullIdx [sample]  (list of all variants for sample)
 * 5) * Using: c14 (the idxAlt)
 * 6) * Against: idxAlt with count=num_samples (common variants)
 * 7) * and column: c1 (the idxAlt)
 * 8) Trim off the index rows to get back to a VCF
 * 9) * Text > cut > "c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11"
 * 10) * or any other subset of fields you want to report on