GalaxyFilterCommonSnps
Revision as of 22:27, 19 December 2011 by Curtish@uab.edu (talk | contribs)
Attention: Research Computing Documentation has Moved
https://docs.rc.uab.edu/
https://docs.rc.uab.edu/
Please use the new documentation url https://docs.rc.uab.edu/ for all Research Computing documentation needs.
As a result of this move, we have deprecated use of this wiki for documentation. We are providing read-only access to the content to facilitate migration of bookmarks and to serve as an historical record. All content updates should be made at the new documentation site. The original wiki will not receive further updates.
Thank you,
The Research Computing Team
Protocol for Filtering common SNPs from a set of alignments
Galaxy supports set operations on single columns. Thus, I build an index column for each sample formated as "chr:pos:ref:alt", which I refer to as the indexAlt.
Tools used
- Text Manipulation
- Compute
- Cut
- Concatenate Datasets tail-to-head
- Filter data on any column
- Join, Subtract and Group
- Group data by a column
- Compare two Datasets
Step by Step
- For each sample
- Creating the BAM files (usually with BWA + GATK realigner)
- Create VCF of variant SNPs (mpileup or GATK)
- Run snpEffect, compute the "indexAlt" column and extract that index to it's own file
- I use the workflow a workflow to run SOP: index snpEffect with Sample name, which actually computes several other files and indices need for building an SNP vs Samples grid.
chrLAB 0 . chrLAB JH03_B8M2 0 . JH03_B8M2 JH03_B8M2 JH03_B8M2 JH03_B8M2 JH03_B8M2 chrLAB:0:chrLAB chrLAB:0:chrLAB:JH03_B8M2 chrI 2323 . C T 471.72 . AC=1;AF=0.50;AN=2;BaseQRankSum=0.330;DP=234;Dels=0.00;FS=21.822;HRun=2;HaplotypeScore=4.4329;MQ=44.55;MQ0=0;MQRankSum=-10.441;QD=2.02;ReadPosRankSum=0.083;EFF=DOWNSTREAM(LOW|||YAL067C|CALC_BIOTYPE||YAL067C|),DOWNSTREAM(LOW|||YAL068 W-A|CALC_BIOTYPE||YAL068W-A|),DOWNSTREAM(LOW|||YAL069W|CALC_BIOTYPE||YAL069W|),UPSTREAM(LOW|||YAL067W-A|CALC_BIOTYPE||YAL067W-A|),UPSTREAM(LOW|||YAL068C|CALC_BIOTYPE||YAL068C|) GT:AD:DP:GQ:PL 0/1:175,58:234:99:502,0,6142 DOWNSTREAM(LOW|||YAL067C|CALC_BIOTY PE||YAL067C|),DOWNSTREAM(LOW|||YAL068W-A|CALC_BIOTYPE||YAL068W-A|),DOWNSTREAM(LOW|||YAL069W|CALC_BIOTYPE||YAL069W|),UPSTREAM(LOW|||YAL067W-A|CALC_BIOTYPE||YAL067W-A|),UPSTREAM(LOW|||YAL068C|CALC_BIOTYPE||YAL068C|) 0/1 chrI:2323:C chrI:2323:C:T
- Concanate idxAlt files from all samples into one file
chrLAB:0:chrLAB:JH01_B8M1 chrI:2323:C:T chrI:2331:A:C chrI:3981:A:T ...
- Group on c1, computing count(c1)
- this produces one line for every SNP in any sample, with a count of how many samples it appears in
- Filter to select only records where count()=num_samples
2micron:265:G:A 10 chrI:100399:G:C 10 chrI:101282:C:A 10
- For each sample, remove the common SNP rows
- I use the workflow SOP: VCF_fullidx remove common SNPs
- Text > Compare two data sets
- Compare: snpEff+fullIdx [sample] (list of all variants for sample)
- Using: c14 (the idxAlt)
- Against: idxAlt with count=num_samples (common variants)
- and column: c1 (the idxAlt)
- Trim off the index rows to get back to a VCF
- Text > cut > "c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11"
- or any other subset of fields you want to report on