GalaxyFilterCommonSnps

From Cheaha
Revision as of 22:27, 19 December 2011 by Curtish@uab.edu (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


Attention: Research Computing Documentation has Moved
https://docs.rc.uab.edu/


Please use the new documentation url https://docs.rc.uab.edu/ for all Research Computing documentation needs.


As a result of this move, we have deprecated use of this wiki for documentation. We are providing read-only access to the content to facilitate migration of bookmarks and to serve as an historical record. All content updates should be made at the new documentation site. The original wiki will not receive further updates.

Thank you,

The Research Computing Team

Protocol for Filtering common SNPs from a set of alignments

Galaxy supports set operations on single columns. Thus, I build an index column for each sample formated as "chr:pos:ref:alt", which I refer to as the indexAlt.

Tools used

  • Text Manipulation
    • Compute
    • Cut
    • Concatenate Datasets tail-to-head
    • Filter data on any column
  • Join, Subtract and Group
    • Group data by a column
    • Compare two Datasets


Step by Step

  1. For each sample
    1. Creating the BAM files (usually with BWA + GATK realigner)
    2. Create VCF of variant SNPs (mpileup or GATK)
    3. Run snpEffect, compute the "indexAlt" column and extract that index to it's own file
chrLAB	0	.	chrLAB	JH03_B8M2	0	.	JH03_B8M2	JH03_B8M2	JH03_B8M2	JH03_B8M2	JH03_B8M2	chrLAB:0:chrLAB	chrLAB:0:chrLAB:JH03_B8M2
chrI	2323	.	C	T	471.72	.	AC=1;AF=0.50;AN=2;BaseQRankSum=0.330;DP=234;Dels=0.00;FS=21.822;HRun=2;HaplotypeScore=4.4329;MQ=44.55;MQ0=0;MQRankSum=-10.441;QD=2.02;ReadPosRankSum=0.083;EFF=DOWNSTREAM(LOW|||YAL067C|CALC_BIOTYPE||YAL067C|),DOWNSTREAM(LOW|||YAL068
W-A|CALC_BIOTYPE||YAL068W-A|),DOWNSTREAM(LOW|||YAL069W|CALC_BIOTYPE||YAL069W|),UPSTREAM(LOW|||YAL067W-A|CALC_BIOTYPE||YAL067W-A|),UPSTREAM(LOW|||YAL068C|CALC_BIOTYPE||YAL068C|)	GT:AD:DP:GQ:PL	0/1:175,58:234:99:502,0,6142	DOWNSTREAM(LOW|||YAL067C|CALC_BIOTY
PE||YAL067C|),DOWNSTREAM(LOW|||YAL068W-A|CALC_BIOTYPE||YAL068W-A|),DOWNSTREAM(LOW|||YAL069W|CALC_BIOTYPE||YAL069W|),UPSTREAM(LOW|||YAL067W-A|CALC_BIOTYPE||YAL067W-A|),UPSTREAM(LOW|||YAL068C|CALC_BIOTYPE||YAL068C|)	0/1	chrI:2323:C	chrI:2323:C:T
  1. Concanate idxAlt files from all samples into one file
chrLAB:0:chrLAB:JH01_B8M1
chrI:2323:C:T
chrI:2331:A:C
chrI:3981:A:T
...
  1. Group on c1, computing count(c1)
    • this produces one line for every SNP in any sample, with a count of how many samples it appears in
  2. Filter to select only records where count()=num_samples
2micron:265:G:A	10
chrI:100399:G:C	10
chrI:101282:C:A	10
  1. For each sample, remove the common SNP rows
    1. Text > Compare two data sets
      • Compare: snpEff+fullIdx [sample] (list of all variants for sample)
      • Using: c14 (the idxAlt)
      • Against: idxAlt with count=num_samples (common variants)
      • and column: c1 (the idxAlt)
    2. Trim off the index rows to get back to a VCF
      • Text > cut > "c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11"
      • or any other subset of fields you want to report on