Attention: Research Computing Documentation has Moved
https://docs.rc.uab.edu/

Please use the new documentation url https://docs.rc.uab.edu/ for all Research Computing documentation needs.

As a result of this move, we have deprecated use of this wiki for documentation. We are providing read-only access to the content to facilitate migration of bookmarks and to serve as an historical record. All content updates should be made at the new documentation site. The original wiki will not receive further updates.

Thank you,

The Research Computing Team

Protocol for Filtering common SNPs from a set of alignments

Galaxy supports set operations on single columns. Thus, I build an index column for each sample formated as "chr:pos:ref:alt", which I refer to as the indexAlt.

Tools used

Text Manipulation
- Compute
- Cut
- Concatenate Datasets tail-to-head
- Filter data on any column
Join, Subtract and Group
- Group data by a column
- Compare two Datasets

Step by Step

For each sample
1. Creating the BAM files (usually with BWA + GATK realigner)
2. Create VCF of variant SNPs (mpileup or GATK)
3. Run snpEffect, compute the "indexAlt" column and extract that index to it's own file
  - I use the workflow a workflow to run SOP: index snpEffect with Sample name, which actually computes several other files and indices need for building an SNP vs Samples grid.

chrLAB	0	.	chrLAB	JH03_B8M2	0	.	JH03_B8M2	JH03_B8M2	JH03_B8M2	JH03_B8M2	JH03_B8M2	chrLAB:0:chrLAB	chrLAB:0:chrLAB:JH03_B8M2
chrI	2323	.	C	T	471.72	.	AC=1;AF=0.50;AN=2;BaseQRankSum=0.330;DP=234;Dels=0.00;FS=21.822;HRun=2;HaplotypeScore=4.4329;MQ=44.55;MQ0=0;MQRankSum=-10.441;QD=2.02;ReadPosRankSum=0.083;EFF=DOWNSTREAM(LOW|||YAL067C|CALC_BIOTYPE||YAL067C|),DOWNSTREAM(LOW|||YAL068
W-A|CALC_BIOTYPE||YAL068W-A|),DOWNSTREAM(LOW|||YAL069W|CALC_BIOTYPE||YAL069W|),UPSTREAM(LOW|||YAL067W-A|CALC_BIOTYPE||YAL067W-A|),UPSTREAM(LOW|||YAL068C|CALC_BIOTYPE||YAL068C|)	GT:AD:DP:GQ:PL	0/1:175,58:234:99:502,0,6142	DOWNSTREAM(LOW|||YAL067C|CALC_BIOTY
PE||YAL067C|),DOWNSTREAM(LOW|||YAL068W-A|CALC_BIOTYPE||YAL068W-A|),DOWNSTREAM(LOW|||YAL069W|CALC_BIOTYPE||YAL069W|),UPSTREAM(LOW|||YAL067W-A|CALC_BIOTYPE||YAL067W-A|),UPSTREAM(LOW|||YAL068C|CALC_BIOTYPE||YAL068C|)	0/1	chrI:2323:C	chrI:2323:C:T

Concanate idxAlt files from all samples into one file

chrLAB:0:chrLAB:JH01_B8M1
chrI:2323:C:T
chrI:2331:A:C
chrI:3981:A:T
...

Group on c1, computing count(c1)
- this produces one line for every SNP in any sample, with a count of how many samples it appears in
Filter to select only records where count()=num_samples

2micron:265:G:A	10
chrI:100399:G:C	10
chrI:101282:C:A	10

For each sample, remove the common SNP rows
- I use the workflow SOP: VCF_fullidx remove common SNPs
1. Text > Compare two data sets
  - Compare: snpEff+fullIdx [sample] (list of all variants for sample)
  - Using: c14 (the idxAlt)
  - Against: idxAlt with count=num_samples (common variants)
  - and column: c1 (the idxAlt)
2. Trim off the index rows to get back to a VCF
  - Text > cut > "c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11"
  - or any other subset of fields you want to report on

GalaxyFilterCommonSnps

Tools used

Step by Step

Navigation menu

GalaxyFilterCommonSnps

Tools used

Step by Step

Navigation menu

Search