Latest revision as of 20:19, 18 December 2012

Public Genomes and indexes for NGS

cheaha:/lustre/project/public_datasets/ngs

CCTS BMI in conjunection with anyone who wishes to participate, maintains a public directory of genomic sequences and indexes built from them for use with a variety of NGS alignment and analysis programs, for use both from the command-line and within the Galaxy environment

See also NgsCcts for programs that build and use these indices and databases.

WARNING: these datasets are currently under construction. Contact curtish A uab.edu with questions.

Directory Structure

GIT:/??/scripts
/lustre/project/public_dataset/ngs

 * curator: ngs-ccts group

Notes on what CCTS BMI wants to populate

Order chroms
Galaxy Rsync
- see http://wiki.g2.bx.psu.edu/Admin/Data%20Integration

# list available dbkeys
rsync --list-only --dirs -lptgoDzvP datacache.g2.bx.psu.edu::indexes
# list available dbkeys for "microbes"
rsync --list-only --dirs -lptgoDzvP datacache.g2.bx.psu.edu::indexes/microbes/
# list available .loc files
rsync --list-only --dirs -lptgoDzvP datacache.g2.bx.psu.edu::location

# list mm9 dir/file tree
rsync --list-only -azvP rsync://datacache.g2.bx.psu.edu/indexes/mm9 /lustre/project/public_datasets/ngs/galaxy-datacache/indexes
# pull down mm9 (all)
rsync -azvP rsync://datacache.g2.bx.psu.edu/indexes/mm9 /lustre/project/public_datasets/ngs/galaxy-datacache/indexes
# pull down mm9 w/o alignment files
rsync --exclude align/ -azvP rsync://datacache.g2.bx.psu.edu/indexes/mm9 /lustre/project/public_datasets/ngs/galaxy-datacache/indexes

# pull down all loc files
rsync -azvP rsync://datacache.g2.bx.psu.edu/location/ /lustre/project/public_datasets/ngs/galaxy-datacache/location/

UCSC
- Hg19 – chromFa.tar.gz
- Mm9, Mm10
- Rn4, Rn5
- Zv9=DanRer7
Ensembl (need chr); gtf + fasta
- TreeShrew (don’t add chr!)
- SacSer2, SacSer3 (so it will match SNPeff – ignore SM/RM – alphabetic order
- GHR37
- Celegans (rel66) WS220.release66 (ce10.66)
NCBI
- VacvWR
- Mycoplasm Genomes
- HCMV

why UCSC ?
- Download all chrom, including UNKNOWN chroms
- unknown chroms missing from Illumina
Ensemble missing “chr” prefix
GATK Chrome order
- "new" GATK order
  - [dkcrossm] For humans, if you use UCSC's genome, then mitochondria goes first, then numerically (chrMt, chr1, chr2...chrX, chrY). If you use another human genome (from 1000 genome project), then it goes in numerical order with mitochondria after the chrY (chr1, chr2, chr3...chrY, chrMt).
    - For non-humans, it doesn’t seem to matter. I've tried it both ways. In a C. elegans study, I had mitochondria first. In a yeast study, I had mitochondria last, and they both worked. So, I guess non-humans chromosome order isn't critical.
  - [dkcrossm] It seems that GATK has reversed their policy again on the chromosome order for humans (it doesn’t say anything about the other species chromosome orders, so that is still left up in the air). I’ve copied their section on human chromosome order for your reference:
  - If you are using human data, your reads must be aligned to one of the official b3x (e.g. b36, b37) or hg1x (e.g. hg18, hg19) references. The contig ordering in the reference you used must exactly match that of one of the official references canonical orderings. These are defined by historical karotyping of largest to smallest chromosomes, followed by the X, Y, and MT. The order is thus 1, 2, 3, ..., 10, 11, 12, ... 20, 21, 22, X, Y, MT. The GATK will detect misordered contigs (for example, lexicographically sorted) and throw an error. This draconian approach, though unnecessary technically, ensures that all supplementary data provided with the GATK works correctly. You can use ReorderSam to fix a BAM file aligned to a missorted reference sequence.
- old Order GATK Human
  - Mitochondria, 1…26, X, Y, Unknown by chrom + num order
BWA .62 version
Bowtie 2 (though not in galaxy)
SNPeff2.1 SNPeff3.0 download tools
Galaxy rsync

GATK – resource bundle.

- dbSNP
- 1000genome
- Mils
GTF downloads
Bedtools gff3 -> gtf
Embl needs chr
UCSC needs gtf + refgene then my script to get good GTF

data sources

galaxy data-cache

Galaxy docs on builds.txt search docs
.2bit to .fa conversion for datacache [1] and [2]
directory structure
- XXX = {mm9, hg19, etc}
  - bowtie_index/XXX.*
    - cs/XXX.*
  - bwa_index/XXX.*
  - chrom/chr*.fa
  - download/(date)/md5sum*
  - liftover/XXXToYYY.over.chain
  - picard_index/XXX.{dict,.fai}
  - sam_index/XXX.fai (identical to .fai in picard_index!)
  - seq/XXX.2bit

@@ Line 1: / Line 1: @@
+__FORCETOC__
 = Public Genomes and indexes for NGS =
-cheaha:/lustre/projects/public_datasets/ngs
+cheaha:/lustre/project/public_datasets/ngs
 CCTS BMI in conjunection with anyone who wishes to participate, maintains a public directory of genomic sequences and indexes built from them for use with a variety of NGS alignment and analysis programs, for use both from the command-line and within the Galaxy environment
@@ Line 11: / Line 13: @@
 = Directory Structure =
 * GIT:/??/scripts
-* /lustre/projects/public_datasets/ngs
+* /lustre/project/public_dataset/ngs
    * curator: ngs-ccts group
@@ Line 59: / Line 61: @@
 * GATK Chrome order
 ** "new" GATK order
+*** [dkcrossm] For humans, if you use UCSC's genome, then mitochondria goes first, then numerically (chrMt, chr1, chr2...chrX, chrY).  If you use another human genome (from 1000 genome project), then it goes in numerical order with mitochondria after the chrY (chr1, chr2, chr3...chrY, chrMt).
+**** For non-humans, it doesn’t seem to matter.  I've tried it both ways.  In a C. elegans study, I had mitochondria first.  In a yeast study, I had mitochondria last, and they both worked.  So, I guess non-humans chromosome order isn't critical.
 *** [dkcrossm] It seems that GATK has reversed their policy again on the chromosome order for humans (it doesn’t say anything about the other species chromosome orders, so that is still left up in the air).  I’ve copied their section on human chromosome order for your reference:
 *** If you are using human data, your reads must be aligned to one of the official b3x (e.g. b36, b37) or hg1x (e.g. hg18, hg19) references. The contig ordering in the reference you used must exactly match that of one of the official references canonical orderings. These are defined by historical karotyping of largest to smallest chromosomes, followed by the X, Y, and MT. The order is thus 1, 2, 3, ..., 10, 11, 12, ... 20, 21, 22, X, Y, MT. The GATK will detect misordered contigs (for example, lexicographically sorted) and throw an error. This draconian approach, though unnecessary technically, ensures that all supplementary data provided with the GATK works correctly. You can use ReorderSam to fix a BAM file aligned to a missorted reference sequence.
@@ Line 75: / Line 79: @@
 * Embl needs chr
 * UCSC needs gtf + refgene then my script to get good GTF
+== data sources ==
+=== galaxy data-cache ===
+* Galaxy docs on builds.txt [http://wiki.galaxyproject.org/Admin/Config?action=fullsearch&context=180&value=builds.txt&fullsearch=Text search docs]
+* .2bit to .fa conversion for datacache [http://www.mail-archive.com/galaxy-dev@lists.bx.psu.edu/msg07034.html] and [https://bitbucket.org/gvl/loc_files]
+* directory structure
+** XXX = {mm9, hg19, etc}
+*** bowtie_index/XXX.*
+**** cs/XXX.*
+*** bwa_index/XXX.*
+*** chrom/chr*.fa
+*** download/(date)/md5sum*
+*** liftover/XXXToYYY.over.chain
+*** picard_index/XXX.{dict,.fai}
+*** sam_index/XXX.fai (identical to .fai in picard_index!)
+*** seq/XXX.2bit

PublicDatasetsNgs: Difference between revisions

Latest revision as of 20:19, 18 December 2012

Contents

Public Genomes and indexes for NGS

Directory Structure

Notes on what CCTS BMI wants to populate

data sources

galaxy data-cache

Navigation menu

PublicDatasetsNgs: Difference between revisions

Latest revision as of 20:19, 18 December 2012

Public Genomes and indexes for NGS

Directory Structure

Notes on what CCTS BMI wants to populate

data sources

galaxy data-cache

Navigation menu

Search