PublicDatasetsNgs: Difference between revisions

From Cheaha
Jump to navigation Jump to search
Line 17: Line 17:
  # list available dbkeys
  # list available dbkeys
  rsync --list-only --dirs -lptgoDzvP datacache.g2.bx.psu.edu::indexes
  rsync --list-only --dirs -lptgoDzvP datacache.g2.bx.psu.edu::indexes
# list available .loc files
rsync --list-only --dirs -lptgoDzvP datacache.g2.bx.psu.edu::location
# list mm9 dir/file tree
rsync --list-only -azvP rsync://datacache.g2.bx.psu.edu/indexes/mm9 /lustre/project/public_datasets/ngs/galaxy-datacache/indexes
  # pull down mm9
  # pull down mm9
  rsync -azvP rsync://datacache.g2.bx.psu.edu/indexes/mm9 /lustre/project/public_datasets/ngs/galaxy-datacache
  rsync -azvP rsync://datacache.g2.bx.psu.edu/indexes/mm9 /lustre/project/public_datasets/ngs/galaxy-datacache/indexes
 
# pull down all loc files
rsync -azvP rsync://datacache.g2.bx.psu.edu/location/ /lustre/project/public_datasets/ngs/galaxy-datacache/location/
 





Revision as of 18:58, 2 October 2012

Public Genomes and indexes for NGS

CCTS BMI in conjunection with anyone who wishes to participate, maintains a public directory of genomic sequences and indexes built from them for use with a variety of NGS alignment and analysis programs, for use both from the command-line and Galaxy.

cheaha:/lustre/projects/public_datasets

Directory Structure

  • GIT:/??/scripts
  • /lustre/projects/public_datasets/ngs
 * curator: ngs-ccts group

Notes on what CCTS BMI wants to populate

# list available dbkeys
rsync --list-only --dirs -lptgoDzvP datacache.g2.bx.psu.edu::indexes
# list available .loc files
rsync --list-only --dirs -lptgoDzvP datacache.g2.bx.psu.edu::location
# list mm9 dir/file tree
rsync --list-only -azvP rsync://datacache.g2.bx.psu.edu/indexes/mm9 /lustre/project/public_datasets/ngs/galaxy-datacache/indexes
# pull down mm9
rsync -azvP rsync://datacache.g2.bx.psu.edu/indexes/mm9 /lustre/project/public_datasets/ngs/galaxy-datacache/indexes
# pull down all loc files
rsync -azvP rsync://datacache.g2.bx.psu.edu/location/ /lustre/project/public_datasets/ngs/galaxy-datacache/location/


  • UCSC
    • Hg19 – chromFa.tar.gz
    • Mm9, Mm10
    • Rn4, Rn5
    • Zv9=DanRer7
  • Ensembl (need chr); gtf + fasta
    • TreeShrew (don’t add chr!)
    • SacSer2, SacSer3 (so it will match SNPeff – ignore SM/RM – alphabetic order
    • GHR37
    • Celegans (rel66) WS220.release66 (ce10.66)
  • NCBI
    • VacvWR
    • Mycoplasm Genomes
    • HCMV
  • why UCSC ?
    • Download all chrom, including UNKNOWN chroms
    • unknown chroms missing from Illumina
  • Ensemble missing “chr” prefix
  • GATK Chrome order
    • "new" GATK order
      • [dkcrossm] It seems that GATK has reversed their policy again on the chromosome order for humans (it doesn’t say anything about the other species chromosome orders, so that is still left up in the air). I’ve copied their section on human chromosome order for your reference:
      • If you are using human data, your reads must be aligned to one of the official b3x (e.g. b36, b37) or hg1x (e.g. hg18, hg19) references. The contig ordering in the reference you used must exactly match that of one of the official references canonical orderings. These are defined by historical karotyping of largest to smallest chromosomes, followed by the X, Y, and MT. The order is thus 1, 2, 3, ..., 10, 11, 12, ... 20, 21, 22, X, Y, MT. The GATK will detect misordered contigs (for example, lexicographically sorted) and throw an error. This draconian approach, though unnecessary technically, ensures that all supplementary data provided with the GATK works correctly. You can use ReorderSam to fix a BAM file aligned to a missorted reference sequence.
    • old Order GATK Human
      • Mitochondria, 1…26, X, Y, Unknown by chrom + num order
  • BWA .62 version
  • Bowtie 2 (though not in galaxy)
  • SNPeff2.1 SNPeff3.0 download tools
  • Galaxy rsync

GATK – resource bundle.

    • dbSNP
    • 1000genome
    • Mils
  • GTF downloads
  • Bedtools gff3 -> gtf
  • Embl needs chr
  • UCSC needs gtf + refgene then my script to get good GTF