Q&A added during the NGS variant analysis training2

From BITS wiki
Jump to: navigation, search

Please find here questions raised during the session with some element of answer found with Dr G

[ Main_Page | NGS_data_analysis ]



qanda.png

Which mapper should I use for DNA sequencing?

  • Bowtie2 or BWA? what is best as a matter of accuracy and speed
  • Do I want speed in the first place or do I want maximal accuracy at the cost of few more hours (days?) ?
  • A nice Sequanswer page to follow this 'hot' discussion http://seqanswers.com/forums/showthread.php?t=15200[1]
  • Heng Li's benchmarks (cited in the former link) comparing mappers on hard data http://lh3lh3.users.sourceforge.net/alnROC.shtml[2].
  • Finally: An online short publication by William Langdon about the same issue (link[3]) and his conclusion: (quoted: BWA finds more matches than the other three tools (Table 1, column “% matched”). However the difference be- tween BWA and Bowtie2 is only 0.2% and BWA takes more than three times as long. The fastest program is Bowtie but it is almost the same speed as Bowtie2GP and find 5-6% fewer matches than the other tools. Bowtie2GP and Bowtie2 produce very similar matches but Bowtie2GP is 26% faster.)

Handicon.png In order to make our own mind, we applied Bowtie2 to the reads previously analyzed with BWA. Our results are presented in NGS_Exercise.3c.


Technical.png corrigendum: Bowtie does not map RNASeq reads across introns as was said during the first day session: tophat does it by using bowtie in weird ways. Read more in Biostar[4](my mistake, sorry!)

How can I run IGV from the webstart?

Download it from the Broad site (http://www.broadinstitute.org/software/igv/download) as a '.jnlp' applet and run it with your locally installed Java (get the file from the BITS server: http://data.bits.vib.be/pub/trainingen/NGSDNA2013/varia/igv_lm.jnlp)

How can I convert 1-based VCF format to 0-based BED?

A VCF to bed converter is available in BEDOPS: https://bedops.readthedocs.org/en/latest/content/reference/file-management/conversion/vcf2bed.html

more information about 0/1-base systems

Picture showing the difference between 0-based (top) and 1-based (bottom) coordinates


interbase.png

ATG is (3-6] in 0-based and [4-6] in 1-based open coordinates

Interesting discussions about advantages of each system at BioStar <http://www.biostars.org/p/6373/> and <http://www.biostars.org/p/84686/>

BEDOPS command details

install BEDOPS from: http://bedops.readthedocs.org/en/latest/content/installation.html

vcf2bed --help
Usage:
  /usr/local/bin/vcf2bed [ --help ] [ --do-not-sort | --max-mem <value> ] < foo.vcf

Options:
  --help              Print this help message and exit
  --do-not-sort       Do not sort converted data with BEDOPS sort-bed
  --max-mem <value>   Sets aside <value> memory for sorting BED output. For example,
                      <value> can be 8G, 8000M or 8000000000 to specify 8 GB of memory
                      (default: 2G).

About:
  This script converts 1-based, closed [a, b] VCF v4 data from standard input
  into 0-based, half-open [a-1, b) extended BED, sent to standard output.

  This conversion script relies on the VCF v4 format, with its
  specifications outlined here by the 1000 Genomes project:

  http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

  -- The 'meta-information' (starting with '##') and 'header'
     lines (starting with '#') are discarded.

  -- The header line must be tab-delimited. The eight, fixed mandatory
     columns are converted to BED data as follows:

     - Data in the #CHROM column are mapped to the first column of the BED output
     - The POS column is mapped to the second and third BED columns
     - The ID and QUAL columns are mapped to the fourth and fifth BED columns, respectively
     - The REF, ALT, FILTER and INFO are mapped to the sixth through ninth BED columns, respectively

  -- If present, genotype data in FORMAT and subsequence sample IDs
     are placed into tenth and subsequent columns

  -- Data rows must also be tab-delimited.

  -- Any missing data or non-standard delimiters may cause
     problems. It may be useful to validate the VCF v4 input
     before conversion.

Example:
  $ /usr/local/bin/vcf2bed < foo.vcf > sorted-foo.vcf.bed

  We make no assumptions about sort order from converted output. Apply
  the usage case displayed to pass data to the BEDOPS sort-bed application,
  which generates lexicographically-sorted BED data as output.

  If you want to skip sorting, use the --do-not-sort option:

  $ /usr/local/bin/vcf2bed --do-not-sort < foo.vcf > unsorted-foo.vcf.bed

How can I generate mappability tracks for other genomes?

The tool used to create the human (alignability=mappability) tacks is called gem-library and can be run for any other genome. Please start with this page and install what you need accordingly (<http://algorithms.cnag.cat/wiki/FAQ:Generating_mappability_tracks>).

Alignability/Mappability UCSC info

Alignability: The CRG Alignability tracks display how uniquely k-mer sequences align to a region of the genome. To generate the data, the GEM-mappability program has been employed. The method is equivalent to mapping sliding windows of k-mers (where k has been set to 36, 40, 50, 75 or 100 nts to produce these tracks) back to the genome using the GEM mapper aligner (up to 2 mismatches were allowed in this case). For each window, a mapability score was computed (S = 1/(number of matches found in the genome): S=1 means one match in the genome, S=0.5 is two matches in the genome, and so on). The CRG Alignability tracks were generated independently of the ENCODE project, in the framework of the GEM (GEnome Multitool) project.

A tutorial page How_to_create_a_mappability_track was added as how-to in order to guide you through the use of GEM.

How are multi-allelic gene regions (MHC TCR Ig ...) handled in the most recent genome builds (hg19/mm10)

A question was posted to SeqAnswer and already triggered an initial answer.

seqanswer thread

I remember that former versions of the reference genome (at least hg18) used to lack some genes due to compression of the sequences to one prototype in the case of closely located repeated 'genes'.

I am wondering to which extend this is still true in the hg19/B37 builds and how many genes/regions might suffer from this artifactual simplification. Please correct me if this is wrong as I am not 100% certain.

This question aims at informing and maybe warning people performing NGS that their reference might not be the true genome in given analyzed cell types.

Thanks to Immunologists or specialists aware of the status of MHC regions and other similar hypervariable loci (Ig genes, TCR...) for their lights.

Cheers,

Please follow the link to access the information. http://seqanswers.com/forums/showthread.php?p=127968#post127968

How can I generate a consensus sequence from a reference fasta file and a VCF variant file?

A post in the samtools list by Colin Hercus recently covered this

"You can use novoutil iupac to generate a new consensus sequence from an original fasta and a vcf file. There's options for selecting variant quality and whether to include indel calls. It can also lift over coordinates on a bed file when applying inserts.

It's free to use and part of Novoalign download at http://www.novocraft.com."



References:
  1. http://seqanswers.com/forums/showthread.php?t=15200
  2. http://lh3lh3.users.sourceforge.net/alnROC.shtml
  3. http://www0.cs.ucl.ac.uk/staff/ucacbbl/ftp/papers/Langdon_2013_GECCOlb.pdf
  4. https://www.biostars.org/p/63477/

[ Main_Page | NGS_data_analysis ]