Q&A added during the NGS variant analysis training2
Please find here questions raised during the session with some element of answer found with Dr G
[ Main_Page | NGS_data_analysis ]

Contents
- 1 Which mapper should I use for DNA sequencing?
- 2 How can I run IGV from the webstart?
- 3 How can I convert 1-based VCF format to 0-based BED?
- 4 How can I generate mappability tracks for other genomes?
- 5 How are multi-allelic gene regions (MHC TCR Ig ...) handled in the most recent genome builds (hg19/mm10)
- 6 How can I generate a consensus sequence from a reference fasta file and a VCF variant file?
Which mapper should I use for DNA sequencing?
- Bowtie2 or BWA? what is best as a matter of accuracy and speed
- Do I want speed in the first place or do I want maximal accuracy at the cost of few more hours (days?) ?
- A nice Sequanswer page to follow this 'hot' discussion http://seqanswers.com/forums/showthread.php?t=15200[1]
- Heng Li's benchmarks (cited in the former link) comparing mappers on hard data http://lh3lh3.users.sourceforge.net/alnROC.shtml[2].
- Finally: An online short publication by William Langdon about the same issue (link[3]) and his conclusion: (quoted: BWA finds more matches than the other three tools (Table 1, column “% matched”). However the difference be- tween BWA and Bowtie2 is only 0.2% and BWA takes more than three times as long. The fastest program is Bowtie but it is almost the same speed as Bowtie2GP and find 5-6% fewer matches than the other tools. Bowtie2GP and Bowtie2 produce very similar matches but Bowtie2GP is 26% faster.)
In order to make our own mind, we applied Bowtie2 to the reads previously analyzed with BWA. Our results are presented in NGS_Exercise.3c.
corrigendum: Bowtie does not map RNASeq reads across introns as was said during the first day session: tophat does it by using bowtie in weird ways. Read more in Biostar[4](my mistake, sorry!)
How can I run IGV from the webstart?
Download it from the Broad site (http://www.broadinstitute.org/software/igv/download) as a '.jnlp' applet and run it with your locally installed Java (get the file from the BITS server: http://data.bits.vib.be/pub/trainingen/NGSDNA2013/varia/igv_lm.jnlp)
How can I convert 1-based VCF format to 0-based BED?
A VCF to bed converter is available in BEDOPS: https://bedops.readthedocs.org/en/latest/content/reference/file-management/conversion/vcf2bed.html
more information about 0/1-base systems

ATG is (3-6] in 0-based and [4-6] in 1-based open coordinates
Interesting discussions about advantages of each system at BioStar <http://www.biostars.org/p/6373/> and <http://www.biostars.org/p/84686/>BEDOPS command details
vcf2bed --help
Usage:
/usr/local/bin/vcf2bed [ --help ] [ --do-not-sort | --max-mem <value> ] < foo.vcf
Options:
--help Print this help message and exit
--do-not-sort Do not sort converted data with BEDOPS sort-bed
--max-mem <value> Sets aside <value> memory for sorting BED output. For example,
<value> can be 8G, 8000M or 8000000000 to specify 8 GB of memory
(default: 2G).
About:
This script converts 1-based, closed [a, b] VCF v4 data from standard input
into 0-based, half-open [a-1, b) extended BED, sent to standard output.
This conversion script relies on the VCF v4 format, with its
specifications outlined here by the 1000 Genomes project:
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
-- The 'meta-information' (starting with '##') and 'header'
lines (starting with '#') are discarded.
-- The header line must be tab-delimited. The eight, fixed mandatory
columns are converted to BED data as follows:
- Data in the #CHROM column are mapped to the first column of the BED output
- The POS column is mapped to the second and third BED columns
- The ID and QUAL columns are mapped to the fourth and fifth BED columns, respectively
- The REF, ALT, FILTER and INFO are mapped to the sixth through ninth BED columns, respectively
-- If present, genotype data in FORMAT and subsequence sample IDs
are placed into tenth and subsequent columns
-- Data rows must also be tab-delimited.
-- Any missing data or non-standard delimiters may cause
problems. It may be useful to validate the VCF v4 input
before conversion.
Example:
$ /usr/local/bin/vcf2bed < foo.vcf > sorted-foo.vcf.bed
We make no assumptions about sort order from converted output. Apply
the usage case displayed to pass data to the BEDOPS sort-bed application,
which generates lexicographically-sorted BED data as output.
If you want to skip sorting, use the --do-not-sort option:
$ /usr/local/bin/vcf2bed --do-not-sort < foo.vcf > unsorted-foo.vcf.bed
How can I generate mappability tracks for other genomes?
The tool used to create the human (alignability=mappability) tacks is called gem-library and can be run for any other genome. Please start with this page and install what you need accordingly (<http://algorithms.cnag.cat/wiki/FAQ:Generating_mappability_tracks>).
Alignability/Mappability UCSC info
A tutorial page How_to_create_a_mappability_track was added as how-to in order to guide you through the use of GEM.
How are multi-allelic gene regions (MHC TCR Ig ...) handled in the most recent genome builds (hg19/mm10)
A question was posted to SeqAnswer and already triggered an initial answer.
seqanswer thread
I am wondering to which extend this is still true in the hg19/B37 builds and how many genes/regions might suffer from this artifactual simplification. Please correct me if this is wrong as I am not 100% certain.
This question aims at informing and maybe warning people performing NGS that their reference might not be the true genome in given analyzed cell types.
Thanks to Immunologists or specialists aware of the status of MHC regions and other similar hypervariable loci (Ig genes, TCR...) for their lights.
Cheers,Please follow the link to access the information. http://seqanswers.com/forums/showthread.php?p=127968#post127968
How can I generate a consensus sequence from a reference fasta file and a VCF variant file?
A post in the samtools list by Colin Hercus recently covered this
"You can use novoutil iupac to generate a new consensus sequence from an original fasta and a vcf file. There's options for selecting variant quality and whether to include indel calls. It can also lift over coordinates on a bed file when applying inserts.
It's free to use and part of Novoalign download at http://www.novocraft.com."
References:
- ↑ http://seqanswers.com/forums/showthread.php?t=15200
- ↑ http://lh3lh3.users.sourceforge.net/alnROC.shtml
- ↑ http://www0.cs.ucl.ac.uk/staff/ucacbbl/ftp/papers/Langdon_2013_GECCOlb.pdf
- ↑ https://www.biostars.org/p/63477/
[ Main_Page | NGS_data_analysis ]