Create a mappability track
[ Main_Page | NGS_data_analysis ]
Create a custom 'GEM-library' Alignability (mappability) track
(©Derrien et al in 2012, see reference below)
this page relates to Create a GC content track
Contents
Introduction
Results presented here are produced thanks to the method and tools developed by Derrien et al in 2012 [2]. We recommend users to fully read this technical paper in order to assimilate subtleties associated with mappability issues and identify limits of short read alignment.
We will create here five kmer alignability track for 16 nuclear chromosomes and the mitochondrial genome of S. cerevisiae strain S288C (sacCer3 obtained from the UCSC FTP repository). This tutorial was mainly built from the developer's web material [3].
Any other genome for which the chromosome sequences can be assembles in fasta format and gene annotations are available (optional but important for visualisation) can be processed similarly
Updated script can be found HERE
Download reference data
Download the yeast reference genome data from the UCSC table repository
- UCSC data can be obtained from their FTP server http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3[4]
- other files are built from the table browser with interaction in your browser
We use below Kent tools (unix version from [5], also available for macOSX[6]) used by the UCSC team to produce the Genome browser system. Some of these tools should be present on your machine in order to repeat the code below.
# create a folder to store all results basefolder="~/Desktop/alignability/sacCer3_alignability" mkdir –p ${basefolder} && cd ${basefolder} # download the 2bit file wget http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.2bit # extract the fasta data with Kent's 'twoBitToFa' twoBitToFa sacCer3.2bit SacCer3.fa # also create a two-column list of chromosomes names and sizes with Kent's 'faSize' faSize sacCer3.fa -detailed > sacCer3.sizes
download additional SacCer3 annotations
Get the SGD gene table as well as the GC distribution across the whole genome from the UCSC-table (http://genome.ucsc.edu/cgi-bin/hgTables[7]).
the chromosome names should be identical between the 2bit-fasta and the gene table in order for IGV to display them together
# create a folder to store all results basefolder="~/Desktop/alignability/sacCer3_alignability" mkdir –p ${basefolder} && cd ${basefolder} # get the sgdGene table for other purposes (this is not a BED file) wget http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/database/sgdGene.txt.gz # also get it in the BED-format ' SacCer3_sgdGenes.bed' using the Table browser (see screen-shot below) # get and adapt GC% data in variable steps wget http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/gc5Base/sacCer3.gc5Base.varStep.txt.gz # the file: sacCer3.gc5Base.varStep.txt.gz is the raw data used to encode # the gc5Base track on sacCer3. This file was produced from the sacCer3.2bit # sequence with the kent source tree utility hgGcPercent (see below)
create a 5base step file from the variable step with Kent's hgGcPercent
We can produce the GC% wig file directly from the 2bit genome and name it slightly differently to the downloaded version.
# create a folder to store all results basefolder="~/Desktop/alignability/sacCer3_alignability" mkdir –p ${basefolder} && cd ${basefolder} hgGcPercent -wigOut \ -doGaps \ -file=stdout \ -win=5 \ sacCer3 \ sacCer3.2bit | \ gzip > sacCer3.gc5Base.varStep.wig.gz
create a bigwig version for rapid visualization
The wigToBigWig command is somehow sensitive to complex fasta headers. If you Fasta sequences have long names including spaces and ':', you should better simplify it and keep only one word like '1' or 'chr1' to avoid problems at this stage
# create a folder to store all results basefolder="~/Desktop/alignability/sacCer3_alignability" mkdir –p ${basefolder} && cd ${basefolder} wigToBigWig <(bgzip -cd sacCer3.gc5Base.varStep.wig.gz) \ sacCer3.sizes sacCer3.gc5Base.bw
Install and run the GEM library tools
Tools were obtained and installed from : https://sourceforge.net/projects/gemlibrary/files/gem-library/[8] (the core-2 version was used on the local machine!)
A new version is on the make as of Nov2018 - please read more at https://github.com/smarco/gem3-mapper
The new version Gem3-mapper is not compatible yet with this workflow, you need to use the old version for that and refer to this post https://evodify.com/gem-mappability/
- The current link for the older version is: https://sourceforge.net/projects/gemlibrary/files/gem-library/Binary%20pre-release%203/
create a 'gem' index
# create a folder to store all results basefolder="~/Desktop/alignability/sacCer3_alignability" mkdir –p ${basefolder} && cd ${basefolder} pref="sacCer3" reference="sacCer3.fa" idxpref="sacCer3_index" thr=8; # use 8 cores gem-indexer -T ${thr} -c dna -i ${reference} -o ${idxpref}
(c) 2010-2013 Santiago Marco Sola <santiagomsola@gmail.com> (c) 2010-2013 Leonor Frias Moya <leonor.frias@gmail.com> For the terms of use, run the program with the option --show-license. ************************************************************************ * WARNING: this is a beta version, provided for testing purposes only; * * check for updates at <http://gemlibrary.sourceforge.net>. * ************************************************************************ Creating sequence and location files... done. Computing DNA BWT (likely to take long)... done. Generating index (likely to take long)... done. Cleaning up... done. -rw-rw-r-- 1 splaisan splaisan 35M Dec 18 14:15 SacCer3_index.gem -rw-rw-r-- 1 splaisan splaisan 1.9K Dec 18 14:15 SacCer3_index.log
compute Alignability for several K-mer lengths
The created index can now be used to compute mapability across the full genome. The alignability / mappability is computed for kmers between 50 and 250 bases by steps of 50 bases and the results are immediately converted to WIG and BigWIG format for IGV visualization.
create for 5 different read sizes
We selected increasing sizes corresponding to read material available on public repositories like SRA and showing the effect of read size on the uniqueness of subsequent short read alignment.
our tracks are based on single-end reads, paired reads are obviously more easily uniquely mapped due to their increased span size that can reach-out from difficult regions
The wigToBigWig command is somehow sensitive to complex fasta headers. If you Fasta sequences have long names including spaces and ':', you should better simplify it and keep only one word like '1' or 'chr1' to avoid problems at this stage
# create a folder to store all results basefolder="~/Desktop/alignability/sacCer3_alignability" mkdir –p ${basefolder} && cd ${basefolder} pref="sacCer3" idxpref="sacCer3_index" thr=8; # use 8 cores for kmer in $(seq 50 50 250); do # compute mappability data gem-mappability -T ${thr} -I ${idxpref}.gem -l ${kmer} -o ${pref}_${kmer} # convert results to wig and bigwig gem-2-wig -I ${idxpref}.gem -i ${pref}_${kmer}.mappability -o ${pref}_${kmer} wigToBigWig ${pref}_${kmer}.wig ${pref}.sizes ${pref}_${kmer}.bw done
Visualize results in IGV
A screen-shot is provided to illustrate the results using IGV and the Broad sacCer3 loaded genome. Other yeast genomes available in IGV include:
Other strains are available from:
- http://genomevolution.org/wiki/index.php/SGRP:_Sanger_Institute_Yeast_Genomes[10]
- (e.g. Y55) http://genomevolution.org/CoGe/services/JBrowse/service.pl/sequence/9136[11]
Reads for yeast genome can be obtained from the Sanger institute at:
Legend:
- top track represents gene density across the genome for SGD genes
- next 5 tacks show alignability / mappability for increasing kmer length.
- bottom track shows GC% across the genome in 5base bins (scale set between 30 and 50)
Conclusion
Building a custom mappability track is easy and can be done on a laptop for a small genome for which users have the fasta sequence (therefore also for 'contigs' or 'scaffolds' in the case of unfinished genomes). We present here code and results for the current yeast strain that can be adapted for other genomes or strains. The analysis of alignabilty/mappability for paired reads is much more complex and is not produced here, please read the publication for more information.
Feed back on this page is welcome and can be sent to bits@vib.be
Related reading
A recent paper evaluates the effect of read length on the unicity of mapping (BMC Bioinformatics 2014, 15:2 doi:10.1186/1471-2105-15-2 [13])
download exercise files
Download demo files here
References:
- ↑ https://www.biorxiv.org/content/10.1101/611160v1
- ↑
Thomas Derrien, Jordi Estellé, Santiago Marco Sola, David G Knowles, Emanuele Raineri, Roderic Guigó, Paolo Ribeca
Fast computation and applications of genome mappability.
PLoS One: 2012, 7(1);e30377
[PubMed:22276185] ##WORLDCAT## [DOI] (I p) - ↑ http://algorithms.cnag.cat/wiki/The_GEM_library
- ↑ http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3
- ↑ http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64
- ↑ http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386"
- ↑ http://genome.ucsc.edu/cgi-bin/hgTables
- ↑ https://sourceforge.net/projects/gemlibrary/files/gem-library/
- ↑ https://github.com/smarco/gem3-mapper
- ↑ http://genomevolution.org/wiki/index.php/SGRP:_Sanger_Institute_Yeast_Genomes
- ↑ http://genomevolution.org/CoGe/services/JBrowse/service.pl/sequence/9136
- ↑ http://www.sanger.ac.uk/research/projects/genomeinformatics/sgrp.html
- ↑
Wentian Li, Jan Freudenberg, Pedro Miramontes
Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome.
BMC Bioinformatics: 2014, 15;2
[PubMed:24386976] ##WORLDCAT## [DOI] (I e)
[ Main_Page | NGS_data_analysis ]