Create a mappability track

From BITS wiki
Jump to: navigation, search


[ Main_Page | NGS_data_analysis ]


Create a custom 'GEM-library' Alignability (mappability) track


Derrien_fig3.png

(©Derrien et al in 2012, see reference below)

Handicon.png Another tool GenMap [1] can also produce mappability tracks and is faster than gem

Handicon.png this page relates to Create a GC content track

Introduction

Results presented here are produced thanks to the method and tools developed by Derrien et al in 2012 [2]. We recommend users to fully read this technical paper in order to assimilate subtleties associated with mappability issues and identify limits of short read alignment.

We will create here five kmer alignability track for 16 nuclear chromosomes and the mitochondrial genome of S. cerevisiae strain S288C (sacCer3 obtained from the UCSC FTP repository). This tutorial was mainly built from the developer's web material [3].

Handicon.png Any other genome for which the chromosome sequences can be assembles in fasta format and gene annotations are available (optional but important for visualisation) can be processed similarly

Handicon.png Updated script can be found HERE

Download reference data

Download the yeast reference genome data from the UCSC table repository

We use below Kent tools (unix version from [5], also available for macOSX[6]) used by the UCSC team to produce the Genome browser system. Some of these tools should be present on your machine in order to repeat the code below.

# create a folder to store all results
basefolder="~/Desktop/alignability/sacCer3_alignability"
mkdir –p ${basefolder} && cd ${basefolder}
 
# download the 2bit file
wget http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.2bit
 
# extract the fasta data with Kent's 'twoBitToFa'
twoBitToFa sacCer3.2bit SacCer3.fa
 
# also create a two-column list of chromosomes names and sizes with Kent's 'faSize'
faSize sacCer3.fa -detailed > sacCer3.sizes

download additional SacCer3 annotations

Get the SGD gene table as well as the GC distribution across the whole genome from the UCSC-table (http://genome.ucsc.edu/cgi-bin/hgTables[7]).

Handicon.png the chromosome names should be identical between the 2bit-fasta and the gene table in order for IGV to display them together

# create a folder to store all results
basefolder="~/Desktop/alignability/sacCer3_alignability"
mkdir –p ${basefolder} && cd ${basefolder}
 
# get the sgdGene table for other purposes (this is not a BED file)
wget http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/database/sgdGene.txt.gz
 
# also get it in the BED-format ' SacCer3_sgdGenes.bed' using the Table browser (see screen-shot below)
 
# get and adapt GC% data in variable steps
wget http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/gc5Base/sacCer3.gc5Base.varStep.txt.gz
 
# the file: sacCer3.gc5Base.varStep.txt.gz is the raw data used to encode
# the gc5Base track on sacCer3.  This file was produced from the sacCer3.2bit
# sequence with the kent source tree utility hgGcPercent (see below)


sgdgenes_table.png

create a 5base step file from the variable step with Kent's hgGcPercent

We can produce the GC% wig file directly from the 2bit genome and name it slightly differently to the downloaded version.

# create a folder to store all results
basefolder="~/Desktop/alignability/sacCer3_alignability"
mkdir –p ${basefolder} && cd ${basefolder}
 
hgGcPercent -wigOut \
 	-doGaps \
 	-file=stdout \
 	-win=5 \
 	sacCer3 \
 	sacCer3.2bit | \
 	gzip > sacCer3.gc5Base.varStep.wig.gz

create a bigwig version for rapid visualization

Technical.png The wigToBigWig command is somehow sensitive to complex fasta headers. If you Fasta sequences have long names including spaces and ':', you should better simplify it and keep only one word like '1' or 'chr1' to avoid problems at this stage

# create a folder to store all results
basefolder="~/Desktop/alignability/sacCer3_alignability"
mkdir –p ${basefolder} && cd ${basefolder}
 
wigToBigWig <(bgzip -cd sacCer3.gc5Base.varStep.wig.gz) \
	sacCer3.sizes sacCer3.gc5Base.bw

Install and run the GEM library tools

Tools were obtained and installed from : https://sourceforge.net/projects/gemlibrary/files/gem-library/[8] (the core-2 version was used on the local machine!)

Technical.png A new version is on the make as of Nov2018 - please read more at https://github.com/smarco/gem3-mapper

[9]

Technical.png The new version Gem3-mapper is not compatible yet with this workflow, you need to use the old version for that and refer to this post https://evodify.com/gem-mappability/

Cli tools.png there is not yet a version for apple OSX and the source is not available to build locally

create a 'gem' index

# create a folder to store all results
basefolder="~/Desktop/alignability/sacCer3_alignability"
mkdir –p ${basefolder} && cd ${basefolder}
 
pref="sacCer3"
reference="sacCer3.fa"
idxpref="sacCer3_index"
thr=8; # use 8 cores
 
gem-indexer -T ${thr} -c dna -i ${reference} -o ${idxpref}
(c) 2010-2013 Santiago Marco Sola <santiagomsola@gmail.com>
 (c) 2010-2013 Leonor Frias Moya <leonor.frias@gmail.com>
For the terms of use, run the program with the option --show-license.
************************************************************************
* WARNING: this is a beta version, provided for testing purposes only; *
*          check for updates at <http://gemlibrary.sourceforge.net>.   *
************************************************************************
Creating sequence and location files... done.
Computing DNA BWT (likely to take long)... done.
Generating index (likely to take long)... done.
Cleaning up... done.
 
-rw-rw-r--  1 splaisan splaisan  35M Dec 18 14:15 SacCer3_index.gem
-rw-rw-r--  1 splaisan splaisan 1.9K Dec 18 14:15 SacCer3_index.log

compute Alignability for several K-mer lengths

The created index can now be used to compute mapability across the full genome. The alignability / mappability is computed for kmers between 50 and 250 bases by steps of 50 bases and the results are immediately converted to WIG and BigWIG format for IGV visualization.

create for 5 different read sizes

We selected increasing sizes corresponding to read material available on public repositories like SRA and showing the effect of read size on the uniqueness of subsequent short read alignment.

Cli tools.png our tracks are based on single-end reads, paired reads are obviously more easily uniquely mapped due to their increased span size that can reach-out from difficult regions

Technical.png The wigToBigWig command is somehow sensitive to complex fasta headers. If you Fasta sequences have long names including spaces and ':', you should better simplify it and keep only one word like '1' or 'chr1' to avoid problems at this stage

# create a folder to store all results
basefolder="~/Desktop/alignability/sacCer3_alignability"
mkdir –p ${basefolder} && cd ${basefolder}
 
pref="sacCer3"
idxpref="sacCer3_index"
thr=8; # use 8 cores
 
for kmer in $(seq 50 50 250); do
 
  # compute mappability data
  gem-mappability -T ${thr} -I ${idxpref}.gem -l ${kmer} -o ${pref}_${kmer}
 
  # convert results to wig and bigwig
  gem-2-wig -I ${idxpref}.gem -i ${pref}_${kmer}.mappability -o ${pref}_${kmer}
  wigToBigWig ${pref}_${kmer}.wig ${pref}.sizes ${pref}_${kmer}.bw
 
done

Visualize results in IGV

A screen-shot is provided to illustrate the results using IGV and the Broad sacCer3 loaded genome. Other yeast genomes available in IGV include:

Other strains are available from:

Reads for yeast genome can be obtained from the Sanger institute at:


sacCer3_mappability.png

Legend:

  • top track represents gene density across the genome for SGD genes
  • next 5 tacks show alignability / mappability for increasing kmer length.
  • bottom track shows GC% across the genome in 5base bins (scale set between 30 and 50)

Conclusion

Building a custom mappability track is easy and can be done on a laptop for a small genome for which users have the fasta sequence (therefore also for 'contigs' or 'scaffolds' in the case of unfinished genomes). We present here code and results for the current yeast strain that can be adapted for other genomes or strains. The analysis of alignabilty/mappability for paired reads is much more complex and is not produced here, please read the publication for more information.

Feed back on this page is welcome and can be sent to bits@vib.be

Related reading

A recent paper evaluates the effect of read length on the unicity of mapping (BMC Bioinformatics 2014, 15:2 doi:10.1186/1471-2105-15-2 [13])

download exercise files

Download demo files here

Use the right application to open the files present in mappability-files

References:
  1. https://www.biorxiv.org/content/10.1101/611160v1
  2. Thomas Derrien, Jordi Estellé, Santiago Marco Sola, David G Knowles, Emanuele Raineri, Roderic Guigó, Paolo Ribeca
    Fast computation and applications of genome mappability.
    PLoS One: 2012, 7(1);e30377
    [PubMed:22276185] ##WORLDCAT## [DOI] (I p)

  3. http://algorithms.cnag.cat/wiki/The_GEM_library
  4. http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3
  5. http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64
  6. http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386"
  7. http://genome.ucsc.edu/cgi-bin/hgTables
  8. https://sourceforge.net/projects/gemlibrary/files/gem-library/
  9. https://github.com/smarco/gem3-mapper
  10. http://genomevolution.org/wiki/index.php/SGRP:_Sanger_Institute_Yeast_Genomes
  11. http://genomevolution.org/CoGe/services/JBrowse/service.pl/sequence/9136
  12. http://www.sanger.ac.uk/research/projects/genomeinformatics/sgrp.html
  13. Wentian Li, Jan Freudenberg, Pedro Miramontes
    Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome.
    BMC Bioinformatics: 2014, 15;2
    [PubMed:24386976] ##WORLDCAT## [DOI] (I e)


[ Main_Page | NGS_data_analysis ]