Create a custom 'GEM-library' Alignability (mappability) track

(©Derrien et al in 2012, see reference below)

Another tool GenMap ^[1] can also produce mappability tracks and is faster than gem

this page relates to Create a GC content track

Introduction

Results presented here are produced thanks to the method and tools developed by Derrien et al in 2012 ^[2]. We recommend users to fully read this technical paper in order to assimilate subtleties associated with mappability issues and identify limits of short read alignment.

We will create here five kmer alignability track for 16 nuclear chromosomes and the mitochondrial genome of S. cerevisiae strain S288C (sacCer3 obtained from the UCSC FTP repository). This tutorial was mainly built from the developer's web material ^[3].

Any other genome for which the chromosome sequences can be assembles in fasta format and gene annotations are available (optional but important for visualisation) can be processed similarly

Updated script can be found HERE

Download reference data

Download the yeast reference genome data from the UCSC table repository

UCSC data can be obtained from their FTP server http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3^[4]
other files are built from the table browser with interaction in your browser

We use below Kent tools (unix version from ^[5], also available for macOSX^[6]) used by the UCSC team to produce the Genome browser system. Some of these tools should be present on your machine in order to repeat the code below.

# create a folder to store all results
basefolder="~/Desktop/alignability/sacCer3_alignability"
mkdir –p ${basefolder} && cd ${basefolder}
 
# download the 2bit file
wget http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.2bit
 
# extract the fasta data with Kent's 'twoBitToFa'
twoBitToFa sacCer3.2bit SacCer3.fa
 
# also create a two-column list of chromosomes names and sizes with Kent's 'faSize'
faSize sacCer3.fa -detailed > sacCer3.sizes

download additional SacCer3 annotations

Get the SGD gene table as well as the GC distribution across the whole genome from the UCSC-table (http://genome.ucsc.edu/cgi-bin/hgTables^[7]).

the chromosome names should be identical between the 2bit-fasta and the gene table in order for IGV to display them together

# create a folder to store all results
basefolder="~/Desktop/alignability/sacCer3_alignability"
mkdir –p ${basefolder} && cd ${basefolder}
 
# get the sgdGene table for other purposes (this is not a BED file)
wget http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/database/sgdGene.txt.gz
 
# also get it in the BED-format ' SacCer3_sgdGenes.bed' using the Table browser (see screen-shot below)
 
# get and adapt GC% data in variable steps
wget http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/gc5Base/sacCer3.gc5Base.varStep.txt.gz
 
# the file: sacCer3.gc5Base.varStep.txt.gz is the raw data used to encode
# the gc5Base track on sacCer3.  This file was produced from the sacCer3.2bit
# sequence with the kent source tree utility hgGcPercent (see below)

create a 5base step file from the variable step with Kent's hgGcPercent

We can produce the GC% wig file directly from the 2bit genome and name it slightly differently to the downloaded version.

# create a folder to store all results
basefolder="~/Desktop/alignability/sacCer3_alignability"
mkdir –p ${basefolder} && cd ${basefolder}
 
hgGcPercent -wigOut \
 	-doGaps \
 	-file=stdout \
 	-win=5 \
 	sacCer3 \
 	sacCer3.2bit | \
 	gzip > sacCer3.gc5Base.varStep.wig.gz

create a bigwig version for rapid visualization

The wigToBigWig command is somehow sensitive to complex fasta headers. If you Fasta sequences have long names including spaces and ':', you should better simplify it and keep only one word like '1' or 'chr1' to avoid problems at this stage

# create a folder to store all results
basefolder="~/Desktop/alignability/sacCer3_alignability"
mkdir –p ${basefolder} && cd ${basefolder}
 
wigToBigWig <(bgzip -cd sacCer3.gc5Base.varStep.wig.gz) \
	sacCer3.sizes sacCer3.gc5Base.bw

Install and run the GEM library tools

Tools were obtained and installed from : https://sourceforge.net/projects/gemlibrary/files/gem-library/^[8] (the core-2 version was used on the local machine!)

A new version is on the make as of Nov2018 - please read more at https://github.com/smarco/gem3-mapper

^[9]

The new version Gem3-mapper is not compatible yet with this workflow, you need to use the old version for that and refer to this post https://evodify.com/gem-mappability/

The current link for the older version is: https://sourceforge.net/projects/gemlibrary/files/gem-library/Binary%20pre-release%203/

there is not yet a version for apple OSX and the source is not available to build locally

create a 'gem' index

# create a folder to store all results
basefolder="~/Desktop/alignability/sacCer3_alignability"
mkdir –p ${basefolder} && cd ${basefolder}
 
pref="sacCer3"
reference="sacCer3.fa"
idxpref="sacCer3_index"
thr=8; # use 8 cores
 
gem-indexer -T ${thr} -c dna -i ${reference} -o ${idxpref}

(c) 2010-2013 Santiago Marco Sola <santiagomsola@gmail.com>
 (c) 2010-2013 Leonor Frias Moya <leonor.frias@gmail.com>
For the terms of use, run the program with the option --show-license.
************************************************************************
* WARNING: this is a beta version, provided for testing purposes only; *
*          check for updates at <http://gemlibrary.sourceforge.net>.   *
************************************************************************
Creating sequence and location files... done.
Computing DNA BWT (likely to take long)... done.
Generating index (likely to take long)... done.
Cleaning up... done.
 
-rw-rw-r--  1 splaisan splaisan  35M Dec 18 14:15 SacCer3_index.gem
-rw-rw-r--  1 splaisan splaisan 1.9K Dec 18 14:15 SacCer3_index.log

compute Alignability for several K-mer lengths

The created index can now be used to compute mapability across the full genome. The alignability / mappability is computed for kmers between 50 and 250 bases by steps of 50 bases and the results are immediately converted to WIG and BigWIG format for IGV visualization.

create for 5 different read sizes

We selected increasing sizes corresponding to read material available on public repositories like SRA and showing the effect of read size on the uniqueness of subsequent short read alignment.

our tracks are based on single-end reads, paired reads are obviously more easily uniquely mapped due to their increased span size that can reach-out from difficult regions

The wigToBigWig command is somehow sensitive to complex fasta headers. If you Fasta sequences have long names including spaces and ':', you should better simplify it and keep only one word like '1' or 'chr1' to avoid problems at this stage

# create a folder to store all results
basefolder="~/Desktop/alignability/sacCer3_alignability"
mkdir –p ${basefolder} && cd ${basefolder}
 
pref="sacCer3"
idxpref="sacCer3_index"
thr=8; # use 8 cores
 
for kmer in $(seq 50 50 250); do
 
  # compute mappability data
  gem-mappability -T ${thr} -I ${idxpref}.gem -l ${kmer} -o ${pref}_${kmer}
 
  # convert results to wig and bigwig
  gem-2-wig -I ${idxpref}.gem -i ${pref}_${kmer}.mappability -o ${pref}_${kmer}
  wigToBigWig ${pref}_${kmer}.wig ${pref}.sizes ${pref}_${kmer}.bw
 
done

Visualize results in IGV

A screen-shot is provided to illustrate the results using IGV and the Broad sacCer3 loaded genome. Other yeast genomes available in IGV include:

Other strains are available from:

Reads for yeast genome can be obtained from the Sanger institute at:

http://www.sanger.ac.uk/research/projects/genomeinformatics/sgrp.html^[12]

Legend:

top track represents gene density across the genome for SGD genes
next 5 tacks show alignability / mappability for increasing kmer length.
bottom track shows GC% across the genome in 5base bins (scale set between 30 and 50)

Conclusion

Building a custom mappability track is easy and can be done on a laptop for a small genome for which users have the fasta sequence (therefore also for 'contigs' or 'scaffolds' in the case of unfinished genomes). We present here code and results for the current yeast strain that can be adapted for other genomes or strains. The analysis of alignabilty/mappability for paired reads is much more complex and is not produced here, please read the publication for more information.

Feed back on this page is welcome and can be sent to bits@vib.be

download exercise files

Download demo files here

Use the right application to open the files present in mappability-files

References:

↑ https://www.biorxiv.org/content/10.1101/611160v1
↑
Thomas Derrien, Jordi Estellé, Santiago Marco Sola, David G Knowles, Emanuele Raineri, Roderic Guigó, Paolo Ribeca
Fast computation and applications of genome mappability.
PLoS One: 2012, 7(1);e30377
[PubMed:22276185] ##WORLDCAT## [DOI] (I p)
↑ http://algorithms.cnag.cat/wiki/The_GEM_library
↑ http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3
↑ http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64
↑ http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386"
↑ http://genome.ucsc.edu/cgi-bin/hgTables
↑ https://sourceforge.net/projects/gemlibrary/files/gem-library/
↑ https://github.com/smarco/gem3-mapper
↑ http://genomevolution.org/wiki/index.php/SGRP:_Sanger_Institute_Yeast_Genomes
↑ http://genomevolution.org/CoGe/services/JBrowse/service.pl/sequence/9136
↑ http://www.sanger.ac.uk/research/projects/genomeinformatics/sgrp.html
↑
Wentian Li, Jan Freudenberg, Pedro Miramontes
Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome.
BMC Bioinformatics: 2014, 15;2
[PubMed:24386976] ##WORLDCAT## [DOI] (I e)

[ Main_Page | NGS_data_analysis ]

[1] ttps://www.biorxiv.org/content/10.1101/611160v1

[2] 
Thomas Derrien, Jordi Estellé, Santiago Marco Sola, David G Knowles, Emanuele Raineri, Roderic Guigó, Paolo Ribeca
Fast computation and applications of genome mappability.
PLoS One: 2012, 7(1);e30377
[PubMed:22276185] ##WORLDCAT## [DOI] (I p)

[3] ttp://algorithms.cnag.cat/wiki/The_GEM_library

[4] ttp://hgdownload.soe.ucsc.edu/goldenPath/sacCer3

[5] ttp://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64

[6] ttp://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386"

[7] ttp://genome.ucsc.edu/cgi-bin/hgTables

[8] ttps://sourceforge.net/projects/gemlibrary/files/gem-library/

[9] ttps://github.com/smarco/gem3-mapper

[10] ttp://genomevolution.org/wiki/index.php/SGRP:_Sanger_Institute_Yeast_Genomes

[11] ttp://genomevolution.org/CoGe/services/JBrowse/service.pl/sequence/9136

[12] ttp://www.sanger.ac.uk/research/projects/genomeinformatics/sgrp.html

[13] 
Wentian Li, Jan Freudenberg, Pedro Miramontes
Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome.
BMC Bioinformatics: 2014, 15;2
[PubMed:24386976] ##WORLDCAT## [DOI] (I e)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

Create a mappability track

Contents

Introduction

Download reference data

Download the yeast reference genome data from the UCSC table repository

download additional SacCer3 annotations

create a 5base step file from the variable step with Kent's hgGcPercent

create a bigwig version for rapid visualization

Install and run the GEM library tools

create a 'gem' index

compute Alignability for several K-mer lengths

create for 5 different read sizes

Visualize results in IGV

Conclusion

Related reading

download exercise files

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox