Compute read coverage from the mapping results

1 Introduction
2 about the notion of mapability and corresponding data
- 2.1 Can I compute mappability for my own reference genome (if not already available from UCSC)?
3 compute 'per base' genome coverage with samtools
4 compute 'per base' genome coverage with bedTools genomecov
- 4.1 MS-Excel can be used to print the coverage depth distribution for chr21
- 4.2 a nicer [R] script to plot the coverage depth distribution for chr21
5 compute the smoothed coverage in 10kb interval
6 count how much of chr21 non-gap regions was not sequenced with all reads
- 6.1 How can we compute how much is not due to centromers?
- 6.2 What is the length distribution of the no-coverage regions?
7 Note about coverage modes in RNA-Seq experiments
8 download exercise files

Introduction

After aligning reads to a reference genome, one should first perform a coverage analysis.

The coverage analysis results in two measurements

The coverage depth is the count of reads that align to the reference genome at each given position; it includes reads that also align to other position (unless filtered accordingly).
The coverage width reports the count of nucleotides in the reference genome that align to at least one NGS read; a number of reference positions are undefined (telomeres, centromeres) while other locations are highly degenerated (repeats or oligonucleotides); both types cannot be aligned to and will be absent from the coverage depth.

Knowing both the distribution of coverage depths and the genome positions not sequenced is important when doing NGS DNA variant analysis. It is obvious that some parts of the genomes (exons!) will not be sequenced and therefore not analyzed for the presence of variants. These regions may bear variations but will not be called and will seem to be invariable in the end result file (no-calls)

A basic coverage analysis allows detecting regions of higher copy number (either true or apparent) in the genome as well as regions with less or no coverage (possibly deleted in the sequenced genome).

Although a true CNV analysis requires paired genome data like tumor and normal samples from the same individual, a 'basic' coverage analysis is possible with a single sample using read mapping information but will be affected by the 'mappability' effects explained above and should therefore be 'normalized' as compared to other related genomes or using local averaging methods (loess normalization / sliding window averaging).

Taking all reads mapped to the reference genome and piling them up (build a multiple sequence alignment from the mapped reads) allows computing the genome coverage. This is an important metric that will for instance indicate regions of about 50% additional coverage as compared to the median value (potential duplicated loci for one allele), regions of about-half average coverage (potential haploid deletions) and finally regions of no-coverage (potential homozygous deletions). All these results should of course be taken with a lot of salt as artifacts and mapping issues can also lead to similar results (see illustration below for the first exon of the DSCAM gene).

Various tools allow computing read coverage, we will only detail below a selection that allows specific measurements and are simple to use.

about the notion of mapability and corresponding data

Not all regions of the human genome get equal chance to be mapped by reads. This is an important and often underestimated issue with genome sequencing that led to the definition of terms like mapability.

Read mapability or alignability is defined as the probability of any given region to be efficiently sequenced by NGS sequencing. Mapability is not constant across the reference genome and is subject to various effects associated with sequence content (GC, oligomers, N-regions) but also to the existence of larger repeated loci. Excessive piling up of reads is indicative of poorly mapped region

where do I get mapability data

Both links below allow retrieving data in bigwig format (.bw)

A 100bps fragment is reported with a value of 1 if unique in the genome, of 0.5 for fragments mapping to 2 alternate locations, and of 1/n for n alternate mapping fragments.

web-page:<http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability>^[1]
direct-link: <http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/>^[2]

http://genome.ucsc.edu/ENCODE/terms.html.
 
For a filterable list of these files along with additional information such as
release status, restriction dates, track description, methods, and metadata,
please use the 'Downloadable Files' user interface at: 
 
 
     http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability
 
 
In addition to data files, this directory may also contain some or all of the
following:
 
    * files.txt - a file listing each data file and its metadata.
 
    * supplemental - a directory containing any additional files provided by 
      the submitting laboratory.
 
    * obsolete files - WARNING - Revoked and replaced data files may be present 
      in this directory. Please make every effort to determine that you are 
      using the latest version of the data. To do so access the recommended 
      'Downloadable Files' user interface via the URL provided above or check 
      files.txt.  If the data file is not listed in files.txt or has a tag of 
      objStatus=revoked or replaced, then the file is obsolete.
 
    * releaseN directories - WARNING - These are only present on the 
      genome-preview site. Please note that data files on genome-preview have 
      not been verified or released to the public. They are subject to change 
      without notice and should be used with caution.
      Name                                            Last modified      Size  Description
      Parent Directory                                                     -   
      files.txt                                       24-Feb-2012 14:19  3.9K  
      md5sum.txt                                      26-Oct-2011 11:22  748   
      wgEncodeCrgMapabilityAlign24mer.bigWig          27-Apr-2010 06:44  5.0G  
      wgEncodeCrgMapabilityAlign36mer.bigWig          27-Apr-2010 07:47  1.3G  
      wgEncodeCrgMapabilityAlign40mer.bigWig          27-Apr-2010 08:49  1.2G  
      wgEncodeCrgMapabilityAlign50mer.bigWig          27-Apr-2010 09:44  706M  
      wgEncodeCrgMapabilityAlign75mer.bigWig          27-Apr-2010 10:26  281M  
      wgEncodeCrgMapabilityAlign100mer.bigWig         27-Apr-2010 04:25   99M  
      wgEncodeDacMapabilityConsensusExcludable.bed.gz 04-May-2011 14:08  4.6K  
      wgEncodeDukeMapabilityRegionsExcludable.bed.gz  28-Mar-2011 16:16   17K  
      wgEncodeDukeMapabilityUniqueness20bp.bigWig     15-Mar-2011 17:51  1.4G  
      wgEncodeDukeMapabilityUniqueness35bp.bigWig     11-May-2011 13:39  929M

Can I compute mappability for my own reference genome (if not already available from UCSC)?

Yes you can! using the GEM-library software and the fasta sequence of your genome. We provide a tutorial to do so in a separate page Create_a_mappability_track on our Wiki.

compute 'per base' genome coverage with samtools

Samtools is the main toolbox dedicated to SAM/BAM formatted files and is written in C++ and Perl. Samtools is complemented by similar tools including the popular Picard tools (written in Java) or the Broad Institute GATK (not used here due to license limitations).

REM: Samtools assumes diploid genomes, this might become an issue if you work on aneuploid species (plants) for which alternative tools exist. For haploid genomes (eg. bacteria, viruses) the consequences are less important and samtools can be used with some post hoc filtering.

#! /usr/bin/env bash
## script: samtools_coverage.sh
## ©SP-BITS, 2013 v1.0
# last edit: 2014-04-09
 
# required:
# samtools Version: 0.1.19-44428cd
 
# Samtools in recent versions has replaces pileup with mpileup. 
# Keep this in mind when you read online documentation as functionalities
#   have significantly changed with mpileup.
 
# full chromosome
# infolder=hg19_bwa-mapping/full_chromosome
# outfolder=samtools_genomeCvg/full_chromosome
 
ref=ref/HiSeq_UCSC_hg19.fa
genomeint=ref/hg19.100k.bed
infolder=Final_Results/hg19_bwa-mapping
outfolder=samtools_genomeCvg
mkdir -p ${outfolder}
 
# process bam data by mapping methods, compress, and save
for b in ${infolder}/*_mdup_chr21.bam; do
 
# using 'expr': another way to extract string subsets
name=$(expr "$b" : '.*shuffled_\(.*\)_mdup_chr21.bam$')
 
samtools depth ${b} | \
	bgzip -c > ${outfolder}/depth_${name}.txt.gz
 
done

The resulting data is not ready for visualisation and would require more reformatting.

samtools coverage first lines for bwa_aln on the 10pc sample

chr21   9467391 1
chr21   9467392 1
chr21   9467393 1
chr21   9467394 1
chr21   9467395 1
chr21   9467396 1
chr21   9467397 2
chr21   9467398 2
chr21   9467399 2
chr21   9467400 2
...

compute 'per base' genome coverage with bedTools genomecov

Another tool to compute global coverage per base is bedtools genomecov, part of the very complete bedtools suite. More info about this command here^[3]

#! /usr/bin/env bash
 
# script: bedtools_genomecoverage.sh
 
# required:
# bedtools Version: v2.17.0 or more
# IGVtools
# Summary: Compute the coverage of a feature file across a genome.
#
# Usage: bedtools genomecov [OPTIONS] -ibam <bam> -g <genome>
 
# we use here the genome size file generated above from the UCSC SQL query
genome=ref/hg19.genome
 
# sample or full data
infolder=Final_Results/hg19_bwa-mapping
 
outfolder="bedtools_genomeCvg"
mkdir -p ${outfolder}
 
for b in ${infolder}/*_mdup_chr21.bam; do
 
# using 'expr': another way to extract string subsets
name=$(expr "$b" : '.*shuffled_\(.*\)_mdup_chr21.bam$')
 
echo
echo "# processing ${name}"
 
# computing a coverage summary used to plot the coverage distribution with R
# we only keep chr21 lines obviously
bedtools genomecov -ibam ${b} \
	-g ${genome} | grep "^chr21" > ${outfolder}/${name}-histo.txt
 
# compute full genome coverage for IGV track
# -bga: Report depth in BedGraph format like with -bg. 
# However with this option, regions with zero coverage are also reported. 
# This allows one to quickly extract all regions of a genome with 0 coverage 
# by applying: “grep -w 0$” to the output (see below).
# the output is fairly large and can be reduced with '''igvtools toTDF'''
bedtools genomecov -bga \
	-ibam ${b} \
	-g ${genome} > ${outfolder}/${name}.bg
 
# convert .bg to .igv and compress with igvtools on the fly
(gawk 'BEGIN{FS="\t"; OFS="\t"}{print $1, $2, $3, "genomecov", $4}' \
	${outfolder}/${name}.bg > ${outfolder}/${name}-cvg.igv) && \
	(igvtools toTDF -z 4 -f min,max,mean ${outfolder}/${name}-cvg.igv \
	${outfolder}/${name}-cvg.tdf hg19)
 
# extract 0-coverage regions from the bg file to a second file with '1' as score
grep -w 0$ ${outfolder}/${name}.bg | \
	( gawk 'BEGIN{FS="\t"; OFS="\t"}{print $1, $2, $3, "no-coverage", 1}' \
	> ${outfolder}/nocvg-${name}-cvg.igv ) && \
	( igvtools toTDF -z 4 -f min,max,mean ${outfolder}/nocvg-${name}-cvg.igv \
	${outfolder}/nocvg-${name}-cvg.tdf hg19 )
 
done

files resulting from the above script for the full data

-rw-rw-r--  1 bits bits  56K May  6 10:46 nocvg-PE_NA18507_GAIIx_100_chr21_aln-pe-cvg.igv
-rw-rw-r--  1 bits bits  66K May  6 10:46 nocvg-PE_NA18507_GAIIx_100_chr21_aln-pe-cvg.tdf
-rw-rw-r--  1 bits bits  63K May  6 10:51 nocvg-PE_NA18507_GAIIx_100_chr21_mem-pe-cvg.igv
-rw-rw-r--  1 bits bits  72K May  6 10:51 nocvg-PE_NA18507_GAIIx_100_chr21_mem-pe-cvg.tdf
-rw-rw-r--  1 bits bits 424M May  6 10:44 PE_NA18507_GAIIx_100_chr21_aln-pe.bg
-rw-rw-r--  1 bits bits 581M May  6 10:45 PE_NA18507_GAIIx_100_chr21_aln-pe-cvg.igv
-rw-rw-r--  1 bits bits  54M May  6 10:46 PE_NA18507_GAIIx_100_chr21_aln-pe-cvg.tdf
-rw-rw-r--  1 bits bits  30K May  6 10:42 PE_NA18507_GAIIx_100_chr21_aln-pe-histo.txt
-rw-rw-r--  1 bits bits 425M May  6 10:49 PE_NA18507_GAIIx_100_chr21_mem-pe.bg
-rw-rw-r--  1 bits bits 582M May  6 10:49 PE_NA18507_GAIIx_100_chr21_mem-pe-cvg.igv
-rw-rw-r--  1 bits bits  54M May  6 10:50 PE_NA18507_GAIIx_100_chr21_mem-pe-cvg.tdf
-rw-rw-r--  1 bits bits  46K May  6 10:47 PE_NA18507_GAIIx_100_chr21_mem-pe-histo.txt

the first part of the first script will output a text histogram table for each chromosome with the count of positions having each possible coverage value (eg PE_NA18507_GAIIx_100_chr21_mem-histo.txt). Such histogram can be used to plot the coverage curve often seen in NGS reports (see next section using excel and [R])
the second output PE_NA18507_GAIIx_100_chr21_mem-cvg.bg is a bedgraph formatted file that we can convert to IGV text format and then TDF binary format and visualize in IGV.
the third part will extract regions of no coverage from the .bg data to identify mapping issues or loci absent from the sequenced genome (homozygous deletions).

The TDF file is a binary form of the IGV format that is lighter for file transfer and faster to load in IGV. By contrast it is not readable anymore and not applicable to downstream text parsing (give and take)

Coverage can be visualized in IGV directly from the BAM data (IGV will compute a coverage track from it), or by loading the TDF file obtained above in IGV side by side with other data. Advantage of the TDF data is that it is visible in full chromosome view while the BAM data only becomes visible at high zoom level.

view bedtools TDF coverage data in IGV

Legend: the 5'-end of the NCAM2 gene (chr21:22,369,000-22,376,000)) with both coverage from the bedtool command and from the BWA-mem reads. Note that the coverage plots are fully concordant.

no-coverage regions in the 5'-end of the TPTE gene (chr21:10,988,000-10,993,000)

Legend: The figure shows both coverage from the bedtool command and from the BWA-mem reads as well as two no-coverage regions extracted from the first file. Note that the no-coverage is not due to the presence of undefined regions in this case as shown by the corresponding empty hg19_gap.bed track.

gap regions close to the chr21 centromer (chr21:10,310,800-11,582,800)

The hg19 assembly contains the following principal types of gaps:

Clone - gaps between clones in the same map contig. These may be bridged or not.
Contig - non-bridged gaps between map contigs.
Centromere - non-bridged gaps from centromeres.
Telomere - non-bridged gaps from telomeres.
Heterochromatin - non-bridged gaps from large blocks of heterochromatin.
Short Arm - non-bridged long gaps on the short arm of the chromosome.

Legend: The figure shows both no-coverage and gaps (regions with only N's) from the hg19_gap.bed track. Different gap types are shown and labelled accordingly.

The IGV inspection of genome coverage data often allows the identification of sequencing artefacts ('coverage gaps or extreme coverage levels'). The gaps can be real when a portion of the chromosome is absent from the DNA sample. Other gaps reveal sequencing biases due to GC content (as shown below) or other mappability issues (see NGS_Exercise.8 for more examples). Regions of apparent extreme coverage can reveal genome duplications or CNV (many copies of a unique reference sequence in the DNA sample lead to many more reads than what is expected for a diploid genome)

When opening the aln TDF file, a very high coverage (>400x) is obtained in the BAGE region (chr21:10,669,922-11,221,719). This phenomenon is reported in Genecard (<http://www.genecards.org/cgi-bin/carddisp.pl?gene=BAGE4>).

Miscellaneous: The ancestral BAGE gene was generated by juxtacentromeric reshuffling of the MLL3 gene. 
The BAGE family was expanded by juxtacentromeric movement and/or acrocentric exchanges. 
BAGE family is composed of expressed genes that map to the juxtacentromeric regions of chromosomes 13 and 21 
and of unexpressed gene fragments that scattered in the juxtacentromeric regions of several chromosomes, 
including chromosomes 9, 13, 18 and 21

BAGE4 locus with abnormally high coverage

The high copy-number of that region in the NA18507 genome led to a large excess of reads mapping to the single represented locus (called reference genome compression). The artefact is confirmed by the presence of a very large number of variants across the lower BAM track that reflect the multiple sequence differences between BAGE gene copies (note that these variants will likely be ignored during calling by setting a max value for coverage).

MS-Excel can be used to print the coverage depth distribution for chr21

The content of the second and third columns of the PE_NA18507_GAIIx_100_chr21_mem-histo.txt file are used as data to build a scatter plot using MS-Excel.

excel coverage plot drawn from PE_NA18507_GAIIx_100_chr21_mem-histo.txt

See below for a plot generated using [R] from the sam edata

a nicer [R] script to plot the coverage depth distribution for chr21

Copy the code below into a blank RStudio script page or load 'plot_coverage.R' from the 'scripts' folder

'plot_coverage.R' script

# script: plot_coverage.R
# required to add the table to the plot
# install once with 'install.packages("plotrix")'
library(plotrix)
 
setwd("~/NGS_DNASeq-training/bedtools_genomeCvg")
 
for (met in c("aln", "mem")) {
  data <- paste("NA18507_GAIIx_100_chr21_", met, "-histo.txt", sep="")
 
  chr21 <- read.delim(data, header=F)
  colnames(chr21) <- c("chr", "depth", "count", "chrlen", "cov%")
 
  # get depth pic (avoid coverage depth=0)
  # pic <- chr21$depth[which.max(chr21$count[chr21$depth>0])]+1
 
  # find X-coordinates for 5%, Q1, Q3, 95%
  sub <- chr21[chr21$depth>0,]
  vec <- rep(sub$depth, sub$count)
  Q005 <- quantile(vec, 0.05)
  Q1 <- summary(vec)[2]
  Q2 <- summary(vec)[3]
  Q3 <- summary(vec)[5]
  Q95 <- quantile(vec, 0.95)
 
  # create table for plot
  leg.table <- data.frame(depth=c(Q005, Q1, Q2, Q3, Q95))
 
  # create pictures with coverage distribution
  filename <- paste("sample_", met, "_coverage-plot.png", sep="")
  #filename <- paste(met, "_coverage-plot.png", sep="")
  title <- paste("coverage depth (", met, ")", sep="")
 
  png(filename = filename,
      width = 250, height = 250, units = "px", pointsize = 12,
      bg = "white")
 
  # limits for all and sample datasets
  #xmax=80
  xmax=10
  #ymax=1500000
  ymax=10000000
  plot(chr21$count ~ chr21$depth, 
       xlim=c(0, xmax),
       ylim=c(0, ymax),
       xlab=title,
       ylab="bp-count", 
       type="l",
       col="blue",
       lwd=3,
       cex=0.5)
 
  # add vertical line at pic
  #  abline(v=pic-0.5, col="grey", lwd=2)
  abline(v=Q005, col="grey", lwd=2)
  abline(v=Q1, col="grey", lwd=2)
  abline(v=Q2, col="red", lwd=2)
  abline(v=Q3, col="grey", lwd=2)
  abline(v=Q95, col="grey", lwd=2)
 
  # show most of the options
  addtable2plot(Q95, 
                (max(chr21$count)/20), 
                leg.table,
                bg="white",
                bty="o", 
                display.rownames=TRUE,
                hlines=TRUE, 
                vlines=TRUE,
                cex=0.75)
 
  dev.off()
}

coverage depth distribution plots for all mappings in aln and mem BAM files

coverage depth from all BWA-aln/mem reads	10pc results

NA18507 was sequenced with a median coverage depth of >40x which is a very decent value. The large number of bases with '0' coverage are located in the short arm of chr21, centromere, and telomeres where repeat frequency is highest and N-bases are very frequents (see IGV figures above).

compute the smoothed coverage in 10kb interval

create 10kb & 100kb bins for the 'hg19' reference genome

To normalize for local effects and obtain smoothed results, a stepping window of 10/100kb (arbitrary choice) is built using bedtools makewindows and used later with bedtools commands

#! /usr/bin/env bash
# script: bedtools_makewindows.sh
 
# required:
# bedtools Version: v2.17.0 or more
 
# Divide the human genome into stepping windows of 10kb, with no overlap
# Usage: bedtools makewindows [OPTIONS] [-g <genome> OR -b <bed>]
# [ -w <window_size> OR -n <number of windows>, -s <step_size> ]
 
# 10kb stepping
bedtools makewindows -g ref/hg19.genome \
        -w 10000 > ref/hg19.10k.bed
 
echo "# preview 10k windows (first 3 lines)"
head -3 ref/hg19.10k.bed
 
# chr1    0      10000
# chr1    10000  20000
# chr1    20000  30000
 
# 100kb stepping
bedtools makewindows -g ref/hg19.genome \
        -w 100000 > ref/hg19.100k.bed
 
echo "# preview 100k windows (first 3 lines)"
head -3 ref/hg19.100k.bed
 
# chr1    0       100000
# chr1    100000  200000
# chr1    200000  300000

compute the average coverage with bedTools coverage in 10kb interval

If you wish instead to report a smoother local average value of read coverage, you can use the bedtools makewindows data generated earlier and summarize BAM coverage over the defined intervals with bedtools coverage. The output results are IGV | TDF formatted files that we can visualize in IGV to identify problematic genomic regions.

#! /usr/bin/env bash
# scripts: bedtools_10k-coverage.sh
 
# Required: bedtools v2.17.0 or higher
# IGVTools
 
# compute %coverage in 10k bins
# Tool:    bedtools coverage (aka coverageBed)
# Summary: Returns the depth and breadth of coverage of features from A
#	 on the intervals in B.
#
# Usage:   bedtools coverage [OPTIONS] -abam <bam> -b <hg19.10k.bed>
# Default Output:  
#         After each entry in B, reports: 
#         4) The number of features in A that overlapped the B interval.
#         5) The number of bases in B that had non-zero coverage.
#         6) The length of the entry in B.
#         7) The fraction of bases in B that had non-zero coverage.
 
# we use here the genome size file generated above from the UCSC SQL query
genome=ref/hg19.genome
genomeint=ref/hg19.10k.bed
infolder=Final_Results/hg19_bwa-mapping
outfolder="bedtools_genomeCvg.10k"
mkdir -p ${outfolder}
 
# process chr21 BAM files
for b in ${infolder}/*_mdup_chr21.bam; do
 
# using 'expr': another way to extract string subsets
name=$(expr "$b" : '.*shuffled_\(.*\)_mdup_chr21.bam$')
 
echo
echo "#processing ${name}"
 
outfile=${name}_10k-cvg
 
bedtools coverage -abam ${b} \
  -b ${genomeint} | sort -k 1V,1 -k 2n,2 -k 3n,3 \
  > ${outfolder}/${outfile}.txt
 
# transform to igv for visualization
# convert .txt to .igv and compress with igvtools on the fly
gawk 'BEGIN{FS="\t"; OFS="\t"}{print $1, $2, $3, "percent.cvg", $7}' \
  ${outfolder}/${outfile}.txt > ${outfolder}/${outfile}.igv
 
igvtools toTDF -z 4 -f min,max,mean ${outfolder}/${outfile}.igv \
  ${outfolder}/${outfile}.tdf hg19
 
done

files resulting from the above script and full data

-rw-rw-r--  1 bits bits  14M May  6 11:46 PE_NA18507_GAIIx_100_chr21_aln-pe_10k-cvg.igv
-rw-rw-r--  1 bits bits 3.2M May  6 11:46 PE_NA18507_GAIIx_100_chr21_aln-pe_10k-cvg.tdf
-rw-rw-r--  1 bits bits  13M May  6 11:46 PE_NA18507_GAIIx_100_chr21_aln-pe_10k-cvg.txt
-rw-rw-r--  1 bits bits  14M May  6 11:48 PE_NA18507_GAIIx_100_chr21_mem-pe_10k-cvg.igv
-rw-rw-r--  1 bits bits 3.2M May  6 11:49 PE_NA18507_GAIIx_100_chr21_mem-pe_10k-cvg.tdf
-rw-rw-r--  1 bits bits  13M May  6 11:48 PE_NA18507_GAIIx_100_chr21_mem-pe_10k-cvg.txt

partial coverage in the ANKRD30BP2 gene region (chr21:14,445,900-14,480,100)

Legend: partial coverage in the ANKRD30BP2 ankyrin repeat domain 30B pseudogene 2 (chr21:14,445,900-14,480,100). regions of no-coverage and coverage alternate and are summarized in the 10kb window to the average value. This visualisation allows smoothing of signal and identifies regions of sufficient width showing incomplete coverage.

bedtools allows many more computations and reads from BED, VCF, BAM and more formats given a little bit of gawk magic.

create a Circos plot with the 10kb smoothed results

This section is provided as is and will not be executed during the training.
Feel free to recycle the code and make it yours.

Circos plots are quite popular due to their attractive aesthetics but are not ideal for detailed inspection. We provide this starter code (very basic plot indeed!) to introduce you to the world of Circos. Up to you to explore this further if interested. You will find all necessary information on about to install, configure, and use Circos in Martin Krzywinski pages <http://circos.ca>^[4]. Some likely missing files required to run the code below were copied to the exercise page on our server in the 'missing_etc-files.zip' archive . To reuse the script on your own machine, please extract the contained files to your own 'path-to-circos/etc' folder.

Perl script used to create the Circos data, configuration file and finally plot

#!/usr/bin/perl
use strict;
use POSIX qw/strftime ceil/;
use File::Basename;
# required: install with 'sudo cpan File::Basename'
use Statistics::Descriptive;
 
# script: plot_with_circos.sh
# create Circos plot from bedtools_genomeCvg.igv  data
# parse a bedtools 10k-cvg.igv file
# chr1	0	10000	percent.cvg	0.0000000
# chr1	10000	20000	percent.cvg	0.0146000
# chr1	20000	30000	percent.cvg	0.0000000
# chr1	30000	40000	percent.cvg	0.0000000
# chr1	40000	50000	percent.cvg	0.0000000
# chr1	50000	60000	percent.cvg	0.0000000
#
# Stéphane Plaisance VIB-BITS Nov-04-2013 v1.0
 
@ARGV == 2 || die ("usage: plot_with_circos.sh <bedtools_genomeCvg.igv> <target-chr>\n");
 
my ($infile, $onechr) = @ARGV;
# convert in circos naming
my $circhr = $onechr;
$circhr =~ s/chr/hs/;
print stderr "chromosome:",$circhr,"\n";
 
my $name = basename($infile, ".igv");
my $conffile=$name."_coverage-circos.conf";
my $plotfile=$name."_coverage-plot.png";
my $datafile=$name."_coverage-data.txt";
my $stat = Statistics::Descriptive::Full->new();
 
# one pass through data to compute min mid and max
my $z = OpenArchiveFile ($infile) || die "Error reading from $infile: $!\n";
my $cnt = 0;
print STDERR "# parsing data (\'.\'=10000 rows): ";
while (my $line = <$z>) {
	my ($chr, undef, undef, undef, $val) = split(/\t/, $line);
	# autosomes only
    next if $chr != $onechr;
	# exclude no-cvg
    next if $val==0;
    $cnt++;
    $cnt =~ /00000$/ && print STDERR ".";
    $stat->add_data(100*$val); 
	}
undef $z;
 
my $count = $stat->count();
my $min = $stat->quantile(1);
my $med = $stat->quantile(3);
my $max = 1.1*$stat->quantile(4);
 
print STDERR "# analyzed:$count\n# min\tmed\tmax\n# $min\t$med\t$max\n";
 
# call sub to create .conf file
print STDERR ". creating conf and data files\n";
createConf();
 
# extract column from quantile file and create Circos data file
my $z = OpenArchiveFile ($infile) || die "Error reading from $infile: $!\n";
open(DATA, "> $datafile") || die "Error: cannot create output :$!\n";
 
while (my $line = <$z>) {
	my ($chr, $start, $end, undef, $val) = split(/\t/, $line);
	# autosomes only
	next if $chr != $onechr;
	# rename 'chr' to 'hs' for circos
	$chr =~ s/chr/hs/;
	print DATA join ("\t", $chr, $start, $end, 100*$val)."\n";
	}
 
close data;
 
undef $z;
 
print STDERR ". creating Circos plot\n";
system ("circos -conf $conffile");
 
#===============
#==== SUBS =====
#===============
 
sub OpenArchiveFile {
my $infile = shift;
my $FH;
 
if ($infile =~ /.bz2$/) {
	open ($FH, "bzcat $infile |") or die ("$!: can't open file $infile");
	} elsif ($infile =~ /.gz$/) {
		open ($FH, "zcat $infile |") or die ("$!: can't open file $infile");
		} elsif ($infile =~ /.tsv|.txt|.bedg|.igv$/) {
			open ($FH, "cat $infile |") or die ("$!: can't open file $infile");
			} else {
				die ("$!: do not recognise file type $infile");
				# if this happens add, the file type with correct opening proc
	}
return $FH;
}
 
# generate conf file paired to the data
sub createConf {
open(CONF, "> $conffile") || die "Error: cannot create output file :$!\n";
 
my $circos_root = $ENV{BIOTOOLS}."/circos";
my $circos_etc = $ENV{BIOTOOLS}."/circos/etc";
my $circos_data = $ENV{BIOTOOLS}."/circos/data";
 
print CONF <<"EOT";
<<include $circos_etc/colors_fonts_patterns.conf>>
 
<<include $circos_etc/ideogram.conf>>
<<include $circos_etc/ticks.conf>>
 
<image>
file* = $plotfile
<<include $circos_etc/image.conf>>
</image>
 
karyotype   = $circos_data/karyotype/karyotype.human.hg19.txt
 
#chromosomes_units = 1000000
#chromosomes_display_default = yes
 
chromosomes_units           = 1000000
chromosomes_display_default = no
chromosomes                 = $circhr
chromosomes_breaks          = -$circhr:120-140
 
<plots>
 
# tall histogram immediately inside the ideogram circle
# background* parameters define a colored backdrop for this histogram
# axes* define y-axes
 
<plot>
 
type      = histogram
file      = $datafile
 
r1        = 0.9r
r0        = 0.7r
max       = $max
min       = 0
 
stroke_type = outline
thickness   = 4
color       = vdgrey
extend_bin  = no
 
<backgrounds>
<background>
color = vvlgrey
</background>
</backgrounds>
 
<axes>
<axis>
spacing   = 0.1r
color     = lgrey
thickness = 2
</axis>
</axes>
 
<rules>
 
<rule>
use       = no
condition = var(value) < 0
show      = no
</rule>
 
<rule>
#condition  = var(value) < 0
condition  = 1
fill_color = eval(sprintf("spectral-9-div-%d",remap_int(var(value),-1,1,1,9)))
</rule>
 
</rules>
 
</plot>
 
</plots>
 
<<include etc/housekeeping.conf>>
EOT
 
close CONF;
}

chr21 coverage plot for the BWA mem mappings

Legend: Read coverage for chr21 obtained from the BWA mem data

PE_NA18507_GAIIx_100_chr21_mem_10k-cvg_coverage-plot.png

CircosAPI is a separate package helping build circos configuration files and can be obtained at <http://kylase.github.io/CircosAPI/>^[5] together with a nice tutorial ^[6]

For a good tutorial to setup circos & circos_api as a virtual machine on your non-unix computer, see here^[7]

count how much of chr21 non-gap regions was not sequenced with all reads

This little exercise shows classic use of grep, wc, and awk; your best friends in many situations.

## script: count_no-coverage.sh
## ©SP-BITS, 2013 v1.0
# last edit: 2014-04-09
 
# compute for full bwa_mem chr21 data
infolder=bedtools_genomeCvg/full_chromosome
outfolder=bedtools_genomeCvg/full_chromosome
mkdir -p ${outfolder}
 
infile=nocvg-PE_NA18507_GAIIx_100_chr21_mem.igv
 
outfile=chr21_${infile%%.igv}.bed
 
# create subset for chr21
grep "^chr21" ${infolder}/${infile} > ${outfolder}/${outfile}
 
len=$( wc -l ${outfolder}/${outfile} | cut -d " " -f 1 );
wid=$( awk 'BEGIN{FS="\t";OFS="\t";sum=0}{sum+=$3-$2}END{print sum}' ${outfolder}/${outfile} );
 
echo -e "count\twidth\n${len}\t${wid}" | column -t

count  width
1693   14094487

The answer suggests that 1693 regions of chr21 are not covered by reads for a total width of more than 14Mbases. This seems a lot and of course most of this is due to telomeres and centromeres (hashed in grey in the pictogram below).

^[8]

How can we compute how much is not due to centromers?

Bedtools subtract allows excluding gaps from the result to obtain the sample specific no-coverage (see more about this command at: <http://bedtools.readthedocs.org/en/latest/content/tools/subtract.html>^[9]).

the bedtols subtract command

Tool:    bedtools subtract (aka subtractBed)
Version: v2.17.0-111-gab7bbbe
Summary: Removes the portion(s) of an interval that is overlapped
	 by another feature(s).
 
Usage:   bedtools subtract [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf>
 
Options: 
	-f	Minimum overlap required as a fraction of A.
		- Default is 1E-9 (i.e., 1bp).
		- (FLOAT) (e.g. 0.50)
 
	-s	Require same strandedness.  That is, only subtract hits in B
		that overlap A on the _same_ strand.
		- By default, overlaps are subtracted without respect to strand.
 
	-S	Force strandedness.  That is, only subtract hits in B that
		overlap A on the _opposite_ strand.
		- By default, overlaps are subtracted without respect to strand.
 
	-A	Remove entire feature if any overlap.  That is, by default,
		only subtract the portion of A that overlaps B. Here, if
		any overlap is found (or -f amount), the entire feature is removed.
 
	-N	Same as -A except when used with -f, the amount is the sum
		of all features (not any single feature).

The gap file was obtained from the UCSC table browser as described below and is stored under ref/hg19_gaps.bed.

how do I get gaps from the UCSC table browser?

For each reference build, a file can be created (BED) with the name column indicating the gap type from the UCSC table browser http://genome.ucsc.edu/cgi-bin/hgTables?^[10]

## script: correct_no-coverage.sh
## ©SP-BITS, 2013 v1.0
# last edit: 2014-04-09
 
# compute for full bwa_mem chr21 data
infolder=bedtools_genomeCvg/full_chromosome
outfolder=bedtools_genomeCvg/full_chromosome
 
outfile=chr21_${infile%%.igv}.bed
mkdir -p ${outfolder}
 
infile=chr21_nocvg-PE_NA18507_GAIIx_100_chr21_mem-pe-cvg.bed
 
gaps=ref/hg19_gaps.bed
outfile=chr21_nogap_nocvg-PE_NA18507_GAIIx_100_chr21_mem-pe-cvg.bed
 
bedtools subtract -a ${infolder}/${infile} -b ${gaps} > ${outfolder}/${outfile}
 
len=$( wc -l ${infolder}/${outfile} | cut -d " " -f 1 );
wid=$( awk 'BEGIN{FS="\t";OFS="\t";sum=0}{sum+=$3-$2}END{print sum}' ${infolder}/${outfile} );
 
echo -e "count\twidth\n${len}\t${wid}" | column -t

count  width
1693   1073302

The region count is unchanged but the cumulated width dropped to 1Mbase

The large number of small no-coverage regions found is partly due to the fact that we work here with only 10% of the reads and have a limited sequencing depth. A possible next step would be to retrieve these sequences in order to analyze their base content or try to amplify them by conventional PCR for manual re-sequencing. Some of this is detailed in a supplementary exercise: Advanced usage of BEDTOOLS to retrieve unsequenced regions of chr21 for Sanger re-sequencing that you can read but do not need to do now.

What is the length distribution of the no-coverage regions?

We can also compute such basic descriptive statistics using R (and make plots) or more simply by applying awk on the BED files obtained above and the nice little qstats to compute simple stats. ^[11]. Note that a similar tool called stats is part of the filo package developed by Aaron Qinlan ^[12]

## script: gap_sizes.sh
## ©SP-BITS, 2013 v1.0
# last edit: 2014-04-09
 
# BED data obtained from the former scripts
infolder=bedtools_genomeCvg/full_chromosome
outfolder=bedtools_genomeCvg/full_chromosome
mkdir -p ${outfolder}
 
all=chr21_nocvg-PE_NA18507_GAIIx_100_chr21_mem-pe-cvg.bed
nongap=chr21_nogap_nocvg-PE_NA18507_GAIIx_100_chr21_mem-pe-cvg.bed
 
# we compute the no-coverage length with end-start
awk 'BEGIN{FS="\t";OFS="\t";sum=0}{print $3-$2}' ${infolder}/${all} | qstats
 
Min.     1
1st Qu.  6
Median   44
Mean     8325.15
3rd Qu.  118
Max.     9.43619e+06
Range    9.43619e+06
Std Dev. 241707
Length   1693
 
awk 'BEGIN{FS="\t";OFS="\t";sum=0}{print $3-$2}' ${infolder}/${nongap} | qstats
 
Min.     1
1st Qu.  6
Median   44
Mean     633.965
3rd Qu.  118
Max.     61261
Range    61260
Std Dev. 3280.6
Length   1693

The results confirm that we mainly removed very large regions (centromeres) and that the remaining no-coverage blocks are relatively short with
* a median width of 44bps
* 75% of the fragments shorter or equal to 118bps.

Note about coverage modes in RNA-Seq experiments

In the case of spliced reads (RNA-Seq), coverage can be computed in various ways illustrated in the graph below. While this is not the aim of this training session, it is a worth mentioning detail. Please follow links below to get more info about HTSeq tools.

HTSeq^[13] is originally a python library but it can also be used directly in terminal to compute coverage from the command htseq-count.

HTSeq illustration for coverage count modes

download exercise files

Download exercise files here

Use the right application to open the files present in ex4-files

References:

[1] ttp://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability

[2] ttp://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/

[3] ttp://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html

[4] ttp://circos.ca

[5] ttp://kylase.github.io/CircosAPI/

[6] ttp://kylase.github.io/CircosAPI/tutorials/

[7] ttp://kylase.github.io/CircosAPI/circos-on-linux-virtual-machine/

[8] ttp://www.ncbi.nlm.nih.gov/mapview/maps.cgi?TAXID=9606&CHR=21&MAPS=ideogr,genec[0.00%3A3200.00]&QSTR=SOD1%20OR%20APS1&QUERY=uid(313,6305,61194,9026314)&rsize=3200&GOTO=3053.71human%253A21%253AISCN

[9] ttp://bedtools.readthedocs.org/en/latest/content/tools/subtract.html

[10] ttp://genome.ucsc.edu/cgi-bin/hgTables?hgsid=352776079&boolshad.hgta_printCustomTrackHeaders=0&hgta_ctName=tb_gap&hgta_ctDesc=table+browser+query+on+gap&hgta_ctVis=pack&hgta_ctUrl=&fbQual=whole&fbUpBases=200&fbDownBases=200&hgta_doMainPage=cancel

[11] ttps://github.com/tonyfischetti/qstats

[12] ttps://github.com/arq5x/filo

[13] ttp://www-huber.embl.de/users/anders/HTSeq

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

NGS Exercise.4

Contents

Introduction

about the notion of mapability and corresponding data

Can I compute mappability for my own reference genome (if not already available from UCSC)?

compute 'per base' genome coverage with samtools

compute 'per base' genome coverage with bedTools genomecov

MS-Excel can be used to print the coverage depth distribution for chr21

a nicer [R] script to plot the coverage depth distribution for chr21

compute the smoothed coverage in 10kb interval

create 10kb & 100kb bins for the 'hg19' reference genome

compute the average coverage with bedTools coverage in 10kb interval

create a Circos plot with the 10kb smoothed results

count how much of chr21 non-gap regions was not sequenced with all reads

How can we compute how much is not due to centromers?

What is the length distribution of the no-coverage regions?

Note about coverage modes in RNA-Seq experiments

download exercise files

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox

NGS Exercise.4

Contents

Introduction

about the notion of mapability and corresponding data

Can I compute mappability for my own reference genome (if not already available from UCSC)?

compute 'per base' genome coverage with samtools

compute 'per base' genome coverage with bedTools genomecov

MS-Excel can be used to print the coverage depth distribution for chr21

a nicer [R] script to plot the coverage depth distribution for chr21

compute the smoothed coverage in 10kb interval

create 10kb & 100kb *bins* for the 'hg19' reference genome

compute the average coverage with bedTools coverage in 10kb interval

create a Circos plot with the 10kb smoothed results

count how much of chr21 non-gap regions was not sequenced with all reads

How can we compute how much is not due to centromers?

What is the length distribution of the no-coverage regions?

Note about coverage modes in RNA-Seq experiments

download exercise files

Navigation menu

Search

create 10kb & 100kb bins for the 'hg19' reference genome